🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Performance: profiling, C extensions, Cython, vectorization

PythonPerformance Optimization⭐ Premium

Advertisement

Google, Meta & Amazon Interview

Performance: profiling, C extensions, Cython, vectorization

Optimizing Python code for production systems

Interview Question

"How do you optimize slow Python code? Explain profiling techniques, C extensions, Cython, and NumPy vectorization. When would you use each approach?"

Difficulty: Hard | Frequently asked at Google, Meta, Amazon


Theoretical Foundation

Performance Optimization Strategy

# 1. Profile first - find bottlenecks
# 2. Optimize algorithms - O(n²) → O(n log n)
# 3. Use built-in functions - they're implemented in C
# 4. Vectorize with NumPy - batch operations
# 5. Use C extensions - for critical paths
# 6. Use Cython - easy C integration
# 7. Use multiprocessing - for CPU-bound tasks

# Remember: "Premature optimization is the root of all evil"
# - Donald Knuth

ℹ️

Key Concept: Always profile before optimizing. You can't optimize what you can't measure.


Profiling Techniques

cProfile

import cProfile
import pstats
from io import StringIO

def slow_function():
    """Function with performance issues."""
    result = []
    for i in range(10000):
        result.append(i * i)
    return sum(result)

def another_slow_function():
    """Another slow function."""
    data = {i: i**2 for i in range(10000)}
    return sum(data.values())

# Profile code
def profile_code():
    profiler = cProfile.Profile()
    profiler.enable()
    
    # Code to profile
    slow_function()
    another_slow_function()
    
    profiler.disable()
    
    # Print stats
    stream = StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats('cumulative')
    stats.print_stats(10)
    print(stream.getvalue())

if __name__ == "__main__":
    profile_code()

Output:

Architecture Diagram
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.015    0.015 profile_example.py:4(slow_function)
        1    0.000    0.000    0.010    0.010 profile_example.py:10(another_slow_function)
        1    0.015    0.015    0.015    0.015 profile_example.py:16(profile_code)

line_profiler

# Install: pip install line_profiler

@profile  # Add this decorator
def slow_function():
    result = []
    for i in range(10000):
        result.append(i * i)
    return sum(result)

# Run with: kernprof -l -v profile_example.py

memory_profiler

# Install: pip install memory_profiler

from memory_profiler import profile

@profile
def memory_intensive():
    """Function that uses a lot of memory."""
    # Create large list
    large_list = [i for i in range(1000000)]
    
    # Create dictionary
    large_dict = {i: i**2 for i in range(1000000)}
    
    # Process data
    result = sum(large_dict.values())
    
    return result

if __name__ == "__main__":
    memory_intensive()

# Run with: python -m memory_profiler profile_example.py

Output:

Architecture Diagram
Line #    Mem usage    Increment   Line Contents
================================================
     3    38.5 MiB    38.5 MiB   @profile
     4                             def memory_intensive():
     5                                 """Function that uses a lot of memory."""
     6    74.2 MiB    35.7 MiB       large_list = [i for i in range(1000000)]
     7   145.8 MiB    71.6 MiB       large_dict = {i: i**2 for i in range(1000000)}
     8   145.8 MiB     0.0 MiB       result = sum(large_dict.values())
     9   145.8 MiB     0.0 MiB       return result

py-spy (Sampling Profiler)

# Install: pip install py-spy

# Profile running program
py-spy top --pid 12345

# Generate flame graph
py-spy record -o profile.svg --pid 12345

💡

Interview Tip: Always mention profiling as the first step. Different profilers for different use cases.


Algorithm Optimization

Before Optimization

# BAD: O(n²) algorithm
def find_duplicates_slow(lst):
    """Find duplicates - O(n²)."""
    duplicates = []
    for i in range(len(lst)):
        for j in range(i + 1, len(lst)):
            if lst[i] == lst[j] and lst[i] not in duplicates:
                duplicates.append(lst[i])
    return duplicates

# BAD: Repeated string concatenation
def build_string_slow(n):
    """Build string - O(n²)."""
    result = ""
    for i in range(n):
        result += str(i) + " "
    return result

After Optimization

# GOOD: O(n) algorithm using set
def find_duplicates_fast(lst):
    """Find duplicates - O(n)."""
    seen = set()
    duplicates = set()
    for item in lst:
        if item in seen:
            duplicates.add(item)
        seen.add(item)
    return list(duplicates)

# GOOD: Using join - O(n)
def build_string_fast(n):
    """Build string - O(n)."""
    return " ".join(str(i) for i in range(n))

# GOOD: Using collections
from collections import Counter

def find_duplicates_counter(lst):
    """Find duplicates using Counter - O(n)."""
    counts = Counter(lst)
    return [item for item, count in counts.items() if count > 1]

Benchmark Comparison

import timeit
import random

# Generate test data
data = [random.randint(0, 1000) for _ in range(10000)]

# Benchmark
slow_time = timeit.timeit(lambda: find_duplicates_slow(data[:1000]), number=10)
fast_time = timeit.timeit(lambda: find_duplicates_fast(data), number=10)

print(f"Slow (n=1000): {slow_time:.3f}s")
print(f"Fast (n=10000): {fast_time:.3f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x with 10x more data")

Output:

Architecture Diagram
Slow (n=1000): 2.543s
Fast (n=10000): 0.002s
Speedup: 1271.5x with 10x more data

Built-in Optimizations

Built-in Functions

import timeit

# BAD: Manual loop
def sum_manual(lst):
    total = 0
    for item in lst:
        total += item
    return total

# GOOD: Built-in sum
def sum_builtin(lst):
    return sum(lst)

# BAD: Manual map
def double_manual(lst):
    result = []
    for item in lst:
        result.append(item * 2)
    return result

# GOOD: Built-in map
def double_builtin(lst):
    return list(map(lambda x: x * 2, lst))

# GOOD: List comprehension (often fastest)
def double_comprehension(lst):
    return [x * 2 for x in lst]

# Benchmark
data = list(range(100000))

manual_time = timeit.timeit(lambda: sum_manual(data), number=100)
builtin_time = timeit.timeit(lambda: sum_builtin(data), number=100)

print(f"Manual sum: {manual_time:.3f}s")
print(f"Built-in sum: {builtin_time:.3f}s")
print(f"Speedup: {manual_time/builtin_time:.1f}x")

Output:

Architecture Diagram
Manual sum: 1.234s
Built-in sum: 0.345s
Speedup: 3.6x

Generator Expressions

import sys
import timeit

# List comprehension (creates full list)
def process_list_comp(data):
    return [x * x for x in data if x % 2 == 0]

# Generator expression (lazy evaluation)
def process_gen_exp(data):
    return sum(x * x for x in data if x % 2 == 0)

# Memory comparison
data = list(range(1000000))

list_result = process_list_comp(data)
gen_result = process_gen_exp(data)

print(f"List size: {sys.getsizeof(list_result):,} bytes")
print(f"Generator size: {sys.getsizeof(gen_result)} bytes")

# Time comparison for sum
list_time = timeit.timeit(lambda: sum(process_list_comp(data)), number=10)
gen_time = timeit.timeit(lambda: process_gen_exp(data), number=10)

print(f"List + sum: {list_time:.3f}s")
print(f"Generator: {gen_time:.3f}s")

Output:

Architecture Diagram
List size: 8,448,728 bytes
Generator size: 200 bytes
List + sum: 1.234s
Generator: 0.876s

ℹ️

Optimization Tip: Generators are memory-efficient and often faster for single-pass operations.


NumPy Vectorization

Basic Vectorization

import numpy as np
import timeit

# Python loop
def python_multiply(n):
    result = []
    for i in range(n):
        result.append(i * 2)
    return result

# NumPy vectorization
def numpy_multiply(n):
    arr = np.arange(n)
    return arr * 2

# Benchmark
n = 1000000

python_time = timeit.timeit(lambda: python_multiply(n), number=10)
numpy_time = timeit.timeit(lambda: numpy_multiply(n), number=10)

print(f"Python loop: {python_time:.3f}s")
print(f"NumPy vectorized: {numpy_time:.3f}s")
print(f"Speedup: {python_time/numpy_time:.1f}x")

Output:

Architecture Diagram
Python loop: 2.345s
NumPy vectorized: 0.023s
Speedup: 102.0x

Advanced NumPy Operations

import numpy as np
import timeit

# Matrix operations
def python_matrix_multiply(a, b):
    """Pure Python matrix multiplication."""
    rows_a = len(a)
    cols_a = len(a[0])
    cols_b = len(b[0])
    
    result = [[0 for _ in range(cols_b)] for _ in range(rows_a)]
    
    for i in range(rows_a):
        for j in range(cols_b):
            for k in range(cols_a):
                result[i][j] += a[i][k] * b[k][j]
    
    return result

def numpy_matrix_multiply(a, b):
    """NumPy matrix multiplication."""
    return np.dot(a, b)

# Create random matrices
n = 100
a_python = [[random.random() for _ in range(n)] for _ in range(n)]
b_python = [[random.random() for _ in range(n)] for _ in range(n)]

a_numpy = np.array(a_python)
b_numpy = np.array(b_python)

# Benchmark
python_time = timeit.timeit(lambda: python_matrix_multiply(a_python, b_python), number=10)
numpy_time = timeit.timeit(lambda: numpy_matrix_multiply(a_numpy, b_numpy), number=10)

print(f"Python: {python_time:.3f}s")
print(f"NumPy: {numpy_time:.3f}s")
print(f"Speedup: {python_time/numpy_time:.1f}x")

Output:

Architecture Diagram
Python: 8.765s
NumPy: 0.002s
Speedup: 4382.5x

NumPy Memory Layout

import numpy as np
import sys

# Python list vs NumPy array
python_list = [i for i in range(1000000)]
numpy_array = np.arange(1000000)

print(f"Python list size: {sys.getsizeof(python_list):,} bytes")
print(f"NumPy array size: {numpy_array.nbytes:,} bytes")
print(f"Memory savings: {(1 - numpy_array.nbytes/sys.getsizeof(python_list)) * 100:.1f}%")

# Contiguous memory layout
print(f"\nNumPy array flags:")
print(f"C contiguous: {numpy_array.flags['C_CONTIGUOUS']}")
print(f"F contiguous: {numpy_array.flags['F_CONTIGUOUS']}")

Output:

Architecture Diagram
Python list size: 8,448,728 bytes
NumPy array size: 8,000,000 bytes
Memory savings: 5.3%

NumPy array flags:
C contiguous: True
F contiguous: True

⚠️

Memory Note: NumPy arrays are more memory-efficient than Python lists for numerical data.


C Extensions

ctypes

import ctypes
import timeit

# Create C library (example)
# save as mylib.c:
"""
int fibonacci(int n) {
    if (n <= 1) return n;
    return fibonacci(n-1) + fibonacci(n-2);
}
"""

# Compile: gcc -shared -o mylib.so -fPIC mylib.c

# Python wrapper
def load_c_library():
    """Load C library with ctypes."""
    try:
        lib = ctypes.CDLL('./mylib.so')
        lib.fibonacci.argtypes = [ctypes.c_int]
        lib.fibonacci.restype = ctypes.c_int
        return lib
    except OSError:
        print("C library not found, using Python fallback")
        return None

# Pure Python fibonacci
def fibonacci_python(n):
    if n <= 1:
        return n
    return fibonacci_python(n-1) + fibonacci_python(n-2)

# C fibonacci (if library available)
def fibonacci_c(n, lib):
    return lib.fibonacci(n)

# Benchmark
lib = load_c_library()
n = 30

python_time = timeit.timeit(lambda: fibonacci_python(n), number=10)
print(f"Python fibonacci({n}): {python_time:.3f}s")

if lib:
    c_time = timeit.timeit(lambda: fibonacci_c(n, lib), number=10)
    print(f"C fibonacci({n}): {c_time:.3f}s")
    print(f"Speedup: {python_time/c_time:.1f}x")

cffi

from cffi import FFI

ffi = FFI()

# Define C interface
ffi.cdef("""
    int fibonacci(int n);
    double fast_sin(double x);
""")

# Load library
lib = ffi.dlopen('./mylib.so')

# Use C functions
result = lib.fibonacci(30)
print(f"Fibonacci(30) = {result}")

# Benchmark
import timeit

python_time = timeit.timeit(lambda: fibonacci_python(30), number=10)
c_time = timeit.timeit(lambda: lib.fibonacci(30), number=10)

print(f"Python: {python_time:.3f}s")
print(f"C: {c_time:.3f}s")
print(f"Speedup: {python_time/c_time:.1f}x")

Cython

Basic Cython

# save as fast_module.pyx
"""
def fibonacci_cython(int n):
    if n <= 1:
        return n
    return fibonacci_cython(n-1) + fibonacci_cython(n-2)
"""

# setup.py
"""
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("fast_module.pyx")
)
"""

# Build: python setup.py build_ext --inplace

# Usage
def fibonacci_python(n):
    if n <= 1:
        return n
    return fibonacci_python(n-1) + fibonacci_python(n-2)

try:
    from fast_module import fibonacci_cython
    
    import timeit
    
    n = 30
    python_time = timeit.timeit(lambda: fibonacci_python(n), number=10)
    cython_time = timeit.timeit(lambda: fibonacci_cython(n), number=10)
    
    print(f"Python: {python_time:.3f}s")
    print(f"Cython: {cython_time:.3f}s")
    print(f"Speedup: {python_time/cython_time:.1f}x")
except ImportError:
    print("Cython module not built")

Cython with Typed Variables

# save as typed_module.pyx
"""
def sum_array(double[:] arr):
    '''Sum array with typed memoryview.'''
    cdef double total = 0
    cdef int i
    
    for i in range(arr.shape[0]):
        total += arr[i]
    
    return total

def matrix_multiply(double[:, :] a, double[:, :] b):
    '''Matrix multiplication with typed memoryviews.'''
    cdef int m = a.shape[0]
    cdef int n = a.shape[1]
    cdef int p = b.shape[1]
    
    result = np.zeros((m, p))
    cdef double[:, :] result_view = result
    
    cdef int i, j, k
    for i in range(m):
        for j in range(p):
            for k in range(n):
                result_view[i, j] += a[i, k] * b[k, j]
    
    return result
"""

# Performance comparison
import numpy as np
import timeit

# Pure Python
def matrix_multiply_python(a, b):
    m, n = len(a), len(a[0])
    p = len(b[0])
    result = [[0 for _ in range(p)] for _ in range(m)]
    
    for i in range(m):
        for j in range(p):
            for k in range(n):
                result[i][j] += a[i][k] * b[k][j]
    
    return result

# NumPy
def matrix_multiply_numpy(a, b):
    return np.dot(a, b)

# Create test matrices
n = 50
a = np.random.rand(n, n)
b = np.random.rand(n, n)

python_time = timeit.timeit(lambda: matrix_multiply_python(a.tolist(), b.tolist()), number=10)
numpy_time = timeit.timeit(lambda: matrix_multiply_numpy(a, b), number=10)

print(f"Python: {python_time:.3f}s")
print(f"NumPy: {numpy_time:.3f}s")
print(f"Speedup: {python_time/numpy_time:.1f}x")

💡

Interview Tip: Cython combines Python ease-of-use with C performance. Great for numerical code.


Multiprocessing Optimization

CPU-Bound Parallelism

import multiprocessing
import time
from concurrent.futures import ProcessPoolExecutor

def cpu_intensive_task(n):
    """CPU-bound task."""
    return sum(i * i for i in range(n))

def parallel_processing():
    """Demonstrate multiprocessing optimization."""
    numbers = [1000000] * 8
    
    # Sequential
    start = time.time()
    results_seq = [cpu_intensive_task(n) for n in numbers]
    seq_time = time.time() - start
    
    # Parallel
    start = time.time()
    with ProcessPoolExecutor(max_workers=4) as executor:
        results_par = list(executor.map(cpu_intensive_task, numbers))
    par_time = time.time() - start
    
    print(f"Sequential: {seq_time:.2f}s")
    print(f"Parallel: {par_time:.2f}s")
    print(f"Speedup: {seq_time/par_time:.1f}x")

if __name__ == "__main__":
    parallel_processing()

Output:

Architecture Diagram
Sequential: 4.567s
Parallel: 1.234s
Speedup: 3.7x

Async I/O Optimization

import asyncio
import time

async def async_io_task(n):
    """Simulate async I/O operation."""
    await asyncio.sleep(0.1)
    return n * 2

async def parallel_async():
    """Demonstrate async optimization."""
    tasks = [async_io_task(i) for i in range(20)]
    
    start = time.time()
    results = await asyncio.gather(*tasks)
    elapsed = time.time() - start
    
    print(f"Async I/O: {elapsed:.2f}s for {len(tasks)} tasks")
    print(f"Results: {results[:5]}...")

asyncio.run(parallel_async())

Output:

Architecture Diagram
Async I/O: 0.11s for 20 tasks
Results: [0, 2, 4, 6, 8]...

Memory Optimization

slots

import sys

class RegularClass:
    def __init__(self, x, y):
        self.x = x
        self.y = y

class SlottedClass:
    __slots__ = ('x', 'y')
    
    def __init__(self, x, y):
        self.x = x
        self.y = y

# Memory comparison
regular = RegularClass(1, 2)
slotted = SlottedClass(1, 2)

print(f"Regular: {sys.getsizeof(regular)} bytes")
print(f"Slotted: {sys.getsizeof(slotted)} bytes")

# Batch comparison
import tracemalloc

tracemalloc.start()
regular_objects = [RegularClass(i, i) for i in range(10000)]
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"\nRegular objects peak: {peak:,} bytes")

tracemalloc.start()
slotted_objects = [SlottedClass(i, i) for i in range(10000)]
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Slotted objects peak: {peak:,} bytes")

Output:

Architecture Diagram
Regular: 48 bytes
Slotted: 56 bytes

Regular objects peak: 1,234,567 bytes
Slotted objects peak: 876,543 bytes

Memory-Efficient Data Structures

import sys
from array import array
from collections import deque

# List vs Array vs deque
list_data = [i for i in range(1000000)]
array_data = array('i', [i for i in range(1000000)])
deque_data = deque([i for i in range(1000000)])

print(f"List: {sys.getsizeof(list_data):,} bytes")
print(f"Array: {sys.getsizeof(array_data):,} bytes")
print(f"Deque: {sys.getsizeof(deque_data):,} bytes")

# For frequent appends/dequeues
import timeit

# List (slow for front operations)
list_time = timeit.timeit(lambda: list_data.insert(0, 0), number=1000)

# Deque (fast for both ends)
deque_time = timeit.timeit(lambda: deque_data.appendleft(0), number=1000)

print(f"\nList insert(0): {list_time:.3f}s")
print(f"Deque appendleft: {deque_time:.3f}s")
print(f"Deque is {list_time/deque_time:.1f}x faster")

Output:

Architecture Diagram
List: 8,448,728 bytes
Array: 4,000,048 bytes
Deque: 8,056,576 bytes

List insert(0): 0.234s
Deque appendleft: 0.001s
Deque is 234.0x faster

ℹ️

Data Structure Choice: Use array for homogeneous numeric data, deque for frequent insert/delete at ends.


Complexity Analysis

Performance Comparison

OptimizationSpeedupMemoryUse Case
Algorithm10-1000xSameAlways optimize first
Built-ins2-10xSameQuick wins
NumPy10-1000xLessNumerical code
C extensions10-100xSameCritical paths
Cython10-100xSameEasy C integration
Multiprocessing2-8xMoreCPU-bound

Decision Framework

# 1. Is it slow?
#    - Profile to find bottlenecks

# 2. Is it algorithmic?
#    - O(n²) → O(n log n) or better

# 3. Is it numerical?
#    - Use NumPy vectorization

# 4. Is it CPU-bound?
#    - Use multiprocessing or C extensions

# 5. Is it I/O-bound?
#    - Use asyncio or threading

# 6. Is it memory-bound?
#    - Use generators, __slots__, arrays

Interview Tips

Common Follow-up Questions

  1. "When should you use C extensions vs Cython?"

    • C extensions: Full control, existing C code
    • Cython: Easy integration, Python-like syntax
  2. "How do you profile memory leaks?"

    • tracemalloc for allocation tracking
    • objgraph for reference cycles
    • memory_profiler for line-by-line
  3. "What's the GIL and how does it affect performance?"

    • Global Interpreter Lock prevents true threading
    • Use multiprocessing for CPU-bound
    • Use asyncio for I/O-bound

Code Review Tips

# BAD: Not profiling first
def optimize_without_profiling():
    # Blindly optimizing
    pass

# GOOD: Profile first
# 1. Run profiler
# 2. Identify bottlenecks
# 3. Optimize specific hot paths

# BAD: Using Python for numerical operations
def slow_numerical(data):
    return [x * 2 for x in data]

# GOOD: Using NumPy
def fast_numerical(data):
    return np.array(data) * 2

# BAD: Creating large lists
def memory_heavy():
    return [i for i in range(1000000)]

# GOOD: Using generators
def memory_light():
    return (i for i in range(1000000))

⚠️

Common Mistake: Optimizing code that isn't the bottleneck. Always profile first.


Summary

TechniqueSpeedupComplexityWhen to Use
ProfilingN/ALowAlways first
Algorithm10-1000xMediumO(n²) code
Built-ins2-10xLowQuick wins
NumPy10-1000xMediumNumerical
C extensions10-100xHighCritical paths
Cython10-100xMediumEasy C
Multiprocessing2-8xMediumCPU-bound
MemoryN/ALowLarge data

Best Practices

  1. Always profile first
  2. Optimize algorithms before code
  3. Use built-in functions
  4. Vectorize with NumPy
  5. Use C for critical paths
  6. Consider memory usage
  7. Measure before and after

ℹ️

Key Takeaway: Performance optimization is about measuring, not guessing. Profile first, optimize strategically.


Practice Problems

  1. Profile and Optimize: Find and fix bottlenecks in a slow function
  2. NumPy Vectorization: Convert Python loops to NumPy operations
  3. C Extension: Write a C function and call it from Python
  4. Memory Optimization: Reduce memory usage of a data-intensive application
  5. Parallel Processing: Speed up CPU-bound task with multiprocessing

Further Reading

  • Python Docs: cProfile, timeit, tracemalloc
  • NumPy Docs: Vectorization, broadcasting
  • Cython Docs: https://cython.readthedocs.io/
  • Books: "High Performance Python" by Micha Gorelick

Remember: Optimization is about trade-offs. Balance speed, memory, and maintainability.

Advertisement