Performance: profiling, C extensions, Cython, vectorization
Optimizing Python code for production systems
Interview Question
"How do you optimize slow Python code? Explain profiling techniques, C extensions, Cython, and NumPy vectorization. When would you use each approach?"
Difficulty: Hard | Frequently asked at Google, Meta, Amazon
Theoretical Foundation
Performance Optimization Strategy
# 1. Profile first - find bottlenecks
# 2. Optimize algorithms - O(n²) → O(n log n)
# 3. Use built-in functions - they're implemented in C
# 4. Vectorize with NumPy - batch operations
# 5. Use C extensions - for critical paths
# 6. Use Cython - easy C integration
# 7. Use multiprocessing - for CPU-bound tasks
# Remember: "Premature optimization is the root of all evil"
# - Donald Knuth
ℹ️
Key Concept: Always profile before optimizing. You can't optimize what you can't measure.
Profiling Techniques
cProfile
import cProfile
import pstats
from io import StringIO
def slow_function():
"""Function with performance issues."""
result = []
for i in range(10000):
result.append(i * i)
return sum(result)
def another_slow_function():
"""Another slow function."""
data = {i: i**2 for i in range(10000)}
return sum(data.values())
# Profile code
def profile_code():
profiler = cProfile.Profile()
profiler.enable()
# Code to profile
slow_function()
another_slow_function()
profiler.disable()
# Print stats
stream = StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats('cumulative')
stats.print_stats(10)
print(stream.getvalue())
if __name__ == "__main__":
profile_code()
Output:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.015 0.015 profile_example.py:4(slow_function)
1 0.000 0.000 0.010 0.010 profile_example.py:10(another_slow_function)
1 0.015 0.015 0.015 0.015 profile_example.py:16(profile_code)
line_profiler
# Install: pip install line_profiler
@profile # Add this decorator
def slow_function():
result = []
for i in range(10000):
result.append(i * i)
return sum(result)
# Run with: kernprof -l -v profile_example.py
memory_profiler
# Install: pip install memory_profiler
from memory_profiler import profile
@profile
def memory_intensive():
"""Function that uses a lot of memory."""
# Create large list
large_list = [i for i in range(1000000)]
# Create dictionary
large_dict = {i: i**2 for i in range(1000000)}
# Process data
result = sum(large_dict.values())
return result
if __name__ == "__main__":
memory_intensive()
# Run with: python -m memory_profiler profile_example.py
Output:
Line # Mem usage Increment Line Contents
================================================
3 38.5 MiB 38.5 MiB @profile
4 def memory_intensive():
5 """Function that uses a lot of memory."""
6 74.2 MiB 35.7 MiB large_list = [i for i in range(1000000)]
7 145.8 MiB 71.6 MiB large_dict = {i: i**2 for i in range(1000000)}
8 145.8 MiB 0.0 MiB result = sum(large_dict.values())
9 145.8 MiB 0.0 MiB return result
py-spy (Sampling Profiler)
# Install: pip install py-spy
# Profile running program
py-spy top --pid 12345
# Generate flame graph
py-spy record -o profile.svg --pid 12345
💡
Interview Tip: Always mention profiling as the first step. Different profilers for different use cases.
Algorithm Optimization
Before Optimization
# BAD: O(n²) algorithm
def find_duplicates_slow(lst):
"""Find duplicates - O(n²)."""
duplicates = []
for i in range(len(lst)):
for j in range(i + 1, len(lst)):
if lst[i] == lst[j] and lst[i] not in duplicates:
duplicates.append(lst[i])
return duplicates
# BAD: Repeated string concatenation
def build_string_slow(n):
"""Build string - O(n²)."""
result = ""
for i in range(n):
result += str(i) + " "
return result
After Optimization
# GOOD: O(n) algorithm using set
def find_duplicates_fast(lst):
"""Find duplicates - O(n)."""
seen = set()
duplicates = set()
for item in lst:
if item in seen:
duplicates.add(item)
seen.add(item)
return list(duplicates)
# GOOD: Using join - O(n)
def build_string_fast(n):
"""Build string - O(n)."""
return " ".join(str(i) for i in range(n))
# GOOD: Using collections
from collections import Counter
def find_duplicates_counter(lst):
"""Find duplicates using Counter - O(n)."""
counts = Counter(lst)
return [item for item, count in counts.items() if count > 1]
Benchmark Comparison
import timeit
import random
# Generate test data
data = [random.randint(0, 1000) for _ in range(10000)]
# Benchmark
slow_time = timeit.timeit(lambda: find_duplicates_slow(data[:1000]), number=10)
fast_time = timeit.timeit(lambda: find_duplicates_fast(data), number=10)
print(f"Slow (n=1000): {slow_time:.3f}s")
print(f"Fast (n=10000): {fast_time:.3f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x with 10x more data")
Output:
Slow (n=1000): 2.543s
Fast (n=10000): 0.002s
Speedup: 1271.5x with 10x more data
Built-in Optimizations
Built-in Functions
import timeit
# BAD: Manual loop
def sum_manual(lst):
total = 0
for item in lst:
total += item
return total
# GOOD: Built-in sum
def sum_builtin(lst):
return sum(lst)
# BAD: Manual map
def double_manual(lst):
result = []
for item in lst:
result.append(item * 2)
return result
# GOOD: Built-in map
def double_builtin(lst):
return list(map(lambda x: x * 2, lst))
# GOOD: List comprehension (often fastest)
def double_comprehension(lst):
return [x * 2 for x in lst]
# Benchmark
data = list(range(100000))
manual_time = timeit.timeit(lambda: sum_manual(data), number=100)
builtin_time = timeit.timeit(lambda: sum_builtin(data), number=100)
print(f"Manual sum: {manual_time:.3f}s")
print(f"Built-in sum: {builtin_time:.3f}s")
print(f"Speedup: {manual_time/builtin_time:.1f}x")
Output:
Manual sum: 1.234s
Built-in sum: 0.345s
Speedup: 3.6x
Generator Expressions
import sys
import timeit
# List comprehension (creates full list)
def process_list_comp(data):
return [x * x for x in data if x % 2 == 0]
# Generator expression (lazy evaluation)
def process_gen_exp(data):
return sum(x * x for x in data if x % 2 == 0)
# Memory comparison
data = list(range(1000000))
list_result = process_list_comp(data)
gen_result = process_gen_exp(data)
print(f"List size: {sys.getsizeof(list_result):,} bytes")
print(f"Generator size: {sys.getsizeof(gen_result)} bytes")
# Time comparison for sum
list_time = timeit.timeit(lambda: sum(process_list_comp(data)), number=10)
gen_time = timeit.timeit(lambda: process_gen_exp(data), number=10)
print(f"List + sum: {list_time:.3f}s")
print(f"Generator: {gen_time:.3f}s")
Output:
List size: 8,448,728 bytes
Generator size: 200 bytes
List + sum: 1.234s
Generator: 0.876s
ℹ️
Optimization Tip: Generators are memory-efficient and often faster for single-pass operations.
NumPy Vectorization
Basic Vectorization
import numpy as np
import timeit
# Python loop
def python_multiply(n):
result = []
for i in range(n):
result.append(i * 2)
return result
# NumPy vectorization
def numpy_multiply(n):
arr = np.arange(n)
return arr * 2
# Benchmark
n = 1000000
python_time = timeit.timeit(lambda: python_multiply(n), number=10)
numpy_time = timeit.timeit(lambda: numpy_multiply(n), number=10)
print(f"Python loop: {python_time:.3f}s")
print(f"NumPy vectorized: {numpy_time:.3f}s")
print(f"Speedup: {python_time/numpy_time:.1f}x")
Output:
Python loop: 2.345s
NumPy vectorized: 0.023s
Speedup: 102.0x
Advanced NumPy Operations
import numpy as np
import timeit
# Matrix operations
def python_matrix_multiply(a, b):
"""Pure Python matrix multiplication."""
rows_a = len(a)
cols_a = len(a[0])
cols_b = len(b[0])
result = [[0 for _ in range(cols_b)] for _ in range(rows_a)]
for i in range(rows_a):
for j in range(cols_b):
for k in range(cols_a):
result[i][j] += a[i][k] * b[k][j]
return result
def numpy_matrix_multiply(a, b):
"""NumPy matrix multiplication."""
return np.dot(a, b)
# Create random matrices
n = 100
a_python = [[random.random() for _ in range(n)] for _ in range(n)]
b_python = [[random.random() for _ in range(n)] for _ in range(n)]
a_numpy = np.array(a_python)
b_numpy = np.array(b_python)
# Benchmark
python_time = timeit.timeit(lambda: python_matrix_multiply(a_python, b_python), number=10)
numpy_time = timeit.timeit(lambda: numpy_matrix_multiply(a_numpy, b_numpy), number=10)
print(f"Python: {python_time:.3f}s")
print(f"NumPy: {numpy_time:.3f}s")
print(f"Speedup: {python_time/numpy_time:.1f}x")
Output:
Python: 8.765s
NumPy: 0.002s
Speedup: 4382.5x
NumPy Memory Layout
import numpy as np
import sys
# Python list vs NumPy array
python_list = [i for i in range(1000000)]
numpy_array = np.arange(1000000)
print(f"Python list size: {sys.getsizeof(python_list):,} bytes")
print(f"NumPy array size: {numpy_array.nbytes:,} bytes")
print(f"Memory savings: {(1 - numpy_array.nbytes/sys.getsizeof(python_list)) * 100:.1f}%")
# Contiguous memory layout
print(f"\nNumPy array flags:")
print(f"C contiguous: {numpy_array.flags['C_CONTIGUOUS']}")
print(f"F contiguous: {numpy_array.flags['F_CONTIGUOUS']}")
Output:
Python list size: 8,448,728 bytes
NumPy array size: 8,000,000 bytes
Memory savings: 5.3%
NumPy array flags:
C contiguous: True
F contiguous: True
⚠️
Memory Note: NumPy arrays are more memory-efficient than Python lists for numerical data.
C Extensions
ctypes
import ctypes
import timeit
# Create C library (example)
# save as mylib.c:
"""
int fibonacci(int n) {
if (n <= 1) return n;
return fibonacci(n-1) + fibonacci(n-2);
}
"""
# Compile: gcc -shared -o mylib.so -fPIC mylib.c
# Python wrapper
def load_c_library():
"""Load C library with ctypes."""
try:
lib = ctypes.CDLL('./mylib.so')
lib.fibonacci.argtypes = [ctypes.c_int]
lib.fibonacci.restype = ctypes.c_int
return lib
except OSError:
print("C library not found, using Python fallback")
return None
# Pure Python fibonacci
def fibonacci_python(n):
if n <= 1:
return n
return fibonacci_python(n-1) + fibonacci_python(n-2)
# C fibonacci (if library available)
def fibonacci_c(n, lib):
return lib.fibonacci(n)
# Benchmark
lib = load_c_library()
n = 30
python_time = timeit.timeit(lambda: fibonacci_python(n), number=10)
print(f"Python fibonacci({n}): {python_time:.3f}s")
if lib:
c_time = timeit.timeit(lambda: fibonacci_c(n, lib), number=10)
print(f"C fibonacci({n}): {c_time:.3f}s")
print(f"Speedup: {python_time/c_time:.1f}x")
cffi
from cffi import FFI
ffi = FFI()
# Define C interface
ffi.cdef("""
int fibonacci(int n);
double fast_sin(double x);
""")
# Load library
lib = ffi.dlopen('./mylib.so')
# Use C functions
result = lib.fibonacci(30)
print(f"Fibonacci(30) = {result}")
# Benchmark
import timeit
python_time = timeit.timeit(lambda: fibonacci_python(30), number=10)
c_time = timeit.timeit(lambda: lib.fibonacci(30), number=10)
print(f"Python: {python_time:.3f}s")
print(f"C: {c_time:.3f}s")
print(f"Speedup: {python_time/c_time:.1f}x")
Cython
Basic Cython
# save as fast_module.pyx
"""
def fibonacci_cython(int n):
if n <= 1:
return n
return fibonacci_cython(n-1) + fibonacci_cython(n-2)
"""
# setup.py
"""
from setuptools import setup
from Cython.Build import cythonize
setup(
ext_modules=cythonize("fast_module.pyx")
)
"""
# Build: python setup.py build_ext --inplace
# Usage
def fibonacci_python(n):
if n <= 1:
return n
return fibonacci_python(n-1) + fibonacci_python(n-2)
try:
from fast_module import fibonacci_cython
import timeit
n = 30
python_time = timeit.timeit(lambda: fibonacci_python(n), number=10)
cython_time = timeit.timeit(lambda: fibonacci_cython(n), number=10)
print(f"Python: {python_time:.3f}s")
print(f"Cython: {cython_time:.3f}s")
print(f"Speedup: {python_time/cython_time:.1f}x")
except ImportError:
print("Cython module not built")
Cython with Typed Variables
# save as typed_module.pyx
"""
def sum_array(double[:] arr):
'''Sum array with typed memoryview.'''
cdef double total = 0
cdef int i
for i in range(arr.shape[0]):
total += arr[i]
return total
def matrix_multiply(double[:, :] a, double[:, :] b):
'''Matrix multiplication with typed memoryviews.'''
cdef int m = a.shape[0]
cdef int n = a.shape[1]
cdef int p = b.shape[1]
result = np.zeros((m, p))
cdef double[:, :] result_view = result
cdef int i, j, k
for i in range(m):
for j in range(p):
for k in range(n):
result_view[i, j] += a[i, k] * b[k, j]
return result
"""
# Performance comparison
import numpy as np
import timeit
# Pure Python
def matrix_multiply_python(a, b):
m, n = len(a), len(a[0])
p = len(b[0])
result = [[0 for _ in range(p)] for _ in range(m)]
for i in range(m):
for j in range(p):
for k in range(n):
result[i][j] += a[i][k] * b[k][j]
return result
# NumPy
def matrix_multiply_numpy(a, b):
return np.dot(a, b)
# Create test matrices
n = 50
a = np.random.rand(n, n)
b = np.random.rand(n, n)
python_time = timeit.timeit(lambda: matrix_multiply_python(a.tolist(), b.tolist()), number=10)
numpy_time = timeit.timeit(lambda: matrix_multiply_numpy(a, b), number=10)
print(f"Python: {python_time:.3f}s")
print(f"NumPy: {numpy_time:.3f}s")
print(f"Speedup: {python_time/numpy_time:.1f}x")
💡
Interview Tip: Cython combines Python ease-of-use with C performance. Great for numerical code.
Multiprocessing Optimization
CPU-Bound Parallelism
import multiprocessing
import time
from concurrent.futures import ProcessPoolExecutor
def cpu_intensive_task(n):
"""CPU-bound task."""
return sum(i * i for i in range(n))
def parallel_processing():
"""Demonstrate multiprocessing optimization."""
numbers = [1000000] * 8
# Sequential
start = time.time()
results_seq = [cpu_intensive_task(n) for n in numbers]
seq_time = time.time() - start
# Parallel
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
results_par = list(executor.map(cpu_intensive_task, numbers))
par_time = time.time() - start
print(f"Sequential: {seq_time:.2f}s")
print(f"Parallel: {par_time:.2f}s")
print(f"Speedup: {seq_time/par_time:.1f}x")
if __name__ == "__main__":
parallel_processing()
Output:
Sequential: 4.567s
Parallel: 1.234s
Speedup: 3.7x
Async I/O Optimization
import asyncio
import time
async def async_io_task(n):
"""Simulate async I/O operation."""
await asyncio.sleep(0.1)
return n * 2
async def parallel_async():
"""Demonstrate async optimization."""
tasks = [async_io_task(i) for i in range(20)]
start = time.time()
results = await asyncio.gather(*tasks)
elapsed = time.time() - start
print(f"Async I/O: {elapsed:.2f}s for {len(tasks)} tasks")
print(f"Results: {results[:5]}...")
asyncio.run(parallel_async())
Output:
Async I/O: 0.11s for 20 tasks
Results: [0, 2, 4, 6, 8]...
Memory Optimization
slots
import sys
class RegularClass:
def __init__(self, x, y):
self.x = x
self.y = y
class SlottedClass:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x = x
self.y = y
# Memory comparison
regular = RegularClass(1, 2)
slotted = SlottedClass(1, 2)
print(f"Regular: {sys.getsizeof(regular)} bytes")
print(f"Slotted: {sys.getsizeof(slotted)} bytes")
# Batch comparison
import tracemalloc
tracemalloc.start()
regular_objects = [RegularClass(i, i) for i in range(10000)]
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"\nRegular objects peak: {peak:,} bytes")
tracemalloc.start()
slotted_objects = [SlottedClass(i, i) for i in range(10000)]
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Slotted objects peak: {peak:,} bytes")
Output:
Regular: 48 bytes
Slotted: 56 bytes
Regular objects peak: 1,234,567 bytes
Slotted objects peak: 876,543 bytes
Memory-Efficient Data Structures
import sys
from array import array
from collections import deque
# List vs Array vs deque
list_data = [i for i in range(1000000)]
array_data = array('i', [i for i in range(1000000)])
deque_data = deque([i for i in range(1000000)])
print(f"List: {sys.getsizeof(list_data):,} bytes")
print(f"Array: {sys.getsizeof(array_data):,} bytes")
print(f"Deque: {sys.getsizeof(deque_data):,} bytes")
# For frequent appends/dequeues
import timeit
# List (slow for front operations)
list_time = timeit.timeit(lambda: list_data.insert(0, 0), number=1000)
# Deque (fast for both ends)
deque_time = timeit.timeit(lambda: deque_data.appendleft(0), number=1000)
print(f"\nList insert(0): {list_time:.3f}s")
print(f"Deque appendleft: {deque_time:.3f}s")
print(f"Deque is {list_time/deque_time:.1f}x faster")
Output:
List: 8,448,728 bytes
Array: 4,000,048 bytes
Deque: 8,056,576 bytes
List insert(0): 0.234s
Deque appendleft: 0.001s
Deque is 234.0x faster
ℹ️
Data Structure Choice: Use array for homogeneous numeric data, deque for frequent insert/delete at ends.
Complexity Analysis
Performance Comparison
| Optimization | Speedup | Memory | Use Case |
|---|---|---|---|
| Algorithm | 10-1000x | Same | Always optimize first |
| Built-ins | 2-10x | Same | Quick wins |
| NumPy | 10-1000x | Less | Numerical code |
| C extensions | 10-100x | Same | Critical paths |
| Cython | 10-100x | Same | Easy C integration |
| Multiprocessing | 2-8x | More | CPU-bound |
Decision Framework
# 1. Is it slow?
# - Profile to find bottlenecks
# 2. Is it algorithmic?
# - O(n²) → O(n log n) or better
# 3. Is it numerical?
# - Use NumPy vectorization
# 4. Is it CPU-bound?
# - Use multiprocessing or C extensions
# 5. Is it I/O-bound?
# - Use asyncio or threading
# 6. Is it memory-bound?
# - Use generators, __slots__, arrays
Interview Tips
Common Follow-up Questions
-
"When should you use C extensions vs Cython?"
- C extensions: Full control, existing C code
- Cython: Easy integration, Python-like syntax
-
"How do you profile memory leaks?"
- tracemalloc for allocation tracking
- objgraph for reference cycles
- memory_profiler for line-by-line
-
"What's the GIL and how does it affect performance?"
- Global Interpreter Lock prevents true threading
- Use multiprocessing for CPU-bound
- Use asyncio for I/O-bound
Code Review Tips
# BAD: Not profiling first
def optimize_without_profiling():
# Blindly optimizing
pass
# GOOD: Profile first
# 1. Run profiler
# 2. Identify bottlenecks
# 3. Optimize specific hot paths
# BAD: Using Python for numerical operations
def slow_numerical(data):
return [x * 2 for x in data]
# GOOD: Using NumPy
def fast_numerical(data):
return np.array(data) * 2
# BAD: Creating large lists
def memory_heavy():
return [i for i in range(1000000)]
# GOOD: Using generators
def memory_light():
return (i for i in range(1000000))
⚠️
Common Mistake: Optimizing code that isn't the bottleneck. Always profile first.
Summary
| Technique | Speedup | Complexity | When to Use |
|---|---|---|---|
| Profiling | N/A | Low | Always first |
| Algorithm | 10-1000x | Medium | O(n²) code |
| Built-ins | 2-10x | Low | Quick wins |
| NumPy | 10-1000x | Medium | Numerical |
| C extensions | 10-100x | High | Critical paths |
| Cython | 10-100x | Medium | Easy C |
| Multiprocessing | 2-8x | Medium | CPU-bound |
| Memory | N/A | Low | Large data |
Best Practices
- Always profile first
- Optimize algorithms before code
- Use built-in functions
- Vectorize with NumPy
- Use C for critical paths
- Consider memory usage
- Measure before and after
ℹ️
Key Takeaway: Performance optimization is about measuring, not guessing. Profile first, optimize strategically.
Practice Problems
- Profile and Optimize: Find and fix bottlenecks in a slow function
- NumPy Vectorization: Convert Python loops to NumPy operations
- C Extension: Write a C function and call it from Python
- Memory Optimization: Reduce memory usage of a data-intensive application
- Parallel Processing: Speed up CPU-bound task with multiprocessing
Further Reading
- Python Docs:
cProfile,timeit,tracemalloc - NumPy Docs: Vectorization, broadcasting
- Cython Docs: https://cython.readthedocs.io/
- Books: "High Performance Python" by Micha Gorelick
Remember: Optimization is about trade-offs. Balance speed, memory, and maintainability.