Python Performance — Profiling & Optimization
Understanding performance helps you write faster, more efficient Python code. Always profile first to find actual bottlenecks before optimizing.
Learning Objectives
- Profile code to find bottlenecks with cProfile and line_profiler
- Use timeit for micro-benchmarks
- Apply common Python performance optimizations
- Use memory profiling tools and generators for memory efficiency
- Understand Big O notation for Python operations
- Optimize data structures and algorithms
timeit — Micro-benchmarks
Measure execution time of small code snippets:
import timeit
# Time a statement
time = timeit.timeit('sum(range(1000))', number=10000)
print(f"Total: {time:.4f}s") # ~0.5s
# Time a function
def fib(n):
if n < 2:
return n
return fib(n-1) + fib(n-2)
time = timeit.timeit(lambda: fib(30), number=10)
print(f"Fib(30) x10: {time:.4f}s")
# Compare approaches
list_time = timeit.timeit(
'[i ** 2 for i in range(1000)]',
number=10000
)
gen_time = timeit.timeit(
'sum(i ** 2 for i in range(1000))',
number=10000
)
print(f"List comprehension: {list_time:.4f}s")
print(f"Generator: {gen_time:.4f}s")
Timeit from Command Line
# Time a statement
python -m timeit "sum(range(1000))"
# Time with setup code
python -m timeit -s "import random; data = [random.randint(0,100) for _ in range(1000)]" "sorted(data)"
# Compare multiple approaches
python -m timeit "l = [1,2,3,4,5]; l.append(6)"
python -m timeit "l = [1,2,3,4,5]; l + [6]"
cProfile — Function-Level Profiling
Find which functions take the most time:
import cProfile
import pstats
def slow_function():
"""Simulate slow code."""
total = 0
for i in range(1000000):
total += i ** 2
return total
def another_slow():
"""Another slow function."""
data = [i for i in range(500000)]
return sum(data)
# Profile execution
cProfile.run('slow_function(); another_slow()')
# More detailed profiling
profiler = cProfile.Profile()
profiler.enable()
slow_function()
another_slow()
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative') # Sort by cumulative time
stats.print_stats(10) # Show top 10 functions
Profiling to File
import cProfile
import pstats
# Profile and save results
cProfile.run('slow_function()', 'profile_output')
# Load and analyze later
stats = pstats.Stats('profile_output')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(20)
# Save report to file
with open('profile_report.txt', 'w') as f:
stats = pstats.Stats('profile_output', stream=f)
stats.sort_stats('cumulative')
stats.print_stats(20)
line_profiler — Line-by-Line Profiling
Find exactly which lines are slow:
# Install: pip install line_profiler
# Add @profile decorator
@profile
def process_data(data):
filtered = [x for x in data if x > 0] # Line 4
squared = [x ** 2 for x in filtered] # Line 5
total = sum(squared) # Line 6
average = total / len(filtered) if filtered else 0 # Line 7
return average # Line 8
# Run: kernprof -l -v script.py
Practical Line Profiling
# profile_example.py
import random
def generate_data(n):
return [random.randint(0, 1000) for _ in range(n)]
def filter_data(data):
return [x for x in data if x > 100]
def transform_data(data):
return [x ** 2 for x in data]
def aggregate_data(data):
return sum(data) / len(data) if data else 0
def main():
data = generate_data(1000000)
filtered = filter_data(data)
transformed = transform_data(filtered)
result = aggregate_data(transformed)
return result
if __name__ == '__main__':
main()
# Profile with: python -m line_profiler profile_example.py
Memory Profiling
# Install: pip install memory_profiler
from memory_profiler import profile
@profile
def memory_intensive():
"""Function with memory issues."""
# List comprehension - high memory
large_list = [i ** 2 for i in range(1000000)]
# Filter creates another list
filtered = [x for x in large_list if x % 2 == 0]
# This doubles memory usage
result = sum(filtered)
return result
# Run: python -m memory_profiler script.py
# Output shows memory usage per line
Check Object Size
import sys
# Check size of different objects
print(f"int: {sys.getsizeof(42)} bytes")
print(f"float: {sys.getsizeof(3.14)} bytes")
print(f"string: {sys.getsizeof('hello')} bytes")
print(f"list: {sys.getsizeof([1, 2, 3])} bytes")
print(f"dict: {sys.getsizeof({'a': 1})} bytes")
print(f"set: {sys.getsizeof({1, 2, 3})} bytes")
# Large lists
big_list = list(range(1000000))
print(f"List of 1M ints: {sys.getsizeof(big_list)} bytes")
# ~8MB for the list object itself (each int is a separate object)
Common Python Optimizations
Use Sets for Membership Testing
import time
# Slow: O(n) lookup in list
large_list = list(range(1000000))
start = time.time()
999999 in large_list
print(f"List lookup: {time.time() - start:.6f}s")
# Fast: O(1) lookup in set
large_set = set(range(1000000))
start = time.time()
999999 in large_set
print(f"Set lookup: {time.time() - start:.6f}s")
# Sets are ~1000x faster for membership testing
Use Generators for Large Datasets
import sys
# List comprehension - loads all into memory
list_comp = [i ** 2 for i in range(1000000)]
print(f"List size: {sys.getsizeof(list_comp)} bytes") # ~8MB
# Generator expression - lazy evaluation
gen_expr = (i ** 2 for i in range(1000000))
print(f"Generator size: {sys.getsizeof(gen_expr)} bytes") # ~200 bytes
# Use generators for aggregation
total = sum(i ** 2 for i in range(1000000)) # Memory efficient
Use List Comprehensions Over Loops
import time
# Slow: manual loop
start = time.time()
result = []
for i in range(100000):
result.append(i * 2)
print(f"Loop: {time.time() - start:.4f}s")
# Fast: list comprehension
start = time.time()
result = [i * 2 for i in range(100000)]
print(f"Comprehension: {time.time() - start:.4f}s")
# Comprehensions are ~30% faster due to optimized bytecode
Use Built-in Functions and Libraries
import time
data = list(range(100000))
# Slow: manual sum
start = time.time()
total = 0
for x in data:
total += x
print(f"Manual sum: {time.time() - start:.4f}s")
# Fast: built-in sum
start = time.time()
total = sum(data)
print(f"Built-in sum: {time.time() - start:.4f}s")
# Fast: built-in sort
start = time.time()
sorted_data = sorted(data)
print(f"Built-in sort: {time.time() - start:.4f}s")
String Concatenation
import time
# Slow: string concatenation in loop
start = time.time()
result = ""
for i in range(10000):
result += str(i)
print(f"Concatenation: {time.time() - start:.4f}s")
# Fast: join method
start = time.time()
result = "".join(str(i) for i in range(10000))
print(f"Join: {time.time() - start:.4f}s")
Big O Notation for Python
# O(1) — Constant time
def get_first(items):
return items[0] # Always one operation
# O(n) — Linear time
def find_item(items, target):
for item in items: # Checks each item
if item == target:
return True
return False
# O(n²) — Quadratic time
def bubble_sort(items):
n = len(items)
for i in range(n):
for j in range(n): # Nested loop
if items[j] > items[j+1]:
items[j], items[j+1] = items[j+1], items[j]
# O(log n) — Logarithmic time
def binary_search(sorted_items, target):
low, high = 0, len(sorted_items) - 1
while low <= high:
mid = (low + high) // 2
if sorted_items[mid] == target:
return mid
elif sorted_items[mid] < target:
low = mid + 1
else:
high = mid - 1
return -1
# Python operation complexities:
# List append: O(1)
# List index: O(1)
# List contains: O(n)
# Dict get: O(1) average
# Dict contains: O(1) average
# Set contains: O(1) average
# Set add: O(1) average
Memory Optimization with slots
import sys
# Without __slots__
class PointRegular:
def __init__(self, x, y):
self.x = x
self.y = y
# With __slots__
class PointOptimized:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x = x
self.y = y
# Compare memory usage
regular = PointRegular(1, 2)
optimized = PointOptimized(1, 2)
print(f"Regular: {sys.getsizeof(regular)} bytes + dict: {sys.getsizeof(regular.__dict__)} bytes")
print(f"Optimized: {sys.getsizeof(optimized)} bytes")
# Optimized uses ~40% less memory per instance
# For millions of objects, this adds up significantly
Real-World Examples
Example 1: Optimizing a Slow Function
import time
from collections import Counter
# SLOW VERSION
def count_words_slow(text):
"""Count word frequencies - slow version."""
words = text.lower().split()
word_count = {}
for word in words:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
return word_count
# OPTIMIZED VERSION
def count_words_fast(text):
"""Count word frequencies - fast version."""
return Counter(text.lower().split())
# Even faster for large texts
def count_words_optimized(text):
"""Use Counter with generator for memory efficiency."""
return Counter(word for word in text.lower().split() if word)
# Benchmark
text = "hello world " * 100000
start = time.time()
result1 = count_words_slow(text)
print(f"Slow: {time.time() - start:.4f}s")
start = time.time()
result2 = count_words_fast(text)
print(f"Fast: {time.time() - start:.4f}s")
Example 2: Database Query Optimization
import sqlite3
import time
def slow_query():
"""Inefficient database query pattern."""
conn = sqlite3.connect('app.db')
cursor = conn.cursor()
# Bad: N+1 query problem
users = cursor.execute("SELECT * FROM users").fetchall()
for user in users:
# This runs a query for EACH user!
orders = cursor.execute(
"SELECT * FROM orders WHERE user_id = ?", (user[0],)
).fetchall()
return users
def fast_query():
"""Efficient database query pattern."""
conn = sqlite3.connect('app.db')
cursor = conn.cursor()
# Good: Single JOIN query
results = cursor.execute('''
SELECT u.*, GROUP_CONCAT(o.id) as order_ids
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id
''').fetchall()
return results
Example 3: Caching Expensive Computations
import time
from functools import lru_cache
# Without caching
def fibonacci_slow(n):
if n < 2:
return n
return fibonacci_slow(n-1) + fibonacci_slow(n-2)
# With caching
@lru_cache(maxsize=None)
def fibonacci_fast(n):
if n < 2:
return n
return fibonacci_fast(n-1) + fibonacci_fast(n-2)
# Benchmark
start = time.time()
result1 = fibonacci_slow(35)
print(f"Slow fib(35): {result1} in {time.time() - start:.4f}s")
start = time.time()
result2 = fibonacci_fast(35)
print(f"Fast fib(35): {result2} in {time.time() - start:.4f}s")
# Caching makes it ~1000x faster
# Manual caching
cache = {}
def fibonacci_manual(n):
if n in cache:
return cache[n]
if n < 2:
return n
result = fibonacci_manual(n-1) + fibonacci_manual(n-2)
cache[n] = result
return result
Common Mistakes
| Mistake | Problem | Solution |
|---|---|---|
| Optimizing without profiling | Wastes time on non-bottlenecks | Profile first to find actual bottlenecks |
| Using lists for membership testing | O(n) lookup | Use sets for O(1) lookup |
| String concatenation in loops | O(n²) time | Use "".join() |
| Not using generators | High memory usage | Use generators for large datasets |
| Ignoring built-in functions | Slower than C implementations | Use sum(), min(), max(), sorted() |
Not using __slots__ | High memory per instance | Use __slots__ for many instances |
Best Practices
# 1. Profile before optimizing
import cProfile
cProfile.run('my_function()')
# 2. Use appropriate data structures
# Need fast lookup? Use dict or set
# Need ordered data? Use list
# Need sorted data? Use heapq or sorted()
# 3. Use built-in functions
total = sum(data)
minimum = min(data)
maximum = max(data)
sorted_data = sorted(data)
# 4. Use generators for memory efficiency
total = sum(x**2 for x in range(1000000))
# 5. Cache expensive computations
from functools import lru_cache
@lru_cache(maxsize=128)
def expensive_computation(n):
return n ** n
# 6. Use __slots__ for many instances
class Point:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x = x
self.y = y
Key Takeaways
- Profile first — don't optimize blindly; find the actual bottleneck
- Use sets for O(1) membership testing instead of lists
- Use generators for memory-efficient iteration over large datasets
- Use list comprehensions over manual loops — they're ~30% faster
- Use built-in functions (
sum,min,max,sorted) — they're implemented in C - Use
__slots__to reduce memory overhead for classes with many instances - Cache expensive computations with
@lru_cachefor repeated calls