Python Performance — Profiling & Optimization

Understanding performance helps you write faster, more efficient Python code. Always profile first to find actual bottlenecks before optimizing.

Learning Objectives

Profile code to find bottlenecks with cProfile and line_profiler
Use timeit for micro-benchmarks
Apply common Python performance optimizations
Use memory profiling tools and generators for memory efficiency
Understand Big O notation for Python operations
Optimize data structures and algorithms

timeit — Micro-benchmarks

Measure execution time of small code snippets:

import timeit

# Time a statement
time = timeit.timeit('sum(range(1000))', number=10000)
print(f"Total: {time:.4f}s")  # ~0.5s

# Time a function
def fib(n):
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

time = timeit.timeit(lambda: fib(30), number=10)
print(f"Fib(30) x10: {time:.4f}s")

# Compare approaches
list_time = timeit.timeit(
    '[i ** 2 for i in range(1000)]',
    number=10000
)
gen_time = timeit.timeit(
    'sum(i ** 2 for i in range(1000))',
    number=10000
)
print(f"List comprehension: {list_time:.4f}s")
print(f"Generator: {gen_time:.4f}s")

Timeit from Command Line

# Time a statement
python -m timeit "sum(range(1000))"

# Time with setup code
python -m timeit -s "import random; data = [random.randint(0,100) for _ in range(1000)]" "sorted(data)"

# Compare multiple approaches
python -m timeit "l = [1,2,3,4,5]; l.append(6)"
python -m timeit "l = [1,2,3,4,5]; l + [6]"

cProfile — Function-Level Profiling

Find which functions take the most time:

import cProfile
import pstats

def slow_function():
    """Simulate slow code."""
    total = 0
    for i in range(1000000):
        total += i ** 2
    return total

def another_slow():
    """Another slow function."""
    data = [i for i in range(500000)]
    return sum(data)

# Profile execution
cProfile.run('slow_function(); another_slow()')

# More detailed profiling
profiler = cProfile.Profile()
profiler.enable()

slow_function()
another_slow()

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')  # Sort by cumulative time
stats.print_stats(10)  # Show top 10 functions

Profiling to File

import cProfile
import pstats

# Profile and save results
cProfile.run('slow_function()', 'profile_output')

# Load and analyze later
stats = pstats.Stats('profile_output')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(20)

# Save report to file
with open('profile_report.txt', 'w') as f:
    stats = pstats.Stats('profile_output', stream=f)
    stats.sort_stats('cumulative')
    stats.print_stats(20)

line_profiler — Line-by-Line Profiling

Find exactly which lines are slow:

# Install: pip install line_profiler
# Add @profile decorator
@profile
def process_data(data):
    filtered = [x for x in data if x > 0]         # Line 4
    squared = [x ** 2 for x in filtered]           # Line 5
    total = sum(squared)                            # Line 6
    average = total / len(filtered) if filtered else 0  # Line 7
    return average                                  # Line 8

# Run: kernprof -l -v script.py

Practical Line Profiling

# profile_example.py
import random

def generate_data(n):
    return [random.randint(0, 1000) for _ in range(n)]

def filter_data(data):
    return [x for x in data if x > 100]

def transform_data(data):
    return [x ** 2 for x in data]

def aggregate_data(data):
    return sum(data) / len(data) if data else 0

def main():
    data = generate_data(1000000)
    filtered = filter_data(data)
    transformed = transform_data(filtered)
    result = aggregate_data(transformed)
    return result

if __name__ == '__main__':
    main()

# Profile with: python -m line_profiler profile_example.py

Memory Profiling

# Install: pip install memory_profiler

from memory_profiler import profile

@profile
def memory_intensive():
    """Function with memory issues."""
    # List comprehension - high memory
    large_list = [i ** 2 for i in range(1000000)]

    # Filter creates another list
    filtered = [x for x in large_list if x % 2 == 0]

    # This doubles memory usage
    result = sum(filtered)

    return result

# Run: python -m memory_profiler script.py
# Output shows memory usage per line

Check Object Size

import sys

# Check size of different objects
print(f"int: {sys.getsizeof(42)} bytes")
print(f"float: {sys.getsizeof(3.14)} bytes")
print(f"string: {sys.getsizeof('hello')} bytes")
print(f"list: {sys.getsizeof([1, 2, 3])} bytes")
print(f"dict: {sys.getsizeof({'a': 1})} bytes")
print(f"set: {sys.getsizeof({1, 2, 3})} bytes")

# Large lists
big_list = list(range(1000000))
print(f"List of 1M ints: {sys.getsizeof(big_list)} bytes")
# ~8MB for the list object itself (each int is a separate object)

Common Python Optimizations

Use Sets for Membership Testing

import time

# Slow: O(n) lookup in list
large_list = list(range(1000000))
start = time.time()
999999 in large_list
print(f"List lookup: {time.time() - start:.6f}s")

# Fast: O(1) lookup in set
large_set = set(range(1000000))
start = time.time()
999999 in large_set
print(f"Set lookup: {time.time() - start:.6f}s")
# Sets are ~1000x faster for membership testing

Use Generators for Large Datasets

import sys

# List comprehension - loads all into memory
list_comp = [i ** 2 for i in range(1000000)]
print(f"List size: {sys.getsizeof(list_comp)} bytes")  # ~8MB

# Generator expression - lazy evaluation
gen_expr = (i ** 2 for i in range(1000000))
print(f"Generator size: {sys.getsizeof(gen_expr)} bytes")  # ~200 bytes

# Use generators for aggregation
total = sum(i ** 2 for i in range(1000000))  # Memory efficient

Use List Comprehensions Over Loops

import time

# Slow: manual loop
start = time.time()
result = []
for i in range(100000):
    result.append(i * 2)
print(f"Loop: {time.time() - start:.4f}s")

# Fast: list comprehension
start = time.time()
result = [i * 2 for i in range(100000)]
print(f"Comprehension: {time.time() - start:.4f}s")
# Comprehensions are ~30% faster due to optimized bytecode

Use Built-in Functions and Libraries

import time

data = list(range(100000))

# Slow: manual sum
start = time.time()
total = 0
for x in data:
    total += x
print(f"Manual sum: {time.time() - start:.4f}s")

# Fast: built-in sum
start = time.time()
total = sum(data)
print(f"Built-in sum: {time.time() - start:.4f}s")

# Fast: built-in sort
start = time.time()
sorted_data = sorted(data)
print(f"Built-in sort: {time.time() - start:.4f}s")

String Concatenation

import time

# Slow: string concatenation in loop
start = time.time()
result = ""
for i in range(10000):
    result += str(i)
print(f"Concatenation: {time.time() - start:.4f}s")

# Fast: join method
start = time.time()
result = "".join(str(i) for i in range(10000))
print(f"Join: {time.time() - start:.4f}s")

Big O Notation for Python

# O(1) — Constant time
def get_first(items):
    return items[0]  # Always one operation

# O(n) — Linear time
def find_item(items, target):
    for item in items:  # Checks each item
        if item == target:
            return True
    return False

# O(n²) — Quadratic time
def bubble_sort(items):
    n = len(items)
    for i in range(n):
        for j in range(n):  # Nested loop
            if items[j] > items[j+1]:
                items[j], items[j+1] = items[j+1], items[j]

# O(log n) — Logarithmic time
def binary_search(sorted_items, target):
    low, high = 0, len(sorted_items) - 1
    while low <= high:
        mid = (low + high) // 2
        if sorted_items[mid] == target:
            return mid
        elif sorted_items[mid] < target:
            low = mid + 1
        else:
            high = mid - 1
    return -1

# Python operation complexities:
# List append: O(1)
# List index: O(1)
# List contains: O(n)
# Dict get: O(1) average
# Dict contains: O(1) average
# Set contains: O(1) average
# Set add: O(1) average

Memory Optimization with slots

import sys

# Without __slots__
class PointRegular:
    def __init__(self, x, y):
        self.x = x
        self.y = y

# With __slots__
class PointOptimized:
    __slots__ = ('x', 'y')
    def __init__(self, x, y):
        self.x = x
        self.y = y

# Compare memory usage
regular = PointRegular(1, 2)
optimized = PointOptimized(1, 2)

print(f"Regular: {sys.getsizeof(regular)} bytes + dict: {sys.getsizeof(regular.__dict__)} bytes")
print(f"Optimized: {sys.getsizeof(optimized)} bytes")
# Optimized uses ~40% less memory per instance

# For millions of objects, this adds up significantly

Real-World Examples

Example 1: Optimizing a Slow Function

import time
from collections import Counter

# SLOW VERSION
def count_words_slow(text):
    """Count word frequencies - slow version."""
    words = text.lower().split()
    word_count = {}
    for word in words:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    return word_count

# OPTIMIZED VERSION
def count_words_fast(text):
    """Count word frequencies - fast version."""
    return Counter(text.lower().split())

# Even faster for large texts
def count_words_optimized(text):
    """Use Counter with generator for memory efficiency."""
    return Counter(word for word in text.lower().split() if word)

# Benchmark
text = "hello world " * 100000

start = time.time()
result1 = count_words_slow(text)
print(f"Slow: {time.time() - start:.4f}s")

start = time.time()
result2 = count_words_fast(text)
print(f"Fast: {time.time() - start:.4f}s")

Example 2: Database Query Optimization

import sqlite3
import time

def slow_query():
    """Inefficient database query pattern."""
    conn = sqlite3.connect('app.db')
    cursor = conn.cursor()

    # Bad: N+1 query problem
    users = cursor.execute("SELECT * FROM users").fetchall()
    for user in users:
        # This runs a query for EACH user!
        orders = cursor.execute(
            "SELECT * FROM orders WHERE user_id = ?", (user[0],)
        ).fetchall()

    return users

def fast_query():
    """Efficient database query pattern."""
    conn = sqlite3.connect('app.db')
    cursor = conn.cursor()

    # Good: Single JOIN query
    results = cursor.execute('''
        SELECT u.*, GROUP_CONCAT(o.id) as order_ids
        FROM users u
        LEFT JOIN orders o ON u.id = o.user_id
        GROUP BY u.id
    ''').fetchall()

    return results

Example 3: Caching Expensive Computations

import time
from functools import lru_cache

# Without caching
def fibonacci_slow(n):
    if n < 2:
        return n
    return fibonacci_slow(n-1) + fibonacci_slow(n-2)

# With caching
@lru_cache(maxsize=None)
def fibonacci_fast(n):
    if n < 2:
        return n
    return fibonacci_fast(n-1) + fibonacci_fast(n-2)

# Benchmark
start = time.time()
result1 = fibonacci_slow(35)
print(f"Slow fib(35): {result1} in {time.time() - start:.4f}s")

start = time.time()
result2 = fibonacci_fast(35)
print(f"Fast fib(35): {result2} in {time.time() - start:.4f}s")
# Caching makes it ~1000x faster

# Manual caching
cache = {}
def fibonacci_manual(n):
    if n in cache:
        return cache[n]
    if n < 2:
        return n
    result = fibonacci_manual(n-1) + fibonacci_manual(n-2)
    cache[n] = result
    return result

Common Mistakes

Mistake	Problem	Solution
Optimizing without profiling	Wastes time on non-bottlenecks	Profile first to find actual bottlenecks
Using lists for membership testing	O(n) lookup	Use sets for O(1) lookup
String concatenation in loops	O(n²) time	Use `"".join()`
Not using generators	High memory usage	Use generators for large datasets
Ignoring built-in functions	Slower than C implementations	Use `sum()`, `min()`, `max()`, `sorted()`
Not using `__slots__`	High memory per instance	Use `__slots__` for many instances

Best Practices

# 1. Profile before optimizing
import cProfile
cProfile.run('my_function()')

# 2. Use appropriate data structures
# Need fast lookup? Use dict or set
# Need ordered data? Use list
# Need sorted data? Use heapq or sorted()

# 3. Use built-in functions
total = sum(data)
minimum = min(data)
maximum = max(data)
sorted_data = sorted(data)

# 4. Use generators for memory efficiency
total = sum(x**2 for x in range(1000000))

# 5. Cache expensive computations
from functools import lru_cache

@lru_cache(maxsize=128)
def expensive_computation(n):
    return n ** n

# 6. Use __slots__ for many instances
class Point:
    __slots__ = ('x', 'y')
    def __init__(self, x, y):
        self.x = x
        self.y = y

Key Takeaways

Profile first — don't optimize blindly; find the actual bottleneck
Use sets for O(1) membership testing instead of lists
Use generators for memory-efficient iteration over large datasets
Use list comprehensions over manual loops — they're ~30% faster
Use built-in functions (sum, min, max, sorted) — they're implemented in C
Use __slots__ to reduce memory overhead for classes with many instances
Cache expensive computations with @lru_cache for repeated calls

Python Performance — Profiling & Optimization

Python Performance — Profiling & Optimization

Learning Objectives

timeit — Micro-benchmarks

Timeit from Command Line

cProfile — Function-Level Profiling

Profiling to File

line_profiler — Line-by-Line Profiling

Practical Line Profiling

Memory Profiling

Check Object Size

Common Python Optimizations

Use Sets for Membership Testing

Use Generators for Large Datasets

Use List Comprehensions Over Loops

Use Built-in Functions and Libraries

String Concatenation

Big O Notation for Python

Memory Optimization with slots

Real-World Examples

Example 1: Optimizing a Slow Function

Example 2: Database Query Optimization

Example 3: Caching Expensive Computations

Common Mistakes

Best Practices

Key Takeaways

Premium Content

Need Expert Python Help?