Hard Level: High-Performance Python Computing

Contents

Hard Level: High-Performance Python Computing#

Real-World Context#

The Problem: Python is often criticized for being โ€œslowโ€ compared to C/C++. In reality, proper optimization can make Python competitive for many tasks, and understanding performance is crucial for:

  • Processing large datasets (millions/billions of records)

  • Real-time systems (trading, gaming, streaming)

  • Scientific computing and simulations

  • Machine learning model training

  • Web services handling millions of requests

Why This Matters:

  • Cost Savings: 10x faster code = 10x fewer servers

  • User Experience: Fast responses keep users engaged

  • Scalability: Efficient code handles more load

  • Energy: Less computation = lower power consumption

What Youโ€™ll Learn:

  • How to profile and find bottlenecks

  • Memory and CPU optimization techniques

  • Just-In-Time (JIT) compilation with Numba

  • Vectorization with NumPy

  • Multiprocessing and threading

  • Asynchronous I/O for concurrency

  • Cython for C-speed Python


Part 1: Profiling - Finding the Bottleneck#

Golden Rule: โ€œPremature optimization is the root of all evilโ€ - Donald Knuth

Always profile first! 90% of execution time is typically spent in 10% of the code.

Profiling Tools#

  1. time/timeit: Simple timing

  2. cProfile: Built-in Python profiler

  3. line_profiler: Line-by-line profiling

  4. memory_profiler: Track memory usage

  5. py-spy: Sampling profiler (no code changes needed)

import time
import cProfile
import pstats
from io import StringIO

# Example: Slow vs Fast String Concatenation

def slow_concat(n):
    """Slow: O(nยฒ) due to string immutability."""
    result = ""
    for i in range(n):
        result += str(i)  # Creates new string each time
    return result

def fast_concat(n):
    """Fast: O(n) using list join."""
    parts = []
    for i in range(n):
        parts.append(str(i))
    return ''.join(parts)

# Timing comparison
n = 10000

start = time.perf_counter()
slow_result = slow_concat(n)
slow_time = time.perf_counter() - start

start = time.perf_counter()
fast_result = fast_concat(n)
fast_time = time.perf_counter() - start

print(f"Slow concat: {slow_time:.4f}s")
print(f"Fast concat: {fast_time:.4f}s")
print(f"Speedup: {slow_time/fast_time:.2f}x faster")

# Using timeit for more accurate measurements
import timeit

slow_avg = timeit.timeit(lambda: slow_concat(1000), number=100) / 100
fast_avg = timeit.timeit(lambda: fast_concat(1000), number=100) / 100

print(f"\nAverage over 100 runs (n=1000):")
print(f"Slow: {slow_avg*1000:.2f}ms")
print(f"Fast: {fast_avg*1000:.2f}ms")
# cProfile Example: Finding hotspots

def fibonacci_recursive(n):
    """Inefficient recursive fibonacci."""
    if n <= 1:
        return n
    return fibonacci_recursive(n-1) + fibonacci_recursive(n-2)

def fibonacci_iterative(n):
    """Efficient iterative fibonacci."""
    if n <= 1:
        return n
    a, b = 0, 1
    for _ in range(n-1):
        a, b = b, a + b
    return b

# Profile the recursive version
profiler = cProfile.Profile()
profiler.enable()
result = fibonacci_recursive(25)
profiler.disable()

# Print stats
s = StringIO()
ps = pstats.Stats(profiler, stream=s).sort_stats('cumulative')
ps.print_stats(10)  # Top 10 functions
print("Recursive Fibonacci Profile:")
print(s.getvalue())

# Compare execution times
print("\nPerformance Comparison:")
n = 30

start = time.perf_counter()
rec_result = fibonacci_recursive(n)
rec_time = time.perf_counter() - start

start = time.perf_counter()
iter_result = fibonacci_iterative(n)
iter_time = time.perf_counter() - start

print(f"Recursive: {rec_time:.4f}s")
print(f"Iterative: {iter_time:.6f}s")
print(f"Speedup: {rec_time/iter_time:.0f}x faster!")

Key Profiling Insights#

What to Look For:

  1. High cumulative time: Functions called often or running long

  2. Number of calls: Unexpectedly high call counts indicate inefficiency

  3. Per-call time: Individual function slowness

Common Bottlenecks:

  • String concatenation in loops

  • Nested loops (O(nยฒ) or worse)

  • Repeated file I/O

  • Database queries in loops (N+1 problem)

  • Inefficient algorithms (bubble sort vs quicksort)


Part 2: Memory Optimization#

Memory is often the limiting factor in data processing. Understanding memory usage prevents crashes and improves performance.

Memory Profiling#

import sys
from collections import namedtuple

# Memory comparison: List vs Tuple vs Generator

def memory_usage_demo():
    n = 1_000_000
    
    # List: stores all elements in memory
    list_data = [i for i in range(n)]
    list_size = sys.getsizeof(list_data)
    
    # Tuple: slightly more memory efficient
    tuple_data = tuple(i for i in range(n))
    tuple_size = sys.getsizeof(tuple_data)
    
    # Generator: minimal memory (just stores state)
    gen_data = (i for i in range(n))
    gen_size = sys.getsizeof(gen_data)
    
    print(f"Memory Usage for {n:,} integers:")
    print(f"List:      {list_size:>12,} bytes ({list_size/1024/1024:.2f} MB)")
    print(f"Tuple:     {tuple_size:>12,} bytes ({tuple_size/1024/1024:.2f} MB)")
    print(f"Generator: {gen_size:>12,} bytes ({gen_size/1024:.2f} KB)")
    print(f"\nGenerator is {list_size/gen_size:.0f}x more memory efficient!")

memory_usage_demo()

# Slots for memory-efficient classes
class Point:
    """Regular class with __dict__."""
    def __init__(self, x, y):
        self.x = x
        self.y = y

class PointSlots:
    """Memory-efficient class with __slots__."""
    __slots__ = ['x', 'y']
    
    def __init__(self, x, y):
        self.x = x
        self.y = y

# Named tuple: even more efficient
PointNT = namedtuple('PointNT', ['x', 'y'])

# Compare memory usage
p1 = Point(1, 2)
p2 = PointSlots(1, 2)
p3 = PointNT(1, 2)

print("\nClass Memory Comparison:")
print(f"Regular class: {sys.getsizeof(p1) + sys.getsizeof(vars(p1))} bytes")
print(f"With __slots__: {sys.getsizeof(p2)} bytes")
print(f"Named tuple: {sys.getsizeof(p3)} bytes")

# For millions of objects, this difference is massive!
n_objects = 1_000_000
regular_memory = n_objects * (sys.getsizeof(p1) + sys.getsizeof(vars(p1)))
slots_memory = n_objects * sys.getsizeof(p2)

print(f"\nFor {n_objects:,} objects:")
print(f"Regular: {regular_memory/1024/1024:.2f} MB")
print(f"Slots: {slots_memory/1024/1024:.2f} MB")
print(f"Savings: {(regular_memory - slots_memory)/1024/1024:.2f} MB")

Memory Optimization Techniques#

  1. Use generators for large sequences

  2. slots for classes with many instances

  3. del unused variables in long-running functions

  4. gc.collect() to force garbage collection

  5. Array/NumPy for homogeneous numeric data

  6. Memory-mapped files for huge datasets

  7. Lazy loading - load data only when needed


Part 3: NumPy Vectorization#

NumPy operations are 10-100x faster than pure Python because they:

  1. Use compiled C code

  2. Operate on contiguous memory blocks

  3. Avoid Python interpreter overhead

  4. Use SIMD instructions when possible

import numpy as np

# Example: Sum of squares

def sum_of_squares_python(n):
    """Pure Python implementation."""
    total = 0
    for i in range(n):
        total += i ** 2
    return total

def sum_of_squares_numpy(n):
    """Vectorized NumPy implementation."""
    arr = np.arange(n)
    return np.sum(arr ** 2)

# Benchmark
n = 1_000_000

# Python version
start = time.perf_counter()
result_py = sum_of_squares_python(n)
time_py = time.perf_counter() - start

# NumPy version
start = time.perf_counter()
result_np = sum_of_squares_numpy(n)
time_np = time.perf_counter() - start

print(f"Sum of squares for {n:,} numbers:")
print(f"Python: {time_py:.4f}s")
print(f"NumPy:  {time_np:.4f}s")
print(f"Speedup: {time_py/time_np:.1f}x faster!")

# More complex example: Distance matrix
def euclidean_distance_python(points):
    """Compute pairwise distances - Python loops."""
    n = len(points)
    distances = [[0.0] * n for _ in range(n)]
    
    for i in range(n):
        for j in range(n):
            dx = points[i][0] - points[j][0]
            dy = points[i][1] - points[j][1]
            distances[i][j] = (dx**2 + dy**2) ** 0.5
    
    return distances

def euclidean_distance_numpy(points):
    """Compute pairwise distances - NumPy broadcasting."""
    # Reshape for broadcasting: (n, 1, 2) - (1, n, 2) = (n, n, 2)
    diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]
    return np.sqrt(np.sum(diff ** 2, axis=2))

# Test with random points
np.random.seed(42)
n_points = 1000
points_list = [[np.random.rand(), np.random.rand()] for _ in range(n_points)]
points_array = np.array(points_list)

# Benchmark (use smaller n for Python version)
n_test = 100

start = time.perf_counter()
dist_py = euclidean_distance_python(points_list[:n_test])
time_py = time.perf_counter() - start

start = time.perf_counter()
dist_np = euclidean_distance_numpy(points_array[:n_test])
time_np = time.perf_counter() - start

print(f"\nDistance matrix for {n_test} points:")
print(f"Python: {time_py:.4f}s")
print(f"NumPy:  {time_np:.6f}s")
print(f"Speedup: {time_py/time_np:.1f}x faster!")

# NumPy can handle much larger datasets
start = time.perf_counter()
dist_large = euclidean_distance_numpy(points_array)  # All 1000 points
time_large = time.perf_counter() - start
print(f"\nNumPy with 1000 points: {time_large:.4f}s")
print(f"Creates {1000*1000:,} distance calculations!")

Vectorization Best Practices#

Do:

  • Use NumPy ufuncs (universal functions) instead of loops

  • Leverage broadcasting for element-wise operations

  • Use built-in aggregations (sum, mean, max, etc.)

  • Preallocate arrays when possible

Donโ€™t:

  • Loop over NumPy arrays (defeats the purpose)

  • Use .append() in loops (very slow)

  • Mix Python loops with NumPy operations

  • Create unnecessary array copies


Part 4: Just-In-Time Compilation with Numba#

Numba compiles Python functions to machine code at runtime, achieving C-like performance with minimal code changes.

# Install: pip install numba
try:
    from numba import jit, njit, prange
    NUMBA_AVAILABLE = True
except ImportError:
    print("Numba not installed. Run: pip install numba")
    NUMBA_AVAILABLE = False

if NUMBA_AVAILABLE:
    # Example 1: Simple numerical computation
    
    def monte_carlo_pi_python(n_samples):
        """Estimate ฯ€ using Monte Carlo - Pure Python."""
        inside_circle = 0
        for i in range(n_samples):
            x = np.random.random()
            y = np.random.random()
            if x*x + y*y <= 1.0:
                inside_circle += 1
        return 4.0 * inside_circle / n_samples
    
    @jit(nopython=True)  # Compile to machine code
    def monte_carlo_pi_numba(n_samples):
        """Estimate ฯ€ using Monte Carlo - Numba JIT."""
        inside_circle = 0
        for i in range(n_samples):
            x = np.random.random()
            y = np.random.random()
            if x*x + y*y <= 1.0:
                inside_circle += 1
        return 4.0 * inside_circle / n_samples
    
    # Benchmark
    n = 10_000_000
    
    # Warm up Numba (first call compiles)
    _ = monte_carlo_pi_numba(100)
    
    start = time.perf_counter()
    pi_python = monte_carlo_pi_python(n)
    time_py = time.perf_counter() - start
    
    start = time.perf_counter()
    pi_numba = monte_carlo_pi_numba(n)
    time_numba = time.perf_counter() - start
    
    print(f"Monte Carlo ฯ€ estimation ({n:,} samples):")
    print(f"Python: {pi_python:.6f} in {time_py:.4f}s")
    print(f"Numba:  {pi_numba:.6f} in {time_numba:.4f}s")
    print(f"Speedup: {time_py/time_numba:.1f}x faster!")
    
    # Example 2: Parallel execution
    
    @njit(parallel=True)
    def parallel_sum_of_squares(n):
        """Parallel computation with Numba."""
        total = 0
        for i in prange(n):  # Parallel range
            total += i * i
        return total
    
    # Compare with sequential
    @njit
    def sequential_sum_of_squares(n):
        """Sequential Numba computation."""
        total = 0
        for i in range(n):
            total += i * i
        return total
    
    # Warm up
    _ = parallel_sum_of_squares(100)
    _ = sequential_sum_of_squares(100)
    
    n = 100_000_000
    
    start = time.perf_counter()
    result_seq = sequential_sum_of_squares(n)
    time_seq = time.perf_counter() - start
    
    start = time.perf_counter()
    result_par = parallel_sum_of_squares(n)
    time_par = time.perf_counter() - start
    
    print(f"\nSum of squares ({n:,}):")
    print(f"Sequential: {time_seq:.4f}s")
    print(f"Parallel:   {time_par:.4f}s")
    print(f"Speedup: {time_seq/time_par:.1f}x faster!")
else:
    print("Numba examples skipped (not installed)")

Numba Guidelines#

When to use Numba:

  • Numerical computations with loops

  • Array operations that canโ€™t be vectorized

  • Physics simulations, Monte Carlo methods

  • Signal processing algorithms

Limitations:

  • Limited Python feature support (no complex objects)

  • First call is slow (compilation)

  • Not useful for I/O-bound code

  • NumPy is still faster for pure vectorized ops

Best Practices:

  • Use nopython=True (or @njit) for maximum speed

  • Avoid object mode fallback

  • Use parallel=True for CPU-bound loops

  • Cache compiled functions with cache=True


Part 5: Multiprocessing and Threading#

Python has two main concurrency models:

  • Threading: For I/O-bound tasks (GIL limits CPU parallelism)

  • Multiprocessing: For CPU-bound tasks (bypasses GIL)

The Global Interpreter Lock (GIL)#

The GIL prevents multiple threads from executing Python code simultaneously. This means:

  • Threads: Good for I/O (network, disk)

  • Processes: Needed for CPU parallelism

import threading
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

# CPU-bound task for testing
def compute_intensive(n):
    """Simulate CPU-intensive work."""
    total = 0
    for i in range(n):
        total += i ** 2
    return total

# Sequential baseline
def sequential_processing(tasks):
    """Process tasks sequentially."""
    return [compute_intensive(n) for n in tasks]

# Threading (won't help for CPU-bound due to GIL)
def threaded_processing(tasks, max_workers=4):
    """Process tasks with threads."""
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        return list(executor.map(compute_intensive, tasks))

# Multiprocessing (bypasses GIL)
def multiprocess_processing(tasks, max_workers=4):
    """Process tasks with separate processes."""
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        return list(executor.map(compute_intensive, tasks))

# Benchmark
if __name__ == "__main__":
    tasks = [5_000_000] * 8  # 8 CPU-intensive tasks

    print("Processing 8 CPU-intensive tasks:")

    start = time.perf_counter()
    result_seq = sequential_processing(tasks)
    time_seq = time.perf_counter() - start
    print(f"Sequential:      {time_seq:.2f}s")

    start = time.perf_counter()
    result_thread = threaded_processing(tasks, max_workers=4)
    time_thread = time.perf_counter() - start
    print(f"Threading (4):   {time_thread:.2f}s (Speedup: {time_seq/time_thread:.2f}x)")

    start = time.perf_counter()
    result_multi = multiprocess_processing(tasks, max_workers=4)
    time_multi = time.perf_counter() - start
    print(f"Multiprocess(4): {time_multi:.2f}s (Speedup: {time_seq/time_multi:.2f}x)")

    print("\nKey Insight: Multiprocessing provides real CPU parallelism!")
# I/O-bound example: Threading shines here
import urllib.request

def fetch_url(url):
    """Simulate I/O-bound operation."""
    try:
        with urllib.request.urlopen(url, timeout=5) as response:
            return len(response.read())
    except:
        return 0

urls = [
    'http://www.example.com',
    'http://www.python.org',
    'http://www.github.com',
    'http://www.stackoverflow.com',
] * 3  # 12 requests total

# Sequential
print("Fetching 12 URLs:")
start = time.perf_counter()
sizes_seq = [fetch_url(url) for url in urls]
time_seq = time.perf_counter() - start
print(f"Sequential: {time_seq:.2f}s")

# Threaded (much better for I/O)
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=4) as executor:
    sizes_thread = list(executor.map(fetch_url, urls))
time_thread = time.perf_counter() - start
print(f"Threading:  {time_thread:.2f}s (Speedup: {time_seq/time_thread:.2f}x)")

print("\nFor I/O-bound tasks, threading is ideal!")

Concurrency Decision Tree#

Is your task CPU-bound or I/O-bound?
โ”‚
โ”œโ”€ CPU-bound (computation, data processing)
โ”‚  โ”‚
โ”‚  โ”œโ”€ Vectorizable? โ†’ Use NumPy
โ”‚  โ”œโ”€ Loops required? โ†’ Use Numba
โ”‚  โ””โ”€ Multiple cores? โ†’ Use multiprocessing
โ”‚
โ””โ”€ I/O-bound (network, disk, database)
   โ”‚
   โ”œโ”€ Many concurrent operations? โ†’ Use asyncio
   โ””โ”€ Mixed I/O and CPU? โ†’ Use threading

Part 6: Asynchronous I/O with asyncio#

asyncio enables concurrent I/O without threading overhead. Perfect for:

  • Web scraping

  • API calls

  • Database queries

  • Real-time applications

import asyncio

# Basic async example
async def fetch_data(url, delay):
    """Simulate async data fetching."""
    print(f"Fetching {url}...")
    await asyncio.sleep(delay)  # Simulate network delay
    print(f"Completed {url}")
    return f"Data from {url}"

async def sequential_async():
    """Fetch sequentially (awaits each)."""
    result1 = await fetch_data("API1", 2)
    result2 = await fetch_data("API2", 2)
    result3 = await fetch_data("API3", 2)
    return [result1, result2, result3]

async def concurrent_async():
    """Fetch concurrently (gather)."""
    tasks = [
        fetch_data("API1", 2),
        fetch_data("API2", 2),
        fetch_data("API3", 2),
    ]
    return await asyncio.gather(*tasks)

# Run in Jupyter (handles event loop)
print("Sequential async (total: ~6 seconds):")
start = time.perf_counter()
results_seq = await sequential_async()
time_seq = time.perf_counter() - start
print(f"Completed in {time_seq:.2f}s\n")

print("Concurrent async (total: ~2 seconds):")
start = time.perf_counter()
results_con = await concurrent_async()
time_con = time.perf_counter() - start
print(f"Completed in {time_con:.2f}s")
print(f"Speedup: {time_seq/time_con:.1f}x faster!")
# Real-world async pattern: Rate-limited API calls

async def rate_limited_fetch(semaphore, url, delay):
    """Fetch with concurrency limit."""
    async with semaphore:  # Limit concurrent requests
        print(f"Starting {url}")
        await asyncio.sleep(delay)
        print(f"Finished {url}")
        return f"Data from {url}"

async def fetch_many_with_limit(urls, max_concurrent=3):
    """Fetch many URLs with concurrency limit."""
    semaphore = asyncio.Semaphore(max_concurrent)
    tasks = [rate_limited_fetch(semaphore, url, 1) for url in urls]
    return await asyncio.gather(*tasks)

# Fetch 10 URLs but only 3 at a time
urls = [f"URL{i}" for i in range(10)]

print("Fetching 10 URLs with max 3 concurrent:")
start = time.perf_counter()
results = await fetch_many_with_limit(urls, max_concurrent=3)
elapsed = time.perf_counter() - start

print(f"\nCompleted in {elapsed:.2f}s")
print(f"With 3 concurrent, 10 URLs take ~4 batches: ceil(10/3) = 4 seconds")

Async Best Practices#

Do:

  • Use async for I/O-bound concurrent operations

  • Use asyncio.gather() for parallel awaits

  • Use asyncio.Semaphore() for rate limiting

  • Use aiohttp for async HTTP requests

Donโ€™t:

  • Use async for CPU-bound tasks (blocks event loop)

  • Mix sync and async APIs carelessly

  • Forget to await async functions

  • Use time.sleep() instead of asyncio.sleep()


Part 7: Cython - C Extension Speed#

Cython compiles Python-like code to C extensions for maximum performance.

When to use Cython:

  • Need C-level performance

  • Interfacing with C libraries

  • Tight loops that canโ€™t be vectorized

  • Numba isnโ€™t enough

Cython Example (Pseudocode)#

# fibonacci.pyx
def fib_cython(int n):
    cdef int a = 0, b = 1, temp
    cdef int i
    
    for i in range(n-1):
        temp = a
        a = b
        b = temp + b
    
    return b

Compile with:

cython fibonacci.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing \
    -I/usr/include/python3.x \
    -o fibonacci.so fibonacci.c

Performance: 10-100x faster than pure Python for numerical code.


Part 8: Performance Optimization Checklist#

1. Profile First โœ“#

  • Use cProfile to find hotspots

  • Use line_profiler for detailed analysis

  • Use memory_profiler for memory issues

2. Algorithm Optimization โœ“#

  • Choose right data structure (dict vs list vs set)

  • Use appropriate algorithm (O(n log n) vs O(nยฒ))

  • Cache repeated calculations

  • Avoid premature optimization

3. Data Structure Choice โœ“#

  • Use set for membership testing

  • Use dict for lookups

  • Use collections.deque for queues

  • Use collections.Counter for counting

4. Vectorization โœ“#

  • Replace loops with NumPy operations

  • Use broadcasting for element-wise ops

  • Avoid mixing loops and NumPy

5. Compilation โœ“#

  • Use Numba for numerical code

  • Use Cython for C-level performance

  • Consider PyPy for long-running services

6. Concurrency โœ“#

  • Multiprocessing for CPU-bound

  • Threading for I/O-bound (with caution)

  • Asyncio for high-concurrency I/O

7. Memory Optimization โœ“#

  • Use generators for large sequences

  • Use slots for many objects

  • Del unused variables

  • Use memory-mapped files for huge data

8. I/O Optimization โœ“#

  • Batch database queries

  • Use connection pooling

  • Cache file reads

  • Use binary formats (pickle, parquet)

9. String Operations โœ“#

  • Use join() instead of += in loops

  • Use f-strings for formatting

  • Use str.format() for templates

  • Compile regex patterns once

10. Measure Results โœ“#

  • Always benchmark optimizations

  • Use realistic data sizes

  • Test edge cases

  • Document performance characteristics


Part 9: Exercises#

Exercise 1: Optimize Matrix Multiplication (Difficulty: โ˜…โ˜…โ˜…โ˜†โ˜†)#

Task: Implement matrix multiplication three ways and compare performance:

  1. Pure Python with nested loops

  2. NumPy vectorized

  3. Numba JIT compiled

Test with 500ร—500 matrices and measure speedup.

Expected outcome: NumPy should be 50-100x faster, Numba similar to NumPy.


Exercise 2: Parallel File Processing (Difficulty: โ˜…โ˜…โ˜…โ˜…โ˜†)#

Task: Process 100 text files in parallel:

  1. Read each file

  2. Count word frequency

  3. Return top 10 words

Implement using:

  • Sequential processing

  • ThreadPoolExecutor

  • ProcessPoolExecutor

Which is faster and why?


Exercise 3: Memory-Efficient Data Processing (Difficulty: โ˜…โ˜…โ˜…โ˜†โ˜†)#

Task: Process a 1GB CSV file:

  1. Read and parse without loading entire file

  2. Calculate statistics (mean, median, std)

  3. Use generator-based approach

Compare memory usage vs loading all data.


Exercise 4: Async Web Scraper (Difficulty: โ˜…โ˜…โ˜…โ˜…โ˜†)#

Task: Build an async web scraper:

  1. Fetch 50 URLs concurrently

  2. Limit to 10 concurrent connections

  3. Handle errors gracefully

  4. Track completion time

Compare with sequential approach.


Exercise 5: Cache Decorator (Difficulty: โ˜…โ˜…โ˜…โ˜†โ˜†)#

Task: Implement a caching decorator with:

  1. LRU cache (max size)

  2. TTL (time-to-live)

  3. Cache statistics (hits/misses)

Test with expensive function calls.


Exercise 6: Profile and Optimize (Difficulty: โ˜…โ˜…โ˜…โ˜…โ˜…)#

Task: Given this slow code, optimize it:

def slow_function(data):
    result = []
    for item in data:
        if item not in result:
            result.append(item)
    return sorted(result)
  1. Profile to find bottlenecks

  2. Optimize (hint: use set)

  3. Measure speedup

  4. Document complexity improvement


Part 10: Self-Check Quiz#

Question 1#

What is the primary advantage of using generators over lists for large datasets?

A) Generators are faster
B) Generators use less memory
C) Generators are easier to write
D) Generators can be pickled

Answer B) Generators use less memory

Explanation: Generators compute values on-demand and donโ€™t store all elements in memory, making them ideal for large or infinite sequences.


Question 2#

When should you use multiprocessing instead of threading in Python?

A) For I/O-bound tasks
B) For CPU-bound tasks
C) When you need shared memory
D) When you want faster startup

Answer B) For CPU-bound tasks

Explanation: The GIL prevents threads from executing Python code in parallel on multiple cores. Multiprocessing bypasses the GIL by using separate processes.


Question 3#

What makes NumPy significantly faster than pure Python for numerical operations?

A) Itโ€™s written in a faster language
B) It uses compiled C code and operates on contiguous memory
C) It automatically parallelizes operations
D) It caches all results

Answer B) It uses compiled C code and operates on contiguous memory

Explanation: NumPy operations are implemented in C and work on contiguous memory blocks, avoiding Python interpreter overhead and enabling CPU cache efficiency.


Question 4#

What is the main limitation of asyncio in Python?

A) Itโ€™s too complex
B) It doesnโ€™t work with CPU-bound tasks
C) It requires special hardware
D) It only works on Linux

Answer B) It doesn't work with CPU-bound tasks

Explanation: Asyncio is designed for I/O-bound concurrency. CPU-bound tasks block the event loop, preventing other tasks from running.


Question 5#

What does the @jit decorator in Numba do?

A) Distributes code across multiple machines
B) Compiles Python functions to machine code at runtime
C) Caches function results
D) Converts code to JavaScript

Answer B) Compiles Python functions to machine code at runtime

Explanation: Numbaโ€™s JIT (Just-In-Time) compiler translates Python functions to optimized machine code when first called, achieving C-like performance.


Key Takeaways#

  1. Profile first: Donโ€™t guess where the bottleneck is

  2. Algorithm matters: O(n) vs O(nยฒ) beats any micro-optimization

  3. Use the right tool: NumPy for arrays, Numba for loops, asyncio for I/O

  4. Memory is finite: Use generators and streaming for large data

  5. GIL awareness: Multiprocessing for CPU, threading for I/O

  6. Vectorization wins: Avoid Python loops on arrays when possible

  7. Measure everything: Benchmark before and after optimization

  8. Premature optimization: Focus on correctness first, then optimize

  9. Compiled beats interpreted: Numba/Cython for critical hot paths

  10. Concurrency โ‰  Parallelism: Understand the difference


Common Mistakes to Avoid#

  1. Optimizing before profiling - Wasting time on non-bottlenecks

  2. Using threads for CPU-bound work - GIL prevents speedup

  3. Loops on NumPy arrays - Defeats vectorization benefits

  4. Blocking asyncio event loop - CPU work in async functions

  5. Premature multiprocessing - Process overhead can be worse

  6. Ignoring memory usage - OOM kills vs slow execution

  7. Not caching expensive calls - Repeated work

  8. String concatenation in loops - O(nยฒ) behavior

  9. Wrong data structure - list vs set for membership testing

  10. Not testing at scale - Small data hides problems


Pro Tips#

  1. Use cProfile + SnakeViz: Visual profiling is enlightening

  2. Try PyPy: 5-10x speedup for long-running pure Python code

  3. Learn NumPy broadcasting: Eliminates many loops

  4. Cache expensive I/O: Redis, memcached, or functools.lru_cache

  5. Use binary formats: Pickle, HDF5, Parquet beat JSON/CSV

  6. Batch operations: Database queries, API calls

  7. **Measure memory: sys.getsizeof(), memory_profiler

  8. Use slots: For classes with many instances

  9. Compile regexes: re.compile() once, use many times

  10. Monitor production: Profile in real environments


Whatโ€™s Next?#

Youโ€™ve mastered high-performance Python! Next topics:

  1. GPU Computing: CUDA, CuPy, JAX (next notebook!)

  2. Distributed Computing: Dask, Ray, Spark

  3. Advanced Profiling: perf, valgrind, flame graphs

  4. C Extensions: ctypes, CFFI, Cython mastery

  5. Alternative Runtimes: PyPy, GraalPython

Practice Projects:

  • Optimize a real application (10x+ speedup possible!)

  • Build a parallel data processing pipeline

  • Create a high-performance web service

  • Implement numerical algorithms in Numba

Remember: Fast code is good. Correct code is better. Maintainable code is best. Optimize where it matters!

Ready for GPU computing? Proceed to the CUDA & Parallel Computing notebook! ๐Ÿš€