Hard Level: High-Performance Python Computing

Hard Level: High-Performance Python Computing#

Real-World Context#

The Problem: Python is often criticized for being “slow” compared to C/C++. In reality, proper optimization can make Python competitive for many tasks, and understanding performance is crucial for:

Processing large datasets (millions/billions of records)
Real-time systems (trading, gaming, streaming)
Scientific computing and simulations
Machine learning model training
Web services handling millions of requests

Why This Matters:

Cost Savings: 10x faster code = 10x fewer servers
User Experience: Fast responses keep users engaged
Scalability: Efficient code handles more load
Energy: Less computation = lower power consumption

What You’ll Learn:

How to profile and find bottlenecks
Memory and CPU optimization techniques
Just-In-Time (JIT) compilation with Numba
Vectorization with NumPy
Multiprocessing and threading
Asynchronous I/O for concurrency
Cython for C-speed Python

Part 1: Profiling - Finding the Bottleneck#

Golden Rule: “Premature optimization is the root of all evil” - Donald Knuth

Always profile first! 90% of execution time is typically spent in 10% of the code.

Profiling Tools#

time/timeit: Simple timing
cProfile: Built-in Python profiler
line_profiler: Line-by-line profiling
memory_profiler: Track memory usage
py-spy: Sampling profiler (no code changes needed)

import time
import cProfile
import pstats
from io import StringIO

# Example: Slow vs Fast String Concatenation

def slow_concat(n):
    """Slow: O(n²) due to string immutability."""
    result = ""
    for i in range(n):
        result += str(i)  # Creates new string each time
    return result

def fast_concat(n):
    """Fast: O(n) using list join."""
    parts = []
    for i in range(n):
        parts.append(str(i))
    return ''.join(parts)

# Timing comparison
n = 10000

start = time.perf_counter()
slow_result = slow_concat(n)
slow_time = time.perf_counter() - start

start = time.perf_counter()
fast_result = fast_concat(n)
fast_time = time.perf_counter() - start

print(f"Slow concat: {slow_time:.4f}s")
print(f"Fast concat: {fast_time:.4f}s")
print(f"Speedup: {slow_time/fast_time:.2f}x faster")

# Using timeit for more accurate measurements
import timeit

slow_avg = timeit.timeit(lambda: slow_concat(1000), number=100) / 100
fast_avg = timeit.timeit(lambda: fast_concat(1000), number=100) / 100

print(f"\nAverage over 100 runs (n=1000):")
print(f"Slow: {slow_avg*1000:.2f}ms")
print(f"Fast: {fast_avg*1000:.2f}ms")

# cProfile Example: Finding hotspots

def fibonacci_recursive(n):
    """Inefficient recursive fibonacci."""
    if n <= 1:
        return n
    return fibonacci_recursive(n-1) + fibonacci_recursive(n-2)

def fibonacci_iterative(n):
    """Efficient iterative fibonacci."""
    if n <= 1:
        return n
    a, b = 0, 1
    for _ in range(n-1):
        a, b = b, a + b
    return b

# Profile the recursive version
profiler = cProfile.Profile()
profiler.enable()
result = fibonacci_recursive(25)
profiler.disable()

# Print stats
s = StringIO()
ps = pstats.Stats(profiler, stream=s).sort_stats('cumulative')
ps.print_stats(10)  # Top 10 functions
print("Recursive Fibonacci Profile:")
print(s.getvalue())

# Compare execution times
print("\nPerformance Comparison:")
n = 30

start = time.perf_counter()
rec_result = fibonacci_recursive(n)
rec_time = time.perf_counter() - start

start = time.perf_counter()
iter_result = fibonacci_iterative(n)
iter_time = time.perf_counter() - start

print(f"Recursive: {rec_time:.4f}s")
print(f"Iterative: {iter_time:.6f}s")
print(f"Speedup: {rec_time/iter_time:.0f}x faster!")

Key Profiling Insights#

What to Look For:

High cumulative time: Functions called often or running long
Number of calls: Unexpectedly high call counts indicate inefficiency
Per-call time: Individual function slowness

Common Bottlenecks:

String concatenation in loops
Nested loops (O(n²) or worse)
Repeated file I/O
Database queries in loops (N+1 problem)
Inefficient algorithms (bubble sort vs quicksort)

Part 2: Memory Optimization#

Memory is often the limiting factor in data processing. Understanding memory usage prevents crashes and improves performance.

Memory Profiling#

import sys
from collections import namedtuple

# Memory comparison: List vs Tuple vs Generator

def memory_usage_demo():
    n = 1_000_000
    
    # List: stores all elements in memory
    list_data = [i for i in range(n)]
    list_size = sys.getsizeof(list_data)
    
    # Tuple: slightly more memory efficient
    tuple_data = tuple(i for i in range(n))
    tuple_size = sys.getsizeof(tuple_data)
    
    # Generator: minimal memory (just stores state)
    gen_data = (i for i in range(n))
    gen_size = sys.getsizeof(gen_data)
    
    print(f"Memory Usage for {n:,} integers:")
    print(f"List:      {list_size:>12,} bytes ({list_size/1024/1024:.2f} MB)")
    print(f"Tuple:     {tuple_size:>12,} bytes ({tuple_size/1024/1024:.2f} MB)")
    print(f"Generator: {gen_size:>12,} bytes ({gen_size/1024:.2f} KB)")
    print(f"\nGenerator is {list_size/gen_size:.0f}x more memory efficient!")

memory_usage_demo()

# Slots for memory-efficient classes
class Point:
    """Regular class with __dict__."""
    def __init__(self, x, y):
        self.x = x
        self.y = y

class PointSlots:
    """Memory-efficient class with __slots__."""
    __slots__ = ['x', 'y']
    
    def __init__(self, x, y):
        self.x = x
        self.y = y

# Named tuple: even more efficient
PointNT = namedtuple('PointNT', ['x', 'y'])

# Compare memory usage
p1 = Point(1, 2)
p2 = PointSlots(1, 2)
p3 = PointNT(1, 2)

print("\nClass Memory Comparison:")
print(f"Regular class: {sys.getsizeof(p1) + sys.getsizeof(vars(p1))} bytes")
print(f"With __slots__: {sys.getsizeof(p2)} bytes")
print(f"Named tuple: {sys.getsizeof(p3)} bytes")

# For millions of objects, this difference is massive!
n_objects = 1_000_000
regular_memory = n_objects * (sys.getsizeof(p1) + sys.getsizeof(vars(p1)))
slots_memory = n_objects * sys.getsizeof(p2)

print(f"\nFor {n_objects:,} objects:")
print(f"Regular: {regular_memory/1024/1024:.2f} MB")
print(f"Slots: {slots_memory/1024/1024:.2f} MB")
print(f"Savings: {(regular_memory - slots_memory)/1024/1024:.2f} MB")

Memory Optimization Techniques#

Use generators for large sequences
slots for classes with many instances
del unused variables in long-running functions
gc.collect() to force garbage collection
Array/NumPy for homogeneous numeric data
Memory-mapped files for huge datasets
Lazy loading - load data only when needed

Part 3: NumPy Vectorization#

NumPy operations are 10-100x faster than pure Python because they:

Use compiled C code
Operate on contiguous memory blocks
Avoid Python interpreter overhead
Use SIMD instructions when possible

import numpy as np

# Example: Sum of squares

def sum_of_squares_python(n):
    """Pure Python implementation."""
    total = 0
    for i in range(n):
        total += i ** 2
    return total

def sum_of_squares_numpy(n):
    """Vectorized NumPy implementation."""
    arr = np.arange(n)
    return np.sum(arr ** 2)

# Benchmark
n = 1_000_000

# Python version
start = time.perf_counter()
result_py = sum_of_squares_python(n)
time_py = time.perf_counter() - start

# NumPy version
start = time.perf_counter()
result_np = sum_of_squares_numpy(n)
time_np = time.perf_counter() - start

print(f"Sum of squares for {n:,} numbers:")
print(f"Python: {time_py:.4f}s")
print(f"NumPy:  {time_np:.4f}s")
print(f"Speedup: {time_py/time_np:.1f}x faster!")

# More complex example: Distance matrix
def euclidean_distance_python(points):
    """Compute pairwise distances - Python loops."""
    n = len(points)
    distances = [[0.0] * n for _ in range(n)]
    
    for i in range(n):
        for j in range(n):
            dx = points[i][0] - points[j][0]
            dy = points[i][1] - points[j][1]
            distances[i][j] = (dx**2 + dy**2) ** 0.5
    
    return distances

def euclidean_distance_numpy(points):
    """Compute pairwise distances - NumPy broadcasting."""
    # Reshape for broadcasting: (n, 1, 2) - (1, n, 2) = (n, n, 2)
    diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]
    return np.sqrt(np.sum(diff ** 2, axis=2))

# Test with random points
np.random.seed(42)
n_points = 1000
points_list = [[np.random.rand(), np.random.rand()] for _ in range(n_points)]
points_array = np.array(points_list)

# Benchmark (use smaller n for Python version)
n_test = 100

start = time.perf_counter()
dist_py = euclidean_distance_python(points_list[:n_test])
time_py = time.perf_counter() - start

start = time.perf_counter()
dist_np = euclidean_distance_numpy(points_array[:n_test])
time_np = time.perf_counter() - start

print(f"\nDistance matrix for {n_test} points:")
print(f"Python: {time_py:.4f}s")
print(f"NumPy:  {time_np:.6f}s")
print(f"Speedup: {time_py/time_np:.1f}x faster!")

# NumPy can handle much larger datasets
start = time.perf_counter()
dist_large = euclidean_distance_numpy(points_array)  # All 1000 points
time_large = time.perf_counter() - start
print(f"\nNumPy with 1000 points: {time_large:.4f}s")
print(f"Creates {1000*1000:,} distance calculations!")

Vectorization Best Practices#

Do:

Use NumPy ufuncs (universal functions) instead of loops
Leverage broadcasting for element-wise operations
Use built-in aggregations (sum, mean, max, etc.)
Preallocate arrays when possible

Don’t:

Loop over NumPy arrays (defeats the purpose)
Use .append() in loops (very slow)
Mix Python loops with NumPy operations
Create unnecessary array copies

Part 4: Just-In-Time Compilation with Numba#

Numba compiles Python functions to machine code at runtime, achieving C-like performance with minimal code changes.

# Install: pip install numba
try:
    from numba import jit, njit, prange
    NUMBA_AVAILABLE = True
except ImportError:
    print("Numba not installed. Run: pip install numba")
    NUMBA_AVAILABLE = False

if NUMBA_AVAILABLE:
    # Example 1: Simple numerical computation
    
    def monte_carlo_pi_python(n_samples):
        """Estimate π using Monte Carlo - Pure Python."""
        inside_circle = 0
        for i in range(n_samples):
            x = np.random.random()
            y = np.random.random()
            if x*x + y*y <= 1.0:
                inside_circle += 1
        return 4.0 * inside_circle / n_samples
    
    @jit(nopython=True)  # Compile to machine code
    def monte_carlo_pi_numba(n_samples):
        """Estimate π using Monte Carlo - Numba JIT."""
        inside_circle = 0
        for i in range(n_samples):
            x = np.random.random()
            y = np.random.random()
            if x*x + y*y <= 1.0:
                inside_circle += 1
        return 4.0 * inside_circle / n_samples
    
    # Benchmark
    n = 10_000_000
    
    # Warm up Numba (first call compiles)
    _ = monte_carlo_pi_numba(100)
    
    start = time.perf_counter()
    pi_python = monte_carlo_pi_python(n)
    time_py = time.perf_counter() - start
    
    start = time.perf_counter()
    pi_numba = monte_carlo_pi_numba(n)
    time_numba = time.perf_counter() - start
    
    print(f"Monte Carlo π estimation ({n:,} samples):")
    print(f"Python: {pi_python:.6f} in {time_py:.4f}s")
    print(f"Numba:  {pi_numba:.6f} in {time_numba:.4f}s")
    print(f"Speedup: {time_py/time_numba:.1f}x faster!")
    
    # Example 2: Parallel execution
    
    @njit(parallel=True)
    def parallel_sum_of_squares(n):
        """Parallel computation with Numba."""
        total = 0
        for i in prange(n):  # Parallel range
            total += i * i
        return total
    
    # Compare with sequential
    @njit
    def sequential_sum_of_squares(n):
        """Sequential Numba computation."""
        total = 0
        for i in range(n):
            total += i * i
        return total
    
    # Warm up
    _ = parallel_sum_of_squares(100)
    _ = sequential_sum_of_squares(100)
    
    n = 100_000_000
    
    start = time.perf_counter()
    result_seq = sequential_sum_of_squares(n)
    time_seq = time.perf_counter() - start
    
    start = time.perf_counter()
    result_par = parallel_sum_of_squares(n)
    time_par = time.perf_counter() - start
    
    print(f"\nSum of squares ({n:,}):")
    print(f"Sequential: {time_seq:.4f}s")
    print(f"Parallel:   {time_par:.4f}s")
    print(f"Speedup: {time_seq/time_par:.1f}x faster!")
else:
    print("Numba examples skipped (not installed)")

Numba Guidelines#

When to use Numba:

Numerical computations with loops
Array operations that can’t be vectorized
Physics simulations, Monte Carlo methods
Signal processing algorithms

Limitations:

Limited Python feature support (no complex objects)
First call is slow (compilation)
Not useful for I/O-bound code
NumPy is still faster for pure vectorized ops

Best Practices:

Use nopython=True (or @njit) for maximum speed
Avoid object mode fallback
Use parallel=True for CPU-bound loops
Cache compiled functions with cache=True

Part 5: Multiprocessing and Threading#

Python has two main concurrency models:

Threading: For I/O-bound tasks (GIL limits CPU parallelism)
Multiprocessing: For CPU-bound tasks (bypasses GIL)

The Global Interpreter Lock (GIL)#

The GIL prevents multiple threads from executing Python code simultaneously. This means:

Threads: Good for I/O (network, disk)
Processes: Needed for CPU parallelism

import threading
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

# CPU-bound task for testing
def compute_intensive(n):
    """Simulate CPU-intensive work."""
    total = 0
    for i in range(n):
        total += i ** 2
    return total

# Sequential baseline
def sequential_processing(tasks):
    """Process tasks sequentially."""
    return [compute_intensive(n) for n in tasks]

# Threading (won't help for CPU-bound due to GIL)
def threaded_processing(tasks, max_workers=4):
    """Process tasks with threads."""
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        return list(executor.map(compute_intensive, tasks))

# Multiprocessing (bypasses GIL)
def multiprocess_processing(tasks, max_workers=4):
    """Process tasks with separate processes."""
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        return list(executor.map(compute_intensive, tasks))

# Benchmark
if __name__ == "__main__":
    tasks = [5_000_000] * 8  # 8 CPU-intensive tasks

    print("Processing 8 CPU-intensive tasks:")

    start = time.perf_counter()
    result_seq = sequential_processing(tasks)
    time_seq = time.perf_counter() - start
    print(f"Sequential:      {time_seq:.2f}s")

    start = time.perf_counter()
    result_thread = threaded_processing(tasks, max_workers=4)
    time_thread = time.perf_counter() - start
    print(f"Threading (4):   {time_thread:.2f}s (Speedup: {time_seq/time_thread:.2f}x)")

    start = time.perf_counter()
    result_multi = multiprocess_processing(tasks, max_workers=4)
    time_multi = time.perf_counter() - start
    print(f"Multiprocess(4): {time_multi:.2f}s (Speedup: {time_seq/time_multi:.2f}x)")

    print("\nKey Insight: Multiprocessing provides real CPU parallelism!")

# I/O-bound example: Threading shines here
import urllib.request

def fetch_url(url):
    """Simulate I/O-bound operation."""
    try:
        with urllib.request.urlopen(url, timeout=5) as response:
            return len(response.read())
    except:
        return 0

urls = [
    'http://www.example.com',
    'http://www.python.org',
    'http://www.github.com',
    'http://www.stackoverflow.com',
] * 3  # 12 requests total

# Sequential
print("Fetching 12 URLs:")
start = time.perf_counter()
sizes_seq = [fetch_url(url) for url in urls]
time_seq = time.perf_counter() - start
print(f"Sequential: {time_seq:.2f}s")

# Threaded (much better for I/O)
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=4) as executor:
    sizes_thread = list(executor.map(fetch_url, urls))
time_thread = time.perf_counter() - start
print(f"Threading:  {time_thread:.2f}s (Speedup: {time_seq/time_thread:.2f}x)")

print("\nFor I/O-bound tasks, threading is ideal!")

Concurrency Decision Tree#

Is your task CPU-bound or I/O-bound?
│
├─ CPU-bound (computation, data processing)
│  │
│  ├─ Vectorizable? → Use NumPy
│  ├─ Loops required? → Use Numba
│  └─ Multiple cores? → Use multiprocessing
│
└─ I/O-bound (network, disk, database)
   │
   ├─ Many concurrent operations? → Use asyncio
   └─ Mixed I/O and CPU? → Use threading

Part 6: Asynchronous I/O with asyncio#

asyncio enables concurrent I/O without threading overhead. Perfect for:

Web scraping
API calls
Database queries
Real-time applications

import asyncio

# Basic async example
async def fetch_data(url, delay):
    """Simulate async data fetching."""
    print(f"Fetching {url}...")
    await asyncio.sleep(delay)  # Simulate network delay
    print(f"Completed {url}")
    return f"Data from {url}"

async def sequential_async():
    """Fetch sequentially (awaits each)."""
    result1 = await fetch_data("API1", 2)
    result2 = await fetch_data("API2", 2)
    result3 = await fetch_data("API3", 2)
    return [result1, result2, result3]

async def concurrent_async():
    """Fetch concurrently (gather)."""
    tasks = [
        fetch_data("API1", 2),
        fetch_data("API2", 2),
        fetch_data("API3", 2),
    ]
    return await asyncio.gather(*tasks)

# Run in Jupyter (handles event loop)
print("Sequential async (total: ~6 seconds):")
start = time.perf_counter()
results_seq = await sequential_async()
time_seq = time.perf_counter() - start
print(f"Completed in {time_seq:.2f}s\n")

print("Concurrent async (total: ~2 seconds):")
start = time.perf_counter()
results_con = await concurrent_async()
time_con = time.perf_counter() - start
print(f"Completed in {time_con:.2f}s")
print(f"Speedup: {time_seq/time_con:.1f}x faster!")

# Real-world async pattern: Rate-limited API calls

async def rate_limited_fetch(semaphore, url, delay):
    """Fetch with concurrency limit."""
    async with semaphore:  # Limit concurrent requests
        print(f"Starting {url}")
        await asyncio.sleep(delay)
        print(f"Finished {url}")
        return f"Data from {url}"

async def fetch_many_with_limit(urls, max_concurrent=3):
    """Fetch many URLs with concurrency limit."""
    semaphore = asyncio.Semaphore(max_concurrent)
    tasks = [rate_limited_fetch(semaphore, url, 1) for url in urls]
    return await asyncio.gather(*tasks)

# Fetch 10 URLs but only 3 at a time
urls = [f"URL{i}" for i in range(10)]

print("Fetching 10 URLs with max 3 concurrent:")
start = time.perf_counter()
results = await fetch_many_with_limit(urls, max_concurrent=3)
elapsed = time.perf_counter() - start

print(f"\nCompleted in {elapsed:.2f}s")
print(f"With 3 concurrent, 10 URLs take ~4 batches: ceil(10/3) = 4 seconds")

Async Best Practices#

Do:

Use async for I/O-bound concurrent operations
Use asyncio.gather() for parallel awaits
Use asyncio.Semaphore() for rate limiting
Use aiohttp for async HTTP requests

Don’t:

Use async for CPU-bound tasks (blocks event loop)
Mix sync and async APIs carelessly
Forget to await async functions
Use time.sleep() instead of asyncio.sleep()

Part 7: Cython - C Extension Speed#

Cython compiles Python-like code to C extensions for maximum performance.

When to use Cython:

Need C-level performance
Interfacing with C libraries
Tight loops that can’t be vectorized
Numba isn’t enough

Cython Example (Pseudocode)#

# fibonacci.pyx
def fib_cython(int n):
    cdef int a = 0, b = 1, temp
    cdef int i
    
    for i in range(n-1):
        temp = a
        a = b
        b = temp + b
    
    return b

Compile with:

cython fibonacci.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing \
    -I/usr/include/python3.x \
    -o fibonacci.so fibonacci.c

Performance: 10-100x faster than pure Python for numerical code.

Part 8: Performance Optimization Checklist#

1. Profile First ✓#

Use cProfile to find hotspots
Use line_profiler for detailed analysis
Use memory_profiler for memory issues

2. Algorithm Optimization ✓#

Choose right data structure (dict vs list vs set)
Use appropriate algorithm (O(n log n) vs O(n²))
Cache repeated calculations
Avoid premature optimization

3. Data Structure Choice ✓#

Use set for membership testing
Use dict for lookups
Use collections.deque for queues
Use collections.Counter for counting

4. Vectorization ✓#

Replace loops with NumPy operations
Use broadcasting for element-wise ops
Avoid mixing loops and NumPy

5. Compilation ✓#

Use Numba for numerical code
Use Cython for C-level performance
Consider PyPy for long-running services

6. Concurrency ✓#

Multiprocessing for CPU-bound
Threading for I/O-bound (with caution)
Asyncio for high-concurrency I/O

7. Memory Optimization ✓#

Use generators for large sequences
Use slots for many objects
Del unused variables
Use memory-mapped files for huge data

8. I/O Optimization ✓#

Batch database queries
Use connection pooling
Cache file reads
Use binary formats (pickle, parquet)

9. String Operations ✓#

Use join() instead of += in loops
Use f-strings for formatting
Use str.format() for templates
Compile regex patterns once

10. Measure Results ✓#

Always benchmark optimizations
Use realistic data sizes
Test edge cases
Document performance characteristics

Part 9: Exercises#

Exercise 1: Optimize Matrix Multiplication (Difficulty: ★★★☆☆)#

Task: Implement matrix multiplication three ways and compare performance:

Pure Python with nested loops
NumPy vectorized
Numba JIT compiled

Test with 500×500 matrices and measure speedup.

Expected outcome: NumPy should be 50-100x faster, Numba similar to NumPy.

Exercise 2: Parallel File Processing (Difficulty: ★★★★☆)#

Task: Process 100 text files in parallel:

Read each file
Count word frequency
Return top 10 words

Implement using:

Sequential processing
ThreadPoolExecutor
ProcessPoolExecutor

Which is faster and why?

Exercise 3: Memory-Efficient Data Processing (Difficulty: ★★★☆☆)#

Task: Process a 1GB CSV file:

Read and parse without loading entire file
Calculate statistics (mean, median, std)
Use generator-based approach

Compare memory usage vs loading all data.

Exercise 4: Async Web Scraper (Difficulty: ★★★★☆)#

Task: Build an async web scraper:

Fetch 50 URLs concurrently
Limit to 10 concurrent connections
Handle errors gracefully
Track completion time

Compare with sequential approach.

Exercise 5: Cache Decorator (Difficulty: ★★★☆☆)#

Task: Implement a caching decorator with:

LRU cache (max size)
TTL (time-to-live)
Cache statistics (hits/misses)

Test with expensive function calls.

Exercise 6: Profile and Optimize (Difficulty: ★★★★★)#

Task: Given this slow code, optimize it:

def slow_function(data):
    result = []
    for item in data:
        if item not in result:
            result.append(item)
    return sorted(result)

Profile to find bottlenecks
Optimize (hint: use set)
Measure speedup
Document complexity improvement

Part 10: Self-Check Quiz#

Question 1#

What is the primary advantage of using generators over lists for large datasets?

A) Generators are faster
B) Generators use less memory
C) Generators are easier to write
D) Generators can be pickled

Answer

B) Generators use less memory

Explanation: Generators compute values on-demand and don’t store all elements in memory, making them ideal for large or infinite sequences.

Question 2#

When should you use multiprocessing instead of threading in Python?

A) For I/O-bound tasks
B) For CPU-bound tasks
C) When you need shared memory
D) When you want faster startup

Answer

B) For CPU-bound tasks

Explanation: The GIL prevents threads from executing Python code in parallel on multiple cores. Multiprocessing bypasses the GIL by using separate processes.

Question 3#

What makes NumPy significantly faster than pure Python for numerical operations?

A) It’s written in a faster language
B) It uses compiled C code and operates on contiguous memory
C) It automatically parallelizes operations
D) It caches all results

Answer

B) It uses compiled C code and operates on contiguous memory

Explanation: NumPy operations are implemented in C and work on contiguous memory blocks, avoiding Python interpreter overhead and enabling CPU cache efficiency.

Question 4#

What is the main limitation of asyncio in Python?

A) It’s too complex
B) It doesn’t work with CPU-bound tasks
C) It requires special hardware
D) It only works on Linux

Answer

B) It doesn't work with CPU-bound tasks

Explanation: Asyncio is designed for I/O-bound concurrency. CPU-bound tasks block the event loop, preventing other tasks from running.

Question 5#

What does the @jit decorator in Numba do?

A) Distributes code across multiple machines
B) Compiles Python functions to machine code at runtime
C) Caches function results
D) Converts code to JavaScript

Answer

B) Compiles Python functions to machine code at runtime

Explanation: Numba’s JIT (Just-In-Time) compiler translates Python functions to optimized machine code when first called, achieving C-like performance.

Key Takeaways#

Profile first: Don’t guess where the bottleneck is
Algorithm matters: O(n) vs O(n²) beats any micro-optimization
Use the right tool: NumPy for arrays, Numba for loops, asyncio for I/O
Memory is finite: Use generators and streaming for large data
GIL awareness: Multiprocessing for CPU, threading for I/O
Vectorization wins: Avoid Python loops on arrays when possible
Measure everything: Benchmark before and after optimization
Premature optimization: Focus on correctness first, then optimize
Compiled beats interpreted: Numba/Cython for critical hot paths
Concurrency ≠ Parallelism: Understand the difference

Common Mistakes to Avoid#

Optimizing before profiling - Wasting time on non-bottlenecks
Using threads for CPU-bound work - GIL prevents speedup
Loops on NumPy arrays - Defeats vectorization benefits
Blocking asyncio event loop - CPU work in async functions
Premature multiprocessing - Process overhead can be worse
Ignoring memory usage - OOM kills vs slow execution
Not caching expensive calls - Repeated work
String concatenation in loops - O(n²) behavior
Wrong data structure - list vs set for membership testing
Not testing at scale - Small data hides problems

Pro Tips#

Use cProfile + SnakeViz: Visual profiling is enlightening
Try PyPy: 5-10x speedup for long-running pure Python code
Learn NumPy broadcasting: Eliminates many loops
Cache expensive I/O: Redis, memcached, or functools.lru_cache
Use binary formats: Pickle, HDF5, Parquet beat JSON/CSV
Batch operations: Database queries, API calls
**Measure memory: sys.getsizeof(), memory_profiler
Use slots: For classes with many instances
Compile regexes: re.compile() once, use many times
Monitor production: Profile in real environments

What’s Next?#

You’ve mastered high-performance Python! Next topics:

GPU Computing: CUDA, CuPy, JAX (next notebook!)
Distributed Computing: Dask, Ray, Spark
Advanced Profiling: perf, valgrind, flame graphs
C Extensions: ctypes, CFFI, Cython mastery
Alternative Runtimes: PyPy, GraalPython

Practice Projects:

Optimize a real application (10x+ speedup possible!)
Build a parallel data processing pipeline
Create a high-performance web service
Implement numerical algorithms in Numba

Remember: Fast code is good. Correct code is better. Maintainable code is best. Optimize where it matters!

Ready for GPU computing? Proceed to the CUDA & Parallel Computing notebook! 🚀

Hard Level: High-Performance Python Computing

Contents

Hard Level: High-Performance Python Computing#

Real-World Context#

Part 1: Profiling - Finding the Bottleneck#

Profiling Tools#

Key Profiling Insights#

Part 2: Memory Optimization#

Memory Profiling#

Memory Optimization Techniques#

Part 3: NumPy Vectorization#

Vectorization Best Practices#

Part 4: Just-In-Time Compilation with Numba#

Numba Guidelines#

Part 5: Multiprocessing and Threading#

The Global Interpreter Lock (GIL)#

Concurrency Decision Tree#

Part 6: Asynchronous I/O with asyncio#

Async Best Practices#

Part 7: Cython - C Extension Speed#

Cython Example (Pseudocode)#

Part 8: Performance Optimization Checklist#

1. Profile First ✓#

2. Algorithm Optimization ✓#

3. Data Structure Choice ✓#

4. Vectorization ✓#

5. Compilation ✓#

6. Concurrency ✓#

7. Memory Optimization ✓#

8. I/O Optimization ✓#

9. String Operations ✓#

10. Measure Results ✓#

Part 9: Exercises#

Exercise 1: Optimize Matrix Multiplication (Difficulty: ★★★☆☆)#

Exercise 2: Parallel File Processing (Difficulty: ★★★★☆)#

Exercise 3: Memory-Efficient Data Processing (Difficulty: ★★★☆☆)#

Exercise 4: Async Web Scraper (Difficulty: ★★★★☆)#

Exercise 5: Cache Decorator (Difficulty: ★★★☆☆)#

Exercise 6: Profile and Optimize (Difficulty: ★★★★★)#

Part 10: Self-Check Quiz#

Question 1#

Question 2#

Question 3#

Question 4#

Question 5#

Key Takeaways#

Common Mistakes to Avoid#

Pro Tips#

What’s Next?#