Hard Level: High-Performance Python Computing#
Real-World Context#
The Problem: Python is often criticized for being โslowโ compared to C/C++. In reality, proper optimization can make Python competitive for many tasks, and understanding performance is crucial for:
Processing large datasets (millions/billions of records)
Real-time systems (trading, gaming, streaming)
Scientific computing and simulations
Machine learning model training
Web services handling millions of requests
Why This Matters:
Cost Savings: 10x faster code = 10x fewer servers
User Experience: Fast responses keep users engaged
Scalability: Efficient code handles more load
Energy: Less computation = lower power consumption
What Youโll Learn:
How to profile and find bottlenecks
Memory and CPU optimization techniques
Just-In-Time (JIT) compilation with Numba
Vectorization with NumPy
Multiprocessing and threading
Asynchronous I/O for concurrency
Cython for C-speed Python
Part 1: Profiling - Finding the Bottleneck#
Golden Rule: โPremature optimization is the root of all evilโ - Donald Knuth
Always profile first! 90% of execution time is typically spent in 10% of the code.
Profiling Tools#
time/timeit: Simple timing
cProfile: Built-in Python profiler
line_profiler: Line-by-line profiling
memory_profiler: Track memory usage
py-spy: Sampling profiler (no code changes needed)
import time
import cProfile
import pstats
from io import StringIO
# Example: Slow vs Fast String Concatenation
def slow_concat(n):
"""Slow: O(nยฒ) due to string immutability."""
result = ""
for i in range(n):
result += str(i) # Creates new string each time
return result
def fast_concat(n):
"""Fast: O(n) using list join."""
parts = []
for i in range(n):
parts.append(str(i))
return ''.join(parts)
# Timing comparison
n = 10000
start = time.perf_counter()
slow_result = slow_concat(n)
slow_time = time.perf_counter() - start
start = time.perf_counter()
fast_result = fast_concat(n)
fast_time = time.perf_counter() - start
print(f"Slow concat: {slow_time:.4f}s")
print(f"Fast concat: {fast_time:.4f}s")
print(f"Speedup: {slow_time/fast_time:.2f}x faster")
# Using timeit for more accurate measurements
import timeit
slow_avg = timeit.timeit(lambda: slow_concat(1000), number=100) / 100
fast_avg = timeit.timeit(lambda: fast_concat(1000), number=100) / 100
print(f"\nAverage over 100 runs (n=1000):")
print(f"Slow: {slow_avg*1000:.2f}ms")
print(f"Fast: {fast_avg*1000:.2f}ms")
# cProfile Example: Finding hotspots
def fibonacci_recursive(n):
"""Inefficient recursive fibonacci."""
if n <= 1:
return n
return fibonacci_recursive(n-1) + fibonacci_recursive(n-2)
def fibonacci_iterative(n):
"""Efficient iterative fibonacci."""
if n <= 1:
return n
a, b = 0, 1
for _ in range(n-1):
a, b = b, a + b
return b
# Profile the recursive version
profiler = cProfile.Profile()
profiler.enable()
result = fibonacci_recursive(25)
profiler.disable()
# Print stats
s = StringIO()
ps = pstats.Stats(profiler, stream=s).sort_stats('cumulative')
ps.print_stats(10) # Top 10 functions
print("Recursive Fibonacci Profile:")
print(s.getvalue())
# Compare execution times
print("\nPerformance Comparison:")
n = 30
start = time.perf_counter()
rec_result = fibonacci_recursive(n)
rec_time = time.perf_counter() - start
start = time.perf_counter()
iter_result = fibonacci_iterative(n)
iter_time = time.perf_counter() - start
print(f"Recursive: {rec_time:.4f}s")
print(f"Iterative: {iter_time:.6f}s")
print(f"Speedup: {rec_time/iter_time:.0f}x faster!")
Key Profiling Insights#
What to Look For:
High cumulative time: Functions called often or running long
Number of calls: Unexpectedly high call counts indicate inefficiency
Per-call time: Individual function slowness
Common Bottlenecks:
String concatenation in loops
Nested loops (O(nยฒ) or worse)
Repeated file I/O
Database queries in loops (N+1 problem)
Inefficient algorithms (bubble sort vs quicksort)
Part 2: Memory Optimization#
Memory is often the limiting factor in data processing. Understanding memory usage prevents crashes and improves performance.
Memory Profiling#
import sys
from collections import namedtuple
# Memory comparison: List vs Tuple vs Generator
def memory_usage_demo():
n = 1_000_000
# List: stores all elements in memory
list_data = [i for i in range(n)]
list_size = sys.getsizeof(list_data)
# Tuple: slightly more memory efficient
tuple_data = tuple(i for i in range(n))
tuple_size = sys.getsizeof(tuple_data)
# Generator: minimal memory (just stores state)
gen_data = (i for i in range(n))
gen_size = sys.getsizeof(gen_data)
print(f"Memory Usage for {n:,} integers:")
print(f"List: {list_size:>12,} bytes ({list_size/1024/1024:.2f} MB)")
print(f"Tuple: {tuple_size:>12,} bytes ({tuple_size/1024/1024:.2f} MB)")
print(f"Generator: {gen_size:>12,} bytes ({gen_size/1024:.2f} KB)")
print(f"\nGenerator is {list_size/gen_size:.0f}x more memory efficient!")
memory_usage_demo()
# Slots for memory-efficient classes
class Point:
"""Regular class with __dict__."""
def __init__(self, x, y):
self.x = x
self.y = y
class PointSlots:
"""Memory-efficient class with __slots__."""
__slots__ = ['x', 'y']
def __init__(self, x, y):
self.x = x
self.y = y
# Named tuple: even more efficient
PointNT = namedtuple('PointNT', ['x', 'y'])
# Compare memory usage
p1 = Point(1, 2)
p2 = PointSlots(1, 2)
p3 = PointNT(1, 2)
print("\nClass Memory Comparison:")
print(f"Regular class: {sys.getsizeof(p1) + sys.getsizeof(vars(p1))} bytes")
print(f"With __slots__: {sys.getsizeof(p2)} bytes")
print(f"Named tuple: {sys.getsizeof(p3)} bytes")
# For millions of objects, this difference is massive!
n_objects = 1_000_000
regular_memory = n_objects * (sys.getsizeof(p1) + sys.getsizeof(vars(p1)))
slots_memory = n_objects * sys.getsizeof(p2)
print(f"\nFor {n_objects:,} objects:")
print(f"Regular: {regular_memory/1024/1024:.2f} MB")
print(f"Slots: {slots_memory/1024/1024:.2f} MB")
print(f"Savings: {(regular_memory - slots_memory)/1024/1024:.2f} MB")
Memory Optimization Techniques#
Use generators for large sequences
slots for classes with many instances
del unused variables in long-running functions
gc.collect() to force garbage collection
Array/NumPy for homogeneous numeric data
Memory-mapped files for huge datasets
Lazy loading - load data only when needed
Part 3: NumPy Vectorization#
NumPy operations are 10-100x faster than pure Python because they:
Use compiled C code
Operate on contiguous memory blocks
Avoid Python interpreter overhead
Use SIMD instructions when possible
import numpy as np
# Example: Sum of squares
def sum_of_squares_python(n):
"""Pure Python implementation."""
total = 0
for i in range(n):
total += i ** 2
return total
def sum_of_squares_numpy(n):
"""Vectorized NumPy implementation."""
arr = np.arange(n)
return np.sum(arr ** 2)
# Benchmark
n = 1_000_000
# Python version
start = time.perf_counter()
result_py = sum_of_squares_python(n)
time_py = time.perf_counter() - start
# NumPy version
start = time.perf_counter()
result_np = sum_of_squares_numpy(n)
time_np = time.perf_counter() - start
print(f"Sum of squares for {n:,} numbers:")
print(f"Python: {time_py:.4f}s")
print(f"NumPy: {time_np:.4f}s")
print(f"Speedup: {time_py/time_np:.1f}x faster!")
# More complex example: Distance matrix
def euclidean_distance_python(points):
"""Compute pairwise distances - Python loops."""
n = len(points)
distances = [[0.0] * n for _ in range(n)]
for i in range(n):
for j in range(n):
dx = points[i][0] - points[j][0]
dy = points[i][1] - points[j][1]
distances[i][j] = (dx**2 + dy**2) ** 0.5
return distances
def euclidean_distance_numpy(points):
"""Compute pairwise distances - NumPy broadcasting."""
# Reshape for broadcasting: (n, 1, 2) - (1, n, 2) = (n, n, 2)
diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]
return np.sqrt(np.sum(diff ** 2, axis=2))
# Test with random points
np.random.seed(42)
n_points = 1000
points_list = [[np.random.rand(), np.random.rand()] for _ in range(n_points)]
points_array = np.array(points_list)
# Benchmark (use smaller n for Python version)
n_test = 100
start = time.perf_counter()
dist_py = euclidean_distance_python(points_list[:n_test])
time_py = time.perf_counter() - start
start = time.perf_counter()
dist_np = euclidean_distance_numpy(points_array[:n_test])
time_np = time.perf_counter() - start
print(f"\nDistance matrix for {n_test} points:")
print(f"Python: {time_py:.4f}s")
print(f"NumPy: {time_np:.6f}s")
print(f"Speedup: {time_py/time_np:.1f}x faster!")
# NumPy can handle much larger datasets
start = time.perf_counter()
dist_large = euclidean_distance_numpy(points_array) # All 1000 points
time_large = time.perf_counter() - start
print(f"\nNumPy with 1000 points: {time_large:.4f}s")
print(f"Creates {1000*1000:,} distance calculations!")
Vectorization Best Practices#
Do:
Use NumPy ufuncs (universal functions) instead of loops
Leverage broadcasting for element-wise operations
Use built-in aggregations (sum, mean, max, etc.)
Preallocate arrays when possible
Donโt:
Loop over NumPy arrays (defeats the purpose)
Use
.append()in loops (very slow)Mix Python loops with NumPy operations
Create unnecessary array copies
Part 4: Just-In-Time Compilation with Numba#
Numba compiles Python functions to machine code at runtime, achieving C-like performance with minimal code changes.
# Install: pip install numba
try:
from numba import jit, njit, prange
NUMBA_AVAILABLE = True
except ImportError:
print("Numba not installed. Run: pip install numba")
NUMBA_AVAILABLE = False
if NUMBA_AVAILABLE:
# Example 1: Simple numerical computation
def monte_carlo_pi_python(n_samples):
"""Estimate ฯ using Monte Carlo - Pure Python."""
inside_circle = 0
for i in range(n_samples):
x = np.random.random()
y = np.random.random()
if x*x + y*y <= 1.0:
inside_circle += 1
return 4.0 * inside_circle / n_samples
@jit(nopython=True) # Compile to machine code
def monte_carlo_pi_numba(n_samples):
"""Estimate ฯ using Monte Carlo - Numba JIT."""
inside_circle = 0
for i in range(n_samples):
x = np.random.random()
y = np.random.random()
if x*x + y*y <= 1.0:
inside_circle += 1
return 4.0 * inside_circle / n_samples
# Benchmark
n = 10_000_000
# Warm up Numba (first call compiles)
_ = monte_carlo_pi_numba(100)
start = time.perf_counter()
pi_python = monte_carlo_pi_python(n)
time_py = time.perf_counter() - start
start = time.perf_counter()
pi_numba = monte_carlo_pi_numba(n)
time_numba = time.perf_counter() - start
print(f"Monte Carlo ฯ estimation ({n:,} samples):")
print(f"Python: {pi_python:.6f} in {time_py:.4f}s")
print(f"Numba: {pi_numba:.6f} in {time_numba:.4f}s")
print(f"Speedup: {time_py/time_numba:.1f}x faster!")
# Example 2: Parallel execution
@njit(parallel=True)
def parallel_sum_of_squares(n):
"""Parallel computation with Numba."""
total = 0
for i in prange(n): # Parallel range
total += i * i
return total
# Compare with sequential
@njit
def sequential_sum_of_squares(n):
"""Sequential Numba computation."""
total = 0
for i in range(n):
total += i * i
return total
# Warm up
_ = parallel_sum_of_squares(100)
_ = sequential_sum_of_squares(100)
n = 100_000_000
start = time.perf_counter()
result_seq = sequential_sum_of_squares(n)
time_seq = time.perf_counter() - start
start = time.perf_counter()
result_par = parallel_sum_of_squares(n)
time_par = time.perf_counter() - start
print(f"\nSum of squares ({n:,}):")
print(f"Sequential: {time_seq:.4f}s")
print(f"Parallel: {time_par:.4f}s")
print(f"Speedup: {time_seq/time_par:.1f}x faster!")
else:
print("Numba examples skipped (not installed)")
Numba Guidelines#
When to use Numba:
Numerical computations with loops
Array operations that canโt be vectorized
Physics simulations, Monte Carlo methods
Signal processing algorithms
Limitations:
Limited Python feature support (no complex objects)
First call is slow (compilation)
Not useful for I/O-bound code
NumPy is still faster for pure vectorized ops
Best Practices:
Use
nopython=True(or@njit) for maximum speedAvoid object mode fallback
Use
parallel=Truefor CPU-bound loopsCache compiled functions with
cache=True
Part 5: Multiprocessing and Threading#
Python has two main concurrency models:
Threading: For I/O-bound tasks (GIL limits CPU parallelism)
Multiprocessing: For CPU-bound tasks (bypasses GIL)
The Global Interpreter Lock (GIL)#
The GIL prevents multiple threads from executing Python code simultaneously. This means:
Threads: Good for I/O (network, disk)
Processes: Needed for CPU parallelism
import threading
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
# CPU-bound task for testing
def compute_intensive(n):
"""Simulate CPU-intensive work."""
total = 0
for i in range(n):
total += i ** 2
return total
# Sequential baseline
def sequential_processing(tasks):
"""Process tasks sequentially."""
return [compute_intensive(n) for n in tasks]
# Threading (won't help for CPU-bound due to GIL)
def threaded_processing(tasks, max_workers=4):
"""Process tasks with threads."""
with ThreadPoolExecutor(max_workers=max_workers) as executor:
return list(executor.map(compute_intensive, tasks))
# Multiprocessing (bypasses GIL)
def multiprocess_processing(tasks, max_workers=4):
"""Process tasks with separate processes."""
with ProcessPoolExecutor(max_workers=max_workers) as executor:
return list(executor.map(compute_intensive, tasks))
# Benchmark
if __name__ == "__main__":
tasks = [5_000_000] * 8 # 8 CPU-intensive tasks
print("Processing 8 CPU-intensive tasks:")
start = time.perf_counter()
result_seq = sequential_processing(tasks)
time_seq = time.perf_counter() - start
print(f"Sequential: {time_seq:.2f}s")
start = time.perf_counter()
result_thread = threaded_processing(tasks, max_workers=4)
time_thread = time.perf_counter() - start
print(f"Threading (4): {time_thread:.2f}s (Speedup: {time_seq/time_thread:.2f}x)")
start = time.perf_counter()
result_multi = multiprocess_processing(tasks, max_workers=4)
time_multi = time.perf_counter() - start
print(f"Multiprocess(4): {time_multi:.2f}s (Speedup: {time_seq/time_multi:.2f}x)")
print("\nKey Insight: Multiprocessing provides real CPU parallelism!")
# I/O-bound example: Threading shines here
import urllib.request
def fetch_url(url):
"""Simulate I/O-bound operation."""
try:
with urllib.request.urlopen(url, timeout=5) as response:
return len(response.read())
except:
return 0
urls = [
'http://www.example.com',
'http://www.python.org',
'http://www.github.com',
'http://www.stackoverflow.com',
] * 3 # 12 requests total
# Sequential
print("Fetching 12 URLs:")
start = time.perf_counter()
sizes_seq = [fetch_url(url) for url in urls]
time_seq = time.perf_counter() - start
print(f"Sequential: {time_seq:.2f}s")
# Threaded (much better for I/O)
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=4) as executor:
sizes_thread = list(executor.map(fetch_url, urls))
time_thread = time.perf_counter() - start
print(f"Threading: {time_thread:.2f}s (Speedup: {time_seq/time_thread:.2f}x)")
print("\nFor I/O-bound tasks, threading is ideal!")
Concurrency Decision Tree#
Is your task CPU-bound or I/O-bound?
โ
โโ CPU-bound (computation, data processing)
โ โ
โ โโ Vectorizable? โ Use NumPy
โ โโ Loops required? โ Use Numba
โ โโ Multiple cores? โ Use multiprocessing
โ
โโ I/O-bound (network, disk, database)
โ
โโ Many concurrent operations? โ Use asyncio
โโ Mixed I/O and CPU? โ Use threading
Part 6: Asynchronous I/O with asyncio#
asyncio enables concurrent I/O without threading overhead. Perfect for:
Web scraping
API calls
Database queries
Real-time applications
import asyncio
# Basic async example
async def fetch_data(url, delay):
"""Simulate async data fetching."""
print(f"Fetching {url}...")
await asyncio.sleep(delay) # Simulate network delay
print(f"Completed {url}")
return f"Data from {url}"
async def sequential_async():
"""Fetch sequentially (awaits each)."""
result1 = await fetch_data("API1", 2)
result2 = await fetch_data("API2", 2)
result3 = await fetch_data("API3", 2)
return [result1, result2, result3]
async def concurrent_async():
"""Fetch concurrently (gather)."""
tasks = [
fetch_data("API1", 2),
fetch_data("API2", 2),
fetch_data("API3", 2),
]
return await asyncio.gather(*tasks)
# Run in Jupyter (handles event loop)
print("Sequential async (total: ~6 seconds):")
start = time.perf_counter()
results_seq = await sequential_async()
time_seq = time.perf_counter() - start
print(f"Completed in {time_seq:.2f}s\n")
print("Concurrent async (total: ~2 seconds):")
start = time.perf_counter()
results_con = await concurrent_async()
time_con = time.perf_counter() - start
print(f"Completed in {time_con:.2f}s")
print(f"Speedup: {time_seq/time_con:.1f}x faster!")
# Real-world async pattern: Rate-limited API calls
async def rate_limited_fetch(semaphore, url, delay):
"""Fetch with concurrency limit."""
async with semaphore: # Limit concurrent requests
print(f"Starting {url}")
await asyncio.sleep(delay)
print(f"Finished {url}")
return f"Data from {url}"
async def fetch_many_with_limit(urls, max_concurrent=3):
"""Fetch many URLs with concurrency limit."""
semaphore = asyncio.Semaphore(max_concurrent)
tasks = [rate_limited_fetch(semaphore, url, 1) for url in urls]
return await asyncio.gather(*tasks)
# Fetch 10 URLs but only 3 at a time
urls = [f"URL{i}" for i in range(10)]
print("Fetching 10 URLs with max 3 concurrent:")
start = time.perf_counter()
results = await fetch_many_with_limit(urls, max_concurrent=3)
elapsed = time.perf_counter() - start
print(f"\nCompleted in {elapsed:.2f}s")
print(f"With 3 concurrent, 10 URLs take ~4 batches: ceil(10/3) = 4 seconds")
Async Best Practices#
Do:
Use async for I/O-bound concurrent operations
Use
asyncio.gather()for parallel awaitsUse
asyncio.Semaphore()for rate limitingUse
aiohttpfor async HTTP requests
Donโt:
Use async for CPU-bound tasks (blocks event loop)
Mix sync and async APIs carelessly
Forget to await async functions
Use
time.sleep()instead ofasyncio.sleep()
Part 7: Cython - C Extension Speed#
Cython compiles Python-like code to C extensions for maximum performance.
When to use Cython:
Need C-level performance
Interfacing with C libraries
Tight loops that canโt be vectorized
Numba isnโt enough
Cython Example (Pseudocode)#
# fibonacci.pyx
def fib_cython(int n):
cdef int a = 0, b = 1, temp
cdef int i
for i in range(n-1):
temp = a
a = b
b = temp + b
return b
Compile with:
cython fibonacci.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing \
-I/usr/include/python3.x \
-o fibonacci.so fibonacci.c
Performance: 10-100x faster than pure Python for numerical code.
Part 8: Performance Optimization Checklist#
1. Profile First โ#
Use cProfile to find hotspots
Use line_profiler for detailed analysis
Use memory_profiler for memory issues
2. Algorithm Optimization โ#
Choose right data structure (dict vs list vs set)
Use appropriate algorithm (O(n log n) vs O(nยฒ))
Cache repeated calculations
Avoid premature optimization
3. Data Structure Choice โ#
Use
setfor membership testingUse
dictfor lookupsUse
collections.dequefor queuesUse
collections.Counterfor counting
4. Vectorization โ#
Replace loops with NumPy operations
Use broadcasting for element-wise ops
Avoid mixing loops and NumPy
5. Compilation โ#
Use Numba for numerical code
Use Cython for C-level performance
Consider PyPy for long-running services
6. Concurrency โ#
Multiprocessing for CPU-bound
Threading for I/O-bound (with caution)
Asyncio for high-concurrency I/O
7. Memory Optimization โ#
Use generators for large sequences
Use slots for many objects
Del unused variables
Use memory-mapped files for huge data
8. I/O Optimization โ#
Batch database queries
Use connection pooling
Cache file reads
Use binary formats (pickle, parquet)
9. String Operations โ#
Use join() instead of += in loops
Use f-strings for formatting
Use str.format() for templates
Compile regex patterns once
10. Measure Results โ#
Always benchmark optimizations
Use realistic data sizes
Test edge cases
Document performance characteristics
Part 9: Exercises#
Exercise 1: Optimize Matrix Multiplication (Difficulty: โ โ โ โโ)#
Task: Implement matrix multiplication three ways and compare performance:
Pure Python with nested loops
NumPy vectorized
Numba JIT compiled
Test with 500ร500 matrices and measure speedup.
Expected outcome: NumPy should be 50-100x faster, Numba similar to NumPy.
Exercise 2: Parallel File Processing (Difficulty: โ โ โ โ โ)#
Task: Process 100 text files in parallel:
Read each file
Count word frequency
Return top 10 words
Implement using:
Sequential processing
ThreadPoolExecutor
ProcessPoolExecutor
Which is faster and why?
Exercise 3: Memory-Efficient Data Processing (Difficulty: โ โ โ โโ)#
Task: Process a 1GB CSV file:
Read and parse without loading entire file
Calculate statistics (mean, median, std)
Use generator-based approach
Compare memory usage vs loading all data.
Exercise 4: Async Web Scraper (Difficulty: โ โ โ โ โ)#
Task: Build an async web scraper:
Fetch 50 URLs concurrently
Limit to 10 concurrent connections
Handle errors gracefully
Track completion time
Compare with sequential approach.
Exercise 5: Cache Decorator (Difficulty: โ โ โ โโ)#
Task: Implement a caching decorator with:
LRU cache (max size)
TTL (time-to-live)
Cache statistics (hits/misses)
Test with expensive function calls.
Exercise 6: Profile and Optimize (Difficulty: โ โ โ โ โ )#
Task: Given this slow code, optimize it:
def slow_function(data):
result = []
for item in data:
if item not in result:
result.append(item)
return sorted(result)
Profile to find bottlenecks
Optimize (hint: use set)
Measure speedup
Document complexity improvement
Part 10: Self-Check Quiz#
Question 1#
What is the primary advantage of using generators over lists for large datasets?
A) Generators are faster
B) Generators use less memory
C) Generators are easier to write
D) Generators can be pickled
Answer
B) Generators use less memoryExplanation: Generators compute values on-demand and donโt store all elements in memory, making them ideal for large or infinite sequences.
Question 2#
When should you use multiprocessing instead of threading in Python?
A) For I/O-bound tasks
B) For CPU-bound tasks
C) When you need shared memory
D) When you want faster startup
Answer
B) For CPU-bound tasksExplanation: The GIL prevents threads from executing Python code in parallel on multiple cores. Multiprocessing bypasses the GIL by using separate processes.
Question 3#
What makes NumPy significantly faster than pure Python for numerical operations?
A) Itโs written in a faster language
B) It uses compiled C code and operates on contiguous memory
C) It automatically parallelizes operations
D) It caches all results
Answer
B) It uses compiled C code and operates on contiguous memoryExplanation: NumPy operations are implemented in C and work on contiguous memory blocks, avoiding Python interpreter overhead and enabling CPU cache efficiency.
Question 4#
What is the main limitation of asyncio in Python?
A) Itโs too complex
B) It doesnโt work with CPU-bound tasks
C) It requires special hardware
D) It only works on Linux
Answer
B) It doesn't work with CPU-bound tasksExplanation: Asyncio is designed for I/O-bound concurrency. CPU-bound tasks block the event loop, preventing other tasks from running.
Question 5#
What does the @jit decorator in Numba do?
A) Distributes code across multiple machines
B) Compiles Python functions to machine code at runtime
C) Caches function results
D) Converts code to JavaScript
Answer
B) Compiles Python functions to machine code at runtimeExplanation: Numbaโs JIT (Just-In-Time) compiler translates Python functions to optimized machine code when first called, achieving C-like performance.
Key Takeaways#
Profile first: Donโt guess where the bottleneck is
Algorithm matters: O(n) vs O(nยฒ) beats any micro-optimization
Use the right tool: NumPy for arrays, Numba for loops, asyncio for I/O
Memory is finite: Use generators and streaming for large data
GIL awareness: Multiprocessing for CPU, threading for I/O
Vectorization wins: Avoid Python loops on arrays when possible
Measure everything: Benchmark before and after optimization
Premature optimization: Focus on correctness first, then optimize
Compiled beats interpreted: Numba/Cython for critical hot paths
Concurrency โ Parallelism: Understand the difference
Common Mistakes to Avoid#
Optimizing before profiling - Wasting time on non-bottlenecks
Using threads for CPU-bound work - GIL prevents speedup
Loops on NumPy arrays - Defeats vectorization benefits
Blocking asyncio event loop - CPU work in async functions
Premature multiprocessing - Process overhead can be worse
Ignoring memory usage - OOM kills vs slow execution
Not caching expensive calls - Repeated work
String concatenation in loops - O(nยฒ) behavior
Wrong data structure - list vs set for membership testing
Not testing at scale - Small data hides problems
Pro Tips#
Use cProfile + SnakeViz: Visual profiling is enlightening
Try PyPy: 5-10x speedup for long-running pure Python code
Learn NumPy broadcasting: Eliminates many loops
Cache expensive I/O: Redis, memcached, or functools.lru_cache
Use binary formats: Pickle, HDF5, Parquet beat JSON/CSV
Batch operations: Database queries, API calls
**Measure memory: sys.getsizeof(), memory_profiler
Use slots: For classes with many instances
Compile regexes: re.compile() once, use many times
Monitor production: Profile in real environments
Whatโs Next?#
Youโve mastered high-performance Python! Next topics:
GPU Computing: CUDA, CuPy, JAX (next notebook!)
Distributed Computing: Dask, Ray, Spark
Advanced Profiling: perf, valgrind, flame graphs
C Extensions: ctypes, CFFI, Cython mastery
Alternative Runtimes: PyPy, GraalPython
Practice Projects:
Optimize a real application (10x+ speedup possible!)
Build a parallel data processing pipeline
Create a high-performance web service
Implement numerical algorithms in Numba
Remember: Fast code is good. Correct code is better. Maintainable code is best. Optimize where it matters!
Ready for GPU computing? Proceed to the CUDA & Parallel Computing notebook! ๐