Hard Level: CUDA & GPU Parallel Computing#
Real-World Context#
The Problem: Modern CPUs have 4-16 cores. Modern GPUs have thousands of cores. For the right workloads, GPUs can be 10-100x faster than CPUs.
Where GPUs Dominate:
Deep Learning: Training neural networks (PyTorch, TensorFlow)
Scientific Computing: Physics simulations, climate modeling
Image/Video Processing: Real-time rendering, computer vision
Cryptography: Password cracking, blockchain mining
Financial Modeling: Monte Carlo simulations, risk analysis
Bioinformatics: Gene sequencing, protein folding
Why This Matters:
Speed: Train models in hours instead of weeks
Scale: Process billions of data points in real-time
Cost: One GPU can replace dozens of CPU cores
Energy: Higher performance per watt
What Youโll Learn:
GPU architecture and CUDA programming model
PyCUDA for Python-CUDA integration
CuPy - NumPy for GPUs
Parallel algorithms and patterns
GPU memory management
Multi-GPU programming
Real-world optimization techniques
Part 1: GPU Architecture Fundamentals#
CPU vs GPU: Different Design Philosophy#
CPU (Central Processing Unit):
Few powerful cores (4-16)
High clock speed (3-5 GHz)
Large cache (MB of L1/L2/L3)
Low latency: Optimized for sequential tasks
Complex control logic: Branch prediction, out-of-order execution
GPU (Graphics Processing Unit):
Thousands of simple cores (2,000-10,000+)
Lower clock speed (1-2 GHz)
Small cache per core: Focus on throughput
High throughput: Optimized for parallel tasks
Simple control: SIMT (Single Instruction, Multiple Threads)
NVIDIA GPU Architecture#
GPU
โ
โโ Streaming Multiprocessor (SM) ร 80-100+
โ โ
โ โโ CUDA Cores ร 64-128 per SM
โ โโ Tensor Cores (for AI)
โ โโ Shared Memory (fast, 48-96 KB)
โ โโ L1 Cache
โ โโ Registers
โ
โโ L2 Cache (shared, several MB)
โ
โโ Global Memory (VRAM, 8-80 GB)
- High bandwidth (1000+ GB/s)
- High latency (100s of cycles)
CUDA Programming Model#
Key Concepts:
Kernel: Function that runs on GPU
Thread: Smallest execution unit
Block: Group of threads (up to 1024)
Grid: Collection of blocks
Grid
โโ Block(0,0) Block(1,0) Block(2,0)
โ โโ Thread(0,0) โโ Thread(0,0) โโ Thread(0,0)
โ โโ Thread(1,0) โโ Thread(1,0) โโ Thread(1,0)
โ โโ Thread(2,0) โโ Thread(2,0) โโ Thread(2,0)
โ โโ ... โโ ... โโ ...
โ
โโ Block(0,1) Block(1,1) Block(2,1)
โโ ... โโ ... โโ ...
Memory Hierarchy (fast to slow):
Registers: Per-thread, fastest (1 cycle)
Shared Memory: Per-block, very fast (1-2 cycles)
L1/L2 Cache: Automatic, fast
Global Memory: Slowest (100s of cycles) but largest
Part 2: Checking GPU Availability#
# Check if CUDA is available
import subprocess
import sys
def check_cuda():
"""Check CUDA and GPU availability."""
print("=" * 60)
print("CUDA & GPU Availability Check")
print("=" * 60)
# Check nvidia-smi
try:
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
if result.returncode == 0:
print("\nโ NVIDIA GPU detected!\n")
print(result.stdout)
else:
print("\nโ nvidia-smi not available")
except FileNotFoundError:
print("\nโ NVIDIA drivers not installed")
# Check PyTorch CUDA
try:
import torch
cuda_available = torch.cuda.is_available()
print(f"\nPyTorch CUDA available: {cuda_available}")
if cuda_available:
print(f"GPU Device: {torch.cuda.get_device_name(0)}")
print(f"CUDA Version: {torch.version.cuda}")
print(f"Number of GPUs: {torch.cuda.device_count()}")
except ImportError:
print("\nPyTorch not installed. Install with: pip install torch")
# Check CuPy
try:
import cupy as cp
print(f"\nCuPy available: True")
print(f"CuPy CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
except ImportError:
print("\nCuPy not installed. Install with: pip install cupy-cuda11x")
except Exception as e:
print(f"\nCuPy error: {e}")
print("\n" + "=" * 60)
check_cuda()
Note: If you donโt have a GPU, many cloud platforms offer GPU access:
Google Colab: Free T4 GPU (15GB VRAM)
Kaggle: Free P100 GPU (30 hours/week)
AWS/GCP/Azure: Pay-per-use GPU instances
Lambda Labs: Specialized GPU cloud
Part 3: CuPy - NumPy for GPUs#
CuPy is a NumPy-compatible library that runs on NVIDIA GPUs. Itโs the easiest way to start GPU computing in Python.
# CuPy basics (pseudocode if GPU not available)
import numpy as np
import time
try:
import cupy as cp
GPU_AVAILABLE = True
except ImportError:
print("CuPy not available. Showing pseudocode examples.")
GPU_AVAILABLE = False
if GPU_AVAILABLE:
# Example 1: Array creation and operations
print("Example 1: Basic Operations")
print("=" * 40)
# CPU (NumPy)
x_cpu = np.array([1, 2, 3, 4, 5])
y_cpu = np.array([6, 7, 8, 9, 10])
z_cpu = x_cpu + y_cpu
print(f"NumPy result: {z_cpu}")
# GPU (CuPy) - Same syntax!
x_gpu = cp.array([1, 2, 3, 4, 5])
y_gpu = cp.array([6, 7, 8, 9, 10])
z_gpu = x_gpu + y_gpu
print(f"CuPy result: {z_gpu}")
# Transfer between CPU and GPU
cpu_array = cp.asnumpy(z_gpu) # GPU โ CPU
gpu_array = cp.asarray(z_cpu) # CPU โ GPU
print("\nExample 2: Performance Comparison")
print("=" * 40)
# Large matrix operations
size = 10000
# CPU
a_cpu = np.random.rand(size, size).astype(np.float32)
b_cpu = np.random.rand(size, size).astype(np.float32)
start = time.perf_counter()
c_cpu = np.dot(a_cpu, b_cpu)
time_cpu = time.perf_counter() - start
# GPU
a_gpu = cp.random.rand(size, size, dtype=cp.float32)
b_gpu = cp.random.rand(size, size, dtype=cp.float32)
# Warm up
_ = cp.dot(a_gpu, b_gpu)
cp.cuda.Stream.null.synchronize() # Wait for GPU
start = time.perf_counter()
c_gpu = cp.dot(a_gpu, b_gpu)
cp.cuda.Stream.null.synchronize() # Important: wait for GPU to finish!
time_gpu = time.perf_counter() - start
print(f"Matrix multiplication ({size}ร{size}):")
print(f"NumPy (CPU): {time_cpu:.4f}s")
print(f"CuPy (GPU): {time_gpu:.4f}s")
print(f"Speedup: {time_cpu/time_gpu:.1f}x faster!")
else:
print("""
CuPy Example (Pseudocode):
import cupy as cp
# Create arrays on GPU
x_gpu = cp.array([1, 2, 3, 4, 5])
y_gpu = cp.array([6, 7, 8, 9, 10])
# Operations run on GPU automatically
z_gpu = x_gpu + y_gpu
# Transfer data: GPU โ CPU
cpu_array = cp.asnumpy(z_gpu) # GPU โ CPU
gpu_array = cp.asarray(cpu_array) # CPU โ GPU
# All NumPy operations work!
result = cp.mean(x_gpu)
Speedup: Typically 10-100x for large arrays
""")
CuPy Best Practices#
Minimize CPU โ GPU transfers: Keep data on GPU
Use synchronize(): GPU operations are async
Batch operations: Single large operation > many small ones
Use float32: Twice as fast as float64 on most GPUs
Reuse arrays: Avoid frequent allocation/deallocation
# Advanced CuPy: Custom kernels
if GPU_AVAILABLE:
# Element-wise kernel (like NumPy ufunc)
from cupy import ElementwiseKernel
# Kernel definition (C++ syntax)
add_kernel = ElementwiseKernel(
'float32 x, float32 y', # Input types
'float32 z', # Output type
'z = x + y', # Operation
'add_kernel' # Name
)
# Use it
x = cp.arange(1000000, dtype=cp.float32)
y = cp.arange(1000000, dtype=cp.float32)
z = add_kernel(x, y)
print(f"Custom kernel result: {z[:5]}...")
# More complex: squared difference
squared_diff_kernel = ElementwiseKernel(
'float32 x, float32 y',
'float32 z',
'z = (x - y) * (x - y)',
'squared_diff'
)
result = squared_diff_kernel(x, y)
print(f"Squared difference: {result[:5]}...")
else:
print("Custom CuPy kernels allow writing GPU code in C++ syntax!")
Part 4: PyTorch GPU Acceleration#
PyTorch provides the easiest path to GPU computing for deep learning and scientific computing.
try:
import torch
TORCH_AVAILABLE = True
except ImportError:
print("PyTorch not installed. Install with: pip install torch")
TORCH_AVAILABLE = False
if TORCH_AVAILABLE:
print("PyTorch GPU Example")
print("=" * 40)
# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
# Create tensors on GPU
size = 5000
# Method 1: Create on GPU directly
x_gpu = torch.rand(size, size, device='cuda' if torch.cuda.is_available() else 'cpu')
y_gpu = torch.rand(size, size, device='cuda' if torch.cuda.is_available() else 'cpu')
# Method 2: Create on CPU then move
x_cpu = torch.rand(size, size)
if torch.cuda.is_available():
x_gpu = x_cpu.to('cuda') # or .cuda()
# Benchmark
if torch.cuda.is_available():
# GPU
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
z_gpu = torch.mm(x_gpu, y_gpu) # Matrix multiply
end.record()
torch.cuda.synchronize()
time_gpu = start.elapsed_time(end) / 1000 # ms to seconds
# CPU
x_cpu = torch.rand(size, size)
y_cpu = torch.rand(size, size)
start_cpu = time.perf_counter()
z_cpu = torch.mm(x_cpu, y_cpu)
time_cpu = time.perf_counter() - start_cpu
print(f"\nMatrix multiplication ({size}ร{size}):")
print(f"CPU: {time_cpu:.4f}s")
print(f"GPU: {time_gpu:.4f}s")
print(f"Speedup: {time_cpu/time_gpu:.1f}x faster!")
else:
print("\nNo GPU available for benchmarking")
else:
print("""
PyTorch GPU Example (Pseudocode):
import torch
# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Create tensor on GPU
x = torch.rand(1000, 1000, device='cuda')
y = torch.rand(1000, 1000, device='cuda')
# All operations run on GPU
z = torch.mm(x, y)
# Move between devices
x_cpu = x.cpu() # GPU โ CPU
x_gpu = x_cpu.cuda() # CPU โ GPU
""")
Part 5: Parallel Algorithm Patterns#
Certain algorithms are naturally parallel and map perfectly to GPUs.
Pattern 1: Map (Element-wise Operations)#
Apply same operation to each element independently.
Examples: Array addition, sigmoid activation, image filters
# CPU: Sequential
for i in range(n):
output[i] = func(input[i])
# GPU: Parallel (each thread handles one element)
thread_id = blockIdx.x * blockDim.x + threadIdx.x
if thread_id < n:
output[thread_id] = func(input[thread_id])
Pattern 2: Reduce (Aggregation)#
Combine all elements into single value.
Examples: Sum, max, min, mean
Tree-based reduction:
[1, 2, 3, 4, 5, 6, 7, 8]
โโโฌโโ โโโฌโโ โโโฌโโ โโโฌโโ Step 1: Pair-wise
3 7 11 15
โโโโฌโโโ โโโโฌโโโ Step 2: Pair-wise
10 26
โโโโโโโฌโโโโโโ Step 3: Final
36
Pattern 3: Scan (Prefix Sum)#
Compute running aggregation.
Examples: Cumulative sum, histogram, sorting
Input: [1, 2, 3, 4, 5]
Output: [1, 3, 6, 10, 15] (cumulative sum)
Pattern 4: Stencil (Neighbor Operations)#
Compute based on neighbors in structured grid.
Examples: Convolution, blur, diffusion
3ร3 kernel:
โโโโโฌโโโโฌโโโโ
โ 1 โ 2 โ 1 โ
โโโโโผโโโโผโโโโค
โ 2 โ 4 โ 2 โ Apply to each pixel
โโโโโผโโโโผโโโโค
โ 1 โ 2 โ 1 โ
โโโโโดโโโโดโโโโ
Part 6: GPU Memory Management#
Efficient memory usage is crucial for GPU performance.
if TORCH_AVAILABLE and torch.cuda.is_available():
print("GPU Memory Management")
print("=" * 40)
# Memory stats
def print_gpu_memory():
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
total = torch.cuda.get_device_properties(0).total_memory / 1024**3
print(f"Allocated: {allocated:.2f} GB")
print(f"Reserved: {reserved:.2f} GB")
print(f"Total: {total:.2f} GB")
print("\nInitial state:")
print_gpu_memory()
# Allocate memory
print("\nAfter creating 5000ร5000 tensor:")
x = torch.rand(5000, 5000, device='cuda')
print_gpu_memory()
# Free memory
del x
torch.cuda.empty_cache() # Release reserved memory
print("\nAfter deleting tensor and clearing cache:")
print_gpu_memory()
# Memory-efficient operations
print("\n" + "="*40)
print("Memory-Efficient Patterns:")
print("="*40)
# In-place operations save memory
x = torch.rand(1000, 1000, device='cuda')
# Bad: Creates new tensor
# y = x + 1
# Good: In-place (appends underscore)
x.add_(1) # Modifies x directly
# Context manager for automatic cleanup
with torch.cuda.device(0):
temp = torch.rand(1000, 1000, device='cuda')
# Automatically freed when exiting context
print("\nMemory Best Practices:")
print("1. Use in-place operations: tensor.add_() vs tensor + 1")
print("2. Delete large tensors when done: del tensor")
print("3. Clear cache periodically: torch.cuda.empty_cache()")
print("4. Use mixed precision (float16): Halves memory usage")
print("5. Batch processing: Process data in chunks")
print("6. Gradient checkpointing: Trade compute for memory")
else:
print("GPU Memory Management (Conceptual):")
print("""
GPU memory is limited (8-80 GB typical).
Best Practices:
1. Monitor: torch.cuda.memory_allocated()
2. Free: del tensor, torch.cuda.empty_cache()
3. In-place ops: tensor.add_(1) instead of tensor + 1
4. Mixed precision: Use float16 when possible
5. Batch processing: Don't load all data at once
""")
Part 7: Multi-GPU Programming#
Scale to multiple GPUs for even more performance.
if TORCH_AVAILABLE:
n_gpus = torch.cuda.device_count() if torch.cuda.is_available() else 0
print(f"Multi-GPU Programming (Found {n_gpus} GPU(s))")
print("=" * 40)
if n_gpus > 1:
# Data Parallel: Same model, split data
print("\nData Parallelism Example:")
# Simple model
class SimpleModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(1000, 1000)
def forward(self, x):
return self.linear(x)
model = SimpleModel()
# Wrap with DataParallel
model = torch.nn.DataParallel(model)
model = model.cuda()
# Forward pass automatically splits across GPUs
x = torch.rand(128, 1000).cuda() # Batch size 128
output = model(x) # Splits batch across GPUs
print(f"Model on {torch.cuda.device_count()} GPUs")
print(f"Input: {x.shape}, Output: {output.shape}")
# DistributedDataParallel (better for multi-node)
print("\nFor production, use DistributedDataParallel:")
print("""
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize process group
torch.distributed.init_process_group(backend='nccl')
# Wrap model
model = DDP(model, device_ids=[local_rank])
""")
else:
print("""
Multi-GPU Strategies:
1. Data Parallelism:
- Same model replicated on each GPU
- Different data batches
- Most common approach
2. Model Parallelism:
- Split model across GPUs
- For models too large for single GPU
- More complex implementation
3. Pipeline Parallelism:
- Different stages on different GPUs
- Overlaps computation
Example:
model = torch.nn.DataParallel(model) # Simple!
""")
else:
print("Multi-GPU programming requires PyTorch")
Part 8: Real-World GPU Applications#
Application 1: Image Processing#
# GPU-accelerated image filtering
import numpy as np
if GPU_AVAILABLE:
import cupy as cp
# Create fake image (1920ร1080, RGB)
image_cpu = np.random.rand(1080, 1920, 3).astype(np.float32)
image_gpu = cp.asarray(image_cpu)
# Gaussian blur kernel
def gaussian_blur_cpu(image):
"""CPU version."""
kernel = np.array([[1, 2, 1],
[2, 4, 2],
[1, 2, 1]], dtype=np.float32) / 16
# Simplified convolution (real version would use scipy)
return image # Placeholder
# Custom GPU kernel for blur
blur_kernel = cp.ElementwiseKernel(
'float32 x',
'float32 y',
'y = x * 0.8', # Simplified
'blur'
)
# Benchmark
n_iter = 100
# GPU
start = time.perf_counter()
for _ in range(n_iter):
result_gpu = blur_kernel(image_gpu)
cp.cuda.Stream.null.synchronize()
time_gpu = time.perf_counter() - start
print(f"Image Processing ({n_iter} iterations):")
print(f"GPU: {time_gpu:.4f}s ({time_gpu/n_iter*1000:.2f}ms per frame)")
print(f"FPS: {n_iter/time_gpu:.1f} frames/second")
else:
print("GPU image processing can achieve 100+ FPS for HD video!")
Application 2: Monte Carlo Simulation#
# GPU-accelerated Monte Carlo
if TORCH_AVAILABLE and torch.cuda.is_available():
def monte_carlo_pi_gpu(n_samples):
"""Estimate ฯ using GPU Monte Carlo."""
# Generate random points on GPU
x = torch.rand(n_samples, device='cuda')
y = torch.rand(n_samples, device='cuda')
# Check if inside unit circle
inside = (x**2 + y**2) <= 1.0
# Estimate ฯ
pi_estimate = 4.0 * inside.float().mean().item()
return pi_estimate
# Run simulation
n_samples = 100_000_000 # 100 million!
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
pi = monte_carlo_pi_gpu(n_samples)
end.record()
torch.cuda.synchronize()
elapsed = start.elapsed_time(end) / 1000
print(f"\nMonte Carlo ฯ Estimation:")
print(f"Samples: {n_samples:,}")
print(f"Result: ฯ โ {pi:.6f} (true: 3.141593)")
print(f"Error: {abs(pi - 3.141593):.6f}")
print(f"Time: {elapsed:.4f}s")
print(f"Throughput: {n_samples/elapsed/1e6:.1f} million samples/second")
else:
print("Monte Carlo simulations benefit hugely from GPU parallelism!")
Part 9: GPU Optimization Techniques#
1. Coalesced Memory Access#
Problem: GPUs load memory in 128-byte chunks. Random access wastes bandwidth.
Bad (Strided):
Thread 0: array[0]
Thread 1: array[100]
Thread 2: array[200] โ Many memory transactions
Good (Coalesced):
Thread 0: array[0]
Thread 1: array[1]
Thread 2: array[2] โ One memory transaction
3. Occupancy Optimization#
Occupancy = Active warps / Maximum possible warps
Higher occupancy hides memory latency better.
Factors:
Threads per block (multiple of 32)
Registers per thread (fewer is better)
Shared memory usage (less is better)
Sweet spot: 128-256 threads per block
4. Kernel Fusion#
Combine multiple operations to reduce kernel launches.
# Bad: Multiple kernel launches
y = x + 1
z = y * 2
w = z - 3
# Good: Fused operation
w = (x + 1) * 2 - 3 # One kernel
5. Mixed Precision#
Use float16 when possible:
2x less memory
2x faster on Tensor Cores
Minimal accuracy loss
model = model.half() # Convert to float16
Part 10: Exercises#
Exercise 1: Vector Addition (Difficulty: โ โ โโโ)#
Task: Implement vector addition on GPU using CuPy or PyTorch:
Create two large vectors (10 million elements)
Add them on CPU and GPU
Measure and compare performance
Verify results are identical
Exercise 2: Matrix Multiplication Optimization (Difficulty: โ โ โ โ โ)#
Task: Compare different matrix multiplication methods:
Pure Python (nested loops)
NumPy (CPU)
CuPy or PyTorch (GPU)
Mixed precision (float16 on GPU)
Test with various sizes and plot speedup vs matrix size.
Exercise 3: Image Convolution (Difficulty: โ โ โ โ โ)#
Task: Implement 2D convolution on GPU:
Load an image
Apply various filters (blur, sharpen, edge detection)
Compare CPU vs GPU performance
Implement as custom CuPy kernel
Exercise 4: Parallel Reduction (Difficulty: โ โ โ โ โ)#
Task: Implement parallel sum reduction:
Create array of 100 million numbers
Implement tree-based reduction
Compare with built-in sum
Measure throughput (GB/s)
Exercise 5: Multi-GPU Training (Difficulty: โ โ โ โ โ )#
Task: If you have multiple GPUs:
Create a simple neural network
Implement data-parallel training
Measure speedup vs single GPU
Monitor GPU utilization
Exercise 6: Memory Bandwidth Test (Difficulty: โ โ โ โโ)#
Task: Measure GPU memory bandwidth:
Copy large arrays between CPU and GPU
Measure transfer speed (GB/s)
Compare with GPU specs
Identify bottlenecks (PCIe vs GPU memory)
Part 11: Self-Check Quiz#
Question 1#
Why are GPUs faster than CPUs for parallel workloads?
A) Higher clock speed
B) Thousands of cores for massive parallelism
C) Larger cache
D) Better branch prediction
Answer
B) Thousands of cores for massive parallelismExplanation: GPUs sacrifice per-core performance for massive parallelism, with thousands of simpler cores that excel at data-parallel tasks.
Question 2#
What is the main bottleneck when using GPUs?
A) Computation speed
B) Data transfer between CPU and GPU
C) Power consumption
D) Programming difficulty
Answer
B) Data transfer between CPU and GPUExplanation: PCIe bandwidth is limited (16-32 GB/s), much slower than GPU memory bandwidth (1000+ GB/s). Minimize CPU โ GPU transfers!
Question 3#
What does synchronize() do in GPU programming?
A) Copies data to GPU
B) Waits for GPU operations to complete
C) Frees GPU memory
D) Compiles kernels
Answer
B) Waits for GPU operations to completeExplanation: GPU operations are asynchronous. synchronize() ensures operations finish before continuing, necessary for accurate timing.
Question 4#
When should you use float16 instead of float32 on GPU?
A) Always, itโs always faster
B) Never, itโs less accurate
C) When memory is limited and precision loss is acceptable
D) Only for integer operations
Answer
C) When memory is limited and precision loss is acceptableExplanation: float16 uses half the memory and is faster on Tensor Cores, but has less precision. Good for deep learning, check carefully for other applications.
Question 5#
What is DataParallel used for?
A) Training different models on different GPUs
B) Splitting same model across multiple GPUs
C) Distributing data batches across multiple GPUs with same model
D) Compressing model size
Answer
C) Distributing data batches across multiple GPUs with same modelExplanation: DataParallel replicates the model on each GPU and splits the batch across GPUs, then combines results. Most common multi-GPU approach.
Key Takeaways#
GPUs excel at parallelism: Thousands of cores for data-parallel tasks
Transfer is expensive: Keep data on GPU, minimize CPU โ GPU copies
CuPy = NumPy on GPU: Easiest way to start GPU computing
PyTorch for deep learning: Seamless GPU acceleration
Memory is limited: Monitor usage, use float16 when possible
Synchronization matters: GPU ops are async, synchronize for timing
Batch operations: Large batches amortize launch overhead
Coalesced access: Contiguous memory access is critical
Multi-GPU scales: DataParallel for easy multi-GPU training
Right tool for job: GPU for parallel, CPU for sequential
Common Mistakes#
Frequent CPU โ GPU transfers: Keep data on GPU
Small workloads: Overhead dominates, GPU slower than CPU
Forgetting synchronize(): Timing without sync is wrong
Memory leaks: Delete tensors, clear cache
Wrong precision: float64 on GPU is slow
Sequential operations: GPU needs parallelism
Not profiling: Assumptions about bottlenecks
Ignoring occupancy: Too many/few threads per block
Pro Tips#
Use Google Colab: Free GPU access for learning
Profile with nvprof: Identify kernel bottlenecks
Torch.cuda.amp: Automatic mixed precision
Pin memory: Faster CPU โ GPU transfers
Async transfers: Overlap compute and transfer
NVIDIA Nsight: Visual profiling tool
Benchmarking: Warm up kernels, multiple runs
GPU utils: nvidia-smi for monitoring
Whatโs Next?#
Youโre now ready for GPU-accelerated computing!
Advanced Topics:
CUDA C++: Write custom kernels for maximum performance
JAX: Composable transformations for ML research
TensorRT: Optimize models for inference
Distributed Training: Multi-node GPU clusters
GPU Optimization: Advanced memory patterns
Projects to Build:
Real-time image processing pipeline
GPU-accelerated data science workflow
Deep learning model from scratch
Physics simulation (N-body, fluid dynamics)
Cryptocurrency miner (educational!)
Remember: GPUs are powerful but not magic. Profile first, optimize bottlenecks, and use the right tool for each task!
Congratulations on completing the Education Playground curriculum! ๐๐