Hard Level: CUDA & GPU Parallel Computing

Hard Level: CUDA & GPU Parallel Computing#

Real-World Context#

The Problem: Modern CPUs have 4-16 cores. Modern GPUs have thousands of cores. For the right workloads, GPUs can be 10-100x faster than CPUs.

Where GPUs Dominate:

Deep Learning: Training neural networks (PyTorch, TensorFlow)
Scientific Computing: Physics simulations, climate modeling
Image/Video Processing: Real-time rendering, computer vision
Cryptography: Password cracking, blockchain mining
Financial Modeling: Monte Carlo simulations, risk analysis
Bioinformatics: Gene sequencing, protein folding

Why This Matters:

Speed: Train models in hours instead of weeks
Scale: Process billions of data points in real-time
Cost: One GPU can replace dozens of CPU cores
Energy: Higher performance per watt

What You’ll Learn:

GPU architecture and CUDA programming model
PyCUDA for Python-CUDA integration
CuPy - NumPy for GPUs
Parallel algorithms and patterns
GPU memory management
Multi-GPU programming
Real-world optimization techniques

Part 1: GPU Architecture Fundamentals#

CPU vs GPU: Different Design Philosophy#

CPU (Central Processing Unit):

Few powerful cores (4-16)
High clock speed (3-5 GHz)
Large cache (MB of L1/L2/L3)
Low latency: Optimized for sequential tasks
Complex control logic: Branch prediction, out-of-order execution

GPU (Graphics Processing Unit):

Thousands of simple cores (2,000-10,000+)
Lower clock speed (1-2 GHz)
Small cache per core: Focus on throughput
High throughput: Optimized for parallel tasks
Simple control: SIMT (Single Instruction, Multiple Threads)

NVIDIA GPU Architecture#

GPU
│
├─ Streaming Multiprocessor (SM) × 80-100+
│  │
│  ├─ CUDA Cores × 64-128 per SM
│  ├─ Tensor Cores (for AI)
│  ├─ Shared Memory (fast, 48-96 KB)
│  ├─ L1 Cache
│  └─ Registers
│
├─ L2 Cache (shared, several MB)
│
└─ Global Memory (VRAM, 8-80 GB)
   - High bandwidth (1000+ GB/s)
   - High latency (100s of cycles)

CUDA Programming Model#

Key Concepts:

Kernel: Function that runs on GPU
Thread: Smallest execution unit
Block: Group of threads (up to 1024)
Grid: Collection of blocks

Grid
├─ Block(0,0)     Block(1,0)     Block(2,0)
│  ├─ Thread(0,0) ├─ Thread(0,0) ├─ Thread(0,0)
│  ├─ Thread(1,0) ├─ Thread(1,0) ├─ Thread(1,0)
│  ├─ Thread(2,0) ├─ Thread(2,0) ├─ Thread(2,0)
│  └─ ...         └─ ...         └─ ...
│
├─ Block(0,1)     Block(1,1)     Block(2,1)
   └─ ...         └─ ...         └─ ...

Memory Hierarchy (fast to slow):

Registers: Per-thread, fastest (1 cycle)
Shared Memory: Per-block, very fast (1-2 cycles)
L1/L2 Cache: Automatic, fast
Global Memory: Slowest (100s of cycles) but largest

Part 2: Checking GPU Availability#

# Check if CUDA is available
import subprocess
import sys

def check_cuda():
    """Check CUDA and GPU availability."""
    print("=" * 60)
    print("CUDA & GPU Availability Check")
    print("=" * 60)
    
    # Check nvidia-smi
    try:
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        if result.returncode == 0:
            print("\n✓ NVIDIA GPU detected!\n")
            print(result.stdout)
        else:
            print("\n✗ nvidia-smi not available")
    except FileNotFoundError:
        print("\n✗ NVIDIA drivers not installed")
    
    # Check PyTorch CUDA
    try:
        import torch
        cuda_available = torch.cuda.is_available()
        print(f"\nPyTorch CUDA available: {cuda_available}")
        if cuda_available:
            print(f"GPU Device: {torch.cuda.get_device_name(0)}")
            print(f"CUDA Version: {torch.version.cuda}")
            print(f"Number of GPUs: {torch.cuda.device_count()}")
    except ImportError:
        print("\nPyTorch not installed. Install with: pip install torch")
    
    # Check CuPy
    try:
        import cupy as cp
        print(f"\nCuPy available: True")
        print(f"CuPy CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
    except ImportError:
        print("\nCuPy not installed. Install with: pip install cupy-cuda11x")
    except Exception as e:
        print(f"\nCuPy error: {e}")
    
    print("\n" + "=" * 60)

check_cuda()

Note: If you don’t have a GPU, many cloud platforms offer GPU access:

Google Colab: Free T4 GPU (15GB VRAM)
Kaggle: Free P100 GPU (30 hours/week)
AWS/GCP/Azure: Pay-per-use GPU instances
Lambda Labs: Specialized GPU cloud

Part 3: CuPy - NumPy for GPUs#

CuPy is a NumPy-compatible library that runs on NVIDIA GPUs. It’s the easiest way to start GPU computing in Python.

# CuPy basics (pseudocode if GPU not available)
import numpy as np
import time

try:
    import cupy as cp
    GPU_AVAILABLE = True
except ImportError:
    print("CuPy not available. Showing pseudocode examples.")
    GPU_AVAILABLE = False

if GPU_AVAILABLE:
    # Example 1: Array creation and operations
    print("Example 1: Basic Operations")
    print("=" * 40)
    
    # CPU (NumPy)
    x_cpu = np.array([1, 2, 3, 4, 5])
    y_cpu = np.array([6, 7, 8, 9, 10])
    z_cpu = x_cpu + y_cpu
    print(f"NumPy result: {z_cpu}")
    
    # GPU (CuPy) - Same syntax!
    x_gpu = cp.array([1, 2, 3, 4, 5])
    y_gpu = cp.array([6, 7, 8, 9, 10])
    z_gpu = x_gpu + y_gpu
    print(f"CuPy result: {z_gpu}")
    
    # Transfer between CPU and GPU
    cpu_array = cp.asnumpy(z_gpu)  # GPU → CPU
    gpu_array = cp.asarray(z_cpu)  # CPU → GPU
    
    print("\nExample 2: Performance Comparison")
    print("=" * 40)
    
    # Large matrix operations
    size = 10000
    
    # CPU
    a_cpu = np.random.rand(size, size).astype(np.float32)
    b_cpu = np.random.rand(size, size).astype(np.float32)
    
    start = time.perf_counter()
    c_cpu = np.dot(a_cpu, b_cpu)
    time_cpu = time.perf_counter() - start
    
    # GPU
    a_gpu = cp.random.rand(size, size, dtype=cp.float32)
    b_gpu = cp.random.rand(size, size, dtype=cp.float32)
    
    # Warm up
    _ = cp.dot(a_gpu, b_gpu)
    cp.cuda.Stream.null.synchronize()  # Wait for GPU
    
    start = time.perf_counter()
    c_gpu = cp.dot(a_gpu, b_gpu)
    cp.cuda.Stream.null.synchronize()  # Important: wait for GPU to finish!
    time_gpu = time.perf_counter() - start
    
    print(f"Matrix multiplication ({size}×{size}):")
    print(f"NumPy (CPU): {time_cpu:.4f}s")
    print(f"CuPy (GPU):  {time_gpu:.4f}s")
    print(f"Speedup: {time_cpu/time_gpu:.1f}x faster!")
    
else:
    print("""
    CuPy Example (Pseudocode):
    
    import cupy as cp
    
    # Create arrays on GPU
    x_gpu = cp.array([1, 2, 3, 4, 5])
    y_gpu = cp.array([6, 7, 8, 9, 10])
    
    # Operations run on GPU automatically
    z_gpu = x_gpu + y_gpu
    
    # Transfer data: GPU ↔ CPU
    cpu_array = cp.asnumpy(z_gpu)  # GPU → CPU
    gpu_array = cp.asarray(cpu_array)  # CPU → GPU
    
    # All NumPy operations work!
    result = cp.mean(x_gpu)
    
    Speedup: Typically 10-100x for large arrays
    """)

CuPy Best Practices#

Minimize CPU ↔ GPU transfers: Keep data on GPU
Use synchronize(): GPU operations are async
Batch operations: Single large operation > many small ones
Use float32: Twice as fast as float64 on most GPUs
Reuse arrays: Avoid frequent allocation/deallocation

# Advanced CuPy: Custom kernels
if GPU_AVAILABLE:
    # Element-wise kernel (like NumPy ufunc)
    from cupy import ElementwiseKernel
    
    # Kernel definition (C++ syntax)
    add_kernel = ElementwiseKernel(
        'float32 x, float32 y',  # Input types
        'float32 z',  # Output type
        'z = x + y',  # Operation
        'add_kernel'  # Name
    )
    
    # Use it
    x = cp.arange(1000000, dtype=cp.float32)
    y = cp.arange(1000000, dtype=cp.float32)
    z = add_kernel(x, y)
    
    print(f"Custom kernel result: {z[:5]}...")
    
    # More complex: squared difference
    squared_diff_kernel = ElementwiseKernel(
        'float32 x, float32 y',
        'float32 z',
        'z = (x - y) * (x - y)',
        'squared_diff'
    )
    
    result = squared_diff_kernel(x, y)
    print(f"Squared difference: {result[:5]}...")
else:
    print("Custom CuPy kernels allow writing GPU code in C++ syntax!")

Part 4: PyTorch GPU Acceleration#

PyTorch provides the easiest path to GPU computing for deep learning and scientific computing.

try:
    import torch
    TORCH_AVAILABLE = True
except ImportError:
    print("PyTorch not installed. Install with: pip install torch")
    TORCH_AVAILABLE = False

if TORCH_AVAILABLE:
    print("PyTorch GPU Example")
    print("=" * 40)
    
    # Check device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    
    # Create tensors on GPU
    size = 5000
    
    # Method 1: Create on GPU directly
    x_gpu = torch.rand(size, size, device='cuda' if torch.cuda.is_available() else 'cpu')
    y_gpu = torch.rand(size, size, device='cuda' if torch.cuda.is_available() else 'cpu')
    
    # Method 2: Create on CPU then move
    x_cpu = torch.rand(size, size)
    if torch.cuda.is_available():
        x_gpu = x_cpu.to('cuda')  # or .cuda()
    
    # Benchmark
    if torch.cuda.is_available():
        # GPU
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        
        start.record()
        z_gpu = torch.mm(x_gpu, y_gpu)  # Matrix multiply
        end.record()
        torch.cuda.synchronize()
        
        time_gpu = start.elapsed_time(end) / 1000  # ms to seconds
        
        # CPU
        x_cpu = torch.rand(size, size)
        y_cpu = torch.rand(size, size)
        
        start_cpu = time.perf_counter()
        z_cpu = torch.mm(x_cpu, y_cpu)
        time_cpu = time.perf_counter() - start_cpu
        
        print(f"\nMatrix multiplication ({size}×{size}):")
        print(f"CPU: {time_cpu:.4f}s")
        print(f"GPU: {time_gpu:.4f}s")
        print(f"Speedup: {time_cpu/time_gpu:.1f}x faster!")
    else:
        print("\nNo GPU available for benchmarking")
        
else:
    print("""
    PyTorch GPU Example (Pseudocode):
    
    import torch
    
    # Check GPU availability
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # Create tensor on GPU
    x = torch.rand(1000, 1000, device='cuda')
    y = torch.rand(1000, 1000, device='cuda')
    
    # All operations run on GPU
    z = torch.mm(x, y)
    
    # Move between devices
    x_cpu = x.cpu()  # GPU → CPU
    x_gpu = x_cpu.cuda()  # CPU → GPU
    """)

Part 5: Parallel Algorithm Patterns#

Certain algorithms are naturally parallel and map perfectly to GPUs.

Pattern 1: Map (Element-wise Operations)#

Apply same operation to each element independently.

Examples: Array addition, sigmoid activation, image filters

# CPU: Sequential
for i in range(n):
    output[i] = func(input[i])

# GPU: Parallel (each thread handles one element)
thread_id = blockIdx.x * blockDim.x + threadIdx.x
if thread_id < n:
    output[thread_id] = func(input[thread_id])

Pattern 2: Reduce (Aggregation)#

Combine all elements into single value.

Examples: Sum, max, min, mean

Tree-based reduction:
[1, 2, 3, 4, 5, 6, 7, 8]
 └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘   Step 1: Pair-wise
   3     7     11    15
   └──┬──┘     └──┬──┘      Step 2: Pair-wise
      10          26
      └─────┬─────┘          Step 3: Final
            36

Pattern 3: Scan (Prefix Sum)#

Compute running aggregation.

Examples: Cumulative sum, histogram, sorting

Input:  [1, 2, 3, 4, 5]
Output: [1, 3, 6, 10, 15]  (cumulative sum)

Pattern 4: Stencil (Neighbor Operations)#

Compute based on neighbors in structured grid.

Examples: Convolution, blur, diffusion

3×3 kernel:
  ┌───┬───┬───┐
  │ 1 │ 2 │ 1 │
  ├───┼───┼───┤
  │ 2 │ 4 │ 2 │  Apply to each pixel
  ├───┼───┼───┤
  │ 1 │ 2 │ 1 │
  └───┴───┴───┘

Part 6: GPU Memory Management#

Efficient memory usage is crucial for GPU performance.

if TORCH_AVAILABLE and torch.cuda.is_available():
    print("GPU Memory Management")
    print("=" * 40)
    
    # Memory stats
    def print_gpu_memory():
        allocated = torch.cuda.memory_allocated() / 1024**3
        reserved = torch.cuda.memory_reserved() / 1024**3
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        print(f"Allocated: {allocated:.2f} GB")
        print(f"Reserved:  {reserved:.2f} GB")
        print(f"Total:     {total:.2f} GB")
    
    print("\nInitial state:")
    print_gpu_memory()
    
    # Allocate memory
    print("\nAfter creating 5000×5000 tensor:")
    x = torch.rand(5000, 5000, device='cuda')
    print_gpu_memory()
    
    # Free memory
    del x
    torch.cuda.empty_cache()  # Release reserved memory
    
    print("\nAfter deleting tensor and clearing cache:")
    print_gpu_memory()
    
    # Memory-efficient operations
    print("\n" + "="*40)
    print("Memory-Efficient Patterns:")
    print("="*40)
    
    # In-place operations save memory
    x = torch.rand(1000, 1000, device='cuda')
    
    # Bad: Creates new tensor
    # y = x + 1
    
    # Good: In-place (appends underscore)
    x.add_(1)  # Modifies x directly
    
    # Context manager for automatic cleanup
    with torch.cuda.device(0):
        temp = torch.rand(1000, 1000, device='cuda')
        # Automatically freed when exiting context
    
    print("\nMemory Best Practices:")
    print("1. Use in-place operations: tensor.add_() vs tensor + 1")
    print("2. Delete large tensors when done: del tensor")
    print("3. Clear cache periodically: torch.cuda.empty_cache()")
    print("4. Use mixed precision (float16): Halves memory usage")
    print("5. Batch processing: Process data in chunks")
    print("6. Gradient checkpointing: Trade compute for memory")
    
else:
    print("GPU Memory Management (Conceptual):")
    print("""
    GPU memory is limited (8-80 GB typical).
    
    Best Practices:
    1. Monitor: torch.cuda.memory_allocated()
    2. Free: del tensor, torch.cuda.empty_cache()
    3. In-place ops: tensor.add_(1) instead of tensor + 1
    4. Mixed precision: Use float16 when possible
    5. Batch processing: Don't load all data at once
    """)

Part 7: Multi-GPU Programming#

Scale to multiple GPUs for even more performance.

if TORCH_AVAILABLE:
    n_gpus = torch.cuda.device_count() if torch.cuda.is_available() else 0
    
    print(f"Multi-GPU Programming (Found {n_gpus} GPU(s))")
    print("=" * 40)
    
    if n_gpus > 1:
        # Data Parallel: Same model, split data
        print("\nData Parallelism Example:")
        
        # Simple model
        class SimpleModel(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = torch.nn.Linear(1000, 1000)
            
            def forward(self, x):
                return self.linear(x)
        
        model = SimpleModel()
        
        # Wrap with DataParallel
        model = torch.nn.DataParallel(model)
        model = model.cuda()
        
        # Forward pass automatically splits across GPUs
        x = torch.rand(128, 1000).cuda()  # Batch size 128
        output = model(x)  # Splits batch across GPUs
        
        print(f"Model on {torch.cuda.device_count()} GPUs")
        print(f"Input: {x.shape}, Output: {output.shape}")
        
        # DistributedDataParallel (better for multi-node)
        print("\nFor production, use DistributedDataParallel:")
        print("""
        from torch.nn.parallel import DistributedDataParallel as DDP
        
        # Initialize process group
        torch.distributed.init_process_group(backend='nccl')
        
        # Wrap model
        model = DDP(model, device_ids=[local_rank])
        """)
        
    else:
        print("""
        Multi-GPU Strategies:
        
        1. Data Parallelism:
           - Same model replicated on each GPU
           - Different data batches
           - Most common approach
           
        2. Model Parallelism:
           - Split model across GPUs
           - For models too large for single GPU
           - More complex implementation
           
        3. Pipeline Parallelism:
           - Different stages on different GPUs
           - Overlaps computation
           
        Example:
        model = torch.nn.DataParallel(model)  # Simple!
        """)
else:
    print("Multi-GPU programming requires PyTorch")

Part 8: Real-World GPU Applications#

Application 1: Image Processing#

# GPU-accelerated image filtering
import numpy as np

if GPU_AVAILABLE:
    import cupy as cp
    
    # Create fake image (1920×1080, RGB)
    image_cpu = np.random.rand(1080, 1920, 3).astype(np.float32)
    image_gpu = cp.asarray(image_cpu)
    
    # Gaussian blur kernel
    def gaussian_blur_cpu(image):
        """CPU version."""
        kernel = np.array([[1, 2, 1],
                          [2, 4, 2],
                          [1, 2, 1]], dtype=np.float32) / 16
        
        # Simplified convolution (real version would use scipy)
        return image  # Placeholder
    
    # Custom GPU kernel for blur
    blur_kernel = cp.ElementwiseKernel(
        'float32 x',
        'float32 y',
        'y = x * 0.8',  # Simplified
        'blur'
    )
    
    # Benchmark
    n_iter = 100
    
    # GPU
    start = time.perf_counter()
    for _ in range(n_iter):
        result_gpu = blur_kernel(image_gpu)
    cp.cuda.Stream.null.synchronize()
    time_gpu = time.perf_counter() - start
    
    print(f"Image Processing ({n_iter} iterations):")
    print(f"GPU: {time_gpu:.4f}s ({time_gpu/n_iter*1000:.2f}ms per frame)")
    print(f"FPS: {n_iter/time_gpu:.1f} frames/second")
    
else:
    print("GPU image processing can achieve 100+ FPS for HD video!")

Application 2: Monte Carlo Simulation#

# GPU-accelerated Monte Carlo
if TORCH_AVAILABLE and torch.cuda.is_available():
    def monte_carlo_pi_gpu(n_samples):
        """Estimate π using GPU Monte Carlo."""
        # Generate random points on GPU
        x = torch.rand(n_samples, device='cuda')
        y = torch.rand(n_samples, device='cuda')
        
        # Check if inside unit circle
        inside = (x**2 + y**2) <= 1.0
        
        # Estimate π
        pi_estimate = 4.0 * inside.float().mean().item()
        return pi_estimate
    
    # Run simulation
    n_samples = 100_000_000  # 100 million!
    
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    
    start.record()
    pi = monte_carlo_pi_gpu(n_samples)
    end.record()
    torch.cuda.synchronize()
    
    elapsed = start.elapsed_time(end) / 1000
    
    print(f"\nMonte Carlo π Estimation:")
    print(f"Samples: {n_samples:,}")
    print(f"Result: π ≈ {pi:.6f} (true: 3.141593)")
    print(f"Error: {abs(pi - 3.141593):.6f}")
    print(f"Time: {elapsed:.4f}s")
    print(f"Throughput: {n_samples/elapsed/1e6:.1f} million samples/second")
    
else:
    print("Monte Carlo simulations benefit hugely from GPU parallelism!")

Part 9: GPU Optimization Techniques#

1. Coalesced Memory Access#

Problem: GPUs load memory in 128-byte chunks. Random access wastes bandwidth.

Bad (Strided):
Thread 0: array[0]
Thread 1: array[100]
Thread 2: array[200]  → Many memory transactions

Good (Coalesced):
Thread 0: array[0]
Thread 1: array[1]
Thread 2: array[2]  → One memory transaction

2. Shared Memory#

Use fast shared memory (48-96 KB per SM) for data reuse.

# CUDA kernel (pseudocode)
__shared__ float tile[TILE_SIZE][TILE_SIZE];

# Load from global → shared (once)
tile[ty][tx] = global_mem[...]
__syncthreads()

# Compute using shared memory (fast!)
result = 0
for (i = 0; i < TILE_SIZE; i++)
    result += tile[ty][i] * tile[i][tx]

3. Occupancy Optimization#

Occupancy = Active warps / Maximum possible warps

Higher occupancy hides memory latency better.

Factors:

Threads per block (multiple of 32)
Registers per thread (fewer is better)
Shared memory usage (less is better)

Sweet spot: 128-256 threads per block

4. Kernel Fusion#

Combine multiple operations to reduce kernel launches.

# Bad: Multiple kernel launches
y = x + 1
z = y * 2
w = z - 3

# Good: Fused operation
w = (x + 1) * 2 - 3  # One kernel

5. Mixed Precision#

Use float16 when possible:

2x less memory
2x faster on Tensor Cores
Minimal accuracy loss

model = model.half()  # Convert to float16

Part 10: Exercises#

Exercise 1: Vector Addition (Difficulty: ★★☆☆☆)#

Task: Implement vector addition on GPU using CuPy or PyTorch:

Create two large vectors (10 million elements)
Add them on CPU and GPU
Measure and compare performance
Verify results are identical

Exercise 2: Matrix Multiplication Optimization (Difficulty: ★★★★☆)#

Task: Compare different matrix multiplication methods:

Pure Python (nested loops)
NumPy (CPU)
CuPy or PyTorch (GPU)
Mixed precision (float16 on GPU)

Test with various sizes and plot speedup vs matrix size.

Exercise 3: Image Convolution (Difficulty: ★★★★☆)#

Task: Implement 2D convolution on GPU:

Load an image
Apply various filters (blur, sharpen, edge detection)
Compare CPU vs GPU performance
Implement as custom CuPy kernel

Exercise 4: Parallel Reduction (Difficulty: ★★★★☆)#

Task: Implement parallel sum reduction:

Create array of 100 million numbers
Implement tree-based reduction
Compare with built-in sum
Measure throughput (GB/s)

Exercise 5: Multi-GPU Training (Difficulty: ★★★★★)#

Task: If you have multiple GPUs:

Create a simple neural network
Implement data-parallel training
Measure speedup vs single GPU
Monitor GPU utilization

Exercise 6: Memory Bandwidth Test (Difficulty: ★★★☆☆)#

Task: Measure GPU memory bandwidth:

Copy large arrays between CPU and GPU
Measure transfer speed (GB/s)
Compare with GPU specs
Identify bottlenecks (PCIe vs GPU memory)

Part 11: Self-Check Quiz#

Question 1#

Why are GPUs faster than CPUs for parallel workloads?

A) Higher clock speed
B) Thousands of cores for massive parallelism
C) Larger cache
D) Better branch prediction

Answer

B) Thousands of cores for massive parallelism

Explanation: GPUs sacrifice per-core performance for massive parallelism, with thousands of simpler cores that excel at data-parallel tasks.

Question 2#

What is the main bottleneck when using GPUs?

A) Computation speed
B) Data transfer between CPU and GPU
C) Power consumption
D) Programming difficulty

Answer

B) Data transfer between CPU and GPU

Explanation: PCIe bandwidth is limited (16-32 GB/s), much slower than GPU memory bandwidth (1000+ GB/s). Minimize CPU ↔ GPU transfers!

Question 3#

What does synchronize() do in GPU programming?

A) Copies data to GPU
B) Waits for GPU operations to complete
C) Frees GPU memory
D) Compiles kernels

Answer

B) Waits for GPU operations to complete

Explanation: GPU operations are asynchronous. synchronize() ensures operations finish before continuing, necessary for accurate timing.

Question 4#

When should you use float16 instead of float32 on GPU?

A) Always, it’s always faster
B) Never, it’s less accurate
C) When memory is limited and precision loss is acceptable
D) Only for integer operations

Answer

C) When memory is limited and precision loss is acceptable

Explanation: float16 uses half the memory and is faster on Tensor Cores, but has less precision. Good for deep learning, check carefully for other applications.

Question 5#

What is DataParallel used for?

A) Training different models on different GPUs
B) Splitting same model across multiple GPUs
C) Distributing data batches across multiple GPUs with same model
D) Compressing model size

Answer

C) Distributing data batches across multiple GPUs with same model

Explanation: DataParallel replicates the model on each GPU and splits the batch across GPUs, then combines results. Most common multi-GPU approach.

Key Takeaways#

GPUs excel at parallelism: Thousands of cores for data-parallel tasks
Transfer is expensive: Keep data on GPU, minimize CPU ↔ GPU copies
CuPy = NumPy on GPU: Easiest way to start GPU computing
PyTorch for deep learning: Seamless GPU acceleration
Memory is limited: Monitor usage, use float16 when possible
Synchronization matters: GPU ops are async, synchronize for timing
Batch operations: Large batches amortize launch overhead
Coalesced access: Contiguous memory access is critical
Multi-GPU scales: DataParallel for easy multi-GPU training
Right tool for job: GPU for parallel, CPU for sequential

Common Mistakes#

Frequent CPU ↔ GPU transfers: Keep data on GPU
Small workloads: Overhead dominates, GPU slower than CPU
Forgetting synchronize(): Timing without sync is wrong
Memory leaks: Delete tensors, clear cache
Wrong precision: float64 on GPU is slow
Sequential operations: GPU needs parallelism
Not profiling: Assumptions about bottlenecks
Ignoring occupancy: Too many/few threads per block

Pro Tips#

Use Google Colab: Free GPU access for learning
Profile with nvprof: Identify kernel bottlenecks
Torch.cuda.amp: Automatic mixed precision
Pin memory: Faster CPU → GPU transfers
Async transfers: Overlap compute and transfer
NVIDIA Nsight: Visual profiling tool
Benchmarking: Warm up kernels, multiple runs
GPU utils: nvidia-smi for monitoring

What’s Next?#

You’re now ready for GPU-accelerated computing!

Advanced Topics:

CUDA C++: Write custom kernels for maximum performance
JAX: Composable transformations for ML research
TensorRT: Optimize models for inference
Distributed Training: Multi-node GPU clusters
GPU Optimization: Advanced memory patterns

Projects to Build:

Real-time image processing pipeline
GPU-accelerated data science workflow
Deep learning model from scratch
Physics simulation (N-body, fluid dynamics)
Cryptocurrency miner (educational!)

Remember: GPUs are powerful but not magic. Profile first, optimize bottlenecks, and use the right tool for each task!

Congratulations on completing the Education Playground curriculum! 🎓🚀

Hard Level: CUDA & GPU Parallel Computing

Contents

Hard Level: CUDA & GPU Parallel Computing#

Real-World Context#

Part 1: GPU Architecture Fundamentals#

CPU vs GPU: Different Design Philosophy#

NVIDIA GPU Architecture#

CUDA Programming Model#

Part 2: Checking GPU Availability#

Part 3: CuPy - NumPy for GPUs#

CuPy Best Practices#

Part 4: PyTorch GPU Acceleration#

Part 5: Parallel Algorithm Patterns#

Pattern 1: Map (Element-wise Operations)#

Pattern 2: Reduce (Aggregation)#

Pattern 3: Scan (Prefix Sum)#

Pattern 4: Stencil (Neighbor Operations)#

Part 6: GPU Memory Management#

Part 7: Multi-GPU Programming#

Part 8: Real-World GPU Applications#

Application 1: Image Processing#

Application 2: Monte Carlo Simulation#

Part 9: GPU Optimization Techniques#

1. Coalesced Memory Access#

2. Shared Memory#

3. Occupancy Optimization#

4. Kernel Fusion#

5. Mixed Precision#

Part 10: Exercises#

Exercise 1: Vector Addition (Difficulty: ★★☆☆☆)#

Exercise 2: Matrix Multiplication Optimization (Difficulty: ★★★★☆)#

Exercise 3: Image Convolution (Difficulty: ★★★★☆)#

Exercise 4: Parallel Reduction (Difficulty: ★★★★☆)#

Exercise 5: Multi-GPU Training (Difficulty: ★★★★★)#

Exercise 6: Memory Bandwidth Test (Difficulty: ★★★☆☆)#

Part 11: Self-Check Quiz#

Question 1#

Question 2#

Question 3#

Question 4#

Question 5#

Key Takeaways#

Common Mistakes#

Pro Tips#

What’s Next?#