Hard Level: CUDA & GPU Parallel Computing#

Real-World Context#

The Problem: Modern CPUs have 4-16 cores. Modern GPUs have thousands of cores. For the right workloads, GPUs can be 10-100x faster than CPUs.

Where GPUs Dominate:

  • Deep Learning: Training neural networks (PyTorch, TensorFlow)

  • Scientific Computing: Physics simulations, climate modeling

  • Image/Video Processing: Real-time rendering, computer vision

  • Cryptography: Password cracking, blockchain mining

  • Financial Modeling: Monte Carlo simulations, risk analysis

  • Bioinformatics: Gene sequencing, protein folding

Why This Matters:

  • Speed: Train models in hours instead of weeks

  • Scale: Process billions of data points in real-time

  • Cost: One GPU can replace dozens of CPU cores

  • Energy: Higher performance per watt

What Youโ€™ll Learn:

  • GPU architecture and CUDA programming model

  • PyCUDA for Python-CUDA integration

  • CuPy - NumPy for GPUs

  • Parallel algorithms and patterns

  • GPU memory management

  • Multi-GPU programming

  • Real-world optimization techniques


Part 1: GPU Architecture Fundamentals#

CPU vs GPU: Different Design Philosophy#

CPU (Central Processing Unit):

  • Few powerful cores (4-16)

  • High clock speed (3-5 GHz)

  • Large cache (MB of L1/L2/L3)

  • Low latency: Optimized for sequential tasks

  • Complex control logic: Branch prediction, out-of-order execution

GPU (Graphics Processing Unit):

  • Thousands of simple cores (2,000-10,000+)

  • Lower clock speed (1-2 GHz)

  • Small cache per core: Focus on throughput

  • High throughput: Optimized for parallel tasks

  • Simple control: SIMT (Single Instruction, Multiple Threads)

NVIDIA GPU Architecture#

GPU
โ”‚
โ”œโ”€ Streaming Multiprocessor (SM) ร— 80-100+
โ”‚  โ”‚
โ”‚  โ”œโ”€ CUDA Cores ร— 64-128 per SM
โ”‚  โ”œโ”€ Tensor Cores (for AI)
โ”‚  โ”œโ”€ Shared Memory (fast, 48-96 KB)
โ”‚  โ”œโ”€ L1 Cache
โ”‚  โ””โ”€ Registers
โ”‚
โ”œโ”€ L2 Cache (shared, several MB)
โ”‚
โ””โ”€ Global Memory (VRAM, 8-80 GB)
   - High bandwidth (1000+ GB/s)
   - High latency (100s of cycles)

CUDA Programming Model#

Key Concepts:

  1. Kernel: Function that runs on GPU

  2. Thread: Smallest execution unit

  3. Block: Group of threads (up to 1024)

  4. Grid: Collection of blocks

Grid
โ”œโ”€ Block(0,0)     Block(1,0)     Block(2,0)
โ”‚  โ”œโ”€ Thread(0,0) โ”œโ”€ Thread(0,0) โ”œโ”€ Thread(0,0)
โ”‚  โ”œโ”€ Thread(1,0) โ”œโ”€ Thread(1,0) โ”œโ”€ Thread(1,0)
โ”‚  โ”œโ”€ Thread(2,0) โ”œโ”€ Thread(2,0) โ”œโ”€ Thread(2,0)
โ”‚  โ””โ”€ ...         โ””โ”€ ...         โ””โ”€ ...
โ”‚
โ”œโ”€ Block(0,1)     Block(1,1)     Block(2,1)
   โ””โ”€ ...         โ””โ”€ ...         โ””โ”€ ...

Memory Hierarchy (fast to slow):

  1. Registers: Per-thread, fastest (1 cycle)

  2. Shared Memory: Per-block, very fast (1-2 cycles)

  3. L1/L2 Cache: Automatic, fast

  4. Global Memory: Slowest (100s of cycles) but largest


Part 2: Checking GPU Availability#

# Check if CUDA is available
import subprocess
import sys

def check_cuda():
    """Check CUDA and GPU availability."""
    print("=" * 60)
    print("CUDA & GPU Availability Check")
    print("=" * 60)
    
    # Check nvidia-smi
    try:
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        if result.returncode == 0:
            print("\nโœ“ NVIDIA GPU detected!\n")
            print(result.stdout)
        else:
            print("\nโœ— nvidia-smi not available")
    except FileNotFoundError:
        print("\nโœ— NVIDIA drivers not installed")
    
    # Check PyTorch CUDA
    try:
        import torch
        cuda_available = torch.cuda.is_available()
        print(f"\nPyTorch CUDA available: {cuda_available}")
        if cuda_available:
            print(f"GPU Device: {torch.cuda.get_device_name(0)}")
            print(f"CUDA Version: {torch.version.cuda}")
            print(f"Number of GPUs: {torch.cuda.device_count()}")
    except ImportError:
        print("\nPyTorch not installed. Install with: pip install torch")
    
    # Check CuPy
    try:
        import cupy as cp
        print(f"\nCuPy available: True")
        print(f"CuPy CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
    except ImportError:
        print("\nCuPy not installed. Install with: pip install cupy-cuda11x")
    except Exception as e:
        print(f"\nCuPy error: {e}")
    
    print("\n" + "=" * 60)

check_cuda()

Note: If you donโ€™t have a GPU, many cloud platforms offer GPU access:

  • Google Colab: Free T4 GPU (15GB VRAM)

  • Kaggle: Free P100 GPU (30 hours/week)

  • AWS/GCP/Azure: Pay-per-use GPU instances

  • Lambda Labs: Specialized GPU cloud


Part 3: CuPy - NumPy for GPUs#

CuPy is a NumPy-compatible library that runs on NVIDIA GPUs. Itโ€™s the easiest way to start GPU computing in Python.

# CuPy basics (pseudocode if GPU not available)
import numpy as np
import time

try:
    import cupy as cp
    GPU_AVAILABLE = True
except ImportError:
    print("CuPy not available. Showing pseudocode examples.")
    GPU_AVAILABLE = False

if GPU_AVAILABLE:
    # Example 1: Array creation and operations
    print("Example 1: Basic Operations")
    print("=" * 40)
    
    # CPU (NumPy)
    x_cpu = np.array([1, 2, 3, 4, 5])
    y_cpu = np.array([6, 7, 8, 9, 10])
    z_cpu = x_cpu + y_cpu
    print(f"NumPy result: {z_cpu}")
    
    # GPU (CuPy) - Same syntax!
    x_gpu = cp.array([1, 2, 3, 4, 5])
    y_gpu = cp.array([6, 7, 8, 9, 10])
    z_gpu = x_gpu + y_gpu
    print(f"CuPy result: {z_gpu}")
    
    # Transfer between CPU and GPU
    cpu_array = cp.asnumpy(z_gpu)  # GPU โ†’ CPU
    gpu_array = cp.asarray(z_cpu)  # CPU โ†’ GPU
    
    print("\nExample 2: Performance Comparison")
    print("=" * 40)
    
    # Large matrix operations
    size = 10000
    
    # CPU
    a_cpu = np.random.rand(size, size).astype(np.float32)
    b_cpu = np.random.rand(size, size).astype(np.float32)
    
    start = time.perf_counter()
    c_cpu = np.dot(a_cpu, b_cpu)
    time_cpu = time.perf_counter() - start
    
    # GPU
    a_gpu = cp.random.rand(size, size, dtype=cp.float32)
    b_gpu = cp.random.rand(size, size, dtype=cp.float32)
    
    # Warm up
    _ = cp.dot(a_gpu, b_gpu)
    cp.cuda.Stream.null.synchronize()  # Wait for GPU
    
    start = time.perf_counter()
    c_gpu = cp.dot(a_gpu, b_gpu)
    cp.cuda.Stream.null.synchronize()  # Important: wait for GPU to finish!
    time_gpu = time.perf_counter() - start
    
    print(f"Matrix multiplication ({size}ร—{size}):")
    print(f"NumPy (CPU): {time_cpu:.4f}s")
    print(f"CuPy (GPU):  {time_gpu:.4f}s")
    print(f"Speedup: {time_cpu/time_gpu:.1f}x faster!")
    
else:
    print("""
    CuPy Example (Pseudocode):
    
    import cupy as cp
    
    # Create arrays on GPU
    x_gpu = cp.array([1, 2, 3, 4, 5])
    y_gpu = cp.array([6, 7, 8, 9, 10])
    
    # Operations run on GPU automatically
    z_gpu = x_gpu + y_gpu
    
    # Transfer data: GPU โ†” CPU
    cpu_array = cp.asnumpy(z_gpu)  # GPU โ†’ CPU
    gpu_array = cp.asarray(cpu_array)  # CPU โ†’ GPU
    
    # All NumPy operations work!
    result = cp.mean(x_gpu)
    
    Speedup: Typically 10-100x for large arrays
    """)

CuPy Best Practices#

  1. Minimize CPU โ†” GPU transfers: Keep data on GPU

  2. Use synchronize(): GPU operations are async

  3. Batch operations: Single large operation > many small ones

  4. Use float32: Twice as fast as float64 on most GPUs

  5. Reuse arrays: Avoid frequent allocation/deallocation

# Advanced CuPy: Custom kernels
if GPU_AVAILABLE:
    # Element-wise kernel (like NumPy ufunc)
    from cupy import ElementwiseKernel
    
    # Kernel definition (C++ syntax)
    add_kernel = ElementwiseKernel(
        'float32 x, float32 y',  # Input types
        'float32 z',  # Output type
        'z = x + y',  # Operation
        'add_kernel'  # Name
    )
    
    # Use it
    x = cp.arange(1000000, dtype=cp.float32)
    y = cp.arange(1000000, dtype=cp.float32)
    z = add_kernel(x, y)
    
    print(f"Custom kernel result: {z[:5]}...")
    
    # More complex: squared difference
    squared_diff_kernel = ElementwiseKernel(
        'float32 x, float32 y',
        'float32 z',
        'z = (x - y) * (x - y)',
        'squared_diff'
    )
    
    result = squared_diff_kernel(x, y)
    print(f"Squared difference: {result[:5]}...")
else:
    print("Custom CuPy kernels allow writing GPU code in C++ syntax!")

Part 4: PyTorch GPU Acceleration#

PyTorch provides the easiest path to GPU computing for deep learning and scientific computing.

try:
    import torch
    TORCH_AVAILABLE = True
except ImportError:
    print("PyTorch not installed. Install with: pip install torch")
    TORCH_AVAILABLE = False

if TORCH_AVAILABLE:
    print("PyTorch GPU Example")
    print("=" * 40)
    
    # Check device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    
    # Create tensors on GPU
    size = 5000
    
    # Method 1: Create on GPU directly
    x_gpu = torch.rand(size, size, device='cuda' if torch.cuda.is_available() else 'cpu')
    y_gpu = torch.rand(size, size, device='cuda' if torch.cuda.is_available() else 'cpu')
    
    # Method 2: Create on CPU then move
    x_cpu = torch.rand(size, size)
    if torch.cuda.is_available():
        x_gpu = x_cpu.to('cuda')  # or .cuda()
    
    # Benchmark
    if torch.cuda.is_available():
        # GPU
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        
        start.record()
        z_gpu = torch.mm(x_gpu, y_gpu)  # Matrix multiply
        end.record()
        torch.cuda.synchronize()
        
        time_gpu = start.elapsed_time(end) / 1000  # ms to seconds
        
        # CPU
        x_cpu = torch.rand(size, size)
        y_cpu = torch.rand(size, size)
        
        start_cpu = time.perf_counter()
        z_cpu = torch.mm(x_cpu, y_cpu)
        time_cpu = time.perf_counter() - start_cpu
        
        print(f"\nMatrix multiplication ({size}ร—{size}):")
        print(f"CPU: {time_cpu:.4f}s")
        print(f"GPU: {time_gpu:.4f}s")
        print(f"Speedup: {time_cpu/time_gpu:.1f}x faster!")
    else:
        print("\nNo GPU available for benchmarking")
        
else:
    print("""
    PyTorch GPU Example (Pseudocode):
    
    import torch
    
    # Check GPU availability
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # Create tensor on GPU
    x = torch.rand(1000, 1000, device='cuda')
    y = torch.rand(1000, 1000, device='cuda')
    
    # All operations run on GPU
    z = torch.mm(x, y)
    
    # Move between devices
    x_cpu = x.cpu()  # GPU โ†’ CPU
    x_gpu = x_cpu.cuda()  # CPU โ†’ GPU
    """)

Part 5: Parallel Algorithm Patterns#

Certain algorithms are naturally parallel and map perfectly to GPUs.

Pattern 1: Map (Element-wise Operations)#

Apply same operation to each element independently.

Examples: Array addition, sigmoid activation, image filters

# CPU: Sequential
for i in range(n):
    output[i] = func(input[i])

# GPU: Parallel (each thread handles one element)
thread_id = blockIdx.x * blockDim.x + threadIdx.x
if thread_id < n:
    output[thread_id] = func(input[thread_id])

Pattern 2: Reduce (Aggregation)#

Combine all elements into single value.

Examples: Sum, max, min, mean

Tree-based reduction:
[1, 2, 3, 4, 5, 6, 7, 8]
 โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜ โ””โ”€โ”ฌโ”€โ”˜   Step 1: Pair-wise
   3     7     11    15
   โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜     โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜      Step 2: Pair-wise
      10          26
      โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜          Step 3: Final
            36

Pattern 3: Scan (Prefix Sum)#

Compute running aggregation.

Examples: Cumulative sum, histogram, sorting

Input:  [1, 2, 3, 4, 5]
Output: [1, 3, 6, 10, 15]  (cumulative sum)

Pattern 4: Stencil (Neighbor Operations)#

Compute based on neighbors in structured grid.

Examples: Convolution, blur, diffusion

3ร—3 kernel:
  โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”
  โ”‚ 1 โ”‚ 2 โ”‚ 1 โ”‚
  โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค
  โ”‚ 2 โ”‚ 4 โ”‚ 2 โ”‚  Apply to each pixel
  โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค
  โ”‚ 1 โ”‚ 2 โ”‚ 1 โ”‚
  โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜

Part 6: GPU Memory Management#

Efficient memory usage is crucial for GPU performance.

if TORCH_AVAILABLE and torch.cuda.is_available():
    print("GPU Memory Management")
    print("=" * 40)
    
    # Memory stats
    def print_gpu_memory():
        allocated = torch.cuda.memory_allocated() / 1024**3
        reserved = torch.cuda.memory_reserved() / 1024**3
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        print(f"Allocated: {allocated:.2f} GB")
        print(f"Reserved:  {reserved:.2f} GB")
        print(f"Total:     {total:.2f} GB")
    
    print("\nInitial state:")
    print_gpu_memory()
    
    # Allocate memory
    print("\nAfter creating 5000ร—5000 tensor:")
    x = torch.rand(5000, 5000, device='cuda')
    print_gpu_memory()
    
    # Free memory
    del x
    torch.cuda.empty_cache()  # Release reserved memory
    
    print("\nAfter deleting tensor and clearing cache:")
    print_gpu_memory()
    
    # Memory-efficient operations
    print("\n" + "="*40)
    print("Memory-Efficient Patterns:")
    print("="*40)
    
    # In-place operations save memory
    x = torch.rand(1000, 1000, device='cuda')
    
    # Bad: Creates new tensor
    # y = x + 1
    
    # Good: In-place (appends underscore)
    x.add_(1)  # Modifies x directly
    
    # Context manager for automatic cleanup
    with torch.cuda.device(0):
        temp = torch.rand(1000, 1000, device='cuda')
        # Automatically freed when exiting context
    
    print("\nMemory Best Practices:")
    print("1. Use in-place operations: tensor.add_() vs tensor + 1")
    print("2. Delete large tensors when done: del tensor")
    print("3. Clear cache periodically: torch.cuda.empty_cache()")
    print("4. Use mixed precision (float16): Halves memory usage")
    print("5. Batch processing: Process data in chunks")
    print("6. Gradient checkpointing: Trade compute for memory")
    
else:
    print("GPU Memory Management (Conceptual):")
    print("""
    GPU memory is limited (8-80 GB typical).
    
    Best Practices:
    1. Monitor: torch.cuda.memory_allocated()
    2. Free: del tensor, torch.cuda.empty_cache()
    3. In-place ops: tensor.add_(1) instead of tensor + 1
    4. Mixed precision: Use float16 when possible
    5. Batch processing: Don't load all data at once
    """)

Part 7: Multi-GPU Programming#

Scale to multiple GPUs for even more performance.

if TORCH_AVAILABLE:
    n_gpus = torch.cuda.device_count() if torch.cuda.is_available() else 0
    
    print(f"Multi-GPU Programming (Found {n_gpus} GPU(s))")
    print("=" * 40)
    
    if n_gpus > 1:
        # Data Parallel: Same model, split data
        print("\nData Parallelism Example:")
        
        # Simple model
        class SimpleModel(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = torch.nn.Linear(1000, 1000)
            
            def forward(self, x):
                return self.linear(x)
        
        model = SimpleModel()
        
        # Wrap with DataParallel
        model = torch.nn.DataParallel(model)
        model = model.cuda()
        
        # Forward pass automatically splits across GPUs
        x = torch.rand(128, 1000).cuda()  # Batch size 128
        output = model(x)  # Splits batch across GPUs
        
        print(f"Model on {torch.cuda.device_count()} GPUs")
        print(f"Input: {x.shape}, Output: {output.shape}")
        
        # DistributedDataParallel (better for multi-node)
        print("\nFor production, use DistributedDataParallel:")
        print("""
        from torch.nn.parallel import DistributedDataParallel as DDP
        
        # Initialize process group
        torch.distributed.init_process_group(backend='nccl')
        
        # Wrap model
        model = DDP(model, device_ids=[local_rank])
        """)
        
    else:
        print("""
        Multi-GPU Strategies:
        
        1. Data Parallelism:
           - Same model replicated on each GPU
           - Different data batches
           - Most common approach
           
        2. Model Parallelism:
           - Split model across GPUs
           - For models too large for single GPU
           - More complex implementation
           
        3. Pipeline Parallelism:
           - Different stages on different GPUs
           - Overlaps computation
           
        Example:
        model = torch.nn.DataParallel(model)  # Simple!
        """)
else:
    print("Multi-GPU programming requires PyTorch")

Part 8: Real-World GPU Applications#

Application 1: Image Processing#

# GPU-accelerated image filtering
import numpy as np

if GPU_AVAILABLE:
    import cupy as cp
    
    # Create fake image (1920ร—1080, RGB)
    image_cpu = np.random.rand(1080, 1920, 3).astype(np.float32)
    image_gpu = cp.asarray(image_cpu)
    
    # Gaussian blur kernel
    def gaussian_blur_cpu(image):
        """CPU version."""
        kernel = np.array([[1, 2, 1],
                          [2, 4, 2],
                          [1, 2, 1]], dtype=np.float32) / 16
        
        # Simplified convolution (real version would use scipy)
        return image  # Placeholder
    
    # Custom GPU kernel for blur
    blur_kernel = cp.ElementwiseKernel(
        'float32 x',
        'float32 y',
        'y = x * 0.8',  # Simplified
        'blur'
    )
    
    # Benchmark
    n_iter = 100
    
    # GPU
    start = time.perf_counter()
    for _ in range(n_iter):
        result_gpu = blur_kernel(image_gpu)
    cp.cuda.Stream.null.synchronize()
    time_gpu = time.perf_counter() - start
    
    print(f"Image Processing ({n_iter} iterations):")
    print(f"GPU: {time_gpu:.4f}s ({time_gpu/n_iter*1000:.2f}ms per frame)")
    print(f"FPS: {n_iter/time_gpu:.1f} frames/second")
    
else:
    print("GPU image processing can achieve 100+ FPS for HD video!")

Application 2: Monte Carlo Simulation#

# GPU-accelerated Monte Carlo
if TORCH_AVAILABLE and torch.cuda.is_available():
    def monte_carlo_pi_gpu(n_samples):
        """Estimate ฯ€ using GPU Monte Carlo."""
        # Generate random points on GPU
        x = torch.rand(n_samples, device='cuda')
        y = torch.rand(n_samples, device='cuda')
        
        # Check if inside unit circle
        inside = (x**2 + y**2) <= 1.0
        
        # Estimate ฯ€
        pi_estimate = 4.0 * inside.float().mean().item()
        return pi_estimate
    
    # Run simulation
    n_samples = 100_000_000  # 100 million!
    
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    
    start.record()
    pi = monte_carlo_pi_gpu(n_samples)
    end.record()
    torch.cuda.synchronize()
    
    elapsed = start.elapsed_time(end) / 1000
    
    print(f"\nMonte Carlo ฯ€ Estimation:")
    print(f"Samples: {n_samples:,}")
    print(f"Result: ฯ€ โ‰ˆ {pi:.6f} (true: 3.141593)")
    print(f"Error: {abs(pi - 3.141593):.6f}")
    print(f"Time: {elapsed:.4f}s")
    print(f"Throughput: {n_samples/elapsed/1e6:.1f} million samples/second")
    
else:
    print("Monte Carlo simulations benefit hugely from GPU parallelism!")

Part 9: GPU Optimization Techniques#

1. Coalesced Memory Access#

Problem: GPUs load memory in 128-byte chunks. Random access wastes bandwidth.

Bad (Strided):
Thread 0: array[0]
Thread 1: array[100]
Thread 2: array[200]  โ†’ Many memory transactions

Good (Coalesced):
Thread 0: array[0]
Thread 1: array[1]
Thread 2: array[2]  โ†’ One memory transaction

2. Shared Memory#

Use fast shared memory (48-96 KB per SM) for data reuse.

# CUDA kernel (pseudocode)
__shared__ float tile[TILE_SIZE][TILE_SIZE];

# Load from global โ†’ shared (once)
tile[ty][tx] = global_mem[...]
__syncthreads()

# Compute using shared memory (fast!)
result = 0
for (i = 0; i < TILE_SIZE; i++)
    result += tile[ty][i] * tile[i][tx]

3. Occupancy Optimization#

Occupancy = Active warps / Maximum possible warps

Higher occupancy hides memory latency better.

Factors:

  • Threads per block (multiple of 32)

  • Registers per thread (fewer is better)

  • Shared memory usage (less is better)

Sweet spot: 128-256 threads per block

4. Kernel Fusion#

Combine multiple operations to reduce kernel launches.

# Bad: Multiple kernel launches
y = x + 1
z = y * 2
w = z - 3

# Good: Fused operation
w = (x + 1) * 2 - 3  # One kernel

5. Mixed Precision#

Use float16 when possible:

  • 2x less memory

  • 2x faster on Tensor Cores

  • Minimal accuracy loss

model = model.half()  # Convert to float16

Part 10: Exercises#

Exercise 1: Vector Addition (Difficulty: โ˜…โ˜…โ˜†โ˜†โ˜†)#

Task: Implement vector addition on GPU using CuPy or PyTorch:

  1. Create two large vectors (10 million elements)

  2. Add them on CPU and GPU

  3. Measure and compare performance

  4. Verify results are identical


Exercise 2: Matrix Multiplication Optimization (Difficulty: โ˜…โ˜…โ˜…โ˜…โ˜†)#

Task: Compare different matrix multiplication methods:

  1. Pure Python (nested loops)

  2. NumPy (CPU)

  3. CuPy or PyTorch (GPU)

  4. Mixed precision (float16 on GPU)

Test with various sizes and plot speedup vs matrix size.


Exercise 3: Image Convolution (Difficulty: โ˜…โ˜…โ˜…โ˜…โ˜†)#

Task: Implement 2D convolution on GPU:

  1. Load an image

  2. Apply various filters (blur, sharpen, edge detection)

  3. Compare CPU vs GPU performance

  4. Implement as custom CuPy kernel


Exercise 4: Parallel Reduction (Difficulty: โ˜…โ˜…โ˜…โ˜…โ˜†)#

Task: Implement parallel sum reduction:

  1. Create array of 100 million numbers

  2. Implement tree-based reduction

  3. Compare with built-in sum

  4. Measure throughput (GB/s)


Exercise 5: Multi-GPU Training (Difficulty: โ˜…โ˜…โ˜…โ˜…โ˜…)#

Task: If you have multiple GPUs:

  1. Create a simple neural network

  2. Implement data-parallel training

  3. Measure speedup vs single GPU

  4. Monitor GPU utilization


Exercise 6: Memory Bandwidth Test (Difficulty: โ˜…โ˜…โ˜…โ˜†โ˜†)#

Task: Measure GPU memory bandwidth:

  1. Copy large arrays between CPU and GPU

  2. Measure transfer speed (GB/s)

  3. Compare with GPU specs

  4. Identify bottlenecks (PCIe vs GPU memory)


Part 11: Self-Check Quiz#

Question 1#

Why are GPUs faster than CPUs for parallel workloads?

A) Higher clock speed
B) Thousands of cores for massive parallelism
C) Larger cache
D) Better branch prediction

Answer B) Thousands of cores for massive parallelism

Explanation: GPUs sacrifice per-core performance for massive parallelism, with thousands of simpler cores that excel at data-parallel tasks.


Question 2#

What is the main bottleneck when using GPUs?

A) Computation speed
B) Data transfer between CPU and GPU
C) Power consumption
D) Programming difficulty

Answer B) Data transfer between CPU and GPU

Explanation: PCIe bandwidth is limited (16-32 GB/s), much slower than GPU memory bandwidth (1000+ GB/s). Minimize CPU โ†” GPU transfers!


Question 3#

What does synchronize() do in GPU programming?

A) Copies data to GPU
B) Waits for GPU operations to complete
C) Frees GPU memory
D) Compiles kernels

Answer B) Waits for GPU operations to complete

Explanation: GPU operations are asynchronous. synchronize() ensures operations finish before continuing, necessary for accurate timing.


Question 4#

When should you use float16 instead of float32 on GPU?

A) Always, itโ€™s always faster
B) Never, itโ€™s less accurate
C) When memory is limited and precision loss is acceptable
D) Only for integer operations

Answer C) When memory is limited and precision loss is acceptable

Explanation: float16 uses half the memory and is faster on Tensor Cores, but has less precision. Good for deep learning, check carefully for other applications.


Question 5#

What is DataParallel used for?

A) Training different models on different GPUs
B) Splitting same model across multiple GPUs
C) Distributing data batches across multiple GPUs with same model
D) Compressing model size

Answer C) Distributing data batches across multiple GPUs with same model

Explanation: DataParallel replicates the model on each GPU and splits the batch across GPUs, then combines results. Most common multi-GPU approach.


Key Takeaways#

  1. GPUs excel at parallelism: Thousands of cores for data-parallel tasks

  2. Transfer is expensive: Keep data on GPU, minimize CPU โ†” GPU copies

  3. CuPy = NumPy on GPU: Easiest way to start GPU computing

  4. PyTorch for deep learning: Seamless GPU acceleration

  5. Memory is limited: Monitor usage, use float16 when possible

  6. Synchronization matters: GPU ops are async, synchronize for timing

  7. Batch operations: Large batches amortize launch overhead

  8. Coalesced access: Contiguous memory access is critical

  9. Multi-GPU scales: DataParallel for easy multi-GPU training

  10. Right tool for job: GPU for parallel, CPU for sequential


Common Mistakes#

  1. Frequent CPU โ†” GPU transfers: Keep data on GPU

  2. Small workloads: Overhead dominates, GPU slower than CPU

  3. Forgetting synchronize(): Timing without sync is wrong

  4. Memory leaks: Delete tensors, clear cache

  5. Wrong precision: float64 on GPU is slow

  6. Sequential operations: GPU needs parallelism

  7. Not profiling: Assumptions about bottlenecks

  8. Ignoring occupancy: Too many/few threads per block


Pro Tips#

  1. Use Google Colab: Free GPU access for learning

  2. Profile with nvprof: Identify kernel bottlenecks

  3. Torch.cuda.amp: Automatic mixed precision

  4. Pin memory: Faster CPU โ†’ GPU transfers

  5. Async transfers: Overlap compute and transfer

  6. NVIDIA Nsight: Visual profiling tool

  7. Benchmarking: Warm up kernels, multiple runs

  8. GPU utils: nvidia-smi for monitoring


Whatโ€™s Next?#

Youโ€™re now ready for GPU-accelerated computing!

Advanced Topics:

  1. CUDA C++: Write custom kernels for maximum performance

  2. JAX: Composable transformations for ML research

  3. TensorRT: Optimize models for inference

  4. Distributed Training: Multi-node GPU clusters

  5. GPU Optimization: Advanced memory patterns

Projects to Build:

  • Real-time image processing pipeline

  • GPU-accelerated data science workflow

  • Deep learning model from scratch

  • Physics simulation (N-body, fluid dynamics)

  • Cryptocurrency miner (educational!)

Remember: GPUs are powerful but not magic. Profile first, optimize bottlenecks, and use the right tool for each task!

Congratulations on completing the Education Playground curriculum! ๐ŸŽ“๐Ÿš€