Hard Lesson 02: Generators and Iterators - Memory-Efficient Data Processing

Contents

Hard Lesson 02: Generators and Iterators - Memory-Efficient Data Processing#

Master the art of memory-efficient data processing using generators, iterators, and coroutines.

Learning Objectives#

By the end of this lesson, you will be able to:

  • โœ… Understand the iterator protocol and implement custom iterators

  • โœ… Create generator functions using yield for lazy evaluation

  • โœ… Build memory-efficient data pipelines with generator expressions

  • โœ… Use advanced generator patterns: yield from, send(), throw(), close()

  • โœ… Implement coroutines for cooperative multitasking

  • โœ… Create infinite sequences and bounded iterators

  • โœ… Apply generators to real-world problems (file processing, data streaming)

  • โœ… Analyze memory and performance tradeoffs

Prerequisites#

  • Strong understanding of Python functions and scope

  • Familiarity with decorators and closures

  • Knowledge of list comprehensions

  • Understanding of memory management concepts

Why Generators and Iterators Matter#

Real-World Applications:

  • Big Data Processing: Stream terabytes of data without loading into memory

  • Web Scraping: Process paginated results efficiently

  • Log Analysis: Parse multi-gigabyte log files line by line

  • Machine Learning: Generate training batches on-the-fly

  • ETL Pipelines: Transform data streams in real-time

  • API Rate Limiting: Control request timing with generator-based delays


Part 1: The Iterator Protocol - Building Blocks of Iteration#

What is an Iterator?#

An iterator is an object that implements two methods:

  • __iter__(): Returns the iterator object itself

  • __next__(): Returns the next value or raises StopIteration

This protocol enables the for loop and other iteration contexts.

# Understanding how iteration works under the hood
numbers = [1, 2, 3, 4, 5]

# When you use 'for', Python calls iter() to get an iterator
iterator = iter(numbers)
print(f"Iterator object: {iterator}")
print(f"Type: {type(iterator)}")

# Then repeatedly calls next() until StopIteration
print(f"\nManual iteration:")
print(next(iterator))  # 1
print(next(iterator))  # 2
print(next(iterator))  # 3
print(next(iterator))  # 4
print(next(iterator))  # 5
# print(next(iterator))  # Would raise StopIteration

Creating a Custom Iterator#

Letโ€™s build a custom iterator from scratch:

class Countdown:
    """
    Custom iterator that counts down from a number.
    
    This demonstrates the iterator protocol:
    - __iter__() returns self (the iterator object)
    - __next__() returns next value or raises StopIteration
    """
    def __init__(self, start):
        self.current = start
    
    def __iter__(self):
        """Return the iterator object (self)."""
        return self
    
    def __next__(self):
        """Return the next value or raise StopIteration."""
        if self.current <= 0:
            raise StopIteration
        
        value = self.current
        self.current -= 1
        return value

# Using the custom iterator
print("Countdown from 5:")
counter = Countdown(5)
for num in counter:
    print(num, end=" ")

# Can't iterate again (iterator is exhausted)
print("\n\nTrying to iterate again:")
for num in counter:
    print(num, end=" ")
print("(Nothing printed - iterator exhausted)")

Iterable vs Iterator#

Important distinction:

  • Iterable: Object that can return an iterator (has __iter__())

  • Iterator: Object that produces values (has __iter__() and __next__())

An iterable can be iterated multiple times, an iterator is single-use.

class CountdownIterable:
    """
    An ITERABLE (not iterator) that creates new iterators.
    This allows multiple iterations.
    """
    def __init__(self, start):
        self.start = start
    
    def __iter__(self):
        """Return a NEW iterator each time."""
        return CountdownIterator(self.start)

class CountdownIterator:
    """The actual iterator."""
    def __init__(self, start):
        self.current = start
    
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.current <= 0:
            raise StopIteration
        value = self.current
        self.current -= 1
        return value

# Now we can iterate multiple times
countdown = CountdownIterable(3)

print("First iteration:")
for num in countdown:
    print(num, end=" ")

print("\n\nSecond iteration:")
for num in countdown:
    print(num, end=" ")

print("\n\nWorks because each 'for' gets a fresh iterator!")

Part 2: Generator Functions - Elegant Iterators#

Why Generators?#

Writing custom iterator classes is verbose. Generators provide a simpler syntax using the yield keyword.

Key Benefits:

  • Simple Syntax: No need for __iter__() and __next__()

  • Automatic State Management: Local variables are preserved between calls

  • Memory Efficient: Values are generated on-demand

  • Lazy Evaluation: Compute only whatโ€™s needed

def countdown(n):
    """
    Generator function for counting down.
    
    Much simpler than the class-based iterator!
    """
    while n > 0:
        yield n  # Pause here and return n
        n -= 1   # Resume here on next call

# Using the generator
print("Countdown from 5:")
for num in countdown(5):
    print(num, end=" ")

# Generators are single-use (like iterators)
gen = countdown(3)
print("\n\nFirst iteration:", list(gen))
print("Second iteration:", list(gen))  # Empty!

How Generators Work: Execution Flow#

When you call a generator function:

  1. It returns a generator object (doesnโ€™t execute the body)

  2. Calling next() executes until the first yield

  3. yield pauses execution and returns a value

  4. Next next() call resumes after the yield

  5. When function ends, raises StopIteration

def demo_generator():
    """Demonstrate generator execution flow."""
    print("  [Generator started]")
    
    print("  [About to yield 1]")
    yield 1
    
    print("  [Resumed after yield 1]")
    print("  [About to yield 2]")
    yield 2
    
    print("  [Resumed after yield 2]")
    print("  [About to yield 3]")
    yield 3
    
    print("  [Generator ending]")

print("Creating generator:")
gen = demo_generator()
print(f"Type: {type(gen)}\n")

print("First next():")
value = next(gen)
print(f"Got value: {value}\n")

print("Second next():")
value = next(gen)
print(f"Got value: {value}\n")

print("Third next():")
value = next(gen)
print(f"Got value: {value}\n")

print("Fourth next() (will raise StopIteration):")
try:
    next(gen)
except StopIteration:
    print("StopIteration raised!")

Classic Example: Fibonacci Sequence#

Generators shine when producing sequences:

def fibonacci(n):
    """
    Generate the first n Fibonacci numbers.
    
    Memory efficient: doesn't store all numbers in a list.
    """
    a, b = 0, 1
    for _ in range(n):
        yield a
        a, b = b, a + b

print("First 15 Fibonacci numbers:")
for i, fib in enumerate(fibonacci(15), 1):
    print(f"F({i}) = {fib}")

# Can convert to list if needed
print("\nAs a list:", list(fibonacci(10)))

Part 3: Memory Efficiency - The Power of Lazy Evaluation#

List vs Generator: Memory Comparison#

Letโ€™s see why generators are memory-efficient:

import sys

# List approach: stores all values in memory
def squares_list(n):
    """Return list of squares from 0 to n-1."""
    return [x**2 for x in range(n)]

# Generator approach: computes on-demand
def squares_generator(n):
    """Yield squares from 0 to n-1."""
    for x in range(n):
        yield x**2

# Compare memory usage
n = 100000

# List version
squares_l = squares_list(n)
list_size = sys.getsizeof(squares_l)
print(f"List of {n:,} squares:")
print(f"  Memory: {list_size:,} bytes ({list_size / 1024 / 1024:.2f} MB)")
print(f"  First 5: {squares_l[:5]}")

# Generator version
squares_g = squares_generator(n)
gen_size = sys.getsizeof(squares_g)
print(f"\nGenerator for {n:,} squares:")
print(f"  Memory: {gen_size:,} bytes ({gen_size / 1024:.2f} KB)")
print(f"  First 5: {[next(squares_g) for _ in range(5)]}")

# Memory savings
savings = (list_size - gen_size) / list_size * 100
print(f"\n๐ŸŽฏ Memory savings: {savings:.2f}%")
print(f"   ({list_size / gen_size:.0f}x smaller)")

Generator Expressions#

Like list comprehensions, but with () instead of []:

# List comprehension - creates entire list
list_comp = [x**2 for x in range(10)]
print(f"List comprehension: {list_comp}")
print(f"Type: {type(list_comp)}")
print(f"Size: {sys.getsizeof(list_comp)} bytes\n")

# Generator expression - creates generator
gen_exp = (x**2 for x in range(10))
print(f"Generator expression: {gen_exp}")
print(f"Type: {type(gen_exp)}")
print(f"Size: {sys.getsizeof(gen_exp)} bytes")
print(f"Values: {list(gen_exp)}")

# Perfect for operations that don't need the full list
print("\n๐ŸŽฏ Use cases for generator expressions:")

# Sum (only needs one value at a time)
total = sum(x**2 for x in range(1000000))
print(f"Sum of first million squares: {total:,}")

# Any/all (can short-circuit)
has_large = any(x > 50 for x in range(100))
print(f"Has number > 50: {has_large}")

# Max/min
largest = max(x**2 for x in range(1000))
print(f"Largest square: {largest:,}")

Real-World Example: Processing Large Files#

Generators excel at processing large files line by line:

def process_large_file(filename):
    """
    Generator that processes file line by line.
    
    Memory-efficient: doesn't load entire file into memory.
    Useful for multi-gigabyte log files.
    """
    with open(filename, 'r') as f:
        for line in f:  # File objects are iterators!
            # Process each line
            cleaned = line.strip()
            if cleaned and not cleaned.startswith('#'):
                yield cleaned

def count_errors_in_log(filename):
    """Count ERROR lines in a log file (memory-efficient)."""
    return sum(1 for line in process_large_file(filename) 
               if 'ERROR' in line)

# Example simulation (without actual file)
def simulate_log_lines():
    """Simulate log file processing."""
    logs = [
        "INFO: Application started",
        "DEBUG: Loading config",
        "ERROR: Failed to connect to database",
        "INFO: Retrying connection",
        "ERROR: Connection timeout",
        "INFO: Using fallback database",
        "# This is a comment",
        "",
        "ERROR: Invalid user input",
    ]
    for log in logs:
        yield log.strip()

print("Processing log file:")
error_count = sum(1 for line in simulate_log_lines() 
                  if line and not line.startswith('#') and 'ERROR' in line)
print(f"Found {error_count} errors")

print("\n๐Ÿ“Š Log summary:")
log_types = {}
for line in simulate_log_lines():
    if line and not line.startswith('#'):
        log_type = line.split(':')[0] if ':' in line else 'UNKNOWN'
        log_types[log_type] = log_types.get(log_type, 0) + 1

for log_type, count in sorted(log_types.items()):
    print(f"  {log_type}: {count}")

Part 4: Generator Pipelines - Composing Data Transformations#

Building Data Pipelines#

Generators can be chained to create elegant data processing pipelines:

def read_data(n=20):
    """Stage 1: Generate data source."""
    print("[Stage 1: Generating data]")
    for i in range(1, n + 1):
        yield i

def filter_even(numbers):
    """Stage 2: Filter only even numbers."""
    print("[Stage 2: Filtering even numbers]")
    for num in numbers:
        if num % 2 == 0:
            print(f"  โœ“ {num} is even")
            yield num
        else:
            print(f"  โœ— {num} is odd (skipped)")

def square(numbers):
    """Stage 3: Square each number."""
    print("[Stage 3: Squaring numbers]")
    for num in numbers:
        result = num ** 2
        print(f"  {num}ยฒ = {result}")
        yield result

def take(n, iterable):
    """Stage 4: Take only first n items."""
    print(f"[Stage 4: Taking first {n} items]")
    for i, item in enumerate(iterable):
        if i >= n:
            break
        yield item

# Build the pipeline
print("Building pipeline: data โ†’ filter_even โ†’ square โ†’ take(5)\n")
pipeline = take(5, square(filter_even(read_data(20))))

print("\nExecuting pipeline (lazy evaluation):")
print("="*50)
result = list(pipeline)
print("="*50)
print(f"\nFinal result: {result}")
print(f"\n๐ŸŽฏ Notice: Each stage processes on-demand!")

Pipeline Pattern: ETL (Extract, Transform, Load)#

A common pattern in data engineering:

def extract_records():
    """Extract: Simulate reading from data source."""
    records = [
        {"id": 1, "name": "Alice", "age": 30, "city": "NYC"},
        {"id": 2, "name": "Bob", "age": 25, "city": "LA"},
        {"id": 3, "name": "Charlie", "age": 35, "city": "NYC"},
        {"id": 4, "name": "David", "age": 28, "city": "SF"},
        {"id": 5, "name": "Eve", "age": 32, "city": "NYC"},
    ]
    for record in records:
        yield record

def transform_filter_city(records, city):
    """Transform: Filter by city."""
    for record in records:
        if record['city'] == city:
            yield record

def transform_add_category(records):
    """Transform: Add age category."""
    for record in records:
        if record['age'] < 30:
            record['category'] = 'Young'
        else:
            record['category'] = 'Senior'
        yield record

def load_to_storage(records):
    """Load: Simulate saving to database."""
    results = []
    for record in records:
        print(f"Saving: {record}")
        results.append(record)
    return results

# ETL Pipeline
print("ETL Pipeline: Extract โ†’ Filter(NYC) โ†’ Add Category โ†’ Load\n")
pipeline = transform_add_category(
    transform_filter_city(
        extract_records(),
        city='NYC'
    )
)

saved_records = load_to_storage(pipeline)
print(f"\nโœ… Loaded {len(saved_records)} records")

Part 5: Infinite Generators - Unbounded Sequences#

Creating Infinite Sequences#

Generators can represent infinite sequences (use with caution!):

def infinite_counter(start=0):
    """Generate infinite sequence of integers."""
    n = start
    while True:  # Infinite loop!
        yield n
        n += 1

# Safe: use with a limit
counter = infinite_counter(100)
print("First 10 numbers starting from 100:")
for _ in range(10):
    print(next(counter), end=" ")

print("\n\n๐Ÿ” Infinite Fibonacci:")
def fibonacci_infinite():
    """Generate Fibonacci numbers forever."""
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

# Take only what you need
fib = fibonacci_infinite()
print("First 20 Fibonacci numbers:")
for i, num in enumerate(fib):
    if i >= 20:
        break
    print(num, end=" ")

Practical Use: Cycle and Repeat Patterns#

from itertools import cycle, repeat, islice

# Cycle: repeat sequence infinitely
colors = cycle(['red', 'green', 'blue'])
print("Cycling through colors (first 10):")
for i, color in enumerate(colors):
    if i >= 10:
        break
    print(f"  {i}: {color}")

# Repeat: repeat single value
print("\nRepeat 'X' 5 times:")
for val in repeat('X', 5):
    print(val, end=" ")

# Combining with zip for padding
print("\n\nZipping with infinite repeat:")
names = ['Alice', 'Bob', 'Charlie']
scores = [95, 87]  # Fewer scores than names

# Pad scores with 0
padded_scores = islice(iter(scores + [0] * len(names)), len(names))
for name, score in zip(names, scores + [0] * len(names)):
    print(f"  {name}: {score}")

Part 6: Advanced Generator Features#

Generator Methods: send(), throw(), close()#

Generators can receive values and exceptions:

def running_average():
    """
    Coroutine that maintains a running average.
    
    Uses send() to receive values.
    """
    total = 0
    count = 0
    average = None
    
    while True:
        # Receive value sent via send()
        value = yield average
        
        if value is None:
            break
        
        total += value
        count += 1
        average = total / count

# Create coroutine
avg = running_average()

# MUST call next() or send(None) to prime the coroutine
next(avg)  # Advance to first yield

print("Running average coroutine:")
print(f"  Send 10: {avg.send(10)}")
print(f"  Send 20: {avg.send(20)}")
print(f"  Send 30: {avg.send(30)}")
print(f"  Send 40: {avg.send(40)}")
print(f"\nโœ… Average of [10, 20, 30, 40] = {avg.send(50)}")

Using throw() to Send Exceptions#

def generator_with_exception_handling():
    """
    Generator that can handle exceptions sent via throw().
    """
    try:
        while True:
            value = yield
            print(f"  Received: {value}")
    except ValueError as e:
        print(f"  โš ๏ธ  Caught ValueError: {e}")
        yield "Error handled"
    finally:
        print("  ๐Ÿ”š Generator closing")

gen = generator_with_exception_handling()
next(gen)  # Prime

print("Sending values:")
gen.send(10)
gen.send(20)

print("\nThrowing exception:")
try:
    result = gen.throw(ValueError, "Invalid input!")
    print(f"  Result after exception: {result}")
except StopIteration:
    print("  Generator stopped")

Using close() to Stop a Generator#

def generator_with_cleanup():
    """
    Generator with cleanup logic.
    """
    try:
        print("  ๐Ÿ”ง Setting up resources...")
        for i in range(10):
            yield i
    finally:
        print("  ๐Ÿงน Cleaning up resources...")

gen = generator_with_cleanup()
print("Getting first 3 values:")
for _ in range(3):
    print(f"  Value: {next(gen)}")

print("\nClosing generator early:")
gen.close()

print("\nTrying to use closed generator:")
try:
    next(gen)
except StopIteration:
    print("  โŒ Generator is closed!")

Part 7: yield from - Delegating to Subgenerators#

Generator Delegation#

The yield from syntax delegates to another generator:

def generator1():
    """First generator."""
    yield 1
    yield 2
    yield 3

def generator2():
    """Second generator."""
    yield 'a'
    yield 'b'
    yield 'c'

# Without yield from (manual delegation)
def combined_manual():
    """Manually combine generators."""
    for value in generator1():
        yield value
    for value in generator2():
        yield value

# With yield from (cleaner)
def combined_yield_from():
    """Use yield from for delegation."""
    yield from generator1()
    yield from generator2()

print("Manual delegation:")
print(list(combined_manual()))

print("\nUsing yield from:")
print(list(combined_yield_from()))

Flattening Nested Structures#

def flatten(nested_list):
    """
    Recursively flatten a nested list.
    
    Uses yield from for elegant recursion.
    """
    for item in nested_list:
        if isinstance(item, list):
            # Recursively flatten sublists
            yield from flatten(item)
        else:
            yield item

nested = [1, [2, 3, [4, 5]], 6, [7, [8, 9]]]
print(f"Nested: {nested}")
print(f"Flattened: {list(flatten(nested))}")

# More complex example
complex_nested = [
    1,
    [2, 3],
    [[4, 5], [6]],
    [[[7]], 8],
    9
]
print(f"\nComplex nested: {complex_nested}")
print(f"Flattened: {list(flatten(complex_nested))}")

Part 8: Real-World Applications#

Example 1: Batch Processing for Machine Learning#

def batch_generator(data, batch_size):
    """
    Generate batches for training neural networks.
    
    Memory-efficient: doesn't load all batches at once.
    """
    for i in range(0, len(data), batch_size):
        yield data[i:i + batch_size]

# Simulate training data
training_data = list(range(1, 101))  # 100 samples
batch_size = 10

print(f"Training on {len(training_data)} samples in batches of {batch_size}\n")

for epoch in range(1, 3):  # 2 epochs
    print(f"Epoch {epoch}:")
    for batch_num, batch in enumerate(batch_generator(training_data, batch_size), 1):
        # Simulate training
        avg = sum(batch) / len(batch)
        print(f"  Batch {batch_num}: size={len(batch)}, avg={avg:.1f}")
    print()

Example 2: API Pagination Handler#

def fetch_paginated_api(max_pages=5):
    """
    Simulate fetching paginated API results.
    
    In real code, this would make HTTP requests.
    Generator allows processing results as they arrive.
    """
    page = 1
    while page <= max_pages:
        # Simulate API response
        results = [
            {"id": (page - 1) * 10 + i, "value": f"Item {(page - 1) * 10 + i}"}
            for i in range(1, 11)
        ]
        
        print(f"  ๐Ÿ“ฅ Fetched page {page}")
        
        # Yield each result
        for result in results:
            yield result
        
        page += 1
        
        # Check if there are more pages (in real code, check API response)
        if page > max_pages:
            print(f"  โœ… No more pages\n")
            break

print("Fetching API results:\n")
for i, item in enumerate(fetch_paginated_api(max_pages=3), 1):
    if i <= 5 or i > 25:  # Show first and last few
        print(f"  Item {i}: {item}")
    elif i == 6:
        print(f"  ... (processing items 6-25) ...")

Example 3: Moving Average Calculator#

from collections import deque

def moving_average(data, window_size):
    """
    Calculate moving average over a sliding window.
    
    Memory-efficient for large datasets.
    """
    window = deque(maxlen=window_size)
    
    for value in data:
        window.append(value)
        if len(window) == window_size:
            yield sum(window) / window_size

# Stock prices simulation
prices = [100, 102, 98, 105, 110, 108, 112, 115, 111, 114]
window = 3

print(f"Stock prices: {prices}")
print(f"\nMoving average (window={window}):")

for i, avg in enumerate(moving_average(prices, window), window):
    print(f"  Day {i}: ${avg:.2f}")

Part 9: Performance Comparison#

Benchmark: List vs Generator#

import time
import sys

def benchmark_list_vs_generator():
    """Compare performance of list vs generator."""
    n = 1000000
    
    # List approach
    start = time.time()
    list_result = [x**2 for x in range(n)]
    first_10_list = list_result[:10]
    list_time = time.time() - start
    list_memory = sys.getsizeof(list_result)
    
    # Generator approach
    start = time.time()
    gen_result = (x**2 for x in range(n))
    first_10_gen = [next(gen_result) for _ in range(10)]
    gen_time = time.time() - start
    gen_memory = sys.getsizeof(gen_result)
    
    print(f"Computing first 10 squares from {n:,} numbers:\n")
    
    print("List Comprehension:")
    print(f"  Time: {list_time*1000:.2f} ms")
    print(f"  Memory: {list_memory:,} bytes ({list_memory/1024/1024:.2f} MB)")
    print(f"  Result: {first_10_list}")
    
    print("\nGenerator Expression:")
    print(f"  Time: {gen_time*1000:.4f} ms")
    print(f"  Memory: {gen_memory:,} bytes")
    print(f"  Result: {first_10_gen}")
    
    print("\n๐Ÿ“Š Comparison:")
    print(f"  Speed: Generator is {list_time/gen_time:.0f}x faster")
    print(f"  Memory: Generator uses {list_memory/gen_memory:.0f}x less memory")

benchmark_list_vs_generator()

Exercises#

Exercise 1: Custom Range Iterator#

Implement a custom MyRange class that mimics Pythonโ€™s range() behavior using the iterator protocol.

# Your code here
class MyRange:
    """
    Custom range implementation.
    
    Should support:
    - MyRange(stop)
    - MyRange(start, stop)
    - MyRange(start, stop, step)
    """
    def __init__(self, *args):
        # TODO: Implement __init__
        pass
    
    def __iter__(self):
        # TODO: Return iterator
        pass
    
    def __next__(self):
        # TODO: Return next value or raise StopIteration
        pass

# Test your implementation
# print("MyRange(5):", list(MyRange(5)))
# print("MyRange(2, 8):", list(MyRange(2, 8)))
# print("MyRange(0, 10, 2):", list(MyRange(0, 10, 2)))
# print("MyRange(10, 0, -1):", list(MyRange(10, 0, -1)))

Exercise 2: Generator Pipeline for Data Processing#

Build the generator pipeline from the original exercise:

  1. Generate random numbers between 1 and 100

  2. Filter numbers divisible by both 3 and 5 (divisible by 15)

  3. Transform each number by multiplying by 2

  4. Stop after finding 10 numbers that meet the criteria

Compare memory usage to a list-based approach.

# Your code here
import random
import sys

# TODO: Implement generator pipeline

def generate_random_numbers():
    """Generate infinite stream of random numbers between 1 and 100."""
    pass  # TODO

def filter_divisible_by_15(numbers):
    """Filter numbers divisible by 15."""
    pass  # TODO

def multiply_by_2(numbers):
    """Multiply each number by 2."""
    pass  # TODO

def take_n(iterable, n):
    """Take first n items from iterable."""
    pass  # TODO

# Build pipeline
# pipeline = ...
# result = list(pipeline)
# print(f"Result: {result}")

# Compare with list-based approach
# TODO: Implement list-based version and compare memory usage

Exercise 3: File Processing with Generators#

Create a generator function that:

  1. Reads a CSV-like string (simulate file reading)

  2. Parses each line into a dictionary

  3. Filters rows where a specific column meets a condition

  4. Yields the processed records

# Your code here

# Sample CSV data
csv_data = """name,age,city,salary
Alice,30,NYC,80000
Bob,25,LA,65000
Charlie,35,NYC,95000
David,28,SF,75000
Eve,32,NYC,88000
Frank,29,LA,70000"""

def parse_csv_lines(csv_string):
    """
    Generator that parses CSV string into dictionaries.
    
    Yields one dictionary per row.
    """
    pass  # TODO

def filter_records(records, column, condition):
    """
    Filter records based on condition.
    
    Args:
        records: Generator of dictionaries
        column: Column name to check
        condition: Function that returns True/False
    """
    pass  # TODO

# Test your implementation
# records = parse_csv_lines(csv_data)
# high_earners = filter_records(records, 'salary', lambda x: int(x) > 75000)
# for record in high_earners:
#     print(record)

Exercise 4: Coroutine-Based Logger#

Create a coroutine that receives log messages via send() and:

  • Categorizes them by level (INFO, WARNING, ERROR)

  • Maintains counts for each level

  • Returns summary statistics when receiving None

# Your code here

def log_analyzer():
    """
    Coroutine that analyzes log messages.
    
    Send log messages like: "ERROR: Connection failed"
    Send None to get summary statistics.
    """
    pass  # TODO

# Test your implementation
# logger = log_analyzer()
# next(logger)  # Prime the coroutine

# logger.send("INFO: Application started")
# logger.send("ERROR: Connection failed")
# logger.send("WARNING: High memory usage")
# logger.send("ERROR: Timeout occurred")
# logger.send("INFO: Request completed")

# summary = logger.send(None)
# print(f"Summary: {summary}")

Pro Tips#

๐ŸŽฏ Best Practices#

  1. Use generators for large datasets: When data doesnโ€™t fit in memory

  2. Prefer generator expressions: More concise than generator functions for simple transformations

  3. Prime coroutines: Always call next() or send(None) before using send()

  4. Be careful with infinite generators: Always use limiting mechanisms (take_n, islice, break)

  5. Use yield from for delegation: Cleaner than manual for-loop delegation

  6. Consider itertools: Built-in module with powerful generator utilities

  7. Document generator state: Make clear if generator is single-use or reusable

  8. Use generators for pipelines: Chain operations for readable, efficient code

โš ๏ธ Common Mistakes#

  1. Forgetting generators are single-use: Canโ€™t iterate twice without recreating

  2. Not priming coroutines: Must call next() before send()

  3. Converting to list unnecessarily: Defeats the purpose of lazy evaluation

  4. Infinite generators without limits: Can cause infinite loops

  5. Ignoring StopIteration: Should be handled in manual iteration

  6. Mixing iteration protocols: Donโ€™t mix __iter__/__next__ with yield in same class

  7. Not using yield from: Manual delegation is more error-prone

  8. Forgetting cleanup: Use try/finally or context managers for resource cleanup

๐Ÿ” When to Use What#

Use Lists When:

  • Data fits comfortably in memory

  • Need random access or indexing

  • Need to iterate multiple times

  • Want to modify elements in-place

Use Generators When:

  • Data is very large or infinite

  • Only need to iterate once

  • Processing data in a pipeline

  • Want to minimize memory usage

  • Implementing iteration protocol


Key Takeaways#

  1. Iterators implement __iter__() and __next__() for custom iteration logic

  2. Generators provide elegant iterator creation using yield keyword

  3. Lazy evaluation means values are computed on-demand, saving memory

  4. Generator expressions offer memory-efficient alternative to list comprehensions

  5. Pipelines chain generators for readable, efficient data processing

  6. Infinite sequences are possible with generators (use carefully)

  7. Coroutines use send(), throw(), and close() for bidirectional communication

  8. yield from simplifies generator delegation and subgenerator handling

  9. Performance: Generators often faster and always more memory-efficient for large data

  10. Real-world: Essential for big data, streaming, ETL, ML batching, and API handling


Next Steps#

  1. Explore itertools: Study itertools module (chain, product, permutations, etc.)

  2. Async generators: Learn about async def and async for for asynchronous iteration

  3. Context managers: Combine generators with context managers using contextlib.contextmanager

  4. Data streaming: Build real-time data processing pipelines

  5. Performance profiling: Use timeit and memory_profiler to measure improvements

  6. Practice with large datasets: Process real CSV files, logs, or API data

Resources:


Continue to the next lesson on Algorithms and Complexity to apply these concepts to algorithm design!