Lesson 5: Advanced Machine Learning and NLP#
Master ensemble methods, modern NLP, and production ML techniques
Real-World Context#
This lesson covers the techniques used at top tech companies: ensemble methods power Kaggle winning solutions, transformers revolutionized NLP (GPT, BERT, ChatGPT), and hyperparameter optimization is crucial for production systems. You’ll learn the same approaches used at Google, OpenAI, and Meta.
What You’ll Learn#
Ensemble Learning: Bagging, boosting, stacking, and voting
Advanced Tree Methods: XGBoost, LightGBM, CatBoost
NLP Fundamentals: Tokenization, vectorization, TF-IDF
Word Embeddings: Word2Vec, GloVe, contextual embeddings
Modern NLP: Transformers, BERT, GPT architecture explained
Text Classification: Sentiment analysis, topic modeling
Hyperparameter Optimization: Grid search, random search, Bayesian optimization
Model Interpretability: SHAP, LIME, feature importance
Production ML: Deployment, monitoring, A/B testing
Prerequisites: Python, scikit-learn, basic ML knowledge
Time: 4-5 hours
Part 1: Ensemble Learning Fundamentals#
Why Ensemble Methods?#
“Wisdom of crowds”: Multiple weak models can create a strong model.
Example: Single decision tree = 85% accuracy, 100 trees = 95% accuracy
Three Main Approaches#
Method |
How It Works |
Best For |
Examples |
|---|---|---|---|
Bagging |
Train models on random subsets (parallel) |
Reduce variance |
Random Forest |
Boosting |
Train models sequentially, focus on errors |
Reduce bias |
XGBoost, AdaBoost |
Stacking |
Use model outputs as features for meta-model |
Maximum performance |
Netflix Prize winner |
Bias-Variance Tradeoff#
Error = Bias² + Variance + Irreducible Error
High bias: Model too simple (underfit)
High variance: Model too complex (overfit)
Bagging: Reduces variance
Boosting: Reduces bias
# Install required packages (uncomment if needed):
# !pip install scikit-learn xgboost lightgbm catboost nltk transformers torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import (
RandomForestClassifier,
GradientBoostingClassifier,
VotingClassifier,
BaggingClassifier,
AdaBoostClassifier
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
# Set random seeds
np.random.seed(42)
print("📦 Libraries loaded successfully!")
# Generate classification dataset
X, y = make_classification(
n_samples=2000,
n_features=20,
n_informative=15,
n_redundant=5,
n_classes=2,
weights=[0.6, 0.4], # Imbalanced classes
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Dataset: {X_train.shape[0]} training, {X_test.shape[0]} test samples")
print(f"Features: {X_train.shape[1]}")
print(f"Class distribution: {np.bincount(y_train)}")
Part 2: Bagging Methods#
Random Forest#
Algorithm:
Bootstrap samples from training data (sample with replacement)
Train decision tree on each sample
At each split, consider random subset of features
Average predictions (regression) or vote (classification)
Key Parameters:
n_estimators: Number of trees (100-1000)max_depth: Maximum tree depth (control overfitting)max_features: Features to consider per split (√n for classification)min_samples_split: Minimum samples to split a node
# Train Random Forest
print("🌲 Training Random Forest...\n")
rf_model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
max_features='sqrt',
min_samples_split=5,
random_state=42,
n_jobs=-1 # Use all CPU cores
)
rf_model.fit(X_train, y_train)
# Predictions
rf_pred = rf_model.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)
print(f"✅ Random Forest Accuracy: {rf_acc * 100:.2f}%\n")
# Feature importance
print("📊 Top 5 Important Features:")
feature_importance = sorted(
enumerate(rf_model.feature_importances_),
key=lambda x: x[1],
reverse=True
)[:5]
for idx, (feat, importance) in enumerate(feature_importance, 1):
print(f" {idx}. Feature {feat}: {importance:.4f}")
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(10), sorted(rf_model.feature_importances_, reverse=True)[:10], color='steelblue')
plt.xlabel('Importance')
plt.ylabel('Feature Rank')
plt.title('Top 10 Feature Importances (Random Forest)', fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.show()
Part 3: Boosting Methods#
Gradient Boosting#
Algorithm:
Start with simple model (e.g., predicting mean)
Calculate residuals (errors)
Train new model to predict residuals
Add scaled prediction to ensemble
Repeat for N iterations
Mathematics:
F₀(x) = initial prediction
For m = 1 to M:
rₘ = y - F_{m-1}(x) # Calculate residuals
hₘ = train_tree(rₘ) # Fit to residuals
Fₘ(x) = F_{m-1}(x) + α·hₘ(x) # Update ensemble
XGBoost vs LightGBM vs CatBoost#
Library |
Speed |
Memory |
GPU Support |
Categorical Features |
Use Case |
|---|---|---|---|---|---|
XGBoost |
Medium |
Medium |
Yes |
Manual encoding |
General purpose |
LightGBM |
Fast |
Low |
Yes |
Manual encoding |
Large datasets |
CatBoost |
Slow |
Medium |
Yes |
Native support |
Categorical data |
# Compare Boosting Methods
boosting_models = {
'GradientBoosting': GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
random_state=42
),
'XGBoost': xgb.XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
random_state=42,
eval_metric='logloss'
),
'AdaBoost': AdaBoostClassifier(
n_estimators=100,
learning_rate=1.0,
random_state=42
)
}
results = {}
print("⚡ Comparing Boosting Algorithms...\n")
for name, model in boosting_models.items():
print(f"Training {name}...")
model.fit(X_train, y_train)
pred = model.predict(X_test)
acc = accuracy_score(y_test, pred)
results[name] = acc
print(f" ✅ Accuracy: {acc * 100:.2f}%\n")
# Plot comparison
plt.figure(figsize=(10, 6))
plt.bar(results.keys(), [v * 100 for v in results.values()], color=['#1f77b4', '#ff7f0e', '#2ca02c'])
plt.ylabel('Accuracy (%)')
plt.title('Boosting Algorithms Comparison', fontsize=14, fontweight='bold')
plt.ylim([85, 100])
plt.grid(True, alpha=0.3, axis='y')
# Add value labels on bars
for i, (name, acc) in enumerate(results.items()):
plt.text(i, acc * 100 + 0.5, f'{acc * 100:.2f}%', ha='center', fontweight='bold')
plt.show()
print("\n🏆 Best Model: XGBoost typically wins on structured data")
Part 4: Stacking and Voting#
Voting Classifier#
Hard Voting: Majority vote (classification)
Model A: Class 1, Model B: Class 0, Model C: Class 1 → Prediction: Class 1
Soft Voting: Average probabilities
Model A: [0.7, 0.3], Model B: [0.4, 0.6], Model C: [0.8, 0.2]
Average: [0.63, 0.37] → Prediction: Class 0
Stacking#
Level 0: Base models train on data Level 1: Meta-model trains on base model predictions
# Voting Ensemble
print("🗳️ Building Voting Ensemble...\n")
voting_model = VotingClassifier(
estimators=[
('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
('xgb', xgb.XGBClassifier(n_estimators=50, random_state=42, eval_metric='logloss')),
('lr', LogisticRegression(max_iter=1000, random_state=42))
],
voting='soft' # Use probability averaging
)
voting_model.fit(X_train, y_train)
voting_pred = voting_model.predict(X_test)
voting_acc = accuracy_score(y_test, voting_pred)
print("📊 Results Comparison:")
print(f" Random Forest: {rf_acc * 100:.2f}%")
print(f" XGBoost: {results['XGBoost'] * 100:.2f}%")
print(f" Voting Ensemble: {voting_acc * 100:.2f}%")
if voting_acc > max(rf_acc, results['XGBoost']):
print("\n✅ Ensemble improved over individual models!")
else:
print("\n⚠️ Individual model performed better (not always the case)")
Part 5: Natural Language Processing Fundamentals#
Text Preprocessing Pipeline#
Tokenization: Split into words/sentences
Lowercasing: Normalize capitalization
Remove punctuation/special characters
Remove stop words: Common words (“the”, “is”, “and”)
Stemming/Lemmatization: Reduce to root form
Stemming: “running” → “run” (crude)
Lemmatization: “better” → “good” (linguistic)
Text Vectorization#
Method |
Description |
Pros |
Cons |
|---|---|---|---|
Bag of Words |
Count word occurrences |
Simple, fast |
Ignores order, large sparse matrices |
TF-IDF |
Weight by importance |
Reduces common word impact |
Still sparse, no semantics |
Word2Vec |
Dense embeddings |
Captures semantics |
Requires large data |
BERT |
Contextual embeddings |
SOTA, context-aware |
Slow, requires GPUs |
# NLP Libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import re
# Sample movie reviews dataset
movie_reviews = [
"This movie was absolutely amazing! Best film I've ever seen.",
"Terrible movie, complete waste of time and money. Very disappointing.",
"Great acting and plot. Highly recommend this movie to everyone!",
"Boring and predictable. Would not watch again. Save your money.",
"Excellent cinematography and soundtrack. A true masterpiece!",
"Worst movie ever made. Don't waste your time on this garbage.",
"Absolutely loved it! The cast was perfect and story engaging.",
"Disappointing ending ruined the whole experience for me.",
"Brilliant direction and storytelling. Oscar-worthy performance!",
"Terrible acting and weak plot. Couldn't finish watching it."
]
sentiments = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] # 1=positive, 0=negative
def preprocess_text(text):
"""Clean and normalize text."""
text = text.lower() # Lowercase
text = re.sub(r'[^a-z\s]', '', text) # Remove punctuation
text = re.sub(r'\s+', ' ', text).strip() # Remove extra spaces
return text
# Preprocess
processed_reviews = [preprocess_text(review) for review in movie_reviews]
print("📝 Original Review:")
print(f" {movie_reviews[0]}\n")
print("🔧 Preprocessed:")
print(f" {processed_reviews[0]}\n")
# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=50, stop_words='english')
X_tfidf = tfidf.fit_transform(processed_reviews)
print(f"TF-IDF Matrix Shape: {X_tfidf.shape}")
print(f"Vocabulary Size: {len(tfidf.vocabulary_)}")
print(f"\nTop 15 Features: {list(tfidf.get_feature_names_out()[:15])}")
Understanding TF-IDF#
TF (Term Frequency):
TF(word) = (Number of times word appears in document) / (Total words in document)
IDF (Inverse Document Frequency):
IDF(word) = log(Total documents / Documents containing word)
TF-IDF Score:
TF-IDF = TF × IDF
Effect: Common words (“the”, “is”) get low scores, rare important words get high scores.
# Visualize TF-IDF scores for a document
doc_idx = 0
feature_names = tfidf.get_feature_names_out()
doc_vector = X_tfidf[doc_idx].toarray()[0]
# Get top words
top_indices = doc_vector.argsort()[-10:][::-1]
top_words = [(feature_names[i], doc_vector[i]) for i in top_indices if doc_vector[i] > 0]
if top_words:
words, scores = zip(*top_words)
plt.figure(figsize=(10, 6))
plt.barh(range(len(words)), scores, color='coral')
plt.yticks(range(len(words)), words)
plt.xlabel('TF-IDF Score')
plt.title(f'Top Words in Review: "{movie_reviews[doc_idx][:50]}..."', fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()
print("\n📊 Observation: Important, rare words get higher scores")
Part 6: Sentiment Analysis with Real Data#
Let’s build a production-quality sentiment classifier.
# Expanded dataset (simulating real-world data)
positive_reviews = [
"Excellent product, highly recommended!",
"Love it! Best purchase I've ever made.",
"Amazing quality and very fast delivery.",
"Perfect! Exactly what I was looking for.",
"Outstanding customer service and great product.",
"Fantastic! Exceeded all my expectations.",
"Wonderful experience from start to finish.",
"Brilliant product, works like a charm!",
"Superb quality, definitely worth the price.",
"Absolutely satisfied with my purchase!"
]
negative_reviews = [
"Terrible quality, very disappointed with purchase.",
"Complete waste of money, do not buy this.",
"Poor customer service and defective product.",
"Not as described at all, requesting full refund.",
"Awful experience, would give zero stars if possible.",
"Broke within days, cheap materials used.",
"Horrible product, does not work as advertised.",
"Disappointed and frustrated with this purchase.",
"Very poor quality, fell apart immediately.",
"Worst purchase ever, complete rip-off!"
]
# Combine and create labels
all_reviews = positive_reviews + negative_reviews
labels = np.array([1] * len(positive_reviews) + [0] * len(negative_reviews))
print(f"📊 Dataset: {len(all_reviews)} reviews")
print(f" Positive: {sum(labels)}")
print(f" Negative: {len(labels) - sum(labels)}")
# Build sentiment analysis pipeline
sentiment_pipeline = Pipeline([
('vectorizer', TfidfVectorizer(max_features=100, stop_words='english')),
('classifier', MultinomialNB())
])
# Train
sentiment_pipeline.fit(all_reviews, labels)
# Test on new reviews
test_reviews = [
"This is absolutely fantastic and amazing!",
"Very poor quality and terrible customer service",
"Good value for money, happy with purchase",
"Disappointed with the product, not worth it",
"Excellent! Will definitely buy again!"
]
predictions = sentiment_pipeline.predict(test_reviews)
probabilities = sentiment_pipeline.predict_proba(test_reviews)
print("\n🎯 Sentiment Predictions:\n")
for review, pred, prob in zip(test_reviews, predictions, probabilities):
sentiment = "😊 Positive" if pred == 1 else "😞 Negative"
confidence = max(prob) * 100
bar = '█' * int(confidence / 10)
print(f"Review: \"{review}\"")
print(f" → {sentiment} ({confidence:.1f}% confident) {bar}\n")
Part 7: Word Embeddings#
From Sparse to Dense Representations#
Traditional (Sparse):
"king" → [0, 0, 1, 0, 0, ..., 0] (10,000 dimensions, mostly zeros)
"queen" → [0, 0, 0, 1, 0, ..., 0]
Embeddings (Dense):
"king" → [0.2, -0.5, 0.8, ..., 0.3] (300 dimensions, all meaningful)
"queen" → [0.1, -0.6, 0.7, ..., 0.4] (semantically similar)
Word2Vec Magic#
Famous equation:
king - man + woman ≈ queen
This works because embeddings capture semantic relationships!
# Word embeddings with Keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.models import Sequential
# Prepare text data
tokenizer = Tokenizer(num_words=1000, oov_token='<OOV>')
tokenizer.fit_on_texts(all_reviews)
# Convert to sequences
sequences = tokenizer.texts_to_sequences(all_reviews)
padded_sequences = pad_sequences(sequences, maxlen=20, padding='post', truncating='post')
print(f"Vocabulary size: {len(tokenizer.word_index)}")
print(f"Padded sequence shape: {padded_sequences.shape}")
print(f"\nExample:")
print(f" Original: '{all_reviews[0]}'")
print(f" Tokenized: {sequences[0]}")
print(f" Padded: {padded_sequences[0]}")
# Build LSTM model with embeddings
embedding_dim = 32
vocab_size = min(1000, len(tokenizer.word_index) + 1)
lstm_model = Sequential([
# Embedding layer: Converts word indices to dense vectors
Embedding(vocab_size, embedding_dim, input_length=20, name='embedding'),
# Bidirectional LSTM: Processes text forward and backward
Bidirectional(LSTM(64, return_sequences=True), name='bi_lstm_1'),
Dropout(0.3),
# Second LSTM layer
Bidirectional(LSTM(32), name='bi_lstm_2'),
Dropout(0.3),
# Dense layers
Dense(64, activation='relu', name='dense_1'),
Dropout(0.3),
Dense(1, activation='sigmoid', name='output')
], name='sentiment_lstm')
lstm_model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
lstm_model.summary()
print(f"\n📊 Total parameters: {lstm_model.count_params():,}")
print(f"📊 Embedding matrix shape: (vocab={vocab_size}, dim={embedding_dim})")
Understanding LSTM#
Why LSTM for text?
Captures long-range dependencies
Handles variable-length sequences
Understands context
LSTM Cell:
Forget gate: What to forget from memory
Input gate: What new information to store
Output gate: What to output
Bidirectional: Reads text forward and backward for better context understanding
Part 8: Introduction to Transformers#
The Revolution: Attention is All You Need (2017)#
Before Transformers:
RNNs/LSTMs: Sequential processing (slow)
Limited context window
Hard to parallelize
Transformers:
Parallel processing (fast!)
Unlimited context (practically)
Attention mechanism: Focus on relevant words
Architecture Overview#
Input Text → Tokenization → Embeddings → Positional Encoding
↓
Multi-Head Self-Attention
↓
Feed-Forward Network
↓
Repeat N times
↓
Output
Self-Attention Simplified#
Example: “The animal didn’t cross the street because it was too tired.”
Question: What does “it” refer to?
Attention weights:
“it” → “animal”: 0.8 ✅
“it” → “street”: 0.1
“it” → “cross”: 0.1
The model learns to focus on “animal”!
Major Transformer Models#
Model |
Type |
Parameters |
Training Data |
Use Case |
|---|---|---|---|---|
BERT |
Encoder-only |
110M - 340M |
Books + Wikipedia |
Classification, NER, QA |
GPT-3/4 |
Decoder-only |
175B+ |
Internet text |
Text generation, ChatGPT |
T5 |
Encoder-Decoder |
60M - 11B |
C4 dataset |
Translation, summarization |
RoBERTa |
Encoder-only |
125M - 355M |
Improved BERT data |
Better than BERT |
BART |
Encoder-Decoder |
140M - 400M |
Denoising |
Summarization |
BERT vs GPT#
BERT (Bidirectional):
Reads text in both directions
Great for understanding (classification, QA)
Pre-training: Masked Language Modeling
Input: “The [MASK] sat on the mat”
Task: Predict [MASK] = “cat”
GPT (Autoregressive):
Reads text left-to-right only
Great for generation (writing, chat)
Pre-training: Next Token Prediction
Input: “The cat sat on the”
Task: Predict next word = “mat”
# Simplified Attention Mechanism (educational)
def simple_attention(query, keys, values):
"""
Simplified attention mechanism.
Args:
query: What we're looking for
keys: What's available
values: Actual information
Returns:
Weighted combination of values
"""
# Calculate attention scores
scores = np.dot(query, keys.T) # Dot product
# Apply softmax to get weights (sum to 1)
weights = np.exp(scores) / np.sum(np.exp(scores))
# Weighted sum of values
output = np.dot(weights, values)
return output, weights
# Example: Sentence "I love NLP"
# Simple word embeddings (normally 300-768 dimensions)
embeddings = np.array([
[0.1, 0.2, 0.3], # "I"
[0.4, 0.5, 0.6], # "love"
[0.7, 0.8, 0.9], # "NLP"
])
# Calculate attention for word "love" (index 1)
query = embeddings[1]
keys = embeddings
values = embeddings
output, weights = simple_attention(query, keys, values)
print("🔍 Attention Mechanism Demo\n")
print("Query word: 'love'")
print(f"\nAttention weights:")
words = ["I", "love", "NLP"]
for word, weight in zip(words, weights):
bar = '█' * int(weight * 50)
print(f" {word:6s}: {weight:.3f} {bar}")
print(f"\nContext-aware representation: {output}")
print("\n💡 The model learned which words to focus on!")
Part 9: Hyperparameter Optimization#
Strategies#
Method |
How It Works |
Speed |
Effectiveness |
|---|---|---|---|
Manual |
Try values by hand |
Slow |
Poor |
Grid Search |
Try all combinations |
Very slow |
Good |
Random Search |
Random sampling |
Medium |
Good |
Bayesian Optimization |
Smart sampling |
Fast |
Best |
Hyperband |
Adaptive allocation |
Fast |
Best |
Search Space Example#
param_grid = {
'n_estimators': [50, 100, 200], # 3 options
'max_depth': [5, 10, 15, None], # 4 options
'learning_rate': [0.01, 0.1, 0.3] # 3 options
}
# Grid Search: 3 × 4 × 3 = 36 combinations!
# Random Search: Sample N combinations (e.g., 10)
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import uniform, randint
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
print("🔍 Hyperparameter Tuning Comparison\n")
# Random Search (faster)
print("⚡ Running Random Search...")
rf_random = RandomForestClassifier(random_state=42, n_jobs=-1)
random_search = RandomizedSearchCV(
rf_random,
param_distributions=param_grid,
n_iter=10, # Try 10 random combinations
cv=3, # 3-fold cross-validation
scoring='accuracy',
random_state=42,
n_jobs=-1
)
random_search.fit(X_train, y_train)
print(f"✅ Random Search Complete")
print(f" Best Score: {random_search.best_score_ * 100:.2f}%")
print(f" Best Params: {random_search.best_params_}")
# Test on holdout set
best_model = random_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f" Test Accuracy: {test_score * 100:.2f}%")
Part 10: Model Interpretation with SHAP#
Why Interpretability Matters#
Trust: Understand why model makes predictions
Debugging: Find biases and errors
Compliance: Regulations (GDPR, healthcare)
Science: Discover insights
SHAP (SHapley Additive exPlanations)#
Idea: How much does each feature contribute to prediction?
Example:
Base prediction: 0.5
Feature 'price': +0.2 (pushes toward positive)
Feature 'rating': +0.15 (pushes toward positive)
Feature 'shipping': -0.05 (pushes toward negative)
Final prediction: 0.8 (positive)
# Model Interpretation (Feature Importance)
print("📊 Feature Importance Analysis\n")
# Get feature importances from best model
importances = best_model.feature_importances_
indices = np.argsort(importances)[::-1][:10] # Top 10
# Plot
plt.figure(figsize=(10, 6))
plt.title('Top 10 Feature Importances', fontsize=14, fontweight='bold')
plt.bar(range(10), importances[indices], color='steelblue')
plt.xticks(range(10), [f'Feature {i}' for i in indices], rotation=45)
plt.ylabel('Importance')
plt.xlabel('Feature')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
print("Top 5 Features:")
for rank, idx in enumerate(indices[:5], 1):
print(f" {rank}. Feature {idx}: {importances[idx]:.4f}")
Part 11: Production ML Considerations#
Deployment Checklist#
✅ Model Serialization: Save with pickle/joblib
✅ API Endpoint: Flask/FastAPI wrapper
✅ Input Validation: Check data types, ranges
✅ Monitoring: Track accuracy, latency, errors
✅ Versioning: Model version control
✅ A/B Testing: Compare models in production
✅ Rollback Plan: Revert to previous version
✅ Documentation: API docs, model card
Model Monitoring#
Key Metrics:
Prediction latency (p50, p95, p99)
Accuracy over time
Data drift detection
Error rate
Traffic volume
# Save model for production
import joblib
import os
# Create models directory
os.makedirs('production_models', exist_ok=True)
# Save best model
model_path = 'production_models/best_rf_model.pkl'
joblib.dump(best_model, model_path)
print(f"✅ Model saved to {model_path}")
# Save preprocessing pipeline
pipeline_path = 'production_models/sentiment_pipeline.pkl'
joblib.dump(sentiment_pipeline, pipeline_path)
print(f"✅ Sentiment pipeline saved to {pipeline_path}")
# File sizes
model_size = os.path.getsize(model_path) / 1024 # KB
pipeline_size = os.path.getsize(pipeline_path) / 1024
print(f"\n📦 Model sizes:")
print(f" RF Model: {model_size:.1f} KB")
print(f" Sentiment Pipeline: {pipeline_size:.1f} KB")
# Load and use model (production simulation)
loaded_model = joblib.load(model_path)
# Make prediction
sample_data = X_test[:5]
predictions = loaded_model.predict(sample_data)
probabilities = loaded_model.predict_proba(sample_data)
print("🚀 Production Model Inference:\n")
for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
confidence = max(prob) * 100
print(f"Sample {i+1}: Class {pred} (confidence: {confidence:.1f}%)")
print("\n✅ Model loaded and working correctly!")
Part 12: Exercises#
Exercise 1: Ensemble Comparison (⭐⭐)#
Compare ensemble methods on a real dataset:
Load the Iris or Wine dataset
Train 5 different models:
Single Decision Tree
Random Forest
XGBoost
AdaBoost
Voting Ensemble (combine RF, XGB, LogReg)
Use cross-validation for all
Plot comparison bar chart
Identify which performs best
# Exercise 1: Your code here
# Hint: from sklearn.datasets import load_wine
# Hint: Use cross_val_score for each model
Exercise 2: Advanced Text Classification (⭐⭐⭐)#
Build a multi-class text classifier:
Create/find a dataset with 3+ categories (news topics, product categories, etc.)
Implement preprocessing: lowercasing, stop words, stemming
Compare vectorization methods:
Bag of Words (CountVectorizer)
TF-IDF
Word embeddings + LSTM
Train classifiers on each
Plot confusion matrix for best model
Identify misclassified examples
# Exercise 2: Your code here
# Hint: Use 20 Newsgroups dataset: fetch_20newsgroups()
# Hint: from nltk.stem import PorterStemmer
Exercise 3: Hyperparameter Tuning (⭐⭐)#
Optimize XGBoost with different search strategies:
Define comprehensive parameter grid
Run Grid Search (small grid)
Run Random Search (larger space, 20 iterations)
Compare time taken and best scores
Plot search history (score vs iteration)
# Exercise 3: Your code here
# Hint: Track time with time.time()
# Hint: Access cv_results_ for search history
Exercise 4: LSTM Sentiment Analysis (⭐⭐⭐)#
Build LSTM model from scratch:
Collect larger sentiment dataset (IMDB or custom)
Implement full preprocessing pipeline
Build LSTM with:
Embedding layer (trainable)
2 LSTM layers
Dropout regularization
Dense output
Train with early stopping
Plot training curves
Compare with traditional ML (Naive Bayes, SVM)
# Exercise 4: Your code here
# Hint: keras.datasets.imdb.load_data()
# Hint: Use callbacks=[EarlyStopping()]
Exercise 5: Model Interpretability (⭐⭐⭐⭐)#
Analyze model decisions:
Train Random Forest or XGBoost
Extract feature importances
Visualize top 20 features
For text model: Find most important words per class
Bonus: Implement LIME for local explanations
Compare global vs local importance
# Exercise 5: Your code here
# Hint: !pip install lime
# Hint: from lime.lime_text import LimeTextExplainer
Exercise 6: Production API (⭐⭐⭐)#
Build deployment-ready prediction service:
Save trained model with joblib
Create Flask/FastAPI endpoint
Implement input validation
Add error handling
Return predictions with confidence scores
Bonus: Add simple logging
Test with curl/Postman
# Exercise 6: Your code here
# Hint: from flask import Flask, request, jsonify
# Hint: Validate input before prediction
Self-Check Quiz#
Test your understanding:
What is the main difference between bagging and boosting?
A) Bagging is faster
B) Bagging trains models in parallel, boosting sequentially
C) Boosting uses fewer models
D) No difference
Which boosting algorithm is fastest for large datasets?
A) AdaBoost
B) Gradient Boosting
C) LightGBM
D) Random Forest
What does TF-IDF stand for?
A) Text Frequency - Important Document Frequency
B) Term Frequency - Inverse Document Frequency
C) Total Frequency - Indexed Document Frequency
D) Token Frequency - Information Document Frequency
What is the main advantage of word embeddings over TF-IDF?
A) Faster computation
B) Captures semantic relationships
C) Requires less data
D) Always performs better
What is the key innovation of transformers?
A) Recurrent connections
B) Convolutional layers
C) Self-attention mechanism
D) Dropout regularization
Which model is best for text generation?
A) BERT
B) GPT
C) Random Forest
D) Naive Bayes
What is the purpose of hyperparameter tuning?
A) Train model faster
B) Reduce overfitting only
C) Find optimal model configuration
D) Increase model size
Which search method is most efficient?
A) Manual
B) Grid Search
C) Random Search
D) Bayesian Optimization
Why is model interpretability important?
A) Makes models faster
B) Increases accuracy
C) Builds trust and enables debugging
D) Reduces model size
What should you monitor in production?
A) Accuracy only
B) Latency only
C) Accuracy, latency, and data drift
D) Nothing after deployment
Answers: 1-B, 2-C, 3-B, 4-B, 5-C, 6-B, 7-C, 8-D, 9-C, 10-C
Key Takeaways#
Ensemble Learning#
✅ Ensemble methods combine multiple models for better performance
✅ Bagging reduces variance (Random Forest)
✅ Boosting reduces bias (XGBoost, LightGBM)
✅ XGBoost wins most Kaggle competitions
NLP Fundamentals#
✅ Text preprocessing is critical (tokenization, cleaning, normalization)
✅ TF-IDF weights words by importance
✅ Word embeddings capture semantic meaning
✅ LSTMs handle sequential text data
Modern NLP#
✅ Transformers revolutionized NLP (parallel processing)
✅ Self-attention focuses on relevant context
✅ BERT for understanding, GPT for generation
✅ Pre-trained models save time and data
Production ML#
✅ Hyperparameter tuning improves performance
✅ Model interpretability builds trust
✅ Monitor models in production
✅ Version control models like code
Pro Tips#
Start with XGBoost: It works well out-of-the-box for structured data
Always try ensembles: Combining models rarely hurts
Text preprocessing matters: Clean data = better results
Use pre-trained embeddings: Don’t train Word2Vec from scratch with small data
Fine-tune transformers: BERT/GPT transfer learning beats training from scratch
Random search before grid: Sample 10-20 random configs, then refine
Cross-validation always: Never trust single train/test split
Monitor data drift: Production data changes over time
Start simple: Logistic Regression baseline, then add complexity
Document everything: Model cards, API docs, decision rationale
Common Mistakes#
❌ Skipping text preprocessing
❌ Not removing stop words for classification
❌ Training Word2Vec on small datasets
❌ Using BERT for generation (use GPT)
❌ Grid searching huge spaces
❌ Not validating production inputs
❌ Ignoring class imbalance
❌ Overfitting to validation set
Debug Checklist#
⚠️ Low accuracy → Check preprocessing, try ensemble
⚠️ High variance → Add regularization, more data
⚠️ High bias → More complex model, better features
⚠️ Slow inference → Reduce model size, optimize code
⚠️ Production accuracy drops → Data drift, retrain model
What’s Next?#
Continue in Hard Track:#
Lesson 6: Computer Systems and Theory
Lesson 7: Project Ideas (apply everything!)
Lesson 8: Classic Problems (interview preparation)
Deepen Your NLP Knowledge:#
Hugging Face Transformers: Pre-trained models library
fast.ai NLP: Practical deep learning for text
spaCy: Industrial-strength NLP
Papers With Code: Latest NLP research
Practice Projects:#
Sentiment analysis on Twitter data
News article classifier (Reuters, 20 Newsgroups)
Chatbot with GPT-2/3
Named Entity Recognition (NER)
Text summarization with T5
Deploy ML API with FastAPI
Resources:#
Books: “Speech and Language Processing” (Jurafsky & Martin)
Courses: Stanford CS224N (NLP with Deep Learning)
Competitions: Kaggle NLP challenges
Tools: Weights & Biases (experiment tracking)
Congratulations! You now understand advanced ML and modern NLP. You can build ensemble models, process text, and deploy production systems! 🚀