Lesson 4: Machine Learning Basics#
Introduction#
Machine Learning is teaching computers to learn from data rather than explicitly programming every rule. Itโs like teaching a child to recognize animals by showing them pictures instead of describing every detail.
Why Machine Learning Matters:
Pattern Recognition: Find patterns humans canโt easily see
Predictions: Forecast future events based on historical data
Automation: Make decisions without constant human intervention
Scale: Process vast amounts of data quickly
What Youโll Learn:
ML fundamentals and terminology
Supervised learning (classification & regression)
Unsupervised learning (clustering)
Data preprocessing and feature engineering
Model evaluation and validation
Cross-validation and hyperparameter tuning
Avoiding overfitting and underfitting
Real-world ML workflows
Prerequisites: Install scikit-learn: pip install scikit-learn numpy pandas matplotlib
1. Machine Learning Fundamentals#
Key Concepts#
Features (X): Input variables used for prediction (like house size, location)
Labels/Target (y): Output variable we want to predict (like house price)
Training: Learning patterns from data
Testing: Evaluating how well the model learned
Model: Mathematical representation of patterns learned from data
Types of Machine Learning#
1. Supervised Learning (Learning with a teacher)
Have labeled examples: input โ known output
Goal: Learn to predict outputs for new inputs
Examples: Email spam detection, image recognition, price prediction
2. Unsupervised Learning (Learning without labels)
Have only inputs, no labels
Goal: Find hidden patterns or structure
Examples: Customer segmentation, anomaly detection
3. Reinforcement Learning (Learning through trial and error)
Learn by interacting with environment
Get rewards or penalties for actions
Examples: Game AI, robotics, recommendation systems
2. Your First ML Model: Classification#
Classification assigns inputs to categories. Letโs predict flower species!
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import pandas as pd
# Load the famous iris dataset
iris = load_iris()
X = iris.data # Features: sepal/petal measurements
y = iris.target # Labels: flower species (0, 1, or 2)
print("Dataset Information:")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {X.shape[1]}")
print(f"\nFeature names: {iris.feature_names}")
print(f"Species: {iris.target_names}")
print(f"\nFirst 5 samples:")
print(pd.DataFrame(X[:5], columns=iris.feature_names))
print(f"\nTheir species: {[iris.target_names[i] for i in y[:5]]}")
Train-Test Split#
Critical concept: Never test on training data!
Training set: Data the model learns from (typically 70-80%)
Test set: Unseen data to evaluate performance (20-30%)
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42 # For reproducibility
)
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nClass distribution in training:")
unique, counts = np.unique(y_train, return_counts=True)
for species_id, count in zip(unique, counts):
print(f" {iris.target_names[species_id]}: {count}")
Training the Model#
# Create a Decision Tree classifier
model = DecisionTreeClassifier(
max_depth=3, # Limit tree depth to avoid overfitting
random_state=42
)
# Train the model
model.fit(X_train, y_train)
print("Model trained successfully!")
print(f"Tree depth: {model.get_depth()}")
print(f"Number of leaves: {model.get_n_leaves()}")
Making Predictions#
# Predict on test set
y_pred = model.predict(X_test)
# Show predictions vs actual
print("Predictions vs Actual (first 10):")
for i in range(10):
pred_species = iris.target_names[y_pred[i]]
actual_species = iris.target_names[y_test[i]]
correct = "โ" if y_pred[i] == y_test[i] else "โ"
print(f"{correct} Predicted: {pred_species:15} Actual: {actual_species}")
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy * 100:.2f}%")
Detailed Evaluation#
from sklearn.metrics import confusion_matrix, classification_report
# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(pd.DataFrame(
cm,
index=[f"Actual {name}" for name in iris.target_names],
columns=[f"Pred {name}" for name in iris.target_names]
))
3. Regression: Predicting Continuous Values#
Regression predicts numeric values instead of categories.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Simple dataset: house size -> price
house_sizes = np.array([1000, 1200, 1500, 1800, 2000, 2200, 2500, 2800, 3000, 3500]).reshape(-1, 1)
prices = np.array([150, 180, 210, 240, 280, 300, 340, 380, 420, 500]) # in thousands
# Split data
X_train, X_test, y_train, y_test = train_test_split(
house_sizes, prices, test_size=0.3, random_state=42
)
# Train model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
# Make predictions
y_pred = reg_model.predict(X_test)
print("House Price Predictions:")
for size, actual, pred in zip(X_test.flatten(), y_test, y_pred):
error = abs(actual - pred)
print(f"Size: {size:4.0f} sq ft | Actual: ${actual:3.0f}k | Predicted: ${pred:3.0f}k | Error: ${error:.1f}k")
# Evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\nMean Squared Error: {mse:.2f}")
print(f"Rยฒ Score: {r2:.2f} (1.0 is perfect)")
# Model coefficients
print(f"\nModel equation: Price = {reg_model.intercept_:.2f} + {reg_model.coef_[0]:.4f} ร Size")
4. Multiple Algorithm Comparison#
Try different algorithms to find the best one for your data.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# Load iris data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
# Dictionary of models to compare
models = {
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42),
'K-Nearest Neighbors': KNeighborsClassifier(),
'Support Vector Machine': SVC(random_state=42),
'Naive Bayes': GaussianNB()
}
# Train and evaluate each model
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
results[name] = accuracy
print(f"{name:25} Accuracy: {accuracy * 100:.2f}%")
# Find best model
best_model = max(results, key=results.get)
print(f"\nBest model: {best_model} with {results[best_model] * 100:.2f}% accuracy")
5. Feature Engineering#
Creating new features can dramatically improve model performance.
# Load iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Original features
print("Original features:")
print(iris.feature_names)
# Create new features
# Feature engineering: combine existing features
sepal_area = X[:, 0] * X[:, 1] # sepal length * sepal width
petal_area = X[:, 2] * X[:, 3] # petal length * petal width
sepal_to_petal_ratio = (X[:, 0] + X[:, 1]) / (X[:, 2] + X[:, 3])
# Combine original and new features
X_engineered = np.column_stack([X, sepal_area, petal_area, sepal_to_petal_ratio])
print(f"\nOriginal features: {X.shape[1]}")
print(f"Engineered features: {X_engineered.shape[1]}")
# Compare models with and without feature engineering
for features, name in [(X, "Original"), (X_engineered, "Engineered")]:
X_train, X_test, y_train, y_test = train_test_split(
features, y, test_size=0.3, random_state=42
)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print(f"{name:12} features accuracy: {accuracy * 100:.2f}%")
6. Data Preprocessing#
Handling Missing Values#
from sklearn.impute import SimpleImputer
# Create data with missing values
X_with_missing = np.array([
[1, 2],
[np.nan, 3],
[7, 6],
[5, np.nan]
])
print("Data with missing values:")
print(X_with_missing)
# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_with_missing)
print("\nAfter imputation (mean):")
print(X_imputed)
Feature Scaling#
Many algorithms perform better when features are on similar scales.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Sample data with different scales
data = np.array([
[1000, 2], # House: size in sq ft, bedrooms
[1500, 3],
[2000, 4]
])
print("Original data:")
print(data)
# Standard scaling (mean=0, std=1)
scaler_std = StandardScaler()
data_standardized = scaler_std.fit_transform(data)
print("\nStandardized (mean=0, std=1):")
print(data_standardized)
# Min-Max scaling (range 0-1)
scaler_minmax = MinMaxScaler()
data_normalized = scaler_minmax.fit_transform(data)
print("\nNormalized (range 0-1):")
print(data_normalized)
7. Cross-Validation#
Get more reliable performance estimates by testing on multiple splits.
from sklearn.model_selection import cross_val_score, KFold
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Create model
model = DecisionTreeClassifier(random_state=42)
# 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
print("Cross-Validation Scores (5 folds):")
for i, score in enumerate(cv_scores, 1):
print(f" Fold {i}: {score * 100:.2f}%")
print(f"\nMean accuracy: {cv_scores.mean() * 100:.2f}%")
print(f"Std deviation: {cv_scores.std() * 100:.2f}%")
8. Hyperparameter Tuning#
Find the best settings for your model.
from sklearn.model_selection import GridSearchCV
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Define parameter grid to search
param_grid = {
'max_depth': [2, 3, 4, 5, 6],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Create base model
dt = DecisionTreeClassifier(random_state=42)
# Grid search with cross-validation
grid_search = GridSearchCV(
dt, param_grid, cv=5, scoring='accuracy', verbose=1
)
grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_ * 100:.2f}%")
# Use best model
best_model = grid_search.best_estimator_
9. Overfitting vs Underfitting#
Understanding the Bias-Variance Tradeoff#
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
# Test different tree depths
depths = [1, 2, 3, 5, 10, 20]
print("Tree Depth | Train Acc | Test Acc | Status")
print("-" * 50)
for depth in depths:
model = DecisionTreeClassifier(max_depth=depth, random_state=42)
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
# Diagnose overfitting/underfitting
if train_acc < 0.85:
status = "Underfitting"
elif train_acc - test_acc > 0.1:
status = "Overfitting"
else:
status = "Good fit"
print(f" {depth:2d} | {train_acc:.2f} | {test_acc:.2f} | {status}")
10. Unsupervised Learning: Clustering#
Find natural groupings in data without labels.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Customer data: [age, annual_income_k]
customers = np.array([
[25, 40], [27, 45], [30, 48], [32, 50], [35, 52], # Young, moderate income
[45, 80], [48, 85], [50, 90], [52, 88], [55, 95], # Middle-age, high income
[60, 30], [62, 32], [65, 35], [67, 28], [70, 25] # Senior, low income
])
# Try different numbers of clusters
print("Finding optimal number of clusters...\n")
for k in range(2, 6):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(customers)
# Silhouette score: measures cluster quality (-1 to 1, higher is better)
score = silhouette_score(customers, clusters)
print(f"k={k}: Silhouette Score = {score:.3f}")
# Use 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(customers)
print("\nCustomer Segments:")
for i in range(3):
cluster_customers = customers[clusters == i]
avg_age = cluster_customers[:, 0].mean()
avg_income = cluster_customers[:, 1].mean()
print(f"Cluster {i}: {len(cluster_customers)} customers, Avg Age: {avg_age:.0f}, Avg Income: ${avg_income:.0f}k")
11. Feature Importance#
Understand which features matter most for predictions.
# Load iris data
iris = load_iris()
X, y = iris.data, iris.target
# Train Random Forest (provides feature importance)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Get feature importance
importances = rf.feature_importances_
# Sort features by importance
indices = np.argsort(importances)[::-1]
print("Feature Importance Ranking:")
for i, idx in enumerate(indices, 1):
print(f"{i}. {iris.feature_names[idx]:20} {importances[idx]:.4f}")
12. Complete ML Workflow Example#
Putting it all together with best practices.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# 1. Load and explore data
iris = load_iris()
X, y = iris.data, iris.target
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 3. Create pipeline (preprocessing + model)
pipeline = Pipeline([
('scaler', StandardScaler()), # Scale features
('classifier', RandomForestClassifier(random_state=42)) # Model
])
# 4. Define hyperparameters to tune
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [3, 5, 7]
}
# 5. Grid search with cross-validation
grid_search = GridSearchCV(
pipeline, param_grid, cv=5, scoring='accuracy'
)
grid_search.fit(X_train, y_train)
# 6. Evaluate on test set
test_score = grid_search.score(X_test, y_test)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_ * 100:.2f}%")
print(f"Test score: {test_score * 100:.2f}%")
# 7. Make predictions on new data
new_flowers = np.array([[5.0, 3.5, 1.5, 0.2], [6.5, 3.0, 5.5, 2.0]])
predictions = grid_search.predict(new_flowers)
print(f"\nPredictions for new flowers: {[iris.target_names[p] for p in predictions]}")
Exercises#
Exercise 1: Binary Classification#
Create a simple spam classifier:
Features: [word_count, has_money_keywords, has_urgent_keywords, num_links]
Label: 0 (not spam) or 1 (spam)
Create 20 sample emails
Train a model and evaluate accuracy
# Your code here
Exercise 2: Regression Challenge#
Predict student test scores based on:
Hours studied
Hours slept
Previous test score
Create synthetic data for 50 students and build a regression model.
# Your code here
Exercise 3: Customer Segmentation#
Use K-Means to segment customers based on:
Purchase frequency (purchases per month)
Average purchase value
Create 30 synthetic customers and find 3-4 meaningful segments.
# Your code here
Exercise 4: Model Comparison#
Compare 5 different classification algorithms on the iris dataset:
Use cross-validation
Report mean and std of accuracy
Identify the best model
# Your code here
Self-Check Quiz#
1. Whatโs the difference between supervised and unsupervised learning?
Answer
Supervised learning uses labeled data (input-output pairs) to learn predictions. Unsupervised learning finds patterns in unlabeled data.2. Why do we split data into training and test sets?
Answer
To evaluate how well the model generalizes to new, unseen data. Testing on training data would give overly optimistic results.3. What is overfitting?
Answer
When a model learns the training data too well, including noise and outliers, resulting in poor performance on new data.4. What does cross-validation do?
Answer
Tests model performance on multiple train-test splits to get a more reliable estimate of how well it will generalize.5. When should you scale features?
Answer
When features have different scales/units and using distance-based algorithms (KNN, SVM, neural networks) or gradient descent.6. Whatโs the difference between classification and regression?
Answer
Classification predicts categories/classes. Regression predicts continuous numeric values.7. What is feature engineering?
Answer
Creating new features from existing ones to improve model performance (e.g., combining features, extracting information).8. What does the Rยฒ score measure in regression?
Answer
How well the model explains variance in the data. 1.0 is perfect, 0 means the model is no better than predicting the mean.9. What is a hyperparameter?
Answer
A setting you configure before training (e.g., tree depth, number of neighbors) that controls how the algorithm learns.10. What does K-Means clustering do?
Answer
Groups data into K clusters where points in the same cluster are similar to each other.Key Takeaways#
โ Supervised learning uses labeled data; unsupervised finds patterns without labels
โ Always split data into training and test sets
โ Classification predicts categories; regression predicts numbers
โ Cross-validation provides more reliable performance estimates
โ Feature engineering can dramatically improve model performance
โ Feature scaling is crucial for distance-based algorithms
โ Overfitting happens when model memorizes training data
โ Hyperparameter tuning optimizes model settings
โ Pipelines combine preprocessing and modeling steps
โ Compare multiple algorithms to find the best one for your data
Pro Tips#
๐ก Start simple - Begin with simple models before trying complex ones
๐ก More data beats better algorithms - Focus on getting quality data
๐ก Check class balance - Imbalanced classes need special handling
๐ก Use stratified splits - Maintains class proportions in train/test sets
๐ก Feature engineering > model tuning - Often gives bigger improvements
๐ก Set random_state - Makes results reproducible
๐ก Monitor train vs test performance - Detects overfitting early
๐ก Use pipelines - Prevents data leakage and simplifies code
๐ก Understand your metrics - Accuracy isnโt always the right choice
๐ก Domain knowledge matters - Understanding the problem helps feature engineering
Common Mistakes to Avoid#
โ Testing on training data - Always use separate test set โ Use train_test_split or cross-validation
โ Not scaling features - Can hurt model performance โ Use StandardScaler or MinMaxScaler when needed
โ Ignoring class imbalance - Model biased toward majority class โ Use stratified sampling, SMOTE, or adjust class weights
โ Using too complex models - Leads to overfitting โ Start simple, add complexity only if needed
โ Not handling missing values - Many models canโt handle NaN โ Use imputation or remove missing data strategically
โ Forgetting to set random_state - Results not reproducible โ Always set random_state for consistency
Next Steps#
You now understand ML fundamentals! Next topics:
Deep Learning - Neural networks with TensorFlow/PyTorch
Natural Language Processing - Text classification, sentiment analysis
Computer Vision - Image classification, object detection
Time Series Analysis - Forecasting, trend analysis
Model Deployment - Serving models in production
Practice Projects:
Build a movie recommendation system
Create a sentiment analyzer for product reviews
Predict stock prices using historical data
Build a customer churn prediction model
Resources:
Scikit-learn documentation: https://scikit-learn.org
Kaggle competitions for practice: https://kaggle.com
Andrew Ngโs ML course: https://coursera.org/learn/machine-learning
Machine Learning is transforming every industry - keep practicing! ๐