Lesson 4: Machine Learning Basics

Lesson 4: Machine Learning Basics#

Introduction#

Machine Learning is teaching computers to learn from data rather than explicitly programming every rule. It’s like teaching a child to recognize animals by showing them pictures instead of describing every detail.

Why Machine Learning Matters:

Pattern Recognition: Find patterns humans can’t easily see
Predictions: Forecast future events based on historical data
Automation: Make decisions without constant human intervention
Scale: Process vast amounts of data quickly

What You’ll Learn:

ML fundamentals and terminology
Supervised learning (classification & regression)
Unsupervised learning (clustering)
Data preprocessing and feature engineering
Model evaluation and validation
Cross-validation and hyperparameter tuning
Avoiding overfitting and underfitting
Real-world ML workflows

Prerequisites: Install scikit-learn: pip install scikit-learn numpy pandas matplotlib

1. Machine Learning Fundamentals#

Key Concepts#

Features (X): Input variables used for prediction (like house size, location)
Labels/Target (y): Output variable we want to predict (like house price)
Training: Learning patterns from data
Testing: Evaluating how well the model learned
Model: Mathematical representation of patterns learned from data

Types of Machine Learning#

1. Supervised Learning (Learning with a teacher)

Have labeled examples: input → known output
Goal: Learn to predict outputs for new inputs
Examples: Email spam detection, image recognition, price prediction

2. Unsupervised Learning (Learning without labels)

Have only inputs, no labels
Goal: Find hidden patterns or structure
Examples: Customer segmentation, anomaly detection

3. Reinforcement Learning (Learning through trial and error)

Learn by interacting with environment
Get rewards or penalties for actions
Examples: Game AI, robotics, recommendation systems

2. Your First ML Model: Classification#

Classification assigns inputs to categories. Let’s predict flower species!

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import pandas as pd

# Load the famous iris dataset
iris = load_iris()
X = iris.data  # Features: sepal/petal measurements
y = iris.target  # Labels: flower species (0, 1, or 2)

print("Dataset Information:")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {X.shape[1]}")
print(f"\nFeature names: {iris.feature_names}")
print(f"Species: {iris.target_names}")
print(f"\nFirst 5 samples:")
print(pd.DataFrame(X[:5], columns=iris.feature_names))
print(f"\nTheir species: {[iris.target_names[i] for i in y[:5]]}")

Train-Test Split#

Critical concept: Never test on training data!

Training set: Data the model learns from (typically 70-80%)
Test set: Unseen data to evaluate performance (20-30%)

# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,  # 20% for testing
    random_state=42  # For reproducibility
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nClass distribution in training:")
unique, counts = np.unique(y_train, return_counts=True)
for species_id, count in zip(unique, counts):
    print(f"  {iris.target_names[species_id]}: {count}")

Training the Model#

# Create a Decision Tree classifier
model = DecisionTreeClassifier(
    max_depth=3,  # Limit tree depth to avoid overfitting
    random_state=42
)

# Train the model
model.fit(X_train, y_train)

print("Model trained successfully!")
print(f"Tree depth: {model.get_depth()}")
print(f"Number of leaves: {model.get_n_leaves()}")

Making Predictions#

# Predict on test set
y_pred = model.predict(X_test)

# Show predictions vs actual
print("Predictions vs Actual (first 10):")
for i in range(10):
    pred_species = iris.target_names[y_pred[i]]
    actual_species = iris.target_names[y_test[i]]
    correct = "✓" if y_pred[i] == y_test[i] else "✗"
    print(f"{correct} Predicted: {pred_species:15} Actual: {actual_species}")

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy * 100:.2f}%")

Detailed Evaluation#

from sklearn.metrics import confusion_matrix, classification_report

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(pd.DataFrame(
    cm,
    index=[f"Actual {name}" for name in iris.target_names],
    columns=[f"Pred {name}" for name in iris.target_names]
))

3. Regression: Predicting Continuous Values#

Regression predicts numeric values instead of categories.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Simple dataset: house size -> price
house_sizes = np.array([1000, 1200, 1500, 1800, 2000, 2200, 2500, 2800, 3000, 3500]).reshape(-1, 1)
prices = np.array([150, 180, 210, 240, 280, 300, 340, 380, 420, 500])  # in thousands

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    house_sizes, prices, test_size=0.3, random_state=42
)

# Train model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

# Make predictions
y_pred = reg_model.predict(X_test)

print("House Price Predictions:")
for size, actual, pred in zip(X_test.flatten(), y_test, y_pred):
    error = abs(actual - pred)
    print(f"Size: {size:4.0f} sq ft | Actual: ${actual:3.0f}k | Predicted: ${pred:3.0f}k | Error: ${error:.1f}k")

# Evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\nMean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f} (1.0 is perfect)")

# Model coefficients
print(f"\nModel equation: Price = {reg_model.intercept_:.2f} + {reg_model.coef_[0]:.4f} × Size")

4. Multiple Algorithm Comparison#

Try different algorithms to find the best one for your data.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Load iris data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Dictionary of models to compare
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Machine': SVC(random_state=42),
    'Naive Bayes': GaussianNB()
}

# Train and evaluate each model
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    results[name] = accuracy
    print(f"{name:25} Accuracy: {accuracy * 100:.2f}%")

# Find best model
best_model = max(results, key=results.get)
print(f"\nBest model: {best_model} with {results[best_model] * 100:.2f}% accuracy")

5. Feature Engineering#

Creating new features can dramatically improve model performance.

# Load iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Original features
print("Original features:")
print(iris.feature_names)

# Create new features
# Feature engineering: combine existing features
sepal_area = X[:, 0] * X[:, 1]  # sepal length * sepal width
petal_area = X[:, 2] * X[:, 3]  # petal length * petal width
sepal_to_petal_ratio = (X[:, 0] + X[:, 1]) / (X[:, 2] + X[:, 3])

# Combine original and new features
X_engineered = np.column_stack([X, sepal_area, petal_area, sepal_to_petal_ratio])

print(f"\nOriginal features: {X.shape[1]}")
print(f"Engineered features: {X_engineered.shape[1]}")

# Compare models with and without feature engineering
for features, name in [(X, "Original"), (X_engineered, "Engineered")]:
    X_train, X_test, y_train, y_test = train_test_split(
        features, y, test_size=0.3, random_state=42
    )
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    print(f"{name:12} features accuracy: {accuracy * 100:.2f}%")

6. Data Preprocessing#

Handling Missing Values#

from sklearn.impute import SimpleImputer

# Create data with missing values
X_with_missing = np.array([
    [1, 2],
    [np.nan, 3],
    [7, 6],
    [5, np.nan]
])

print("Data with missing values:")
print(X_with_missing)

# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_with_missing)

print("\nAfter imputation (mean):")
print(X_imputed)

Feature Scaling#

Many algorithms perform better when features are on similar scales.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data with different scales
data = np.array([
    [1000, 2],      # House: size in sq ft, bedrooms
    [1500, 3],
    [2000, 4]
])

print("Original data:")
print(data)

# Standard scaling (mean=0, std=1)
scaler_std = StandardScaler()
data_standardized = scaler_std.fit_transform(data)
print("\nStandardized (mean=0, std=1):")
print(data_standardized)

# Min-Max scaling (range 0-1)
scaler_minmax = MinMaxScaler()
data_normalized = scaler_minmax.fit_transform(data)
print("\nNormalized (range 0-1):")
print(data_normalized)

7. Cross-Validation#

Get more reliable performance estimates by testing on multiple splits.

from sklearn.model_selection import cross_val_score, KFold

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Create model
model = DecisionTreeClassifier(random_state=42)

# 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)

print("Cross-Validation Scores (5 folds):")
for i, score in enumerate(cv_scores, 1):
    print(f"  Fold {i}: {score * 100:.2f}%")

print(f"\nMean accuracy: {cv_scores.mean() * 100:.2f}%")
print(f"Std deviation: {cv_scores.std() * 100:.2f}%")

8. Hyperparameter Tuning#

Find the best settings for your model.

from sklearn.model_selection import GridSearchCV

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Define parameter grid to search
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create base model
dt = DecisionTreeClassifier(random_state=42)

# Grid search with cross-validation
grid_search = GridSearchCV(
    dt, param_grid, cv=5, scoring='accuracy', verbose=1
)
grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_ * 100:.2f}%")

# Use best model
best_model = grid_search.best_estimator_

9. Overfitting vs Underfitting#

Understanding the Bias-Variance Tradeoff#

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Test different tree depths
depths = [1, 2, 3, 5, 10, 20]

print("Tree Depth | Train Acc | Test Acc | Status")
print("-" * 50)

for depth in depths:
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    # Diagnose overfitting/underfitting
    if train_acc < 0.85:
        status = "Underfitting"
    elif train_acc - test_acc > 0.1:
        status = "Overfitting"
    else:
        status = "Good fit"
    
    print(f"    {depth:2d}     |  {train_acc:.2f}    |  {test_acc:.2f}   | {status}")

10. Unsupervised Learning: Clustering#

Find natural groupings in data without labels.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Customer data: [age, annual_income_k]
customers = np.array([
    [25, 40], [27, 45], [30, 48], [32, 50], [35, 52],  # Young, moderate income
    [45, 80], [48, 85], [50, 90], [52, 88], [55, 95],  # Middle-age, high income
    [60, 30], [62, 32], [65, 35], [67, 28], [70, 25]   # Senior, low income
])

# Try different numbers of clusters
print("Finding optimal number of clusters...\n")
for k in range(2, 6):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(customers)
    
    # Silhouette score: measures cluster quality (-1 to 1, higher is better)
    score = silhouette_score(customers, clusters)
    print(f"k={k}: Silhouette Score = {score:.3f}")

# Use 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(customers)

print("\nCustomer Segments:")
for i in range(3):
    cluster_customers = customers[clusters == i]
    avg_age = cluster_customers[:, 0].mean()
    avg_income = cluster_customers[:, 1].mean()
    print(f"Cluster {i}: {len(cluster_customers)} customers, Avg Age: {avg_age:.0f}, Avg Income: ${avg_income:.0f}k")

11. Feature Importance#

Understand which features matter most for predictions.

# Load iris data
iris = load_iris()
X, y = iris.data, iris.target

# Train Random Forest (provides feature importance)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance
importances = rf.feature_importances_

# Sort features by importance
indices = np.argsort(importances)[::-1]

print("Feature Importance Ranking:")
for i, idx in enumerate(indices, 1):
    print(f"{i}. {iris.feature_names[idx]:20} {importances[idx]:.4f}")

12. Complete ML Workflow Example#

Putting it all together with best practices.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# 1. Load and explore data
iris = load_iris()
X, y = iris.data, iris.target
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Create pipeline (preprocessing + model)
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scale features
    ('classifier', RandomForestClassifier(random_state=42))  # Model
])

# 4. Define hyperparameters to tune
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 7]
}

# 5. Grid search with cross-validation
grid_search = GridSearchCV(
    pipeline, param_grid, cv=5, scoring='accuracy'
)
grid_search.fit(X_train, y_train)

# 6. Evaluate on test set
test_score = grid_search.score(X_test, y_test)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_ * 100:.2f}%")
print(f"Test score: {test_score * 100:.2f}%")

# 7. Make predictions on new data
new_flowers = np.array([[5.0, 3.5, 1.5, 0.2], [6.5, 3.0, 5.5, 2.0]])
predictions = grid_search.predict(new_flowers)
print(f"\nPredictions for new flowers: {[iris.target_names[p] for p in predictions]}")

Exercises#

Exercise 1: Binary Classification#

Create a simple spam classifier:

Features: [word_count, has_money_keywords, has_urgent_keywords, num_links]
Label: 0 (not spam) or 1 (spam)
Create 20 sample emails
Train a model and evaluate accuracy

# Your code here

Exercise 2: Regression Challenge#

Predict student test scores based on:

Hours studied
Hours slept
Previous test score

Create synthetic data for 50 students and build a regression model.

# Your code here

Exercise 3: Customer Segmentation#

Use K-Means to segment customers based on:

Purchase frequency (purchases per month)
Average purchase value

Create 30 synthetic customers and find 3-4 meaningful segments.

# Your code here

Exercise 4: Model Comparison#

Compare 5 different classification algorithms on the iris dataset:

Use cross-validation
Report mean and std of accuracy
Identify the best model

# Your code here

Self-Check Quiz#

1. What’s the difference between supervised and unsupervised learning?

Answer

Supervised learning uses labeled data (input-output pairs) to learn predictions. Unsupervised learning finds patterns in unlabeled data.

2. Why do we split data into training and test sets?

Answer

To evaluate how well the model generalizes to new, unseen data. Testing on training data would give overly optimistic results.

3. What is overfitting?

Answer

When a model learns the training data too well, including noise and outliers, resulting in poor performance on new data.

4. What does cross-validation do?

Answer

Tests model performance on multiple train-test splits to get a more reliable estimate of how well it will generalize.

5. When should you scale features?

Answer

When features have different scales/units and using distance-based algorithms (KNN, SVM, neural networks) or gradient descent.

6. What’s the difference between classification and regression?

Answer

Classification predicts categories/classes. Regression predicts continuous numeric values.

7. What is feature engineering?

Answer

Creating new features from existing ones to improve model performance (e.g., combining features, extracting information).

8. What does the R² score measure in regression?

Answer

How well the model explains variance in the data. 1.0 is perfect, 0 means the model is no better than predicting the mean.

9. What is a hyperparameter?

Answer

A setting you configure before training (e.g., tree depth, number of neighbors) that controls how the algorithm learns.

10. What does K-Means clustering do?

Answer

Groups data into K clusters where points in the same cluster are similar to each other.

Key Takeaways#

✅ Supervised learning uses labeled data; unsupervised finds patterns without labels

✅ Always split data into training and test sets

✅ Classification predicts categories; regression predicts numbers

✅ Cross-validation provides more reliable performance estimates

✅ Feature engineering can dramatically improve model performance

✅ Feature scaling is crucial for distance-based algorithms

✅ Overfitting happens when model memorizes training data

✅ Hyperparameter tuning optimizes model settings

✅ Pipelines combine preprocessing and modeling steps

✅ Compare multiple algorithms to find the best one for your data

Pro Tips#

💡 Start simple - Begin with simple models before trying complex ones

💡 More data beats better algorithms - Focus on getting quality data

💡 Check class balance - Imbalanced classes need special handling

💡 Use stratified splits - Maintains class proportions in train/test sets

💡 Feature engineering > model tuning - Often gives bigger improvements

💡 Set random_state - Makes results reproducible

💡 Monitor train vs test performance - Detects overfitting early

💡 Use pipelines - Prevents data leakage and simplifies code

💡 Understand your metrics - Accuracy isn’t always the right choice

💡 Domain knowledge matters - Understanding the problem helps feature engineering

Common Mistakes to Avoid#

❌ Testing on training data - Always use separate test set ✅ Use train_test_split or cross-validation

❌ Not scaling features - Can hurt model performance ✅ Use StandardScaler or MinMaxScaler when needed

❌ Ignoring class imbalance - Model biased toward majority class ✅ Use stratified sampling, SMOTE, or adjust class weights

❌ Using too complex models - Leads to overfitting ✅ Start simple, add complexity only if needed

❌ Not handling missing values - Many models can’t handle NaN ✅ Use imputation or remove missing data strategically

❌ Forgetting to set random_state - Results not reproducible ✅ Always set random_state for consistency

Next Steps#

You now understand ML fundamentals! Next topics:

Deep Learning - Neural networks with TensorFlow/PyTorch
Natural Language Processing - Text classification, sentiment analysis
Computer Vision - Image classification, object detection
Time Series Analysis - Forecasting, trend analysis
Model Deployment - Serving models in production

Practice Projects:

Build a movie recommendation system
Create a sentiment analyzer for product reviews
Predict stock prices using historical data
Build a customer churn prediction model

Resources:

Scikit-learn documentation: https://scikit-learn.org
Kaggle competitions for practice: https://kaggle.com
Andrew Ng’s ML course: https://coursera.org/learn/machine-learning

Machine Learning is transforming every industry - keep practicing! 🚀

Lesson 4: Machine Learning Basics

Contents

Lesson 4: Machine Learning Basics#

Introduction#

1. Machine Learning Fundamentals#

Key Concepts#

Types of Machine Learning#

2. Your First ML Model: Classification#

Train-Test Split#

Training the Model#

Making Predictions#

Detailed Evaluation#

3. Regression: Predicting Continuous Values#

4. Multiple Algorithm Comparison#

5. Feature Engineering#

6. Data Preprocessing#

Handling Missing Values#

Feature Scaling#

7. Cross-Validation#

8. Hyperparameter Tuning#

9. Overfitting vs Underfitting#

Understanding the Bias-Variance Tradeoff#

10. Unsupervised Learning: Clustering#

11. Feature Importance#

12. Complete ML Workflow Example#

Exercises#

Exercise 1: Binary Classification#

Exercise 2: Regression Challenge#

Exercise 3: Customer Segmentation#

Exercise 4: Model Comparison#

Self-Check Quiz#

Key Takeaways#

Pro Tips#

Common Mistakes to Avoid#

Next Steps#