Lesson 4: Machine Learning Basics#

Introduction#

Machine Learning is teaching computers to learn from data rather than explicitly programming every rule. Itโ€™s like teaching a child to recognize animals by showing them pictures instead of describing every detail.

Why Machine Learning Matters:

  • Pattern Recognition: Find patterns humans canโ€™t easily see

  • Predictions: Forecast future events based on historical data

  • Automation: Make decisions without constant human intervention

  • Scale: Process vast amounts of data quickly

What Youโ€™ll Learn:

  • ML fundamentals and terminology

  • Supervised learning (classification & regression)

  • Unsupervised learning (clustering)

  • Data preprocessing and feature engineering

  • Model evaluation and validation

  • Cross-validation and hyperparameter tuning

  • Avoiding overfitting and underfitting

  • Real-world ML workflows

Prerequisites: Install scikit-learn: pip install scikit-learn numpy pandas matplotlib

1. Machine Learning Fundamentals#

Key Concepts#

  • Features (X): Input variables used for prediction (like house size, location)

  • Labels/Target (y): Output variable we want to predict (like house price)

  • Training: Learning patterns from data

  • Testing: Evaluating how well the model learned

  • Model: Mathematical representation of patterns learned from data

Types of Machine Learning#

1. Supervised Learning (Learning with a teacher)

  • Have labeled examples: input โ†’ known output

  • Goal: Learn to predict outputs for new inputs

  • Examples: Email spam detection, image recognition, price prediction

2. Unsupervised Learning (Learning without labels)

  • Have only inputs, no labels

  • Goal: Find hidden patterns or structure

  • Examples: Customer segmentation, anomaly detection

3. Reinforcement Learning (Learning through trial and error)

  • Learn by interacting with environment

  • Get rewards or penalties for actions

  • Examples: Game AI, robotics, recommendation systems

2. Your First ML Model: Classification#

Classification assigns inputs to categories. Letโ€™s predict flower species!

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import pandas as pd

# Load the famous iris dataset
iris = load_iris()
X = iris.data  # Features: sepal/petal measurements
y = iris.target  # Labels: flower species (0, 1, or 2)

print("Dataset Information:")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {X.shape[1]}")
print(f"\nFeature names: {iris.feature_names}")
print(f"Species: {iris.target_names}")
print(f"\nFirst 5 samples:")
print(pd.DataFrame(X[:5], columns=iris.feature_names))
print(f"\nTheir species: {[iris.target_names[i] for i in y[:5]]}")

Train-Test Split#

Critical concept: Never test on training data!

  • Training set: Data the model learns from (typically 70-80%)

  • Test set: Unseen data to evaluate performance (20-30%)

# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,  # 20% for testing
    random_state=42  # For reproducibility
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nClass distribution in training:")
unique, counts = np.unique(y_train, return_counts=True)
for species_id, count in zip(unique, counts):
    print(f"  {iris.target_names[species_id]}: {count}")

Training the Model#

# Create a Decision Tree classifier
model = DecisionTreeClassifier(
    max_depth=3,  # Limit tree depth to avoid overfitting
    random_state=42
)

# Train the model
model.fit(X_train, y_train)

print("Model trained successfully!")
print(f"Tree depth: {model.get_depth()}")
print(f"Number of leaves: {model.get_n_leaves()}")

Making Predictions#

# Predict on test set
y_pred = model.predict(X_test)

# Show predictions vs actual
print("Predictions vs Actual (first 10):")
for i in range(10):
    pred_species = iris.target_names[y_pred[i]]
    actual_species = iris.target_names[y_test[i]]
    correct = "โœ“" if y_pred[i] == y_test[i] else "โœ—"
    print(f"{correct} Predicted: {pred_species:15} Actual: {actual_species}")

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy * 100:.2f}%")

Detailed Evaluation#

from sklearn.metrics import confusion_matrix, classification_report

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(pd.DataFrame(
    cm,
    index=[f"Actual {name}" for name in iris.target_names],
    columns=[f"Pred {name}" for name in iris.target_names]
))

3. Regression: Predicting Continuous Values#

Regression predicts numeric values instead of categories.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Simple dataset: house size -> price
house_sizes = np.array([1000, 1200, 1500, 1800, 2000, 2200, 2500, 2800, 3000, 3500]).reshape(-1, 1)
prices = np.array([150, 180, 210, 240, 280, 300, 340, 380, 420, 500])  # in thousands

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    house_sizes, prices, test_size=0.3, random_state=42
)

# Train model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

# Make predictions
y_pred = reg_model.predict(X_test)

print("House Price Predictions:")
for size, actual, pred in zip(X_test.flatten(), y_test, y_pred):
    error = abs(actual - pred)
    print(f"Size: {size:4.0f} sq ft | Actual: ${actual:3.0f}k | Predicted: ${pred:3.0f}k | Error: ${error:.1f}k")

# Evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\nMean Squared Error: {mse:.2f}")
print(f"Rยฒ Score: {r2:.2f} (1.0 is perfect)")

# Model coefficients
print(f"\nModel equation: Price = {reg_model.intercept_:.2f} + {reg_model.coef_[0]:.4f} ร— Size")

4. Multiple Algorithm Comparison#

Try different algorithms to find the best one for your data.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Load iris data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Dictionary of models to compare
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Machine': SVC(random_state=42),
    'Naive Bayes': GaussianNB()
}

# Train and evaluate each model
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    results[name] = accuracy
    print(f"{name:25} Accuracy: {accuracy * 100:.2f}%")

# Find best model
best_model = max(results, key=results.get)
print(f"\nBest model: {best_model} with {results[best_model] * 100:.2f}% accuracy")

5. Feature Engineering#

Creating new features can dramatically improve model performance.

# Load iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Original features
print("Original features:")
print(iris.feature_names)

# Create new features
# Feature engineering: combine existing features
sepal_area = X[:, 0] * X[:, 1]  # sepal length * sepal width
petal_area = X[:, 2] * X[:, 3]  # petal length * petal width
sepal_to_petal_ratio = (X[:, 0] + X[:, 1]) / (X[:, 2] + X[:, 3])

# Combine original and new features
X_engineered = np.column_stack([X, sepal_area, petal_area, sepal_to_petal_ratio])

print(f"\nOriginal features: {X.shape[1]}")
print(f"Engineered features: {X_engineered.shape[1]}")

# Compare models with and without feature engineering
for features, name in [(X, "Original"), (X_engineered, "Engineered")]:
    X_train, X_test, y_train, y_test = train_test_split(
        features, y, test_size=0.3, random_state=42
    )
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    print(f"{name:12} features accuracy: {accuracy * 100:.2f}%")

6. Data Preprocessing#

Handling Missing Values#

from sklearn.impute import SimpleImputer

# Create data with missing values
X_with_missing = np.array([
    [1, 2],
    [np.nan, 3],
    [7, 6],
    [5, np.nan]
])

print("Data with missing values:")
print(X_with_missing)

# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_with_missing)

print("\nAfter imputation (mean):")
print(X_imputed)

Feature Scaling#

Many algorithms perform better when features are on similar scales.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data with different scales
data = np.array([
    [1000, 2],      # House: size in sq ft, bedrooms
    [1500, 3],
    [2000, 4]
])

print("Original data:")
print(data)

# Standard scaling (mean=0, std=1)
scaler_std = StandardScaler()
data_standardized = scaler_std.fit_transform(data)
print("\nStandardized (mean=0, std=1):")
print(data_standardized)

# Min-Max scaling (range 0-1)
scaler_minmax = MinMaxScaler()
data_normalized = scaler_minmax.fit_transform(data)
print("\nNormalized (range 0-1):")
print(data_normalized)

7. Cross-Validation#

Get more reliable performance estimates by testing on multiple splits.

from sklearn.model_selection import cross_val_score, KFold

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Create model
model = DecisionTreeClassifier(random_state=42)

# 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)

print("Cross-Validation Scores (5 folds):")
for i, score in enumerate(cv_scores, 1):
    print(f"  Fold {i}: {score * 100:.2f}%")

print(f"\nMean accuracy: {cv_scores.mean() * 100:.2f}%")
print(f"Std deviation: {cv_scores.std() * 100:.2f}%")

8. Hyperparameter Tuning#

Find the best settings for your model.

from sklearn.model_selection import GridSearchCV

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Define parameter grid to search
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create base model
dt = DecisionTreeClassifier(random_state=42)

# Grid search with cross-validation
grid_search = GridSearchCV(
    dt, param_grid, cv=5, scoring='accuracy', verbose=1
)
grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_ * 100:.2f}%")

# Use best model
best_model = grid_search.best_estimator_

9. Overfitting vs Underfitting#

Understanding the Bias-Variance Tradeoff#

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Test different tree depths
depths = [1, 2, 3, 5, 10, 20]

print("Tree Depth | Train Acc | Test Acc | Status")
print("-" * 50)

for depth in depths:
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    # Diagnose overfitting/underfitting
    if train_acc < 0.85:
        status = "Underfitting"
    elif train_acc - test_acc > 0.1:
        status = "Overfitting"
    else:
        status = "Good fit"
    
    print(f"    {depth:2d}     |  {train_acc:.2f}    |  {test_acc:.2f}   | {status}")

10. Unsupervised Learning: Clustering#

Find natural groupings in data without labels.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Customer data: [age, annual_income_k]
customers = np.array([
    [25, 40], [27, 45], [30, 48], [32, 50], [35, 52],  # Young, moderate income
    [45, 80], [48, 85], [50, 90], [52, 88], [55, 95],  # Middle-age, high income
    [60, 30], [62, 32], [65, 35], [67, 28], [70, 25]   # Senior, low income
])

# Try different numbers of clusters
print("Finding optimal number of clusters...\n")
for k in range(2, 6):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(customers)
    
    # Silhouette score: measures cluster quality (-1 to 1, higher is better)
    score = silhouette_score(customers, clusters)
    print(f"k={k}: Silhouette Score = {score:.3f}")

# Use 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(customers)

print("\nCustomer Segments:")
for i in range(3):
    cluster_customers = customers[clusters == i]
    avg_age = cluster_customers[:, 0].mean()
    avg_income = cluster_customers[:, 1].mean()
    print(f"Cluster {i}: {len(cluster_customers)} customers, Avg Age: {avg_age:.0f}, Avg Income: ${avg_income:.0f}k")

11. Feature Importance#

Understand which features matter most for predictions.

# Load iris data
iris = load_iris()
X, y = iris.data, iris.target

# Train Random Forest (provides feature importance)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance
importances = rf.feature_importances_

# Sort features by importance
indices = np.argsort(importances)[::-1]

print("Feature Importance Ranking:")
for i, idx in enumerate(indices, 1):
    print(f"{i}. {iris.feature_names[idx]:20} {importances[idx]:.4f}")

12. Complete ML Workflow Example#

Putting it all together with best practices.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# 1. Load and explore data
iris = load_iris()
X, y = iris.data, iris.target
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Create pipeline (preprocessing + model)
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scale features
    ('classifier', RandomForestClassifier(random_state=42))  # Model
])

# 4. Define hyperparameters to tune
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 7]
}

# 5. Grid search with cross-validation
grid_search = GridSearchCV(
    pipeline, param_grid, cv=5, scoring='accuracy'
)
grid_search.fit(X_train, y_train)

# 6. Evaluate on test set
test_score = grid_search.score(X_test, y_test)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_ * 100:.2f}%")
print(f"Test score: {test_score * 100:.2f}%")

# 7. Make predictions on new data
new_flowers = np.array([[5.0, 3.5, 1.5, 0.2], [6.5, 3.0, 5.5, 2.0]])
predictions = grid_search.predict(new_flowers)
print(f"\nPredictions for new flowers: {[iris.target_names[p] for p in predictions]}")

Exercises#

Exercise 1: Binary Classification#

Create a simple spam classifier:

  • Features: [word_count, has_money_keywords, has_urgent_keywords, num_links]

  • Label: 0 (not spam) or 1 (spam)

  • Create 20 sample emails

  • Train a model and evaluate accuracy

# Your code here

Exercise 2: Regression Challenge#

Predict student test scores based on:

  • Hours studied

  • Hours slept

  • Previous test score

Create synthetic data for 50 students and build a regression model.

# Your code here

Exercise 3: Customer Segmentation#

Use K-Means to segment customers based on:

  • Purchase frequency (purchases per month)

  • Average purchase value

Create 30 synthetic customers and find 3-4 meaningful segments.

# Your code here

Exercise 4: Model Comparison#

Compare 5 different classification algorithms on the iris dataset:

  • Use cross-validation

  • Report mean and std of accuracy

  • Identify the best model

# Your code here

Self-Check Quiz#

1. Whatโ€™s the difference between supervised and unsupervised learning?

Answer Supervised learning uses labeled data (input-output pairs) to learn predictions. Unsupervised learning finds patterns in unlabeled data.

2. Why do we split data into training and test sets?

Answer To evaluate how well the model generalizes to new, unseen data. Testing on training data would give overly optimistic results.

3. What is overfitting?

Answer When a model learns the training data too well, including noise and outliers, resulting in poor performance on new data.

4. What does cross-validation do?

Answer Tests model performance on multiple train-test splits to get a more reliable estimate of how well it will generalize.

5. When should you scale features?

Answer When features have different scales/units and using distance-based algorithms (KNN, SVM, neural networks) or gradient descent.

6. Whatโ€™s the difference between classification and regression?

Answer Classification predicts categories/classes. Regression predicts continuous numeric values.

7. What is feature engineering?

Answer Creating new features from existing ones to improve model performance (e.g., combining features, extracting information).

8. What does the Rยฒ score measure in regression?

Answer How well the model explains variance in the data. 1.0 is perfect, 0 means the model is no better than predicting the mean.

9. What is a hyperparameter?

Answer A setting you configure before training (e.g., tree depth, number of neighbors) that controls how the algorithm learns.

10. What does K-Means clustering do?

Answer Groups data into K clusters where points in the same cluster are similar to each other.

Key Takeaways#

โœ… Supervised learning uses labeled data; unsupervised finds patterns without labels

โœ… Always split data into training and test sets

โœ… Classification predicts categories; regression predicts numbers

โœ… Cross-validation provides more reliable performance estimates

โœ… Feature engineering can dramatically improve model performance

โœ… Feature scaling is crucial for distance-based algorithms

โœ… Overfitting happens when model memorizes training data

โœ… Hyperparameter tuning optimizes model settings

โœ… Pipelines combine preprocessing and modeling steps

โœ… Compare multiple algorithms to find the best one for your data

Pro Tips#

๐Ÿ’ก Start simple - Begin with simple models before trying complex ones

๐Ÿ’ก More data beats better algorithms - Focus on getting quality data

๐Ÿ’ก Check class balance - Imbalanced classes need special handling

๐Ÿ’ก Use stratified splits - Maintains class proportions in train/test sets

๐Ÿ’ก Feature engineering > model tuning - Often gives bigger improvements

๐Ÿ’ก Set random_state - Makes results reproducible

๐Ÿ’ก Monitor train vs test performance - Detects overfitting early

๐Ÿ’ก Use pipelines - Prevents data leakage and simplifies code

๐Ÿ’ก Understand your metrics - Accuracy isnโ€™t always the right choice

๐Ÿ’ก Domain knowledge matters - Understanding the problem helps feature engineering

Common Mistakes to Avoid#

โŒ Testing on training data - Always use separate test set โœ… Use train_test_split or cross-validation

โŒ Not scaling features - Can hurt model performance โœ… Use StandardScaler or MinMaxScaler when needed

โŒ Ignoring class imbalance - Model biased toward majority class โœ… Use stratified sampling, SMOTE, or adjust class weights

โŒ Using too complex models - Leads to overfitting โœ… Start simple, add complexity only if needed

โŒ Not handling missing values - Many models canโ€™t handle NaN โœ… Use imputation or remove missing data strategically

โŒ Forgetting to set random_state - Results not reproducible โœ… Always set random_state for consistency

Next Steps#

You now understand ML fundamentals! Next topics:

  1. Deep Learning - Neural networks with TensorFlow/PyTorch

  2. Natural Language Processing - Text classification, sentiment analysis

  3. Computer Vision - Image classification, object detection

  4. Time Series Analysis - Forecasting, trend analysis

  5. Model Deployment - Serving models in production

Practice Projects:

  • Build a movie recommendation system

  • Create a sentiment analyzer for product reviews

  • Predict stock prices using historical data

  • Build a customer churn prediction model

Resources:

Machine Learning is transforming every industry - keep practicing! ๐Ÿš€