Lesson 5: Data Analysis with Pandas

Lesson 5: Data Analysis with Pandas#

Introduction#

Pandas is Python’s powerhouse library for data analysis. Think of it as Excel on steroids - it handles data manipulation, cleaning, analysis, and visualization with elegant, readable code.

Why Pandas Matters:

Industry Standard: Used by data scientists worldwide
Powerful: Handle millions of rows efficiently
Flexible: Read from CSV, Excel, SQL, JSON, and more
Integrated: Works seamlessly with NumPy, Matplotlib, scikit-learn

What You’ll Learn:

Series and DataFrames fundamentals
Loading and saving data (CSV, Excel, JSON)
Data exploration and inspection
Selecting, filtering, and indexing
Data cleaning and handling missing values
Grouping and aggregation
Merging and joining datasets
Time series basics
Data visualization

Prerequisites: pip install pandas numpy matplotlib openpyxl

1. Pandas Fundamentals#

Series: One-Dimensional Array#

import pandas as pd
import numpy as np

# Create a Series
temperatures = pd.Series([72, 75, 78, 74, 71], 
                         index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
                         name='Temperature')

print("Series:")
print(temperatures)
print(f"\nAccess by label: Wed = {temperatures['Wed']}")
print(f"Access by position: [1] = {temperatures.iloc[1]}")
print(f"\nMean: {temperatures.mean():.1f}")
print(f"Max: {temperatures.max()}")

DataFrames: Two-Dimensional Tables#

# Create DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['NYC', 'Paris', 'London', 'Tokyo', 'Berlin'],
    'Salary': [70000, 80000, 75000, 90000, 85000]
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print(f"\nShape: {df.shape} (rows, columns)")
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index.tolist()}")

2. Loading and Saving Data#

Reading CSV Files#

# Create sample CSV data
import io

csv_data = """Name,Age,Department,Salary
Alice,25,Engineering,75000
Bob,30,Marketing,65000
Charlie,35,Engineering,85000
Diana,28,Sales,70000
Eve,32,Marketing,72000"""

# Read from string (simulating file read)
df = pd.read_csv(io.StringIO(csv_data))
print("Data loaded from CSV:")
print(df)

# Save to CSV
# df.to_csv('employees.csv', index=False)

# Read with options
# df = pd.read_csv('data.csv', 
#                  sep=',',           # Delimiter
#                  header=0,          # Row to use as column names
#                  na_values=['NA', 'missing'],  # Values to treat as NaN
#                  parse_dates=['Date']  # Parse date columns
# )

Working with Excel Files#

# Read Excel (requires openpyxl or xlrd)
# df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Write to Excel
# df.to_excel('output.xlsx', sheet_name='Results', index=False)

# Multiple sheets
# with pd.ExcelWriter('multi_sheet.xlsx') as writer:
#     df1.to_excel(writer, sheet_name='Sales')
#     df2.to_excel(writer, sheet_name='Inventory')

print("Excel I/O examples shown (commented out)")

JSON and Other Formats#

# JSON
json_str = '{"Name": ["Alice", "Bob"], "Age": [25, 30]}'
df_json = pd.read_json(io.StringIO(json_str))
print("From JSON:")
print(df_json)

# To JSON
json_output = df_json.to_json(orient='records')
print(f"\nTo JSON: {json_output}")

# Other formats:
# pd.read_sql(query, connection)  # SQL database
# pd.read_html(url)  # HTML tables
# pd.read_clipboard()  # From clipboard

3. Data Exploration#

Inspecting Data#

# Create sample dataset
np.random.seed(42)
df = pd.DataFrame({
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet'], 100),
    'Price': np.random.randint(200, 1500, 100),
    'Quantity': np.random.randint(1, 10, 100),
    'Rating': np.random.uniform(3.0, 5.0, 100)
})

# Basic info
print("First 5 rows:")
print(df.head())

print("\nLast 3 rows:")
print(df.tail(3))

print("\nDataFrame Info:")
print(df.info())

print("\nStatistical Summary:")
print(df.describe())

print("\nValue Counts:")
print(df['Product'].value_counts())

4. Selecting and Filtering Data#

Column Selection#

# Single column (returns Series)
prices = df['Price']
print("Type:", type(prices))
print(prices.head())

# Multiple columns (returns DataFrame)
subset = df[['Product', 'Price']]
print("\nMultiple columns:")
print(subset.head())

Row Selection: loc vs iloc#

# iloc: Integer-based position
print("First row (iloc):")
print(df.iloc[0])

print("\nFirst 3 rows, first 2 columns:")
print(df.iloc[0:3, 0:2])

# loc: Label-based
df_indexed = df.set_index('Product')
print("\nSelect 'Laptop' rows (loc):")
print(df_indexed.loc['Laptop'].head())

Boolean Filtering#

# Simple condition
expensive = df[df['Price'] > 1000]
print(f"Expensive items (>{1000}): {len(expensive)} rows")
print(expensive.head())

# Multiple conditions (AND: &, OR: |)
high_value = df[(df['Price'] > 800) & (df['Rating'] >= 4.5)]
print(f"\nHigh value items: {len(high_value)} rows")

# Using isin()
mobile_devices = df[df['Product'].isin(['Phone', 'Tablet'])]
print(f"\nMobile devices: {len(mobile_devices)} rows")

# String methods
# phones = df[df['Product'].str.contains('Phone')]
# uppercase = df[df['Product'].str.isupper()]

5. Data Manipulation#

Adding and Modifying Columns#

# Add new column
df['Revenue'] = df['Price'] * df['Quantity']

# Conditional column
df['Price_Category'] = pd.cut(df['Price'], 
                               bins=[0, 500, 1000, 2000],
                               labels=['Low', 'Medium', 'High'])

# Apply function
df['Discount_Price'] = df['Price'].apply(lambda x: x * 0.9)

# Using np.where (vectorized if-else)
df['Quality'] = np.where(df['Rating'] >= 4.5, 'Excellent', 'Good')

print(df[['Product', 'Price', 'Price_Category', 'Quality']].head())

Sorting Data#

# Sort by single column
sorted_price = df.sort_values('Price', ascending=False)
print("Top 5 by price:")
print(sorted_price[['Product', 'Price']].head())

# Sort by multiple columns
sorted_multi = df.sort_values(['Product', 'Price'], ascending=[True, False])
print("\nSorted by Product (asc), then Price (desc):")
print(sorted_multi[['Product', 'Price']].head())

6. Handling Missing Data#

# Create data with missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, 5]
})

print("Original data:")
print(df_missing)

# Detect missing values
print("\nMissing values:")
print(df_missing.isnull().sum())

# Drop rows with any missing values
print("\nAfter dropna():")
print(df_missing.dropna())

# Fill missing values
print("\nFill with 0:")
print(df_missing.fillna(0))

# Fill with mean
df_filled = df_missing.copy()
df_filled['A'].fillna(df_filled['A'].mean(), inplace=True)
print("\nFill column A with mean:")
print(df_filled)

# Forward fill (propagate last valid value)
print("\nForward fill:")
print(df_missing.fillna(method='ffill'))

7. Grouping and Aggregation#

# Group by single column
product_stats = df.groupby('Product')['Revenue'].sum().sort_values(ascending=False)
print("Total Revenue by Product:")
print(product_stats)

# Multiple aggregations
summary = df.groupby('Product').agg({
    'Price': ['mean', 'min', 'max'],
    'Quantity': 'sum',
    'Rating': 'mean'
})
print("\nDetailed summary:")
print(summary)

# Group by multiple columns
multi_group = df.groupby(['Product', 'Price_Category']).size()
print("\nCount by Product and Price Category:")
print(multi_group)

# Custom aggregation
def price_range(x):
    return x.max() - x.min()

price_ranges = df.groupby('Product')['Price'].agg(price_range)
print("\nPrice Range by Product:")
print(price_ranges)

8. Merging and Joining DataFrames#

# Create sample dataframes
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [10, 20, 10, 30]
})

departments = pd.DataFrame({
    'dept_id': [10, 20, 30],
    'dept_name': ['Engineering', 'Marketing', 'Sales']
})

# Inner join (default)
inner = pd.merge(employees, departments, on='dept_id')
print("Inner join:")
print(inner)

# Left join
left = pd.merge(employees, departments, on='dept_id', how='left')
print("\nLeft join:")
print(left)

# Concatenate vertically
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
combined = pd.concat([df1, df2], ignore_index=True)
print("\nConcatenated:")
print(combined)

9. Pivot Tables and Reshaping#

# Create sales data
sales = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'Product': ['A', 'B'] * 6,
    'Region': ['North', 'South', 'North', 'South'] * 3,
    'Sales': np.random.randint(100, 1000, 12)
})

print("Sales data:")
print(sales.head())

# Pivot table
pivot = pd.pivot_table(sales, 
                       values='Sales',
                       index='Product',
                       columns='Region',
                       aggfunc='sum')
print("\nPivot table:")
print(pivot)

# Melt (unpivot)
melted = pivot.reset_index().melt(id_vars='Product', 
                                   var_name='Region',
                                   value_name='Sales')
print("\nMelted back:")
print(melted)

10. Time Series Basics#

# Create time series data
dates = pd.date_range('2024-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(100).cumsum(), index=dates)

print("Time series (first 5):")
print(ts.head())

# Date-based selection
print("\nJanuary data:")
print(ts['2024-01'].head())

# Resampling
weekly = ts.resample('W').mean()
print("\nWeekly average (first 3):")
print(weekly.head(3))

# Rolling window
rolling_mean = ts.rolling(window=7).mean()
print("\n7-day rolling mean (first 10):")
print(rolling_mean.head(10))

11. Data Visualization with Pandas#

import matplotlib.pyplot as plt

# Bar chart
product_stats.plot(kind='bar', figsize=(10, 6), color='skyblue')
plt.title('Total Revenue by Product')
plt.ylabel('Revenue ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Histogram
df['Price'].hist(bins=20, figsize=(10, 6), edgecolor='black')
plt.title('Price Distribution')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Box plot
df.boxplot(column='Price', by='Product', figsize=(10, 6))
plt.title('Price Distribution by Product')
plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.show()

# Scatter plot
df.plot.scatter(x='Price', y='Rating', figsize=(10, 6), alpha=0.5)
plt.title('Price vs Rating')
plt.tight_layout()
plt.show()

12. Real-World Example: Sales Analysis#

# Create realistic sales dataset
np.random.seed(42)
n_records = 500

sales_df = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=n_records, freq='H'),
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_records),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
    'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana'], n_records),
    'Quantity': np.random.randint(1, 5, n_records),
    'Unit_Price': np.random.randint(200, 1500, n_records)
})

# Calculate revenue
sales_df['Revenue'] = sales_df['Quantity'] * sales_df['Unit_Price']

print("Sales Dataset:")
print(sales_df.head())

# Analysis 1: Top products by revenue
top_products = sales_df.groupby('Product')['Revenue'].sum().sort_values(ascending=False)
print("\nTop Products by Revenue:")
print(top_products)

# Analysis 2: Best performing regions
region_performance = sales_df.groupby('Region').agg({
    'Revenue': 'sum',
    'Quantity': 'sum',
    'Salesperson': 'count'
}).rename(columns={'Salesperson': 'Transactions'})
print("\nRegion Performance:")
print(region_performance)

# Analysis 3: Top salespeople
salesperson_stats = sales_df.groupby('Salesperson').agg({
    'Revenue': ['sum', 'mean'],
    'Quantity': 'sum'
}).round(2)
print("\nSalesperson Performance:")
print(salesperson_stats)

# Analysis 4: Daily trends
sales_df['Date_Only'] = sales_df['Date'].dt.date
daily_revenue = sales_df.groupby('Date_Only')['Revenue'].sum()
print("\nDaily Revenue (first 5 days):")
print(daily_revenue.head())

Exercises#

Exercise 1: Student Performance Analysis#

Create and analyze a student grades dataset:

Create DataFrame with students, subjects, and scores
Calculate average score per student
Find top 3 students
Calculate average score per subject
Identify students scoring below 70 in any subject

# Your code here

Exercise 2: E-commerce Data Cleaning#

Given messy e-commerce data:

Handle missing values appropriately
Remove duplicates
Convert data types correctly
Create new calculated columns
Export cleaned data

# Your code here

Exercise 3: Sales Dashboard#

Create a mini sales dashboard:

Load sales data
Calculate key metrics (total revenue, avg order value, etc.)
Create visualizations (revenue trend, top products, regional breakdown)
Generate summary report

# Your code here

Key Takeaways#

✅ DataFrames are 2D labeled data structures (like Excel spreadsheets)

✅ Read/write data in many formats (CSV, Excel, JSON, SQL)

✅ loc/iloc for label-based and integer-based indexing

✅ Boolean indexing for powerful filtering

✅ GroupBy for aggregation and summary statistics

✅ Merge/join to combine datasets

✅ Handle missing data with dropna(), fillna()

✅ Vectorized operations are fast - avoid loops!

✅ Built-in plotting for quick visualizations

✅ Method chaining creates readable data pipelines

Pro Tips#

💡 Use inplace=False - Safer to assign results to new variable

💡 Check dtypes early - Wrong types cause unexpected behavior

💡 copy() when needed - Avoid unintended modifications

💡 Use query() for complex filters: df.query('Price > 100 & Rating >= 4')

💡 Vectorize operations - Much faster than loops

💡 Use categorical dtype for repeated string values (saves memory)

💡 Profile memory usage - df.memory_usage(deep=True)

💡 Read in chunks for large files: pd.read_csv('big.csv', chunksize=10000)

💡 Use .loc for setting values - Avoids SettingWithCopyWarning

💡 explore() method - Interactive pandas profiling (if available)

Common Mistakes to Avoid#

❌ Chained indexing - df[df.A > 0][df.B < 5] = value ✅ Use Boolean mask: df.loc[(df.A > 0) & (df.B < 5)] = value

❌ Iterating with loops - Very slow ✅ Use vectorized operations or apply()

❌ Modifying while iterating - Causes errors ✅ Create new DataFrame or use copy()

❌ Not handling missing data - NaN breaks calculations ✅ Use dropna(), fillna(), or handle explicitly

❌ Forgetting axis parameter - axis=0 (rows), axis=1 (columns) ✅ Always check which axis you’re operating on

Next Steps#

You now have solid Pandas fundamentals! Next topics:

Advanced Pandas - MultiIndex, window functions, performance optimization
Data Visualization - Seaborn, Plotly for advanced plots
SQL with Pandas - Read from databases, write queries
Big Data - Dask, PySpark for datasets larger than RAM
Data Science Projects - End-to-end analysis workflows

Practice Projects:

Analyze COVID-19 data
Build a stock market analyzer
Create a customer segmentation analysis
Analyze your own data (finance, health, etc.)

Resources:

Pandas documentation: https://pandas.pydata.org/docs/
Kaggle datasets: https://kaggle.com/datasets
10 Minutes to Pandas: https://pandas.pydata.org/docs/user_guide/10min.html

Pandas is the foundation of data analysis in Python - master it! 🚀