Lesson 5: Data Analysis with Pandas#

Introduction#

Pandas is Pythonโ€™s powerhouse library for data analysis. Think of it as Excel on steroids - it handles data manipulation, cleaning, analysis, and visualization with elegant, readable code.

Why Pandas Matters:

  • Industry Standard: Used by data scientists worldwide

  • Powerful: Handle millions of rows efficiently

  • Flexible: Read from CSV, Excel, SQL, JSON, and more

  • Integrated: Works seamlessly with NumPy, Matplotlib, scikit-learn

What Youโ€™ll Learn:

  • Series and DataFrames fundamentals

  • Loading and saving data (CSV, Excel, JSON)

  • Data exploration and inspection

  • Selecting, filtering, and indexing

  • Data cleaning and handling missing values

  • Grouping and aggregation

  • Merging and joining datasets

  • Time series basics

  • Data visualization

Prerequisites: pip install pandas numpy matplotlib openpyxl

1. Pandas Fundamentals#

Series: One-Dimensional Array#

import pandas as pd
import numpy as np

# Create a Series
temperatures = pd.Series([72, 75, 78, 74, 71], 
                         index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
                         name='Temperature')

print("Series:")
print(temperatures)
print(f"\nAccess by label: Wed = {temperatures['Wed']}")
print(f"Access by position: [1] = {temperatures.iloc[1]}")
print(f"\nMean: {temperatures.mean():.1f}")
print(f"Max: {temperatures.max()}")

DataFrames: Two-Dimensional Tables#

# Create DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['NYC', 'Paris', 'London', 'Tokyo', 'Berlin'],
    'Salary': [70000, 80000, 75000, 90000, 85000]
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print(f"\nShape: {df.shape} (rows, columns)")
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index.tolist()}")

2. Loading and Saving Data#

Reading CSV Files#

# Create sample CSV data
import io

csv_data = """Name,Age,Department,Salary
Alice,25,Engineering,75000
Bob,30,Marketing,65000
Charlie,35,Engineering,85000
Diana,28,Sales,70000
Eve,32,Marketing,72000"""

# Read from string (simulating file read)
df = pd.read_csv(io.StringIO(csv_data))
print("Data loaded from CSV:")
print(df)

# Save to CSV
# df.to_csv('employees.csv', index=False)

# Read with options
# df = pd.read_csv('data.csv', 
#                  sep=',',           # Delimiter
#                  header=0,          # Row to use as column names
#                  na_values=['NA', 'missing'],  # Values to treat as NaN
#                  parse_dates=['Date']  # Parse date columns
# )

Working with Excel Files#

# Read Excel (requires openpyxl or xlrd)
# df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Write to Excel
# df.to_excel('output.xlsx', sheet_name='Results', index=False)

# Multiple sheets
# with pd.ExcelWriter('multi_sheet.xlsx') as writer:
#     df1.to_excel(writer, sheet_name='Sales')
#     df2.to_excel(writer, sheet_name='Inventory')

print("Excel I/O examples shown (commented out)")

JSON and Other Formats#

# JSON
json_str = '{"Name": ["Alice", "Bob"], "Age": [25, 30]}'
df_json = pd.read_json(io.StringIO(json_str))
print("From JSON:")
print(df_json)

# To JSON
json_output = df_json.to_json(orient='records')
print(f"\nTo JSON: {json_output}")

# Other formats:
# pd.read_sql(query, connection)  # SQL database
# pd.read_html(url)  # HTML tables
# pd.read_clipboard()  # From clipboard

3. Data Exploration#

Inspecting Data#

# Create sample dataset
np.random.seed(42)
df = pd.DataFrame({
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet'], 100),
    'Price': np.random.randint(200, 1500, 100),
    'Quantity': np.random.randint(1, 10, 100),
    'Rating': np.random.uniform(3.0, 5.0, 100)
})

# Basic info
print("First 5 rows:")
print(df.head())

print("\nLast 3 rows:")
print(df.tail(3))

print("\nDataFrame Info:")
print(df.info())

print("\nStatistical Summary:")
print(df.describe())

print("\nValue Counts:")
print(df['Product'].value_counts())

4. Selecting and Filtering Data#

Column Selection#

# Single column (returns Series)
prices = df['Price']
print("Type:", type(prices))
print(prices.head())

# Multiple columns (returns DataFrame)
subset = df[['Product', 'Price']]
print("\nMultiple columns:")
print(subset.head())

Row Selection: loc vs iloc#

# iloc: Integer-based position
print("First row (iloc):")
print(df.iloc[0])

print("\nFirst 3 rows, first 2 columns:")
print(df.iloc[0:3, 0:2])

# loc: Label-based
df_indexed = df.set_index('Product')
print("\nSelect 'Laptop' rows (loc):")
print(df_indexed.loc['Laptop'].head())

Boolean Filtering#

# Simple condition
expensive = df[df['Price'] > 1000]
print(f"Expensive items (>{1000}): {len(expensive)} rows")
print(expensive.head())

# Multiple conditions (AND: &, OR: |)
high_value = df[(df['Price'] > 800) & (df['Rating'] >= 4.5)]
print(f"\nHigh value items: {len(high_value)} rows")

# Using isin()
mobile_devices = df[df['Product'].isin(['Phone', 'Tablet'])]
print(f"\nMobile devices: {len(mobile_devices)} rows")

# String methods
# phones = df[df['Product'].str.contains('Phone')]
# uppercase = df[df['Product'].str.isupper()]

5. Data Manipulation#

Adding and Modifying Columns#

# Add new column
df['Revenue'] = df['Price'] * df['Quantity']

# Conditional column
df['Price_Category'] = pd.cut(df['Price'], 
                               bins=[0, 500, 1000, 2000],
                               labels=['Low', 'Medium', 'High'])

# Apply function
df['Discount_Price'] = df['Price'].apply(lambda x: x * 0.9)

# Using np.where (vectorized if-else)
df['Quality'] = np.where(df['Rating'] >= 4.5, 'Excellent', 'Good')

print(df[['Product', 'Price', 'Price_Category', 'Quality']].head())

Sorting Data#

# Sort by single column
sorted_price = df.sort_values('Price', ascending=False)
print("Top 5 by price:")
print(sorted_price[['Product', 'Price']].head())

# Sort by multiple columns
sorted_multi = df.sort_values(['Product', 'Price'], ascending=[True, False])
print("\nSorted by Product (asc), then Price (desc):")
print(sorted_multi[['Product', 'Price']].head())

6. Handling Missing Data#

# Create data with missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, 5]
})

print("Original data:")
print(df_missing)

# Detect missing values
print("\nMissing values:")
print(df_missing.isnull().sum())

# Drop rows with any missing values
print("\nAfter dropna():")
print(df_missing.dropna())

# Fill missing values
print("\nFill with 0:")
print(df_missing.fillna(0))

# Fill with mean
df_filled = df_missing.copy()
df_filled['A'].fillna(df_filled['A'].mean(), inplace=True)
print("\nFill column A with mean:")
print(df_filled)

# Forward fill (propagate last valid value)
print("\nForward fill:")
print(df_missing.fillna(method='ffill'))

7. Grouping and Aggregation#

# Group by single column
product_stats = df.groupby('Product')['Revenue'].sum().sort_values(ascending=False)
print("Total Revenue by Product:")
print(product_stats)

# Multiple aggregations
summary = df.groupby('Product').agg({
    'Price': ['mean', 'min', 'max'],
    'Quantity': 'sum',
    'Rating': 'mean'
})
print("\nDetailed summary:")
print(summary)

# Group by multiple columns
multi_group = df.groupby(['Product', 'Price_Category']).size()
print("\nCount by Product and Price Category:")
print(multi_group)

# Custom aggregation
def price_range(x):
    return x.max() - x.min()

price_ranges = df.groupby('Product')['Price'].agg(price_range)
print("\nPrice Range by Product:")
print(price_ranges)

8. Merging and Joining DataFrames#

# Create sample dataframes
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [10, 20, 10, 30]
})

departments = pd.DataFrame({
    'dept_id': [10, 20, 30],
    'dept_name': ['Engineering', 'Marketing', 'Sales']
})

# Inner join (default)
inner = pd.merge(employees, departments, on='dept_id')
print("Inner join:")
print(inner)

# Left join
left = pd.merge(employees, departments, on='dept_id', how='left')
print("\nLeft join:")
print(left)

# Concatenate vertically
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
combined = pd.concat([df1, df2], ignore_index=True)
print("\nConcatenated:")
print(combined)

9. Pivot Tables and Reshaping#

# Create sales data
sales = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'Product': ['A', 'B'] * 6,
    'Region': ['North', 'South', 'North', 'South'] * 3,
    'Sales': np.random.randint(100, 1000, 12)
})

print("Sales data:")
print(sales.head())

# Pivot table
pivot = pd.pivot_table(sales, 
                       values='Sales',
                       index='Product',
                       columns='Region',
                       aggfunc='sum')
print("\nPivot table:")
print(pivot)

# Melt (unpivot)
melted = pivot.reset_index().melt(id_vars='Product', 
                                   var_name='Region',
                                   value_name='Sales')
print("\nMelted back:")
print(melted)

10. Time Series Basics#

# Create time series data
dates = pd.date_range('2024-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(100).cumsum(), index=dates)

print("Time series (first 5):")
print(ts.head())

# Date-based selection
print("\nJanuary data:")
print(ts['2024-01'].head())

# Resampling
weekly = ts.resample('W').mean()
print("\nWeekly average (first 3):")
print(weekly.head(3))

# Rolling window
rolling_mean = ts.rolling(window=7).mean()
print("\n7-day rolling mean (first 10):")
print(rolling_mean.head(10))

11. Data Visualization with Pandas#

import matplotlib.pyplot as plt

# Bar chart
product_stats.plot(kind='bar', figsize=(10, 6), color='skyblue')
plt.title('Total Revenue by Product')
plt.ylabel('Revenue ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Histogram
df['Price'].hist(bins=20, figsize=(10, 6), edgecolor='black')
plt.title('Price Distribution')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Box plot
df.boxplot(column='Price', by='Product', figsize=(10, 6))
plt.title('Price Distribution by Product')
plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.show()

# Scatter plot
df.plot.scatter(x='Price', y='Rating', figsize=(10, 6), alpha=0.5)
plt.title('Price vs Rating')
plt.tight_layout()
plt.show()

12. Real-World Example: Sales Analysis#

# Create realistic sales dataset
np.random.seed(42)
n_records = 500

sales_df = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=n_records, freq='H'),
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_records),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
    'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana'], n_records),
    'Quantity': np.random.randint(1, 5, n_records),
    'Unit_Price': np.random.randint(200, 1500, n_records)
})

# Calculate revenue
sales_df['Revenue'] = sales_df['Quantity'] * sales_df['Unit_Price']

print("Sales Dataset:")
print(sales_df.head())

# Analysis 1: Top products by revenue
top_products = sales_df.groupby('Product')['Revenue'].sum().sort_values(ascending=False)
print("\nTop Products by Revenue:")
print(top_products)

# Analysis 2: Best performing regions
region_performance = sales_df.groupby('Region').agg({
    'Revenue': 'sum',
    'Quantity': 'sum',
    'Salesperson': 'count'
}).rename(columns={'Salesperson': 'Transactions'})
print("\nRegion Performance:")
print(region_performance)

# Analysis 3: Top salespeople
salesperson_stats = sales_df.groupby('Salesperson').agg({
    'Revenue': ['sum', 'mean'],
    'Quantity': 'sum'
}).round(2)
print("\nSalesperson Performance:")
print(salesperson_stats)

# Analysis 4: Daily trends
sales_df['Date_Only'] = sales_df['Date'].dt.date
daily_revenue = sales_df.groupby('Date_Only')['Revenue'].sum()
print("\nDaily Revenue (first 5 days):")
print(daily_revenue.head())

Exercises#

Exercise 1: Student Performance Analysis#

Create and analyze a student grades dataset:

  1. Create DataFrame with students, subjects, and scores

  2. Calculate average score per student

  3. Find top 3 students

  4. Calculate average score per subject

  5. Identify students scoring below 70 in any subject

# Your code here

Exercise 2: E-commerce Data Cleaning#

Given messy e-commerce data:

  1. Handle missing values appropriately

  2. Remove duplicates

  3. Convert data types correctly

  4. Create new calculated columns

  5. Export cleaned data

# Your code here

Exercise 3: Sales Dashboard#

Create a mini sales dashboard:

  1. Load sales data

  2. Calculate key metrics (total revenue, avg order value, etc.)

  3. Create visualizations (revenue trend, top products, regional breakdown)

  4. Generate summary report

# Your code here

Key Takeaways#

โœ… DataFrames are 2D labeled data structures (like Excel spreadsheets)

โœ… Read/write data in many formats (CSV, Excel, JSON, SQL)

โœ… loc/iloc for label-based and integer-based indexing

โœ… Boolean indexing for powerful filtering

โœ… GroupBy for aggregation and summary statistics

โœ… Merge/join to combine datasets

โœ… Handle missing data with dropna(), fillna()

โœ… Vectorized operations are fast - avoid loops!

โœ… Built-in plotting for quick visualizations

โœ… Method chaining creates readable data pipelines

Pro Tips#

๐Ÿ’ก Use inplace=False - Safer to assign results to new variable

๐Ÿ’ก Check dtypes early - Wrong types cause unexpected behavior

๐Ÿ’ก copy() when needed - Avoid unintended modifications

๐Ÿ’ก Use query() for complex filters: df.query('Price > 100 & Rating >= 4')

๐Ÿ’ก Vectorize operations - Much faster than loops

๐Ÿ’ก Use categorical dtype for repeated string values (saves memory)

๐Ÿ’ก Profile memory usage - df.memory_usage(deep=True)

๐Ÿ’ก Read in chunks for large files: pd.read_csv('big.csv', chunksize=10000)

๐Ÿ’ก Use .loc for setting values - Avoids SettingWithCopyWarning

๐Ÿ’ก explore() method - Interactive pandas profiling (if available)

Common Mistakes to Avoid#

โŒ Chained indexing - df[df.A > 0][df.B < 5] = value โœ… Use Boolean mask: df.loc[(df.A > 0) & (df.B < 5)] = value

โŒ Iterating with loops - Very slow โœ… Use vectorized operations or apply()

โŒ Modifying while iterating - Causes errors โœ… Create new DataFrame or use copy()

โŒ Not handling missing data - NaN breaks calculations โœ… Use dropna(), fillna(), or handle explicitly

โŒ Forgetting axis parameter - axis=0 (rows), axis=1 (columns) โœ… Always check which axis youโ€™re operating on

Next Steps#

You now have solid Pandas fundamentals! Next topics:

  1. Advanced Pandas - MultiIndex, window functions, performance optimization

  2. Data Visualization - Seaborn, Plotly for advanced plots

  3. SQL with Pandas - Read from databases, write queries

  4. Big Data - Dask, PySpark for datasets larger than RAM

  5. Data Science Projects - End-to-end analysis workflows

Practice Projects:

  • Analyze COVID-19 data

  • Build a stock market analyzer

  • Create a customer segmentation analysis

  • Analyze your own data (finance, health, etc.)

Resources:

Pandas is the foundation of data analysis in Python - master it! ๐Ÿš€