Python Data Analysis — Pandas & NumPy

Pandas is the core library for data analysis in Python. NumPy provides fast numerical computing. Together they form the foundation of the Python data science ecosystem.

Learning Objectives

Load and explore datasets with pandas
Filter, group, and aggregate data
Handle missing data effectively
Create visualizations with matplotlib and seaborn
Apply real-world data processing patterns

What is Pandas?

Pandas provides two main data structures:

# Series — one-dimensional labeled array
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
# a    10
# b    20
# c    30
# d    40

# DataFrame — two-dimensional labeled table (like a spreadsheet)
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})
#      name  age     city
# 0    Alice   25      NYC
# 1      Bob   30       LA
# 2  Charlie   35  Chicago

Pandas Data Structures

Structure	Dimensions	Description	Use Case
Series	1D	Labeled array	Single column, time series
DataFrame	2D	Labeled table	Tabular data, spreadsheets
Panel	3D	Labeled cube	Multi-dimensional (deprecated)

Loading Data

import pandas as pd

# From CSV (most common)
df = pd.read_csv('data.csv')

# From CSV with options
df = pd.read_csv('data.csv',
    index_col='id',           # Set index column
    parse_dates=['date'],     # Parse dates
    na_values=['N/A', ''],    # Custom NA values
    dtype={'age': int},       # Column types
    usecols=['name', 'age']   # Load only specific columns
)

# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# From JSON
df = pd.read_json('data.json')

# From database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)

# From parquet (fast, columnar format)
df = pd.read_parquet('data.parquet')

# Quick overview of your data
print(df.head())       # First 5 rows
print(df.info())       # Column types and missing values
print(df.describe())   # Statistical summary
print(df.shape)        # (rows, columns)
print(df.columns)      # Column names
print(df.dtypes)       # Data types
print(df.memory_usage(deep=True))  # Memory usage per column

Selection and Filtering

# Select a single column
names = df['name']

# Select multiple columns
subset = df[['name', 'age']]

# Filter rows with boolean indexing
adults = df[df['age'] >= 18]
ny_residents = df[df['city'] == 'NYC']

# Multiple conditions (use & for AND, | for OR)
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]
any_condition = df[(df['age'] < 25) | (df['city'] == 'LA')]

# Query method (SQL-like syntax)
result = df.query('age > 25 and city == "NYC"')

# loc — label-based selection
row = df.loc[0]                    # First row
subset = df.loc[0:5, 'name':'age'] # Rows 0-5, columns name to age

# iloc — integer-based selection
first_row = df.iloc[0]
first_three = df.iloc[:3]

# Filter with isin
ny_or_la = df[df['city'].isin(['NYC', 'LA'])]

# Filter with string methods
a_names = df[df['name'].str.startswith('A')]
contains_a = df[df['name'].str.contains('a', case=False)]

# Filter with between
mid_age = df[df['age'].between(25, 35)]

GroupBy and Aggregation

# Group by a column and aggregate
city_stats = df.groupby('city').agg({
    'salary': ['mean', 'median', 'std', 'min', 'max'],
    'age': ['count', 'mean']
})

# Multiple groupby
dept_city = df.groupby(['department', 'city']).agg({
    'salary': 'mean',
    'employee_id': 'count'
}).rename(columns={'employee_id': 'headcount'})

# Apply custom aggregation
def salary_range(group):
    return group.max() - group.min()

df.groupby('department')['salary'].agg(salary_range)

# Pivot tables (cross-tabulation)
pivot = df.pivot_table(
    values='salary',
    index='department',
    columns='city',
    aggfunc='mean',
    fill_value=0  # Fill missing with 0
)

# Value counts
city_counts = df['city'].value_counts()
print(city_counts)

# Cumulative operations
df['salary_cumsum'] = df.groupby('department')['salary'].cumsum()

# Rolling window
df['salary_rolling_avg'] = df['salary'].rolling(window=3).mean()

Merging and Joining

# Two DataFrames
employees = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'dept_id': [101, 102, 101]
})

departments = pd.DataFrame({
    'dept_id': [101, 102, 103],
    'dept_name': ['Engineering', 'Marketing', 'Sales']
})

# Inner join (only matching rows)
merged = pd.merge(employees, departments, on='dept_id', how='inner')

# Left join (keep all from left)
merged = pd.merge(employees, departments, on='dept_id', how='left')

# Right join (keep all from right)
merged = pd.merge(employees, departments, on='dept_id', how='right')

# Outer join (keep all)
merged = pd.merge(employees, departments, on='dept_id', how='outer')

# Concatenate DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Stack vertically
combined = pd.concat([df1, df2], ignore_index=True)

# Stack horizontally
combined = pd.concat([df1, df2], axis=1)

Handling Missing Data

# Check for missing values
print(df.isnull().sum())           # Count per column
print(df.isnull().sum().sum())     # Total missing
print(df.isnull().mean() * 100)    # Percentage missing

# Drop missing values
df_clean = df.dropna()                    # Drop any row with missing
df_clean = df.dropna(subset=['age'])      # Drop only if 'age' is missing
df_clean = df.dropna(thresh=3)            # Keep rows with at least 3 non-null

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)   # Fill with median
df['city'].fillna('Unknown', inplace=True)            # Fill with constant
df['salary'].fillna(method='ffill', inplace=True)    # Forward fill
df['salary'].fillna(method='bfill', inplace=True)    # Backward fill

# Interpolate (for time series)
df['temperature'].interpolate(method='linear', inplace=True)

# Drop columns with too many missing values
threshold = len(df) * 0.5  # 50% threshold
df = df.dropna(axis=1, thresh=threshold)

Data Transformation

# Apply function to column
df['salary_monthly'] = df['salary'] / 12

# Apply function to rows
df['full_name'] = df.apply(
    lambda row: f"{row['first_name']} {row['last_name']}", axis=1
)

# Map values
grade_map = {'A': 4, 'B': 3, 'C': 2, 'D': 1, 'F': 0}
df['gpa'] = df['grade'].map(grade_map)

# Cut into bins
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 18, 35, 50, 100],
    labels=['Youth', 'Young Adult', 'Middle Age', 'Senior']
)

# Get dummies (one-hot encoding)
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)

# Rename columns
df = df.rename(columns={'old_name': 'new_name'})

# Type conversion
df['date'] = pd.to_datetime(df['date'])
df['price'] = df['price'].astype(float)

# String operations
df['name_upper'] = df['name'].str.upper()
df['name_clean'] = df['name'].str.strip().str.lower()

Data Cleaning

# Remove duplicates
df = df.drop_duplicates()
df = df.drop_duplicates(subset=['name', 'email'])  # Based on specific columns

# Fix data types
df['price'] = pd.to_numeric(df['price'], errors='coerce')  # Invalid -> NaN
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Handle outliers with IQR
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_clean = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

# Or cap outliers
df['salary'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)

# Clean text data
df['email'] = df['email'].str.strip().str.lower()
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)

Pivot Tables and Reshaping

# Pivot table
pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,           # Add row/column totals
    fill_value=0
)

# Melt (wide to long)
melted = pd.melt(df, id_vars=['name'], value_vars=['math', 'science', 'english'])

# Stack/unstack
stacked = df.set_index(['region', 'product']).unstack(level='product')

# Cross tabulation
cross_tab = pd.crosstab(df['region'], df['product'], margins=True)

Basic Visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram
df['salary'].hist(bins=20, edgecolor='black', figsize=(10, 6))
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.savefig('salary_dist.png')
plt.close()

# Bar chart
city_counts = df['city'].value_counts()
city_counts.plot(kind='bar', figsize=(10, 6))
plt.title('Employees by City')
plt.xlabel('City')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('city_counts.png')
plt.close()

# Scatter plot
df.plot.scatter(x='age', y='salary', alpha=0.5, figsize=(10, 6))
plt.title('Age vs Salary')
plt.savefig('age_salary.png')
plt.close()

# Line plot (time series)
df.set_index('date')['sales'].plot(figsize=(12, 6))
plt.title('Sales Over Time')
plt.savefig('sales_trend.png')
plt.close()

# Correlation heatmap
corr = df.select_dtypes(include='number').corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation.png')
plt.close()

# Box plot
df.boxplot(column='salary', by='department', figsize=(10, 6))
plt.title('Salary by Department')
plt.savefig('salary_boxplot.png')
plt.close()

# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
df['salary'].hist(ax=axes[0, 0], bins=20)
axes[0, 0].set_title('Salary Distribution')
df['age'].hist(ax=axes[0, 1], bins=20)
axes[0, 1].set_title('Age Distribution')
df.plot.scatter(x='age', y='salary', ax=axes[1, 0])
axes[1, 0].set_title('Age vs Salary')
df['city'].value_counts().plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Employees by City')
plt.tight_layout()
plt.savefig('dashboard.png')
plt.close()

Real-World Example: Analyzing Sales Data

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('sales_data.csv')

# Quick exploration
print(f"Dataset shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")

# Clean data
df['date'] = pd.to_datetime(df['date'])
df['revenue'] = df['quantity'] * df['price']
df = df.drop_duplicates()

# Feature engineering
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter

# Analysis 1: Revenue by product
product_revenue = df.groupby('product')['revenue'].sum().sort_values(ascending=False)
print(f"\nRevenue by Product:\n{product_revenue}")

# Analysis 2: Monthly trend
monthly = df.groupby('month')['revenue'].sum()

# Analysis 3: Top customers
top_customers = df.groupby('customer_id')['revenue'].sum().nlargest(10)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Product revenue
product_revenue.plot(kind='bar', ax=axes[0, 0], color='steelblue')
axes[0, 0].set_title('Revenue by Product')
axes[0, 0].set_ylabel('Revenue ($)')

# Monthly trend
monthly.plot(kind='line', ax=axes[0, 1], marker='o', color='green')
axes[0, 1].set_title('Monthly Revenue Trend')
axes[0, 1].set_ylabel('Revenue ($)')

# Top customers
top_customers.plot(kind='barh', ax=axes[1, 0], color='coral')
axes[1, 0].set_title('Top 10 Customers')

# Day of week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_sales = df.groupby('day_of_week')['revenue'].sum().reindex(day_order)
day_sales.plot(kind='bar', ax=axes[1, 1], color='purple')
axes[1, 1].set_title('Revenue by Day of Week')

plt.tight_layout()
plt.savefig('sales_analysis.png', dpi=150)
plt.show()

Common Mistakes

Mistake	Problem	Solution
Not checking dtypes	Unexpected behavior	Always check `df.dtypes`
Modifying original data	Unexpected side effects	Use `.copy()` or method chaining
Not handling missing values	Errors in calculations	Check `df.isnull().sum()` first
Using `inplace=True` everywhere	Hard to chain operations	Use method chaining instead
Forgetting `reset_index()`	Misaligned index after groupby	Reset index when needed
Ignoring duplicates	Skewed results	Check and remove duplicates

Key Takeaways

pd.read_csv() is the most common data loading method
Use boolean indexing for filtering: df[df['col'] > value]
groupby() is powerful for aggregation
Always check for missing data first with df.isnull().sum()
Use describe() for quick statistical overview
pandas integrates well with matplotlib and seaborn
Use inplace=True to modify DataFrame directly
Method chaining keeps transformations readable
Use .copy() to avoid SettingWithCopyWarning
Prefer vectorized operations over apply() for performance

Python Data Analysis — Pandas & NumPy

Python Data Analysis — Pandas & NumPy

Learning Objectives

What is Pandas?

Pandas Data Structures

Loading Data

Selection and Filtering

GroupBy and Aggregation

Merging and Joining

Handling Missing Data

Data Transformation

Data Cleaning

Pivot Tables and Reshaping

Basic Visualization

Real-World Example: Analyzing Sales Data

Common Mistakes

Key Takeaways

Premium Content

Need Expert Python Help?