Python Data Analysis — Pandas & NumPy
Pandas is the core library for data analysis in Python. NumPy provides fast numerical computing. Together they form the foundation of the Python data science ecosystem.
Learning Objectives
- Load and explore datasets with pandas
- Filter, group, and aggregate data
- Handle missing data effectively
- Create visualizations with matplotlib and seaborn
- Apply real-world data processing patterns
What is Pandas?
Pandas provides two main data structures:
# Series — one-dimensional labeled array
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
# a 10
# b 20
# c 30
# d 40
# DataFrame — two-dimensional labeled table (like a spreadsheet)
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago']
})
# name age city
# 0 Alice 25 NYC
# 1 Bob 30 LA
# 2 Charlie 35 Chicago
Pandas Data Structures
| Structure | Dimensions | Description | Use Case |
|---|---|---|---|
| Series | 1D | Labeled array | Single column, time series |
| DataFrame | 2D | Labeled table | Tabular data, spreadsheets |
| Panel | 3D | Labeled cube | Multi-dimensional (deprecated) |
Loading Data
import pandas as pd
# From CSV (most common)
df = pd.read_csv('data.csv')
# From CSV with options
df = pd.read_csv('data.csv',
index_col='id', # Set index column
parse_dates=['date'], # Parse dates
na_values=['N/A', ''], # Custom NA values
dtype={'age': int}, # Column types
usecols=['name', 'age'] # Load only specific columns
)
# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# From JSON
df = pd.read_json('data.json')
# From database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)
# From parquet (fast, columnar format)
df = pd.read_parquet('data.parquet')
# Quick overview of your data
print(df.head()) # First 5 rows
print(df.info()) # Column types and missing values
print(df.describe()) # Statistical summary
print(df.shape) # (rows, columns)
print(df.columns) # Column names
print(df.dtypes) # Data types
print(df.memory_usage(deep=True)) # Memory usage per column
Selection and Filtering
# Select a single column
names = df['name']
# Select multiple columns
subset = df[['name', 'age']]
# Filter rows with boolean indexing
adults = df[df['age'] >= 18]
ny_residents = df[df['city'] == 'NYC']
# Multiple conditions (use & for AND, | for OR)
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]
any_condition = df[(df['age'] < 25) | (df['city'] == 'LA')]
# Query method (SQL-like syntax)
result = df.query('age > 25 and city == "NYC"')
# loc — label-based selection
row = df.loc[0] # First row
subset = df.loc[0:5, 'name':'age'] # Rows 0-5, columns name to age
# iloc — integer-based selection
first_row = df.iloc[0]
first_three = df.iloc[:3]
# Filter with isin
ny_or_la = df[df['city'].isin(['NYC', 'LA'])]
# Filter with string methods
a_names = df[df['name'].str.startswith('A')]
contains_a = df[df['name'].str.contains('a', case=False)]
# Filter with between
mid_age = df[df['age'].between(25, 35)]
GroupBy and Aggregation
# Group by a column and aggregate
city_stats = df.groupby('city').agg({
'salary': ['mean', 'median', 'std', 'min', 'max'],
'age': ['count', 'mean']
})
# Multiple groupby
dept_city = df.groupby(['department', 'city']).agg({
'salary': 'mean',
'employee_id': 'count'
}).rename(columns={'employee_id': 'headcount'})
# Apply custom aggregation
def salary_range(group):
return group.max() - group.min()
df.groupby('department')['salary'].agg(salary_range)
# Pivot tables (cross-tabulation)
pivot = df.pivot_table(
values='salary',
index='department',
columns='city',
aggfunc='mean',
fill_value=0 # Fill missing with 0
)
# Value counts
city_counts = df['city'].value_counts()
print(city_counts)
# Cumulative operations
df['salary_cumsum'] = df.groupby('department')['salary'].cumsum()
# Rolling window
df['salary_rolling_avg'] = df['salary'].rolling(window=3).mean()
Merging and Joining
# Two DataFrames
employees = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'dept_id': [101, 102, 101]
})
departments = pd.DataFrame({
'dept_id': [101, 102, 103],
'dept_name': ['Engineering', 'Marketing', 'Sales']
})
# Inner join (only matching rows)
merged = pd.merge(employees, departments, on='dept_id', how='inner')
# Left join (keep all from left)
merged = pd.merge(employees, departments, on='dept_id', how='left')
# Right join (keep all from right)
merged = pd.merge(employees, departments, on='dept_id', how='right')
# Outer join (keep all)
merged = pd.merge(employees, departments, on='dept_id', how='outer')
# Concatenate DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Stack vertically
combined = pd.concat([df1, df2], ignore_index=True)
# Stack horizontally
combined = pd.concat([df1, df2], axis=1)
Handling Missing Data
# Check for missing values
print(df.isnull().sum()) # Count per column
print(df.isnull().sum().sum()) # Total missing
print(df.isnull().mean() * 100) # Percentage missing
# Drop missing values
df_clean = df.dropna() # Drop any row with missing
df_clean = df.dropna(subset=['age']) # Drop only if 'age' is missing
df_clean = df.dropna(thresh=3) # Keep rows with at least 3 non-null
# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True) # Fill with median
df['city'].fillna('Unknown', inplace=True) # Fill with constant
df['salary'].fillna(method='ffill', inplace=True) # Forward fill
df['salary'].fillna(method='bfill', inplace=True) # Backward fill
# Interpolate (for time series)
df['temperature'].interpolate(method='linear', inplace=True)
# Drop columns with too many missing values
threshold = len(df) * 0.5 # 50% threshold
df = df.dropna(axis=1, thresh=threshold)
Data Transformation
# Apply function to column
df['salary_monthly'] = df['salary'] / 12
# Apply function to rows
df['full_name'] = df.apply(
lambda row: f"{row['first_name']} {row['last_name']}", axis=1
)
# Map values
grade_map = {'A': 4, 'B': 3, 'C': 2, 'D': 1, 'F': 0}
df['gpa'] = df['grade'].map(grade_map)
# Cut into bins
df['age_group'] = pd.cut(
df['age'],
bins=[0, 18, 35, 50, 100],
labels=['Youth', 'Young Adult', 'Middle Age', 'Senior']
)
# Get dummies (one-hot encoding)
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)
# Rename columns
df = df.rename(columns={'old_name': 'new_name'})
# Type conversion
df['date'] = pd.to_datetime(df['date'])
df['price'] = df['price'].astype(float)
# String operations
df['name_upper'] = df['name'].str.upper()
df['name_clean'] = df['name'].str.strip().str.lower()
Data Cleaning
# Remove duplicates
df = df.drop_duplicates()
df = df.drop_duplicates(subset=['name', 'email']) # Based on specific columns
# Fix data types
df['price'] = pd.to_numeric(df['price'], errors='coerce') # Invalid -> NaN
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# Handle outliers with IQR
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
df_clean = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]
# Or cap outliers
df['salary'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)
# Clean text data
df['email'] = df['email'].str.strip().str.lower()
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)
Pivot Tables and Reshaping
# Pivot table
pivot = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum',
margins=True, # Add row/column totals
fill_value=0
)
# Melt (wide to long)
melted = pd.melt(df, id_vars=['name'], value_vars=['math', 'science', 'english'])
# Stack/unstack
stacked = df.set_index(['region', 'product']).unstack(level='product')
# Cross tabulation
cross_tab = pd.crosstab(df['region'], df['product'], margins=True)
Basic Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram
df['salary'].hist(bins=20, edgecolor='black', figsize=(10, 6))
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.savefig('salary_dist.png')
plt.close()
# Bar chart
city_counts = df['city'].value_counts()
city_counts.plot(kind='bar', figsize=(10, 6))
plt.title('Employees by City')
plt.xlabel('City')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('city_counts.png')
plt.close()
# Scatter plot
df.plot.scatter(x='age', y='salary', alpha=0.5, figsize=(10, 6))
plt.title('Age vs Salary')
plt.savefig('age_salary.png')
plt.close()
# Line plot (time series)
df.set_index('date')['sales'].plot(figsize=(12, 6))
plt.title('Sales Over Time')
plt.savefig('sales_trend.png')
plt.close()
# Correlation heatmap
corr = df.select_dtypes(include='number').corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation.png')
plt.close()
# Box plot
df.boxplot(column='salary', by='department', figsize=(10, 6))
plt.title('Salary by Department')
plt.savefig('salary_boxplot.png')
plt.close()
# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
df['salary'].hist(ax=axes[0, 0], bins=20)
axes[0, 0].set_title('Salary Distribution')
df['age'].hist(ax=axes[0, 1], bins=20)
axes[0, 1].set_title('Age Distribution')
df.plot.scatter(x='age', y='salary', ax=axes[1, 0])
axes[1, 0].set_title('Age vs Salary')
df['city'].value_counts().plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Employees by City')
plt.tight_layout()
plt.savefig('dashboard.png')
plt.close()
Real-World Example: Analyzing Sales Data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset
df = pd.read_csv('sales_data.csv')
# Quick exploration
print(f"Dataset shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
# Clean data
df['date'] = pd.to_datetime(df['date'])
df['revenue'] = df['quantity'] * df['price']
df = df.drop_duplicates()
# Feature engineering
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
# Analysis 1: Revenue by product
product_revenue = df.groupby('product')['revenue'].sum().sort_values(ascending=False)
print(f"\nRevenue by Product:\n{product_revenue}")
# Analysis 2: Monthly trend
monthly = df.groupby('month')['revenue'].sum()
# Analysis 3: Top customers
top_customers = df.groupby('customer_id')['revenue'].sum().nlargest(10)
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Product revenue
product_revenue.plot(kind='bar', ax=axes[0, 0], color='steelblue')
axes[0, 0].set_title('Revenue by Product')
axes[0, 0].set_ylabel('Revenue ($)')
# Monthly trend
monthly.plot(kind='line', ax=axes[0, 1], marker='o', color='green')
axes[0, 1].set_title('Monthly Revenue Trend')
axes[0, 1].set_ylabel('Revenue ($)')
# Top customers
top_customers.plot(kind='barh', ax=axes[1, 0], color='coral')
axes[1, 0].set_title('Top 10 Customers')
# Day of week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_sales = df.groupby('day_of_week')['revenue'].sum().reindex(day_order)
day_sales.plot(kind='bar', ax=axes[1, 1], color='purple')
axes[1, 1].set_title('Revenue by Day of Week')
plt.tight_layout()
plt.savefig('sales_analysis.png', dpi=150)
plt.show()
Common Mistakes
| Mistake | Problem | Solution |
|---|---|---|
| Not checking dtypes | Unexpected behavior | Always check df.dtypes |
| Modifying original data | Unexpected side effects | Use .copy() or method chaining |
| Not handling missing values | Errors in calculations | Check df.isnull().sum() first |
Using inplace=True everywhere | Hard to chain operations | Use method chaining instead |
Forgetting reset_index() | Misaligned index after groupby | Reset index when needed |
| Ignoring duplicates | Skewed results | Check and remove duplicates |
Key Takeaways
pd.read_csv()is the most common data loading method- Use boolean indexing for filtering:
df[df['col'] > value] groupby()is powerful for aggregation- Always check for missing data first with
df.isnull().sum() - Use
describe()for quick statistical overview - pandas integrates well with matplotlib and seaborn
- Use
inplace=Trueto modify DataFrame directly - Method chaining keeps transformations readable
- Use
.copy()to avoid SettingWithCopyWarning - Prefer vectorized operations over
apply()for performance