Python Machine Learning — Getting Started
Machine learning lets computers learn from data and make predictions without being explicitly programmed. This tutorial covers the fundamentals with scikit-learn.
Learning Objectives
- Understand the ML workflow
- Build regression and classification models
- Evaluate model performance with proper metrics
- Avoid common ML pitfalls
- Apply feature engineering techniques
What is Machine Learning?
Traditional Programming:
Input + Rules ------► Output
(data + if/else) (result)
Machine Learning:
Input + Output ------► Rules
(data + labels) (learned model)
Example: Instead of writing rules to identify spam emails:
- Traditional:
if "free money" in email: spam = True - ML: Show the model 10,000 emails labeled "spam" or "not spam", and it learns the patterns itself.
Types of Machine Learning
| Type | Goal | Example Algorithms |
|---|---|---|
| Supervised (Regression) | Predict continuous value | Linear Regression, Ridge |
| Supervised (Classification) | Predict category | Logistic Regression, SVM |
| Unsupervised (Clustering) | Find groups | K-Means, DBSCAN |
| Unsupervised (Dimensionality) | Reduce features | PCA, t-SNE |
| Reinforcement | Learn from actions | Q-Learning, Policy Gradient |
The ML Workflow
1. Collect Data
|
v
2. Clean & Prepare Data
|
v
3. Feature Engineering
|
v
4. Split into Train/Test Sets
|
v
5. Choose a Model
|
v
6. Train the Model
|
v
7. Evaluate Performance
|
v
8. Tune Hyperparameters
|
v
9. Deploy
Data Preparation and Splitting
Train/Test Split
from sklearn.model_selection import train_test_split
# Always split BEFORE any preprocessing
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # Reproducible splits
stratify=y # Maintain class distribution (classification)
)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
Cross-Validation
Cross-validation gives a more robust estimate of model performance than a single train/test split.
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")
# Stratified K-Fold (preserves class distribution)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)
Why Split First?
# BAD: Data leakage — test data influences training
scaler.fit(X) # Sees ALL data including test
X_train, X_test = train_test_split(X)
# GOOD: Split first, then preprocess
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler.fit(X_train) # Only training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use same scaler
Feature Engineering
Feature engineering transforms raw data into features that better represent the underlying problem.
import pandas as pd
import numpy as np
# Numeric features
df['log_income'] = np.log1p(df['income']) # Handle skewed data
df['age_squared'] = df['age'] ** 2 # Non-linear relationship
# Categorical features
df['city_encoded'] = df['city'].map({'NYC': 0, 'LA': 1, 'Chicago': 2})
# One-hot encoding
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)
# Binning continuous variables
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100],
labels=['Youth', 'Young Adult', 'Middle Age', 'Senior'])
# Date features
df['day_of_week'] = pd.to_datetime(df['date']).dt.dayofweek
df['month'] = pd.to_datetime(df['date']).dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
# Interaction features
df['income_per_age'] = df['income'] / df['age']
Scaling Features
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# StandardScaler: mean=0, std=1 (most common)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# MinMaxScaler: scales to [0, 1]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
# RobustScaler: robust to outliers
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
Regression: Predicting Numbers
Regression predicts continuous values (price, temperature, salary).
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np
# Sample data: house size vs price
X = np.array([[800], [1000], [1200], [1400], [1600],
[1800], [2000], [2200], [2400], [2600]])
y = np.array([150000, 180000, 210000, 250000, 280000,
310000, 350000, 380000, 420000, 450000])
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.0f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.0f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.0f}")
# Predict new values
new_house = np.array([[1500]])
predicted_price = model.predict(new_house)
print(f"Predicted price for 1500 sqft: ${predicted_price[0]:,.0f}")
Regression Algorithms Comparison
| Algorithm | When to Use | Pros | Cons |
|---|---|---|---|
| Linear Regression | Linear relationships | Interpretable, fast | Assumes linearity |
| Ridge | Many correlated features | Prevents overfitting | Needs alpha tuning |
| Lasso | Feature selection | Automatic feature selection | Can zero out features |
| ElasticNet | High-dimensional data | Combines Ridge + Lasso | Complex tuning |
| Random Forest | Non-linear data | Handles non-linearity | Less interpretable |
| Gradient Boosting | Best accuracy | High performance | Slower, needs tuning |
Classification: Predicting Categories
Classification predicts discrete labels (spam/not spam, cat/dog/bird).
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print(f"Features: {feature_names}")
print(f"Classes: {target_names}")
print(f"Samples: {len(X)}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)
# Predict
y_pred = clf.predict(X_test_scaled)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))
# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Feature importance
importances = clf.feature_importances_
for name, importance in zip(feature_names, importances):
print(f" {name}: {importance:.3f}")
Evaluation Metrics
Regression Metrics
| Metric | What it Measures | Good Value | Formula |
|---|---|---|---|
| R² Score | Variance explained | Close to 1.0 | 1 - (SS_res / SS_tot) |
| MSE | Average squared error | Close to 0 | mean((y - y_pred)²) |
| RMSE | Error in original units | Close to 0 | sqrt(MSE) |
| MAE | Average absolute error | Close to 0 | mean( |
Classification Metrics
| Metric | What it Measures | Good Value | When to Use |
|---|---|---|---|
| Accuracy | Correct predictions / total | Close to 1.0 | Balanced classes |
| Precision | True positives / predicted positives | Close to 1.0 | Cost of false positive is high |
| Recall | True positives / actual positives | Close to 1.0 | Cost of false negative is high |
| F1 Score | Harmonic mean of precision and recall | Close to 1.0 | Imbalanced classes |
| AUC-ROC | Ranking quality | Close to 1.0 | Binary classification |
| Log Loss | Confidence of predictions | Close to 0 | Probabilistic output needed |
When to Use Each Metric
# Balanced dataset — accuracy is fine
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
# Imbalanced dataset — use F1 or AUC
from sklearn.metrics import f1_score, roc_auc_score
print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted'):.2%}")
print(f"AUC: {roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]):.2%}")
# When false negatives are costly (e.g., disease detection)
# Maximize recall
print(f"Recall: {recall_score(y_test, y_pred):.2%}")
# When false positives are costly (e.g., spam filtering)
# Maximize precision
print(f"Precision: {precision_score(y_test, y_pred):.2%}")
Common ML Algorithms
| Algorithm | Type | Best For | Complexity |
|---|---|---|---|
| Linear Regression | Regression | Simple relationships | Low |
| Logistic Regression | Classification | Binary outcomes | Low |
| Decision Trees | Both | Interpretable models | Medium |
| Random Forest | Both | General purpose | Medium |
| SVM | Both | High-dimensional data | High |
| K-Nearest Neighbors | Both | Small datasets | Low |
| K-Means | Clustering | Grouping data | Medium |
| Gradient Boosting | Both | Best accuracy | High |
Avoiding Common Pitfalls
1. Data Leakage
# BAD: Using test data for training decisions
scaler.fit(X) # Fits on ALL data including test
X_train, X_test = train_test_split(X)
# GOOD: Split first, then fit
X_train, X_test = train_test_split(X)
scaler.fit(X_train) # Only fits on training data
X_test_scaled = scaler.transform(X_test)
2. Overfitting vs Underfitting
from sklearn.model_selection import cross_val_score
# Check for overfitting
train_accuracy = clf.score(X_train_scaled, y_train)
cv_scores = cross_val_score(clf, X_train_scaled, y_train, cv=5)
print(f"Training Accuracy: {train_accuracy:.2%}")
print(f"CV Accuracy: {cv_scores.mean():.2%}")
# If training >> CV, model is overfitting
# If both are low, model is underfitting
# Solutions for overfitting:
# 1. Reduce model complexity
# 2. Add regularization
# 3. Get more training data
# 4. Use feature selection
# Solutions for underfitting:
# 1. Use more complex model
# 2. Add more features
# 3. Reduce regularization
3. Class Imbalance
# Check class distribution
print(f"Class distribution: {np.bincount(y)}")
# Solutions for imbalance
from sklearn.utils.class_weight import compute_class_weight
# Option 1: Use class_weight parameter
clf = RandomForestClassifier(class_weight='balanced')
# Option 2: Resample
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
4. Not Enough Data
# Always check your dataset size
print(f"Samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Samples per class: {np.bincount(y)}")
# Rule of thumb: need at least 10x more samples than features
Real-World Example: Predicting House Prices
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Load dataset
df = pd.read_csv('house_prices.csv')
# Feature engineering
df['total_area'] = df['sqft_living'] + df['sqft_lot']
df['bed_bath_ratio'] = df['bedrooms'] / (df['bathrooms'] + 1)
df['age'] = 2024 - df['year_built']
df['renovated'] = (df['yr_renovated'] > 0).astype(int)
# Select features
features = ['sqft_living', 'bedrooms', 'bathrooms', 'floors',
'age', 'renovated', 'total_area', 'bed_bath_ratio']
X = df[features]
y = df['price']
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train
model = GradientBoostingRegressor(n_estimators=200, max_depth=4, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.0f}")
# Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
print(f"CV R² Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
Real-World Example: Spam Classification
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# Sample data
emails = [
"Win a free iPhone now!", "Meeting at 3pm tomorrow",
"Congratulations! You won $1 million", "Please review the attached report",
"Get rich quick! Click here", "Team lunch on Friday",
"URGENT: Verify your account", "Project deadline extended",
"Buy cheap medications online", "Quarterly results attached"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] # 1=spam, 0=not spam
# Build pipeline (vectorizer + classifier)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
('classifier', MultinomialNB())
])
# Train
pipeline.fit(emails, labels)
# Predict
new_emails = ["Win free money now!", "Team meeting tomorrow"]
predictions = pipeline.predict(new_emails)
print(f"Predictions: {predictions}") # [1, 0]
Common Mistakes
| Mistake | Problem | Solution |
|---|---|---|
| Not splitting data | Can't evaluate properly | Always train/test split |
| Using accuracy for imbalance | Misleading metric | Use F1, AUC, or precision/recall |
| Not scaling features | Poor performance | Scale for SVM, KNN, etc. |
| Overfitting | Great training, bad test | Use cross-validation |
| Not handling missing values | Errors or bias | Impute or remove |
| Ignoring outliers | Skewed model | Detect and handle outliers |
| Too many features | Curse of dimensionality | Feature selection |
Key Takeaways
- Always split data into train/test BEFORE any preprocessing
- Scale features for algorithms sensitive to magnitude (SVM, KNN)
- Use cross-validation for robust evaluation
- Start simple, add complexity only when needed
- Check for class imbalance in classification problems
- Feature engineering often matters more than algorithm choice
- R² close to 1.0 means good regression fit
- Accuracy alone is misleading for imbalanced datasets
- Use pipelines to prevent data leakage
- Evaluate with multiple metrics, not just one