Python Machine Learning — Getting Started

Machine learning lets computers learn from data and make predictions without being explicitly programmed. This tutorial covers the fundamentals with scikit-learn.

Learning Objectives

Understand the ML workflow
Build regression and classification models
Evaluate model performance with proper metrics
Avoid common ML pitfalls
Apply feature engineering techniques

What is Machine Learning?

Architecture Diagram

Traditional Programming:
  Input + Rules ------► Output
  (data + if/else)     (result)

Machine Learning:
  Input + Output ------► Rules
  (data + labels)       (learned model)

Example: Instead of writing rules to identify spam emails:

Traditional: if "free money" in email: spam = True
ML: Show the model 10,000 emails labeled "spam" or "not spam", and it learns the patterns itself.

Types of Machine Learning

Type	Goal	Example Algorithms
Supervised (Regression)	Predict continuous value	Linear Regression, Ridge
Supervised (Classification)	Predict category	Logistic Regression, SVM
Unsupervised (Clustering)	Find groups	K-Means, DBSCAN
Unsupervised (Dimensionality)	Reduce features	PCA, t-SNE
Reinforcement	Learn from actions	Q-Learning, Policy Gradient

The ML Workflow

Architecture Diagram

1. Collect Data
       |
       v
2. Clean & Prepare Data
       |
       v
3. Feature Engineering
       |
       v
4. Split into Train/Test Sets
       |
       v
5. Choose a Model
       |
       v
6. Train the Model
       |
       v
7. Evaluate Performance
       |
       v
8. Tune Hyperparameters
       |
       v
9. Deploy

Data Preparation and Splitting

Train/Test Split

from sklearn.model_selection import train_test_split

# Always split BEFORE any preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,       # 20% for testing
    random_state=42,     # Reproducible splits
    stratify=y           # Maintain class distribution (classification)
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

Cross-Validation

Cross-validation gives a more robust estimate of model performance than a single train/test split.

from sklearn.model_selection import cross_val_score, KFold
import numpy as np

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")

# Stratified K-Fold (preserves class distribution)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)

Why Split First?

# BAD: Data leakage — test data influences training
scaler.fit(X)  # Sees ALL data including test
X_train, X_test = train_test_split(X)

# GOOD: Split first, then preprocess
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler.fit(X_train)           # Only training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaler

Feature Engineering

Feature engineering transforms raw data into features that better represent the underlying problem.

import pandas as pd
import numpy as np

# Numeric features
df['log_income'] = np.log1p(df['income'])  # Handle skewed data
df['age_squared'] = df['age'] ** 2          # Non-linear relationship

# Categorical features
df['city_encoded'] = df['city'].map({'NYC': 0, 'LA': 1, 'Chicago': 2})

# One-hot encoding
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)

# Binning continuous variables
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100],
                         labels=['Youth', 'Young Adult', 'Middle Age', 'Senior'])

# Date features
df['day_of_week'] = pd.to_datetime(df['date']).dt.dayofweek
df['month'] = pd.to_datetime(df['date']).dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Interaction features
df['income_per_age'] = df['income'] / df['age']

Scaling Features

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# StandardScaler: mean=0, std=1 (most common)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# MinMaxScaler: scales to [0, 1]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

# RobustScaler: robust to outliers
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)

Regression: Predicting Numbers

Regression predicts continuous values (price, temperature, salary).

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

# Sample data: house size vs price
X = np.array([[800], [1000], [1200], [1400], [1600],
              [1800], [2000], [2200], [2400], [2600]])
y = np.array([150000, 180000, 210000, 250000, 280000,
              310000, 350000, 380000, 420000, 450000])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.0f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.0f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.0f}")

# Predict new values
new_house = np.array([[1500]])
predicted_price = model.predict(new_house)
print(f"Predicted price for 1500 sqft: ${predicted_price[0]:,.0f}")

Regression Algorithms Comparison

Algorithm	When to Use	Pros	Cons
Linear Regression	Linear relationships	Interpretable, fast	Assumes linearity
Ridge	Many correlated features	Prevents overfitting	Needs alpha tuning
Lasso	Feature selection	Automatic feature selection	Can zero out features
ElasticNet	High-dimensional data	Combines Ridge + Lasso	Complex tuning
Random Forest	Non-linear data	Handles non-linearity	Less interpretable
Gradient Boosting	Best accuracy	High performance	Slower, needs tuning

Classification: Predicting Categories

Classification predicts discrete labels (spam/not spam, cat/dog/bird).

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print(f"Features: {feature_names}")
print(f"Classes: {target_names}")
print(f"Samples: {len(X)}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)

# Predict
y_pred = clf.predict(X_test_scaled)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Feature importance
importances = clf.feature_importances_
for name, importance in zip(feature_names, importances):
    print(f"  {name}: {importance:.3f}")

Evaluation Metrics

Regression Metrics

Metric	What it Measures	Good Value	Formula
R² Score	Variance explained	Close to 1.0	1 - (SS_res / SS_tot)
MSE	Average squared error	Close to 0	mean((y - y_pred)²)
RMSE	Error in original units	Close to 0	sqrt(MSE)
MAE	Average absolute error	Close to 0	mean(

Classification Metrics

Metric	What it Measures	Good Value	When to Use
Accuracy	Correct predictions / total	Close to 1.0	Balanced classes
Precision	True positives / predicted positives	Close to 1.0	Cost of false positive is high
Recall	True positives / actual positives	Close to 1.0	Cost of false negative is high
F1 Score	Harmonic mean of precision and recall	Close to 1.0	Imbalanced classes
AUC-ROC	Ranking quality	Close to 1.0	Binary classification
Log Loss	Confidence of predictions	Close to 0	Probabilistic output needed

When to Use Each Metric

# Balanced dataset — accuracy is fine
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

# Imbalanced dataset — use F1 or AUC
from sklearn.metrics import f1_score, roc_auc_score
print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted'):.2%}")
print(f"AUC: {roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]):.2%}")

# When false negatives are costly (e.g., disease detection)
# Maximize recall
print(f"Recall: {recall_score(y_test, y_pred):.2%}")

# When false positives are costly (e.g., spam filtering)
# Maximize precision
print(f"Precision: {precision_score(y_test, y_pred):.2%}")

Common ML Algorithms

Algorithm	Type	Best For	Complexity
Linear Regression	Regression	Simple relationships	Low
Logistic Regression	Classification	Binary outcomes	Low
Decision Trees	Both	Interpretable models	Medium
Random Forest	Both	General purpose	Medium
SVM	Both	High-dimensional data	High
K-Nearest Neighbors	Both	Small datasets	Low
K-Means	Clustering	Grouping data	Medium
Gradient Boosting	Both	Best accuracy	High

Avoiding Common Pitfalls

1. Data Leakage

# BAD: Using test data for training decisions
scaler.fit(X)  # Fits on ALL data including test
X_train, X_test = train_test_split(X)

# GOOD: Split first, then fit
X_train, X_test = train_test_split(X)
scaler.fit(X_train)  # Only fits on training data
X_test_scaled = scaler.transform(X_test)

2. Overfitting vs Underfitting

from sklearn.model_selection import cross_val_score

# Check for overfitting
train_accuracy = clf.score(X_train_scaled, y_train)
cv_scores = cross_val_score(clf, X_train_scaled, y_train, cv=5)

print(f"Training Accuracy: {train_accuracy:.2%}")
print(f"CV Accuracy: {cv_scores.mean():.2%}")

# If training >> CV, model is overfitting
# If both are low, model is underfitting

# Solutions for overfitting:
# 1. Reduce model complexity
# 2. Add regularization
# 3. Get more training data
# 4. Use feature selection

# Solutions for underfitting:
# 1. Use more complex model
# 2. Add more features
# 3. Reduce regularization

3. Class Imbalance

# Check class distribution
print(f"Class distribution: {np.bincount(y)}")

# Solutions for imbalance
from sklearn.utils.class_weight import compute_class_weight

# Option 1: Use class_weight parameter
clf = RandomForestClassifier(class_weight='balanced')

# Option 2: Resample
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

4. Not Enough Data

# Always check your dataset size
print(f"Samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Samples per class: {np.bincount(y)}")

# Rule of thumb: need at least 10x more samples than features

Real-World Example: Predicting House Prices

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv('house_prices.csv')

# Feature engineering
df['total_area'] = df['sqft_living'] + df['sqft_lot']
df['bed_bath_ratio'] = df['bedrooms'] / (df['bathrooms'] + 1)
df['age'] = 2024 - df['year_built']
df['renovated'] = (df['yr_renovated'] > 0).astype(int)

# Select features
features = ['sqft_living', 'bedrooms', 'bathrooms', 'floors',
            'age', 'renovated', 'total_area', 'bed_bath_ratio']
X = df[features]
y = df['price']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train
model = GradientBoostingRegressor(n_estimators=200, max_depth=4, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.0f}")

# Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
print(f"CV R² Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

Real-World Example: Spam Classification

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Sample data
emails = [
    "Win a free iPhone now!", "Meeting at 3pm tomorrow",
    "Congratulations! You won $1 million", "Please review the attached report",
    "Get rich quick! Click here", "Team lunch on Friday",
    "URGENT: Verify your account", "Project deadline extended",
    "Buy cheap medications online", "Quarterly results attached"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1=spam, 0=not spam

# Build pipeline (vectorizer + classifier)
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
    ('classifier', MultinomialNB())
])

# Train
pipeline.fit(emails, labels)

# Predict
new_emails = ["Win free money now!", "Team meeting tomorrow"]
predictions = pipeline.predict(new_emails)
print(f"Predictions: {predictions}")  # [1, 0]

Common Mistakes

Mistake	Problem	Solution
Not splitting data	Can't evaluate properly	Always train/test split
Using accuracy for imbalance	Misleading metric	Use F1, AUC, or precision/recall
Not scaling features	Poor performance	Scale for SVM, KNN, etc.
Overfitting	Great training, bad test	Use cross-validation
Not handling missing values	Errors or bias	Impute or remove
Ignoring outliers	Skewed model	Detect and handle outliers
Too many features	Curse of dimensionality	Feature selection

Key Takeaways

Always split data into train/test BEFORE any preprocessing
Scale features for algorithms sensitive to magnitude (SVM, KNN)
Use cross-validation for robust evaluation
Start simple, add complexity only when needed
Check for class imbalance in classification problems
Feature engineering often matters more than algorithm choice
R² close to 1.0 means good regression fit
Accuracy alone is misleading for imbalanced datasets
Use pipelines to prevent data leakage
Evaluate with multiple metrics, not just one

Python Machine Learning — Getting Started

Python Machine Learning — Getting Started

Learning Objectives

What is Machine Learning?

Types of Machine Learning

The ML Workflow

Data Preparation and Splitting

Train/Test Split

Cross-Validation

Why Split First?

Feature Engineering

Scaling Features

Regression: Predicting Numbers

Regression Algorithms Comparison

Classification: Predicting Categories

Evaluation Metrics

Regression Metrics

Classification Metrics

When to Use Each Metric

Common ML Algorithms

Avoiding Common Pitfalls

1. Data Leakage

2. Overfitting vs Underfitting

3. Class Imbalance

4. Not Enough Data

Real-World Example: Predicting House Prices

Real-World Example: Spam Classification

Common Mistakes

Key Takeaways

Premium Content

Need Expert Python Help?