🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Python Machine Learning — Getting Started

Python Data ScienceMachine Learning🟢 Free Lesson

Advertisement

Python Machine Learning — Getting Started

Machine learning lets computers learn from data and make predictions without being explicitly programmed. This tutorial covers the fundamentals with scikit-learn.

Learning Objectives

  • Understand the ML workflow
  • Build regression and classification models
  • Evaluate model performance with proper metrics
  • Avoid common ML pitfalls
  • Apply feature engineering techniques

What is Machine Learning?

Architecture Diagram
Traditional Programming:
  Input + Rules ------► Output
  (data + if/else)     (result)

Machine Learning:
  Input + Output ------► Rules
  (data + labels)       (learned model)

Example: Instead of writing rules to identify spam emails:

  • Traditional: if "free money" in email: spam = True
  • ML: Show the model 10,000 emails labeled "spam" or "not spam", and it learns the patterns itself.

Types of Machine Learning

TypeGoalExample Algorithms
Supervised (Regression)Predict continuous valueLinear Regression, Ridge
Supervised (Classification)Predict categoryLogistic Regression, SVM
Unsupervised (Clustering)Find groupsK-Means, DBSCAN
Unsupervised (Dimensionality)Reduce featuresPCA, t-SNE
ReinforcementLearn from actionsQ-Learning, Policy Gradient

The ML Workflow

Architecture Diagram
1. Collect Data
       |
       v
2. Clean & Prepare Data
       |
       v
3. Feature Engineering
       |
       v
4. Split into Train/Test Sets
       |
       v
5. Choose a Model
       |
       v
6. Train the Model
       |
       v
7. Evaluate Performance
       |
       v
8. Tune Hyperparameters
       |
       v
9. Deploy

Data Preparation and Splitting

Train/Test Split

from sklearn.model_selection import train_test_split

# Always split BEFORE any preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,       # 20% for testing
    random_state=42,     # Reproducible splits
    stratify=y           # Maintain class distribution (classification)
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

Cross-Validation

Cross-validation gives a more robust estimate of model performance than a single train/test split.

from sklearn.model_selection import cross_val_score, KFold
import numpy as np

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")

# Stratified K-Fold (preserves class distribution)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)

Why Split First?

# BAD: Data leakage — test data influences training
scaler.fit(X)  # Sees ALL data including test
X_train, X_test = train_test_split(X)

# GOOD: Split first, then preprocess
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler.fit(X_train)           # Only training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaler

Feature Engineering

Feature engineering transforms raw data into features that better represent the underlying problem.

import pandas as pd
import numpy as np

# Numeric features
df['log_income'] = np.log1p(df['income'])  # Handle skewed data
df['age_squared'] = df['age'] ** 2          # Non-linear relationship

# Categorical features
df['city_encoded'] = df['city'].map({'NYC': 0, 'LA': 1, 'Chicago': 2})

# One-hot encoding
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)

# Binning continuous variables
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100],
                         labels=['Youth', 'Young Adult', 'Middle Age', 'Senior'])

# Date features
df['day_of_week'] = pd.to_datetime(df['date']).dt.dayofweek
df['month'] = pd.to_datetime(df['date']).dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Interaction features
df['income_per_age'] = df['income'] / df['age']

Scaling Features

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# StandardScaler: mean=0, std=1 (most common)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# MinMaxScaler: scales to [0, 1]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

# RobustScaler: robust to outliers
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)

Regression: Predicting Numbers

Regression predicts continuous values (price, temperature, salary).

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

# Sample data: house size vs price
X = np.array([[800], [1000], [1200], [1400], [1600],
              [1800], [2000], [2200], [2400], [2600]])
y = np.array([150000, 180000, 210000, 250000, 280000,
              310000, 350000, 380000, 420000, 450000])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.0f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.0f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.0f}")

# Predict new values
new_house = np.array([[1500]])
predicted_price = model.predict(new_house)
print(f"Predicted price for 1500 sqft: ${predicted_price[0]:,.0f}")

Regression Algorithms Comparison

AlgorithmWhen to UseProsCons
Linear RegressionLinear relationshipsInterpretable, fastAssumes linearity
RidgeMany correlated featuresPrevents overfittingNeeds alpha tuning
LassoFeature selectionAutomatic feature selectionCan zero out features
ElasticNetHigh-dimensional dataCombines Ridge + LassoComplex tuning
Random ForestNon-linear dataHandles non-linearityLess interpretable
Gradient BoostingBest accuracyHigh performanceSlower, needs tuning

Classification: Predicting Categories

Classification predicts discrete labels (spam/not spam, cat/dog/bird).

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print(f"Features: {feature_names}")
print(f"Classes: {target_names}")
print(f"Samples: {len(X)}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)

# Predict
y_pred = clf.predict(X_test_scaled)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Feature importance
importances = clf.feature_importances_
for name, importance in zip(feature_names, importances):
    print(f"  {name}: {importance:.3f}")

Evaluation Metrics

Regression Metrics

MetricWhat it MeasuresGood ValueFormula
R² ScoreVariance explainedClose to 1.01 - (SS_res / SS_tot)
MSEAverage squared errorClose to 0mean((y - y_pred)²)
RMSEError in original unitsClose to 0sqrt(MSE)
MAEAverage absolute errorClose to 0mean(

Classification Metrics

MetricWhat it MeasuresGood ValueWhen to Use
AccuracyCorrect predictions / totalClose to 1.0Balanced classes
PrecisionTrue positives / predicted positivesClose to 1.0Cost of false positive is high
RecallTrue positives / actual positivesClose to 1.0Cost of false negative is high
F1 ScoreHarmonic mean of precision and recallClose to 1.0Imbalanced classes
AUC-ROCRanking qualityClose to 1.0Binary classification
Log LossConfidence of predictionsClose to 0Probabilistic output needed

When to Use Each Metric

# Balanced dataset — accuracy is fine
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

# Imbalanced dataset — use F1 or AUC
from sklearn.metrics import f1_score, roc_auc_score
print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted'):.2%}")
print(f"AUC: {roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]):.2%}")

# When false negatives are costly (e.g., disease detection)
# Maximize recall
print(f"Recall: {recall_score(y_test, y_pred):.2%}")

# When false positives are costly (e.g., spam filtering)
# Maximize precision
print(f"Precision: {precision_score(y_test, y_pred):.2%}")

Common ML Algorithms

AlgorithmTypeBest ForComplexity
Linear RegressionRegressionSimple relationshipsLow
Logistic RegressionClassificationBinary outcomesLow
Decision TreesBothInterpretable modelsMedium
Random ForestBothGeneral purposeMedium
SVMBothHigh-dimensional dataHigh
K-Nearest NeighborsBothSmall datasetsLow
K-MeansClusteringGrouping dataMedium
Gradient BoostingBothBest accuracyHigh

Avoiding Common Pitfalls

1. Data Leakage

# BAD: Using test data for training decisions
scaler.fit(X)  # Fits on ALL data including test
X_train, X_test = train_test_split(X)

# GOOD: Split first, then fit
X_train, X_test = train_test_split(X)
scaler.fit(X_train)  # Only fits on training data
X_test_scaled = scaler.transform(X_test)

2. Overfitting vs Underfitting

from sklearn.model_selection import cross_val_score

# Check for overfitting
train_accuracy = clf.score(X_train_scaled, y_train)
cv_scores = cross_val_score(clf, X_train_scaled, y_train, cv=5)

print(f"Training Accuracy: {train_accuracy:.2%}")
print(f"CV Accuracy: {cv_scores.mean():.2%}")

# If training >> CV, model is overfitting
# If both are low, model is underfitting

# Solutions for overfitting:
# 1. Reduce model complexity
# 2. Add regularization
# 3. Get more training data
# 4. Use feature selection

# Solutions for underfitting:
# 1. Use more complex model
# 2. Add more features
# 3. Reduce regularization

3. Class Imbalance

# Check class distribution
print(f"Class distribution: {np.bincount(y)}")

# Solutions for imbalance
from sklearn.utils.class_weight import compute_class_weight

# Option 1: Use class_weight parameter
clf = RandomForestClassifier(class_weight='balanced')

# Option 2: Resample
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

4. Not Enough Data

# Always check your dataset size
print(f"Samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Samples per class: {np.bincount(y)}")

# Rule of thumb: need at least 10x more samples than features

Real-World Example: Predicting House Prices

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv('house_prices.csv')

# Feature engineering
df['total_area'] = df['sqft_living'] + df['sqft_lot']
df['bed_bath_ratio'] = df['bedrooms'] / (df['bathrooms'] + 1)
df['age'] = 2024 - df['year_built']
df['renovated'] = (df['yr_renovated'] > 0).astype(int)

# Select features
features = ['sqft_living', 'bedrooms', 'bathrooms', 'floors',
            'age', 'renovated', 'total_area', 'bed_bath_ratio']
X = df[features]
y = df['price']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train
model = GradientBoostingRegressor(n_estimators=200, max_depth=4, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.0f}")

# Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
print(f"CV R² Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

Real-World Example: Spam Classification

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Sample data
emails = [
    "Win a free iPhone now!", "Meeting at 3pm tomorrow",
    "Congratulations! You won $1 million", "Please review the attached report",
    "Get rich quick! Click here", "Team lunch on Friday",
    "URGENT: Verify your account", "Project deadline extended",
    "Buy cheap medications online", "Quarterly results attached"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1=spam, 0=not spam

# Build pipeline (vectorizer + classifier)
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
    ('classifier', MultinomialNB())
])

# Train
pipeline.fit(emails, labels)

# Predict
new_emails = ["Win free money now!", "Team meeting tomorrow"]
predictions = pipeline.predict(new_emails)
print(f"Predictions: {predictions}")  # [1, 0]

Common Mistakes

MistakeProblemSolution
Not splitting dataCan't evaluate properlyAlways train/test split
Using accuracy for imbalanceMisleading metricUse F1, AUC, or precision/recall
Not scaling featuresPoor performanceScale for SVM, KNN, etc.
OverfittingGreat training, bad testUse cross-validation
Not handling missing valuesErrors or biasImpute or remove
Ignoring outliersSkewed modelDetect and handle outliers
Too many featuresCurse of dimensionalityFeature selection

Key Takeaways

  1. Always split data into train/test BEFORE any preprocessing
  2. Scale features for algorithms sensitive to magnitude (SVM, KNN)
  3. Use cross-validation for robust evaluation
  4. Start simple, add complexity only when needed
  5. Check for class imbalance in classification problems
  6. Feature engineering often matters more than algorithm choice
  7. R² close to 1.0 means good regression fit
  8. Accuracy alone is misleading for imbalanced datasets
  9. Use pipelines to prevent data leakage
  10. Evaluate with multiple metrics, not just one

Premium Content

Python Machine Learning — Getting Started

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Python Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement