🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Logistic Regression — Complete Guide for Classification

ML FoundationsClassification🟢 Free Lesson

Advertisement

Supervised Learning

Classification with Probability — From Linear to Sigmoid

Logistic regression transforms linear outputs into probabilities using the sigmoid function. It is the foundation of classification in machine learning.

  • Sigmoid Function — Map any real number to a probability between 0 and 1
  • Cross-Entropy Loss — The cost function that powers classification training
  • Multiclass Extension — Softmax regression for multiple classes

"The goal is to turn data into information, and information into insight."

Logistic Regression — Complete Guide for Classification

Despite its name, logistic regression is a classification algorithm. It predicts the probability that an input belongs to a class.


From Linear to Logistic Regression

DfLogistic Regression

Given training data {(x(i),y(i))}i=1N\{(x^{(i)}, y^{(i)})\}_{i=1}^{N} with y(i){0,1}y^{(i)} \in \{0, 1\}, logistic regression models P(y=1x)P(y=1|x) using the sigmoid function applied to a linear combination of features: P(y=1x)=σ(wTx+b)P(y=1|x) = \sigma(\mathbf{w}^T\mathbf{x} + b).

Sigmoid Function

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Here,

  • σ(z)\sigma(z)=Output probability (0 to 1)
  • zz=Linear combination w^Tx + b
  • ee=Euler's number (~2.71828)
Sigmoid Function: σ(z) = 1 / (1 + e⁻ᶻ)zσ(z)1.00.50.0-6-336z=0 → σ=0.5Inflection pointKey Propertiesσ(0) = 0.5σ(z) → 1 as z → +∞σ(z) → 0 as z → −∞Derivative:σ'(z) = σ(z)(1 − σ(z))Output range:(0, 1) — valid probabilityDecision rule:σ(z) ≥ 0.5 → Class 1σ(z) < 0.5 → Class 0

Example: Sigmoid Output

For different values of z=wTx+bz = \mathbf{w}^T\mathbf{x} + b:

  • z=0σ(z)=0.5z = 0 \Rightarrow \sigma(z) = 0.5 (decision boundary)
  • z=2σ(z)=0.88z = 2 \Rightarrow \sigma(z) = 0.88 (likely Class 1)
  • z=2σ(z)=0.12z = -2 \Rightarrow \sigma(z) = 0.12 (likely Class 0)
  • z+σ(z)1z \to +\infty \Rightarrow \sigma(z) \to 1
  • zσ(z)0z \to -\infty \Rightarrow \sigma(z) \to 0

Cost Function

Why Not MSE?

Linear regression uses MSE, but for logistic regression the sigmoid makes MSE non-convex with many local minima. Cross-entropy loss is convex and has a nice gradient form.

DfBinary Cross-Entropy (Log Loss)

The cost function for logistic regression. For a single example: L=[ylog(y^)+(1y)log(1y^)]L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]. The full cost over NN examples:

J(w,b)=1Ni=1N[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(\mathbf{w}, b) = -\frac{1}{N}\sum_{i=1}^{N}\left[y^{(i)}\log(\hat{y}^{(i)}) + (1-y^{(i)})\log(1-\hat{y}^{(i)})\right]
Cross-Entropy Loss BehaviorWhen y = 1 (True label)ŷ (predicted probability)010L = −log(ŷ)ŷ=1 → L=0ŷ=0 → L=∞When y = 0 (True label)ŷ (predicted probability)L = −log(1−ŷ)ŷ=0 → L=0ŷ=1 → L=∞Cross-entropy heavily penalizes confident wrong predictions — the gradient is largest when the model is most wrong

Example: Cross-Entropy Loss

Case 1: y=1y = 1

  • y^=0.9L=log(0.9)=0.11\hat{y} = 0.9 \Rightarrow L = -\log(0.9) = 0.11 (good — confident and correct)
  • y^=0.1L=log(0.1)=2.30\hat{y} = 0.1 \Rightarrow L = -\log(0.1) = 2.30 (bad — confident and wrong)

Case 2: y=0y = 0

  • y^=0.1L=log(0.9)=0.11\hat{y} = 0.1 \Rightarrow L = -\log(0.9) = 0.11 (good)
  • y^=0.9L=log(0.1)=2.30\hat{y} = 0.9 \Rightarrow L = -\log(0.1) = 2.30 (bad)

Decision Boundary

DfDecision Boundary

The decision boundary is the hyperplane where σ(wTx+b)=0.5\sigma(\mathbf{w}^T\mathbf{x} + b) = 0.5, which simplifies to wTx+b=0\mathbf{w}^T\mathbf{x} + b = 0. This is a linear boundary in feature space. For nonlinear boundaries, add polynomial features or use kernel methods.

Decision Boundary in 2D Feature Spacex₁ (feature 1)x₂ (feature 2)w₁x₁ + w₂x₂ + b = 0Class 0Class 1Multi-class: SoftmaxSoftmax converts logits to probabilities:softmax(zᵢ) = eᶻⁱ / Σⱼ eᶻʲExample with 3 classes:z = [2.0, 1.0, 0.1]eᶻ = [7.39, 2.72, 1.11]softmax = [0.659, 0.242, 0.099]Sum = 1.0 ✓Predicted: Class 0 (highest probability)Output is a valid probability distribution over K classes

Gradient Derivation

Gradient of Cross-Entropy Loss

wJ=1NXT(y^y)\nabla_{\mathbf{w}} J = \frac{1}{N}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})

Here,

  • y^\hat{\mathbf{y}}=Vector of predicted probabilities σ(Xw+b)
  • y\mathbf{y}=Vector of true labels
  • X\mathbf{X}=Design matrix (N × d)

Key Insight

The gradient 1NXT(y^y)\frac{1}{N}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) has the same form as linear regression! This is because the sigmoid's derivative σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z)) cancels nicely in the chain rule, making the gradient clean and interpretable.


Python Implementation

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LogisticRegression(C=1.0, penalty='l2', solver='lbfgs')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

from sklearn.metrics import accuracy_score, roc_auc_score
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.3f}")

Key Takeaways

Summary: Logistic Regression

  1. Logistic regression outputs P(y=1x)=σ(wTx+b)(0,1)P(y=1|x) = \sigma(\mathbf{w}^T\mathbf{x} + b) \in (0,1)
  2. Binary cross-entropy J=1N[ylogy^+(1y)log(1y^)]J = -\frac{1}{N}\sum[y\log\hat{y} + (1-y)\log(1-\hat{y})] is the cost function
  3. Decision boundary is a linear hyperplane wTx+b=0\mathbf{w}^T\mathbf{x} + b = 0
  4. Multiclass: use Softmax regression (multinomial) or One-vs-Rest
  5. AUC-ROC is the best metric for imbalanced datasets
  6. Regularization (L1/L2 via parameter C) prevents overfitting
  7. Logistic regression is fast, interpretable, and a strong baseline for classification
  8. The gradient has the elegant form wJ=1NXT(y^y)\nabla_{\mathbf{w}} J = \frac{1}{N}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})

What to Learn Next

-> Linear Regression From scatter plots to predictions — the simplest ML algorithm.

-> Naive Bayes Bayes' theorem in action — fast, simple, surprisingly powerful.

-> SVM Finding the optimal boundary — maximum margin classification.

Premium Content

Logistic Regression — Complete Guide for Classification

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement