Supervised Learning

Classification with Probability — From Linear to Sigmoid

Logistic regression transforms linear outputs into probabilities using the sigmoid function. It is the foundation of classification in machine learning.

Sigmoid Function — Map any real number to a probability between 0 and 1
Cross-Entropy Loss — The cost function that powers classification training
Multiclass Extension — Softmax regression for multiple classes

"The goal is to turn data into information, and information into insight."

Logistic Regression — Complete Guide for Classification

Despite its name, logistic regression is a classification algorithm. It predicts the probability that an input belongs to a class.

From Linear to Logistic Regression

DfLogistic Regression

Given training data $\{(x^{(i)}, y^{(i)})\}_{i=1}^{N}$ with $y^{(i)} \in \{0, 1\}$ , logistic regression models $P(y=1|x)$ using the sigmoid function applied to a linear combination of features: $P(y=1|x) = \sigma(\mathbf{w}^T\mathbf{x} + b)$ .

Sigmoid Function

\sigma(z) = \frac{1}{1 + e^{-z}}

Here,

$\sigma(z)$ =Output probability (0 to 1)
$z$ =Linear combination w^Tx + b
$e$ =Euler's number (~2.71828)

Example: Sigmoid Output

For different values of $z = \mathbf{w}^T\mathbf{x} + b$ :

$z = 0 \Rightarrow \sigma(z) = 0.5$ (decision boundary)
$z = 2 \Rightarrow \sigma(z) = 0.88$ (likely Class 1)
$z = -2 \Rightarrow \sigma(z) = 0.12$ (likely Class 0)
$z \to +\infty \Rightarrow \sigma(z) \to 1$
$z \to -\infty \Rightarrow \sigma(z) \to 0$

Cost Function

Why Not MSE?

Linear regression uses MSE, but for logistic regression the sigmoid makes MSE non-convex with many local minima. Cross-entropy loss is convex and has a nice gradient form.

DfBinary Cross-Entropy (Log Loss)

The cost function for logistic regression. For a single example: $L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$ . The full cost over $N$ examples:

J(\mathbf{w}, b) = -\frac{1}{N}\sum_{i=1}^{N}\left[y^{(i)}\log(\hat{y}^{(i)}) + (1-y^{(i)})\log(1-\hat{y}^{(i)})\right]

Example: Cross-Entropy Loss

Case 1: $y = 1$

$\hat{y} = 0.9 \Rightarrow L = -\log(0.9) = 0.11$ (good — confident and correct)
$\hat{y} = 0.1 \Rightarrow L = -\log(0.1) = 2.30$ (bad — confident and wrong)

Case 2: $y = 0$

$\hat{y} = 0.1 \Rightarrow L = -\log(0.9) = 0.11$ (good)
$\hat{y} = 0.9 \Rightarrow L = -\log(0.1) = 2.30$ (bad)

Decision Boundary

DfDecision Boundary

The decision boundary is the hyperplane where $\sigma(\mathbf{w}^T\mathbf{x} + b) = 0.5$ , which simplifies to $\mathbf{w}^T\mathbf{x} + b = 0$ . This is a linear boundary in feature space. For nonlinear boundaries, add polynomial features or use kernel methods.

Gradient Derivation

Gradient of Cross-Entropy Loss

\nabla_{\mathbf{w}} J = \frac{1}{N}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})

Here,

$\hat{\mathbf{y}}$ =Vector of predicted probabilities σ(Xw+b)
$\mathbf{y}$ =Vector of true labels
$\mathbf{X}$ =Design matrix (N × d)

Key Insight

The gradient $\frac{1}{N}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})$ has the same form as linear regression! This is because the sigmoid's derivative $\sigma'(z) = \sigma(z)(1-\sigma(z))$ cancels nicely in the chain rule, making the gradient clean and interpretable.

Python Implementation

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LogisticRegression(C=1.0, penalty='l2', solver='lbfgs')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

from sklearn.metrics import accuracy_score, roc_auc_score
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.3f}")

Key Takeaways

Summary: Logistic Regression

Logistic regression outputs $P(y=1|x) = \sigma(\mathbf{w}^T\mathbf{x} + b) \in (0,1)$
Binary cross-entropy $J = -\frac{1}{N}\sum[y\log\hat{y} + (1-y)\log(1-\hat{y})]$ is the cost function
Decision boundary is a linear hyperplane $\mathbf{w}^T\mathbf{x} + b = 0$
Multiclass: use Softmax regression (multinomial) or One-vs-Rest
AUC-ROC is the best metric for imbalanced datasets
Regularization (L1/L2 via parameter C) prevents overfitting
Logistic regression is fast, interpretable, and a strong baseline for classification
The gradient has the elegant form $\nabla_{\mathbf{w}} J = \frac{1}{N}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})$

What to Learn Next

-> Linear Regression From scatter plots to predictions — the simplest ML algorithm.

-> Naive Bayes Bayes' theorem in action — fast, simple, surprisingly powerful.

-> SVM Finding the optimal boundary — maximum margin classification.

Logistic Regression — Complete Guide for Classification

Classification with Probability — From Linear to Sigmoid

Logistic Regression — Complete Guide for Classification

From Linear to Logistic Regression

DfLogistic Regression

Sigmoid Function

Example: Sigmoid Output

Cost Function

DfBinary Cross-Entropy (Log Loss)

Example: Cross-Entropy Loss

Decision Boundary

DfDecision Boundary

Gradient Derivation

Gradient of Cross-Entropy Loss

Python Implementation

Key Takeaways

Summary: Logistic Regression

What to Learn Next

Premium Content

Need Expert Machine Learning Help?