Classification with Probability — From Linear to Sigmoid
Logistic regression transforms linear outputs into probabilities using the sigmoid function. It is the foundation of classification in machine learning.
- Sigmoid Function — Map any real number to a probability between 0 and 1
- Cross-Entropy Loss — The cost function that powers classification training
- Multiclass Extension — Softmax regression for multiple classes
"The goal is to turn data into information, and information into insight."
Logistic Regression — Complete Guide for Classification
Despite its name, logistic regression is a classification algorithm. It predicts the probability that an input belongs to a class.
From Linear to Logistic Regression
DfLogistic Regression
Given training data with , logistic regression models using the sigmoid function applied to a linear combination of features: .
Sigmoid Function
Here,
- =Output probability (0 to 1)
- =Linear combination w^Tx + b
- =Euler's number (~2.71828)
Example: Sigmoid Output
For different values of :
- (decision boundary)
- (likely Class 1)
- (likely Class 0)
Cost Function
Why Not MSE?
Linear regression uses MSE, but for logistic regression the sigmoid makes MSE non-convex with many local minima. Cross-entropy loss is convex and has a nice gradient form.
DfBinary Cross-Entropy (Log Loss)
The cost function for logistic regression. For a single example: . The full cost over examples:
Example: Cross-Entropy Loss
Case 1:
- (good — confident and correct)
- (bad — confident and wrong)
Case 2:
- (good)
- (bad)
Decision Boundary
DfDecision Boundary
The decision boundary is the hyperplane where , which simplifies to . This is a linear boundary in feature space. For nonlinear boundaries, add polynomial features or use kernel methods.
Gradient Derivation
Gradient of Cross-Entropy Loss
Here,
- =Vector of predicted probabilities σ(Xw+b)
- =Vector of true labels
- =Design matrix (N × d)
Key Insight
The gradient has the same form as linear regression! This is because the sigmoid's derivative cancels nicely in the chain rule, making the gradient clean and interpretable.
Python Implementation
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(C=1.0, penalty='l2', solver='lbfgs')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
from sklearn.metrics import accuracy_score, roc_auc_score
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.3f}")
Key Takeaways
Summary: Logistic Regression
- Logistic regression outputs
- Binary cross-entropy is the cost function
- Decision boundary is a linear hyperplane
- Multiclass: use Softmax regression (multinomial) or One-vs-Rest
- AUC-ROC is the best metric for imbalanced datasets
- Regularization (L1/L2 via parameter C) prevents overfitting
- Logistic regression is fast, interpretable, and a strong baseline for classification
- The gradient has the elegant form
What to Learn Next
-> Linear Regression From scatter plots to predictions — the simplest ML algorithm.
-> Naive Bayes Bayes' theorem in action — fast, simple, surprisingly powerful.
-> SVM Finding the optimal boundary — maximum margin classification.