Logistic Regression

Why It Matters

Logistic regression is the baseline classifier and the foundation for neural network classification. It models binary outcomes using the sigmoid function, producing interpretable odds ratios. Every data scientist must understand its coefficients, evaluation metrics, and relationship to more complex models. It is the starting point for any classification task.

Overview

Logistic regression models the probability that a binary outcome $Y = 1$ given features $X$ . Unlike linear regression, it uses the sigmoid function $\sigma(z) = 1/(1 + e^{-z})$ to map the linear predictor to a probability between 0 and 1. The model is linear in the log-odds space: $\log(p/(1-p)) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p$ . Coefficients exponentiate to odds ratios ( $e^{\beta_j}$ ), providing intuitive effect size estimates. A classification threshold (typically 0.5) converts probabilities to binary predictions. The model is fitted via maximum likelihood estimation (MLE), not OLS.

Key Concepts

Logistic Regression Model

P(Y=1|X) = \sigma(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)

Here,

$\sigma(z)$ =Sigmoid function: $1 / (1 + e^{-z})$
$\beta_0, \beta_1, \ldots$ =Model coefficients
$x_1, \ldots, x_p$ =Input features

Log-Odds (Logit) Link

\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p

Here,

$p$ =$P(Y=1|X)$ — probability of class 1
$\frac{p}{1-p}$ =Odds of the outcome

Odds Ratio

\text{OR} = e^{\beta_j}

Here,

$\beta_j$ =Coefficient for feature j
$e^{\beta_j}$ =Multiplicative effect on odds for one-unit increase in x_j

Log-Likelihood

\ell(\beta) = \sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i)]

Here,

$y_i$ =Observed outcome (0 or 1)
$\hat{p}_i$ =Predicted probability for observation i

Classification Metrics

Metric	Formula	Use Case
Accuracy	$\frac{TP+TN}{TP+TN+FP+FN}$	Balanced classes
Precision	$\frac{TP}{TP+FP}$	Cost of false positive high
Recall	$\frac{TP}{TP+FN}$	Cost of false negative high
F1 Score	$\frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$	Balance precision and recall
AUC-ROC	Area under ROC curve	Threshold-independent evaluation

Odds Ratio Interpretation

OR Value	Interpretation
OR = 1	No association
OR > 1	Positive association (increases odds)
OR < 1	Negative association (decreases odds)
OR = 2	Doubles the odds
OR = 0.5	Halves the odds

Quick Example

Interpreting Logistic Coefficient

$\hat{\beta}_1 = 0.5$ for age in a logistic regression predicting disease.

Odds ratio = $e^{0.5} \approx 1.65$ . For each year increase in age, the odds of disease multiply by 1.65 (65% increase in odds). The 95% CI for the OR might be [1.2, 2.3], indicating the effect is statistically significant.

Threshold Trade-Off

A model predicts $P(\text{spam}) = 0.45$ for an email. With threshold 0.5, it's classified as not spam. But if missing spam is costly (false negative), lower the threshold to 0.3 — catching more spam but also flagging more legitimate emails. The choice of threshold depends on the relative costs of false positives vs. false negatives.

Confusion Matrix

A classifier predicts 80 correct and 20 incorrect out of 100 samples:

	Predicted Positive	Predicted Negative
Actual Positive	TP = 45	FN = 5
Actual Negative	FP = 15	TN = 35

Accuracy = (45+35)/100 = 80%. Precision = 45/(45+15) = 75%. Recall = 45/(45+5) = 90%. F1 = 2(0.75)(0.9)/(0.75+0.9) = 81.8%.

Key Takeaways

Summary: Logistic Regression

Sigmoid Function: Maps any real number to $(0, 1)$ , making it ideal for probability estimation.
Log-Odds Linear: The model is linear in log-odds space, not probability space. This avoids probabilities outside $[0, 1]$ .
Odds Ratio: $e^{\beta_j}$ gives the multiplicative effect on odds for a one-unit increase in $x_j$ . The most interpretable output.
MLE Estimation: Fitted via maximum likelihood, not OLS. The log-likelihood is maximized.
Threshold: Default 0.5, but can be adjusted to trade off precision vs. recall based on business costs.
Evaluation: Use accuracy for balanced classes, precision when false positives are costly, recall when false negatives are costly, and F1 for balance. AUC-ROC for threshold-independent evaluation.
Foundation: Logistic regression is the baseline for binary classification. Neural networks generalize it with hidden layers and non-linear activations.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons: