Logistic Regression: Decision Boundary, Cost Function & Multiclass
The workhorse of binary classification in industry
Interview Question
"Explain the decision boundary in logistic regression. Why do we use the logistic (sigmoid) function instead of a linear function for classification? How do you extend logistic regression to multiclass problems?"
Difficulty: Medium | Frequently asked at Meta, Microsoft, Amazon
Theoretical Foundation
The Logistic Model
Logistic regression models the probability that a binary outcome occurs given input features :
where is the linear combination of features and the sigmoid function maps any real number to the interval .
Properties of the sigmoid function:
- (decision boundary)
- (symmetry)
- (efficient gradient computation)
- As ,
- As ,
ℹ️
Key Insight: The sigmoid function is the inverse of the logit function: . This means logistic regression models the log-odds of the outcome as a linear function of the features.
Why Not Use Linear Regression for Classification?
There are three fundamental problems:
-
Unbounded Output: Linear regression predicts values in , but probabilities must be in
-
Non-normal Errors: The error term is not normally distributed for binary outcomes (it follows a Bernoulli distribution)
-
Heteroscedasticity: The variance of the error term depends on the input, violating OLS assumptions
The Decision Boundary
The decision boundary is the hypersurface where the model switches between predicting class 0 and class 1. For logistic regression:
This is because:
- When : → Predict class 1
- When : → Predict class 0
- When : → Boundary
The decision boundary is always linear (a hyperplane in feature space). This is both a strength (simple, interpretable) and a limitation (cannot capture non-linear relationships without feature engineering).
⚠️
Common Misconception: Many candidates think the decision boundary in logistic regression is curved because the probability is non-linear. The boundary itself is linear; only the probability mapping is non-linear.
The Cost Function
Logistic regression uses Binary Cross-Entropy (BCE) loss, also called log loss:
where .
Why not use Mean Squared Error?
If we used MSE:
The problem is that MSE creates a non-convex loss surface with local minima, making optimization difficult. BCE is convex, guaranteeing a global minimum.
Mathematical Derivation of BCE:
From maximum likelihood estimation:
Taking the log:
Minimizing the negative log-likelihood gives us BCE.
Gradient of the Cost Function
The gradient has a beautiful form:
This is identical in form to the OLS gradient, which is why logistic regression can be solved with similar optimization algorithms.
Optimization Algorithms
1. Gradient Descent:
2. Newton-Raphson (Second-Order):
where is the Hessian matrix. Faster convergence but per iteration.
3. L-BFGS (Limited-memory BFGS):
- Approximates the Hessian using limited memory
- Default in scikit-learn
- Good balance of speed and memory efficiency
4. Coordinate Descent (for regularized logistic regression):
- Used when L1/L2 penalties are added
- Solves one coordinate at a time
- Very efficient for high-dimensional sparse data
Multiclass Extensions
One-vs-Rest (OvR) / One-vs-All
Trains binary classifiers, one for each class:
Prediction: Choose the class with the highest probability.
Pros: Simple, parallelizable Cons: Class probabilities don't sum to 1; can be suboptimal when classes are not balanced
Multinomial Logistic Regression (Softmax)
Directly models classes using the softmax function:
Properties:
- Probabilities sum to 1:
- Requires weight vectors (or if one class is the reference)
- Uses categorical cross-entropy loss:
💡
Production Tip: In practice, multinomial logistic regression (softmax) usually outperforms one-vs-rest. Meta uses softmax regression for their content recommendation ranking systems.
Code Implementation
Explanation of Code
-
Decision Boundary: Visualizes the linear decision boundary and shows the equation of the separating hyperplane.
-
Probability Output: Demonstrates how logistic regression outputs calibrated probabilities and how different thresholds affect predictions.
-
Multiclass Comparison: Compares One-vs-Rest vs Multinomial (Softmax) approaches, showing Softmax produces properly calibrated probabilities.
-
Cost Function: Illustrates why BCE is preferred over MSE for classification, showing BCE penalizes confident wrong predictions more heavily.
-
Regularization: Shows how C (inverse of λ) controls model complexity in logistic regression.
Real-World Applications
Meta: Content Ranking
Meta uses logistic regression (and its neural network extensions) for:
- News Feed Ranking: Predicting probability of user engagement
- Ad Targeting: Estimating click-through rates (CTR)
- Content Moderation: Classifying potentially harmful content
Microsoft: Spam Detection
Microsoft's email spam filters use multinomial logistic regression with:
- TF-IDF features from email text
- Header features (sender reputation, time sent)
- Behavioral features (sender-recipient interaction history)
Industry Best Practices
- Feature Scaling: Always standardize features before logistic regression
- Class Imbalance: Use
class_weight='balanced'or SMOTE - Multicollinearity: Check VIF values; use regularization if VIF > 10
- Model Calibration: Use
CalibratedClassifierCVif probabilities need to be well-calibrated
💡
Meta Interview Tip: Be prepared to discuss how logistic regression scales to billions of samples. Mention techniques like stochastic gradient descent, feature hashing, and parameter servers.
Common Follow-Up Questions
Q1: Why is the sigmoid function preferred over other activation functions for binary classification?
The sigmoid function is preferred because:
- It's the inverse of the logit function, giving a natural probabilistic interpretation
- Its derivative is easy to compute
- It arises naturally from the exponential family distribution
- It ensures outputs are in
Q2: How do you handle multiclass problems where classes are not mutually exclusive?
Use One-vs-Rest (OvR) instead of Softmax. Softmax assumes mutual exclusivity (probabilities sum to 1). For multi-label classification, train independent binary classifiers for each class.
Q3: What is the connection between logistic regression and neural networks?
Logistic regression is equivalent to a single-layer neural network with:
- One output neuron
- Sigmoid activation function
- Binary cross-entropy loss
Deep learning extends this by adding hidden layers and non-linear activations.
Q4: How do you detect and handle class imbalance in logistic regression?
- Resampling: SMOTE (oversampling minority) or undersampling majority
- Class weights: Set
class_weight='balanced'in sklearn - Threshold tuning: Lower threshold to increase recall
- Metrics: Use F1, AUC-PR instead of accuracy
Company-Specific Tips
Meta Interview Tips
- Discuss online learning variants for streaming data
- Be ready to explain probability calibration techniques
- Mention A/B testing frameworks for model comparison
- Talk about feature importance in high-dimensional sparse data
Microsoft Interview Tips
- Focus on interpretability requirements in regulated industries
- Discuss model serving at scale (batch vs real-time)
- Be prepared to explain regularization choices in production
- Mention monitoring for model drift in live systems