Simple Linear Regression
Regression Analysis
Modeling the Relationship Between Two Variables
Simple linear regression quantifies how one variable changes with another, forming the foundation of predictive modeling. It estimates the line that best fits the data by minimizing squared residuals.
-
Economics — Predict consumer spending based on income levels
-
Healthcare — Model the relationship between dosage and patient response
-
Engineering — Relate temperature to material expansion coefficients
Every complex model begins with understanding a single straight line.
Simple linear regression models the linear relationship between a predictor variable and a response variable .
The Statistical Model
DfSimple Linear Regression Model
where:
-
is the intercept (predicted when )
-
is the slope (change in per unit change in )
-
is the error term — the part of not explained by
The Error Term
The error represents the combined effect of all unmeasured factors. Under the classical assumptions, i.i.d. This implies — the conditional mean is linear in .
Ordinary Least Squares (OLS) Estimation
OLS finds the estimates that minimize the sum of squared residuals:
Sum of Squared Residuals
Here,
- =Residual: observed minus predicted
- =Predicted value at x?
- =Sum of squared residuals
ThOLS Closed-Form Solution
Setting and yields:
where and .
Geometric Interpretation
The OLS estimator is the orthogonal projection of the observed vector onto the column space of the design matrix . The residuals are orthogonal to the fitted values — they lie in the null space of .
Properties of OLS Estimators
ThGauss–Markov Theorem
Under the classical assumptions (linearity, i.i.d. errors with mean 0, homoscedasticity, no autocorrelation), the OLS estimators are BLUE (Best Linear Unbiased Estimators):
No other linear unbiased estimator of has smaller variance.
Variance of OLS Estimators
Here,
- =Error variance (estimated by $s^2 = SSR/(n-2)$)
- =Sum of squared deviations of X
- =Sample size
The Coefficient of Determination ()
R-Squared
Here,
- =Proportion of variance in Y explained by X
- =Sum of squared residuals
- =Total sum of squares = S(y? - ?)²
. An means 75% of the variability in is explained by the linear relationship with .
The Four Regression Assumptions (LINE)
| Assumption | Mathematical Statement | Diagnostic Check |
|-----------|----------------------|------------------|
| Linearity | | Scatter plot; residual vs. fitted plot |
| Independence | for | Study design; Durbin–Watson test |
| Normality | | Q–Q plot; Shapiro–Wilk test |
| Equal variance | (constant) | Residual vs. fitted plot; Breusch–Pagan test |
Consequence of Violations
-
Non-linearity: OLS estimates are biased; fit a polynomial or use non-linear regression.
-
Non-independence: Standard errors are wrong; use clustered or time-series methods.
-
Non-normality: Affects small-sample inference; CLT helps for large .
-
Heteroscedasticity: Standard errors are wrong; use robust (HC) standard errors.
Hypothesis Tests for the Slope
t-Test for Slope
Here,
- =Estimated slope
- =Standard error of the slope
- =Degrees of freedom
The -test with is equivalent to the -test in simple regression: .
Prediction Intervals
For a new observation at , the prediction interval is wider than the confidence interval for the mean:
Prediction Interval
Here,
- =Predicted value at x0
- =Residual standard error
- =Extra term accounts for individual prediction uncertainty
The under the square root makes the prediction interval always wider than the confidence interval for the mean response.
Key Takeaways
Summary: Simple Linear Regression
-
is the slope — the change in per unit change in
-
= proportion of variance in explained by — ranges from 0 to 1
-
OLS estimates are BLUE (Best Linear Unbiased Estimators) under the Gauss–Markov theorem
-
Always plot residuals — patterns indicate assumption violations
-
Correlation causation — regression shows linear association, not causal direction
-
95% prediction intervals are always wider than confidence intervals for the mean response
-
The -test for the slope is equivalent to the -test in simple regression