Simple Linear Regression

Regression Analysis

Modeling the Relationship Between Two Variables

Simple linear regression quantifies how one variable changes with another, forming the foundation of predictive modeling. It estimates the line that best fits the data by minimizing squared residuals.

Economics — Predict consumer spending based on income levels
Healthcare — Model the relationship between dosage and patient response
Engineering — Relate temperature to material expansion coefficients

Every complex model begins with understanding a single straight line.

Simple linear regression models the linear relationship between a predictor variable $X$ and a response variable $Y$ .

The Statistical Model

DfSimple Linear Regression Model

Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i = 1, 2, \ldots, n

where:

$\beta_0$ is the intercept (predicted $Y$ when $X = 0$ )
$\beta_1$ is the slope (change in $Y$ per unit change in $X$ )
$\varepsilon_i$ is the error term — the part of $Y$ not explained by $X$

The Error Term

The error $\varepsilon_i$ represents the combined effect of all unmeasured factors. Under the classical assumptions, $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ i.i.d. This implies $E[Y_i \mid X_i] = \beta_0 + \beta_1 X_i$ — the conditional mean is linear in $X$ .

Ordinary Least Squares (OLS) Estimation

OLS finds the estimates $\hat{\beta}_0, \hat{\beta}_1$ that minimize the sum of squared residuals:

Sum of Squared Residuals

\text{SSR} = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2

Here,

$e_i$ =Residual: observed minus predicted
$\hat{y}_i$ =Predicted value at x?
$SSR$ =Sum of squared residuals

ThOLS Closed-Form Solution

Setting $\frac{\partial \text{SSR}}{\partial \beta_0} = 0$ and $\frac{\partial \text{SSR}}{\partial \beta_1} = 0$ yields:

\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

where $S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})$ and $S_{xx} = \sum(x_i - \bar{x})^2$ .

Geometric Interpretation

The OLS estimator is the orthogonal projection of the observed vector $\mathbf{y}$ onto the column space of the design matrix $\mathbf{X}$ . The residuals $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ are orthogonal to the fitted values — they lie in the null space of $\mathbf{X}^T$ .

Properties of OLS Estimators

ThGauss–Markov Theorem

Under the classical assumptions (linearity, i.i.d. errors with mean 0, homoscedasticity, no autocorrelation), the OLS estimators are BLUE (Best Linear Unbiased Estimators):

\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{\sum(x_i - \bar{x})^2} = \frac{\sigma^2}{S_{xx}}

No other linear unbiased estimator of $\beta_1$ has smaller variance.

Variance of OLS Estimators

\text{Var}(\hat{\beta}_0) = \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right), \quad \text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}

Here,

$\sigma^2$ =Error variance (estimated by $s^2 = SSR/(n-2)$)
$S_{xx}$ =Sum of squared deviations of X
$n$ =Sample size

The Coefficient of Determination ( $R^2$ )

R-Squared

R^2 = 1 - \frac{\text{SSR}}{\text{SST}} = \frac{\text{SSR}_{\text{regression}}}{\text{SST}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Here,

$R^2$ =Proportion of variance in Y explained by X
$SSR$ =Sum of squared residuals
$SST$ =Total sum of squares = S(y? - ?)²

$R^2 \in [0, 1]$ . An $R^2 = 0.75$ means 75% of the variability in $Y$ is explained by the linear relationship with $X$ .

The Four Regression Assumptions (LINE)

| Assumption | Mathematical Statement | Diagnostic Check |

|-----------|----------------------|------------------|

| Linearity | $E[Y \mid X] = \beta_0 + \beta_1 X$ | Scatter plot; residual vs. fitted plot |

| Independence | $\text{Cov}(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$ | Study design; Durbin–Watson test |

| Normality | $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ | Q–Q plot; Shapiro–Wilk test |

| Equal variance | $\text{Var}(\varepsilon_i) = \sigma^2$ (constant) | Residual vs. fitted plot; Breusch–Pagan test |

Consequence of Violations

Non-linearity: OLS estimates are biased; fit a polynomial or use non-linear regression.
Non-independence: Standard errors are wrong; use clustered or time-series methods.
Non-normality: Affects small-sample inference; CLT helps for large $n$ .
Heteroscedasticity: Standard errors are wrong; use robust (HC) standard errors.

Hypothesis Tests for the Slope

t-Test for Slope

t = \frac{\hat{\beta}_1 - 0}{\text{SE}(\hat{\beta}_1)} \sim t_{n-2}

Here,

$\hat{\beta}_1$ =Estimated slope
$SE(\hat{\beta}_1)$ =Standard error of the slope
$n-2$ =Degrees of freedom

The $t$ -test with $H_0: \beta_1 = 0$ is equivalent to the $F$ -test in simple regression: $F = t^2$ .

Prediction Intervals

For a new observation at $x_0$ , the prediction interval is wider than the confidence interval for the mean:

Prediction Interval

\hat{y}_0 \pm t_{\alpha/2, n-2} \cdot s \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}}

Here,

$\hat{y}_0$ =Predicted value at x0
$s$ =Residual standard error
$1 +$ =Extra term accounts for individual prediction uncertainty

The $+1$ under the square root makes the prediction interval always wider than the confidence interval for the mean response.

Key Takeaways

Summary: Simple Linear Regression

$\beta_1$ is the slope — the change in $Y$ per unit change in $X$
$R^2$ = proportion of variance in $Y$ explained by $X$ — ranges from 0 to 1
OLS estimates are BLUE (Best Linear Unbiased Estimators) under the Gauss–Markov theorem
Always plot residuals — patterns indicate assumption violations
Correlation $\neq$ causation — regression shows linear association, not causal direction
95% prediction intervals are always wider than confidence intervals for the mean response
The $t$ -test for the slope is equivalent to the $F$ -test in simple regression

Simple Linear Regression — Theory, Assumptions, and Python

Simple Linear Regression

Modeling the Relationship Between Two Variables

The Statistical Model

DfSimple Linear Regression Model

Ordinary Least Squares (OLS) Estimation

Sum of Squared Residuals

ThOLS Closed-Form Solution

Properties of OLS Estimators

ThGauss–Markov Theorem

Variance of OLS Estimators

The Coefficient of Determination ( $R^2$ )

R-Squared

The Four Regression Assumptions (LINE)

Hypothesis Tests for the Slope

t-Test for Slope

Prediction Intervals

Prediction Interval

Key Takeaways

Summary: Simple Linear Regression

Premium Content

Need Expert Statistics Help?

Simple Linear Regression — Theory, Assumptions, and Python

Simple Linear Regression

Modeling the Relationship Between Two Variables

The Statistical Model

DfSimple Linear Regression Model

Ordinary Least Squares (OLS) Estimation

Sum of Squared Residuals

ThOLS Closed-Form Solution

Properties of OLS Estimators

ThGauss–Markov Theorem

Variance of OLS Estimators

The Coefficient of Determination (R2R^2R2)

R-Squared

The Four Regression Assumptions (LINE)

Hypothesis Tests for the Slope

t-Test for Slope

Prediction Intervals

Prediction Interval

Key Takeaways

Summary: Simple Linear Regression

Premium Content

Need Expert Statistics Help?

The Coefficient of Determination ( $R^2$ )