Covariance

Descriptive Statistics

Do Two Variables Move Together? Covariance Tells You

Covariance measures how two variables change together. Positive covariance means they tend to move in the same direction; negative means opposite directions.

Direction of relationship — Positive, negative, or zero covariance reveals the sign of association
Magnitude is scale-dependent — The raw number is hard to interpret without normalization
Covariance matrix — The foundation of multivariate statistics and portfolio theory
Gateway to correlation — Standardizing covariance produces the Pearson correlation coefficient

Covariance is the raw material from which correlation is built. Understanding it gives you the foundation for everything that follows.

What is Covariance?

Definition

Covariance measures how two variables change together. Positive covariance means they tend to move in the same direction; negative means opposite directions.

Sample Covariance

\text{Cov}(X, Y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n-1}

Here,

$x_i, y_i$ =Individual data points
$\bar{x}, \bar{y}$ =Sample means of X and Y
$n$ =Number of observations

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
n = 100

# Three relationships
x = np.random.normal(50, 10, n)
y_pos = 2*x + np.random.normal(0, 10, n)   # positive covariance
y_neg = -2*x + np.random.normal(0, 10, n)  # negative covariance
y_zero = np.random.normal(50, 10, n)        # zero covariance

for y, label in [(y_pos,"Positive"),(y_neg,"Negative"),(y_zero,"Near-Zero")]:
    cov = np.cov(x, y)[0,1]
    print(f"Cov(X, Y) = {cov:+.4f}  -> {label}")

Manual Calculation

data = pd.DataFrame({
    'study_hours': [2, 3, 5, 4, 6, 1, 7, 3, 5, 8],
    'exam_score':  [60,65,75,70,80,55,85,65,72,90]
})

mean_x = data['study_hours'].mean()
mean_y = data['exam_score'].mean()
deviations_xy = (data['study_hours'] - mean_x) * (data['exam_score'] - mean_y)
cov_manual = deviations_xy.sum() / (len(data) - 1)
cov_numpy  = np.cov(data['study_hours'], data['exam_score'])[0, 1]

print(f"Manual covariance: {cov_manual:.4f}")
print(f"NumPy covariance:  {cov_numpy:.4f}")

Covariance Matrix

For multiple variables, the covariance matrix contains pairwise covariances:

iris = sns.load_dataset('iris')
numeric = iris.select_dtypes(include='number')

cov_matrix = numeric.cov()
print("Covariance Matrix:")
print(cov_matrix.round(4))

fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(cov_matrix, annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=ax)
ax.set_title('Covariance Matrix (Iris Dataset)')
plt.tight_layout()
plt.savefig('covariance_matrix.png', dpi=150)
plt.show()

Covariance vs Correlation

Feature	Covariance	Pearson Correlation
Range	(−∞, +∞)	[−1, +1]
Units	Units of X × Units of Y	Dimensionless
Scale-dependent?	Yes	No
Interpretable magnitude?	No	Yes

Covariance to Correlation

r = \frac{\text{Cov}(X,Y)}{s_X \cdot s_Y}

Here,

$\text{Cov}(X,Y)$ =Covariance between X and Y
$s_X, s_Y$ =Sample standard deviations of X and Y
$r$ =Pearson correlation coefficient

cov = np.cov(data['study_hours'], data['exam_score'])[0,1]
sx = data['study_hours'].std(ddof=1)
sy = data['exam_score'].std(ddof=1)
r_from_cov = cov / (sx * sy)
r_numpy = np.corrcoef(data['study_hours'], data['exam_score'])[0,1]

print(f"r from covariance formula: {r_from_cov:.6f}")
print(f"r from np.corrcoef:        {r_numpy:.6f}")

Covariance in Machine Learning

ML Application	Covariance Usage	Why
PCA	Covariance matrix → eigenvectors	Dimensionality reduction
Multicollinearity	High cov between features	Remove redundant features
Portfolio optimization	Asset covariance → risk	Financial ML
Feature selection	Low cov with target → remove	No predictive power

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data

# Covariance matrix
cov_matrix = np.cov(X.T)
print("Covariance matrix:")
print(cov_matrix.round(2))

# PCA uses covariance matrix
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"\nExplained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.3f}")
print("PCA finds directions of maximum variance!")

Key Takeaways

Summary: Covariance

Positive covariance: both variables tend to be above (or below) their means together
Negative covariance: one above mean when other is below
Covariance is scale-dependent — use correlation (standardized covariance) for interpretable strength
np.cov() returns the covariance matrix — [0,1] or [1,0] element is the cross-covariance
The diagonal of the covariance matrix contains each variable's own variance
Portfolio variance = wᵀ Σ w where Σ is the covariance matrix — covariance drives diversification

Covariance — Measuring Joint Variation of Two Variables