Covariance
Descriptive Statistics
Do Two Variables Move Together? Covariance Tells You
Covariance measures how two variables change together. Positive covariance means they tend to move in the same direction; negative means opposite directions.
- Direction of relationship — Positive, negative, or zero covariance reveals the sign of association
- Magnitude is scale-dependent — The raw number is hard to interpret without normalization
- Covariance matrix — The foundation of multivariate statistics and portfolio theory
- Gateway to correlation — Standardizing covariance produces the Pearson correlation coefficient
Covariance is the raw material from which correlation is built. Understanding it gives you the foundation for everything that follows.
What is Covariance?
Definition
Covariance measures how two variables change together. Positive covariance means they tend to move in the same direction; negative means opposite directions.
Sample Covariance
Here,
- =Individual data points
- =Sample means of X and Y
- =Number of observations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
n = 100
# Three relationships
x = np.random.normal(50, 10, n)
y_pos = 2*x + np.random.normal(0, 10, n) # positive covariance
y_neg = -2*x + np.random.normal(0, 10, n) # negative covariance
y_zero = np.random.normal(50, 10, n) # zero covariance
for y, label in [(y_pos,"Positive"),(y_neg,"Negative"),(y_zero,"Near-Zero")]:
cov = np.cov(x, y)[0,1]
print(f"Cov(X, Y) = {cov:+.4f} -> {label}")
Manual Calculation
data = pd.DataFrame({
'study_hours': [2, 3, 5, 4, 6, 1, 7, 3, 5, 8],
'exam_score': [60,65,75,70,80,55,85,65,72,90]
})
mean_x = data['study_hours'].mean()
mean_y = data['exam_score'].mean()
deviations_xy = (data['study_hours'] - mean_x) * (data['exam_score'] - mean_y)
cov_manual = deviations_xy.sum() / (len(data) - 1)
cov_numpy = np.cov(data['study_hours'], data['exam_score'])[0, 1]
print(f"Manual covariance: {cov_manual:.4f}")
print(f"NumPy covariance: {cov_numpy:.4f}")
Covariance Matrix
For multiple variables, the covariance matrix contains pairwise covariances:
iris = sns.load_dataset('iris')
numeric = iris.select_dtypes(include='number')
cov_matrix = numeric.cov()
print("Covariance Matrix:")
print(cov_matrix.round(4))
fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(cov_matrix, annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=ax)
ax.set_title('Covariance Matrix (Iris Dataset)')
plt.tight_layout()
plt.savefig('covariance_matrix.png', dpi=150)
plt.show()
Covariance vs Correlation
| Feature | Covariance | Pearson Correlation |
|---|---|---|
| Range | (−∞, +∞) | [−1, +1] |
| Units | Units of X × Units of Y | Dimensionless |
| Scale-dependent? | Yes | No |
| Interpretable magnitude? | No | Yes |
Covariance to Correlation
Here,
- =Covariance between X and Y
- =Sample standard deviations of X and Y
- =Pearson correlation coefficient
cov = np.cov(data['study_hours'], data['exam_score'])[0,1]
sx = data['study_hours'].std(ddof=1)
sy = data['exam_score'].std(ddof=1)
r_from_cov = cov / (sx * sy)
r_numpy = np.corrcoef(data['study_hours'], data['exam_score'])[0,1]
print(f"r from covariance formula: {r_from_cov:.6f}")
print(f"r from np.corrcoef: {r_numpy:.6f}")
Covariance in Machine Learning
| ML Application | Covariance Usage | Why |
|---|---|---|
| PCA | Covariance matrix → eigenvectors | Dimensionality reduction |
| Multicollinearity | High cov between features | Remove redundant features |
| Portfolio optimization | Asset covariance → risk | Financial ML |
| Feature selection | Low cov with target → remove | No predictive power |
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
# Covariance matrix
cov_matrix = np.cov(X.T)
print("Covariance matrix:")
print(cov_matrix.round(2))
# PCA uses covariance matrix
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"\nExplained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.3f}")
print("PCA finds directions of maximum variance!")
Key Takeaways
Summary: Covariance
- Positive covariance: both variables tend to be above (or below) their means together
- Negative covariance: one above mean when other is below
- Covariance is scale-dependent — use correlation (standardized covariance) for interpretable strength
- np.cov() returns the covariance matrix — [0,1] or [1,0] element is the cross-covariance
- The diagonal of the covariance matrix contains each variable's own variance
- Portfolio variance = wᵀ Σ w where Σ is the covariance matrix — covariance drives diversification