Skewness
Descriptive Statistics
When Distributions Lean to One Side
Skewness reveals the hidden asymmetry in your data — and tells you whether the mean is trustworthy or misleading.
Understanding skewness helps you:
- Interpret the mean — know when mean > median signals a long right tail
- Choose the right model — decide between symmetric and skewed distributions
- Fix data — apply log or square-root transforms to restore symmetry
- Spot real-world patterns — income, house prices, and reaction times are almost always skewed
A symmetric distribution hides nothing. A skewed one whispers where the outliers hide.
What is Skewness?
Definition
Skewness quantifies asymmetry of a distribution. Positive skew pulls the right tail out; negative skew pulls the left tail out. Zero means symmetric.
Skewness (Fisher's)
Here,
- =The i-th observation
- =Sample mean
- =Sample standard deviation
- =Number of observations
Positive -> right tail. Negative -> left tail. Zero -> symmetric.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(42)
right_skew = np.random.lognormal(0, 0.8, 2000) # income-like
symmetric = np.random.normal(0, 1, 2000)
left_skew = -np.random.lognormal(0, 0.8, 2000)
for name, data in [("Right-Skewed", right_skew),
("Symmetric", symmetric),
("Left-Skewed", left_skew)]:
sk = stats.skew(data)
print(f"{name:<15}: skew={sk:+.4f}, mean={np.mean(data):.3f}, median={np.median(data):.3f}")
Mean vs Median Under Skewness
Right-Skewed: Mode < Median < Mean
Symmetric: Mode ≈ Median ≈ Mean
Left-Skewed: Mean < Median < Mode
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
datasets = [("Right-Skewed", right_skew, '#f8d7da'),
("Symmetric", symmetric, '#d4edda'),
("Left-Skewed", left_skew, '#d1ecf1')]
for ax, (name, data, color) in zip(axes, datasets):
ax.hist(data, bins=50, density=True, color=color, edgecolor='gray', alpha=0.7)
ax.axvline(np.mean(data), color='red', lw=2, ls='--', label=f'Mean={np.mean(data):.2f}')
ax.axvline(np.median(data), color='blue', lw=2, ls='-', label=f'Median={np.median(data):.2f}')
ax.set_title(f'{name}\nskewness={stats.skew(data):.3f}')
ax.legend(fontsize=8)
plt.tight_layout()
plt.savefig('skewness.png', dpi=150)
plt.show()
Interpretation Guide
| Absolute Skewness | Interpretation |
|---|---|
| less than 0.5 | Approximately symmetric |
| 0.5–1.0 | Moderately skewed |
| greater than 1.0 | Highly skewed — consider transformation |
Fixing Skewness with Transformations
skewed = np.random.lognormal(0, 1, 500)
print(f"Original skewness: {stats.skew(skewed):.4f}")
# Log transform (works for positive right-skewed data)
log_transformed = np.log(skewed)
print(f"Log-transformed skewness: {stats.skew(log_transformed):.4f}")
# Square root (moderate right skew)
sqrt_transformed = np.sqrt(skewed)
print(f"Sqrt-transformed skewness: {stats.skew(sqrt_transformed):.4f}")
Skewness in Machine Learning
| ML Application | Skewness Usage | Why |
|---|---|---|
| Feature transformation | Log/Box-Cox transform skewed features | Normal distributions work better |
| Loss function design | Skewed targets → asymmetric loss | Weight false positives differently |
| Data augmentation | Know which direction to augment | Balance training data |
| Model selection | Skewed data → robust models | Random Forest over Linear |
import numpy as np
from scipy.stats import skew, boxcox
from sklearn.preprocessing import PowerTransformer
np.random.seed(42)
# Skewed feature → log transform
skewed_data = np.random.lognormal(3, 1, 1000)
print(f"Before transform: skewness = {skew(skewed_data):.3f}")
log_data = np.log(skewed_data)
print(f"After log transform: skewness = {skew(log_data):.3f}")
# Box-Cox transformation (automatic)
bc_data, lam = boxcox(skewed_data)
print(f"After Box-Cox (λ={lam:.2f}): skewness = {skew(bc_data):.3f}")
# PowerTransformer (sklearn)
pt = PowerTransformer(method='yeo-johnson')
pt_data = pt.fit_transform(skewed_data.reshape(-1,1)).flatten()
print(f"After Yeo-Johnson: skewness = {skew(pt_data):.3f}")
Key Takeaways
Positive skew = right tail — mean > median (tail pulls mean rightward)
|skew| greater than 1: strongly skewed — use non-parametric methods or transform
Log transformation corrects right skewness in income, prices, and reaction times
Always visualize — skewness alone doesn't tell you the full story
"Skewness is the data's way of telling you the mean is not the whole story."