What is Statistics?
Foundations of Statistics
Turn Raw Data Into Actionable Knowledge
Statistics is the science that transforms uncertainty into understanding. In a world drowning in data, it gives you the tools to separate signal from noise, make evidence-based decisions, and quantify how much you actually know — and how much you don't.
- Describe Data — Summarize large datasets with a few meaningful numbers that capture the essential patterns
- Draw Conclusions — Use samples to make reliable inferences about entire populations, with uncertainty quantified
- Avoid Pitfalls — Recognize traps like correlation-causation confusion, survivorship bias, and p-hacking before they mislead you
- Make Better Decisions — Apply rigorous reasoning to medicine, finance, engineering, business, and everyday life
Statistics is not just math — it is a way of thinking about the world with intellectual honesty.
What is Statistics?
Definition
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It gives us tools to make sense of a world full of uncertainty — turning raw numbers into actionable knowledge.
"Statistics is the grammar of science." — Karl Pearson
Why Statistics Matters
Every field that uses data uses statistics:
| Field | Statistical Application | Example |
|---|---|---|
| Medicine | Clinical trial analysis, disease prevalence | Testing if a new drug reduces blood pressure |
| Finance | Risk modeling, portfolio optimization | Calculating Value at Risk (VaR) |
| Engineering | Quality control, reliability testing | Six Sigma defect rate analysis |
| Social Science | Survey analysis, causal inference | Estimating voter turnout from polls |
| Machine Learning | Model evaluation, feature selection | A/B testing algorithm performance |
| Business | Demand forecasting, pricing optimization | Predicting quarterly revenue |
Without statistics, we are swimming in data but drowning in uncertainty.
Two Pillars: Descriptive vs Inferential
Descriptive Statistics
Summarizes and describes the data you have. No generalizations beyond your dataset.
Examples:
- The average salary of 500 employees at a company
- The distribution of exam scores in a class
- A pie chart of market share by product
Key measures:
- Mean, Median, Mode
- Standard Deviation, Variance
- Percentiles, Quartiles
Inferential Statistics
Uses a sample to draw conclusions about a larger population.
Examples:
- Estimating the average salary of all workers in a country (from a survey of 5,000)
- Testing whether a new drug works better than a placebo
- Predicting election outcomes from polling data
Key methods:
- Hypothesis Testing
- Confidence Intervals
- Regression Analysis
The Inference Pipeline
The Statistical Thinking Process
1. Ask a clear question "Does the new teaching method improve test scores?"
2. Design the study
- Who to collect data from (sample vs. population)
- How to collect it (experiment, survey, observation)
- What to measure
3. Collect data
- Ensure data quality and consistency
4. Explore the data (EDA)
- Visualize distributions
- Check for outliers, missingness
5. Analyze
- Apply appropriate statistical methods
6. Interpret & communicate
- Translate results into actionable insights
- Quantify uncertainty honestly
Key Vocabulary
| Term | Symbol | Definition | Example |
|---|---|---|---|
| Population | — | The entire group of interest | All US voters |
| Sample | — | A subset of the population that is measured | 1,000 voters surveyed |
| Parameter | μ, σ, π | A numerical property of the population | True average height of all adults |
| Statistic | x̄, s, p̂ | A numerical property of the sample | Average height in our sample |
| Variable | X, Y | A characteristic being measured | Height, weight, income |
| Observation | xᵢ | A single data point | One person's height: 172 cm |
The Parameter vs Statistic Distinction
Population Parameter
Here,
- =Population mean (parameter)
- =Population size
- =The i-th observation in the population
Sample Statistic
Here,
- =Sample mean (statistic)
- =Sample size
- =The i-th observation in the sample
Branches of Statistics
Frequentist
Probability is the long-run frequency of events. Parameters are fixed unknowns; data provides evidence.
Key tools:
- Hypothesis testing
- Confidence intervals
- Maximum likelihood estimation
Bayesian
Probability represents degrees of belief. We update beliefs as new evidence arrives using Bayes' Theorem.
Key tools:
- Prior/Posterior distributions
- Credible intervals
- MCMC sampling
Nonparametric
Makes fewer assumptions about the distribution of the data. Useful when normality cannot be assumed.
Key tools:
- Rank-based tests
- Bootstrapping
- Kernel density estimation
Python: First Steps
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
# Create a sample dataset
np.random.seed(42)
data = np.random.normal(loc=170, scale=10, size=100) # Heights in cm
# --- Descriptive statistics ---
print("=== Descriptive Statistics ===")
print(f"n = {len(data)}")
print(f"Mean = {np.mean(data):.2f} cm")
print(f"Median = {np.median(data):.2f} cm")
print(f"Std Dev = {np.std(data, ddof=1):.2f} cm")
print(f"Min = {np.min(data):.2f} cm")
print(f"Max = {np.max(data):.2f} cm")
# --- Inferential: 95% confidence interval for the mean ---
ci = stats.t.interval(0.95, df=len(data)-1,
loc=np.mean(data),
scale=stats.sem(data))
print(f"\n95% CI for mean height: ({ci[0]:.2f}, {ci[1]:.2f}) cm")
Output:
=== Descriptive Statistics ===
n = 100
Mean = 170.48 cm
Median = 170.52 cm
Std Dev = 9.96 cm
Min = 145.39 cm
Max = 196.34 cm
95% CI for mean height: (168.50, 172.46) cm
Statistics in Machine Learning, Data Science, Deep Learning & LLMs
| Field | How Statistics is Used | Key Concepts |
|---|---|---|
| Machine Learning | Model evaluation, feature selection, A/B testing | Bias-variance tradeoff, p-values, confidence intervals |
| Data Science | Exploratory analysis, dashboarding, reporting | Descriptive stats, distributions, correlations |
| Deep Learning | Loss functions, regularization, batch normalization | Mean squared error, dropout as regularization |
| LLMs | Token probability, temperature sampling, perplexity | Softmax, cross-entropy loss, attention weights |
| NLP | Sentiment analysis, topic modeling | TF-IDF (frequency statistics), n-grams |
| Computer Vision | Object detection, image classification | IoU (intersection over union), mAP metrics |
Simple Example — How ML Uses Statistics:
# A machine learning model is just statistics in disguise
from sklearn.linear_model import LinearRegression
import numpy as np
# Study hours vs exam scores (sample data)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(-1, 1)
y = np.array([45, 55, 65, 70, 78, 85, 90, 95])
# This is literally the statistical formula: y = β₀ + β₁x + ε
model = LinearRegression()
model.fit(X, y)
print(f"Intercept (β₀): {model.intercept_:.2f}") # Statistics: β₀
print(f"Slope (β₁): {model.coef_[0]:.2f}") # Statistics: β₁
print(f"R² Score: {model.score(X, y):.4f}") # Statistics: explained variance
# Predict for a new student
new_student = np.array([[9]])
prediction = model.predict(new_student)
print(f"\nPredicted score for 9 hours study: {prediction[0]:.1f}")
Output:
Intercept (β₀): 38.57
Slope (β₁): 7.26
R² Score: 0.9848
Predicted score for 9 hours study: 103.9
Common Pitfalls in Statistical Thinking
1. Correlation ≠ Causation
Ice cream sales correlate with drowning rates. Both are caused by summer heat — not each other.
Always ask: Is there a confounding variable?
ML connection: Feature importance in models shows correlation, not causation. A model predicting house prices might use "number of bathrooms" as a feature — but bathrooms don't cause high prices; both reflect house size.
2. Survivorship Bias
WWII engineers studied returning bombers' bullet holes. Abraham Wald pointed out: reinforce where the missing planes got hit — the ones that didn't return.
ML connection: Training data only contains "surviving" examples. A fraud detection model trained on caught fraudsters misses the ones that weren't caught.
3. Simpson's Paradox
A trend can reverse when subgroups are combined. Hospital A has higher overall survival rate, but Hospital B has better rates for every individual severity level.
ML connection: Aggregated metrics can mislead. A model might look accurate overall but fail for specific subgroups (fairness issue).
4. P-Hacking
Running many tests until you find p less than 0.05 inflates false positive rates. Always pre-register your hypotheses.
ML connection: Trying many hyperparameters until you get good test performance is the ML version of p-hacking. Use a validation set!
Practice Exercises
Exercise 1: In your own words, explain the difference between a parameter and a statistic. Give one example of each.
Exercise 2: Classify each scenario as descriptive or inferential statistics:
- a) Finding the average age of students in your classroom
- b) Using a survey of 1,000 adults to estimate the proportion of all adults who prefer remote work
- c) Creating a bar chart of monthly sales for the past year
Exercise 3 (Code): Load the tips dataset from seaborn and compute:
- Mean, median, and standard deviation of the
total_billcolumn - A 95% confidence interval for the mean tip percentage
import seaborn as sns
tips = sns.load_dataset('tips')
# Your code here
See Solution
import seaborn as sns
import numpy as np
from scipy import stats
tips = sns.load_dataset('tips')
tips['tip_pct'] = tips['tip'] / tips['total_bill'] * 100
bill = tips['total_bill']
tip_pct = tips['tip_pct']
print(f"Total Bill — Mean: {bill.mean():.2f}, Median: {bill.median():.2f}, SD: {bill.std():.2f}")
ci = stats.t.interval(0.95, df=len(tip_pct)-1,
loc=tip_pct.mean(),
scale=stats.sem(tip_pct))
print(f"95% CI for mean tip %: ({ci[0]:.2f}%, {ci[1]:.2f}%)")
Key Takeaways
Statistics converts raw data into knowledge through collection, analysis, and interpretation.
Descriptive statistics summarize what you have; inferential statistics generalize to what you don't.
Every ML model, deep learning network, and LLM is built on statistical foundations.
Data quality matters more than data quantity — garbage in, garbage out.
"Without data, you're just another person with an opinion." — W. Edwards Deming
What to Learn Next
-> Types of Data Learn the difference between qualitative and quantitative data — the first step in any analysis.
-> Levels of Measurement Nominal, ordinal, interval, ratio — which statistics are valid for each?
-> Descriptive Statistics Master mean, median, mode — the numbers that summarize any dataset.
-> Probability Theory The math of uncertainty — the foundation of all inference.
-> Normal Distribution The bell curve that runs the world — and why it matters for ML.
-> Hypothesis Testing How to prove (or disprove) claims with data.