Applications in Data Science

Why It Matters

Statistical thinking is essential for trustworthy data science, from experiments to causal claims. Without rigorous statistics, A/B tests produce false positives, models overfit, and causal claims confuse correlation with causation. Mastering the full statistical toolkit — from hypothesis testing to causal inference — ensures your conclusions are reliable, reproducible, and actionable.

Overview

Statistics powers the complete data science lifecycle. A/B testing uses two-sample proportion or mean tests to compare treatment and control groups, enabling data-driven product decisions. Power analysis determines required sample sizes before experiments, preventing wasted resources on underpowered studies. Causal inference distinguishes correlation from causation using randomized experiments (gold standard), propensity scores, instrumental variables, and difference-in-differences for observational data. Feature selection uses chi-square tests, permutation importance, and mutual information. Model evaluation relies on cross-validation, AUC-ROC, and calibration curves. Understanding how these pieces fit together transforms data analysis from ad hoc number-crunching into rigorous, reproducible science.

Key Concepts

Two-Proportion Z-Test (A/B Testing)

Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1 + 1/n_2)}}

Here,

$\hat{p}_1, \hat{p}_2$ =Sample proportions for control and treatment
$\hat{p}$ =Pooled proportion: $(x_1 + x_2)/(n_1 + n_2)$

Power Analysis (Sample Size)

n = \frac{(z_{\alpha/2} + z_{\beta})^2 \cdot 2\sigma^2}{\delta^2}

Here,

$\delta$ =Minimum detectable effect (MDE)
$z_{\alpha/2}$ =Significance level critical value (1.96 for α=0.05)
$z_{\beta}$ =Power critical value (0.842 for power=80%)

Cohen's d (Effect Size)

d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}

Here,

$s_p$ =Pooled standard deviation

Chi-Square Feature Selection

\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

Here,

$O_i$ =Observed frequency of feature-category combination
$E_i$ =Expected frequency under independence

Causal Inference Methods

Method	Description	Key Assumption	When to Use
Randomized Experiment	Gold standard	Random assignment	When feasible
Propensity Score Matching	Match treated and control on covariates	No unmeasured confounders	Observational, pre-treatment covariates available
Instrumental Variables	Use exogenous variation	Exclusion restriction	When confounders are unmeasurable
Difference-in-Differences	Compare pre/post changes	Parallel trends	Before/after with control group

A/B Testing Workflow

Define the metric: Choose what to measure (conversion rate, revenue, latency)
Formulate hypotheses: $H_0$ : no difference, $H_1$ : difference exists
Power analysis: Determine sample size before collecting data
Random assignment: Users randomly assigned to control (A) and treatment (B)
Collect data: Run experiment for predetermined duration
Compute test statistic: Z-test for proportions or t-test for means
Make decision: Reject $H_0$ if p-value ≤ $\alpha$ and effect is practically meaningful

Quick Example

A/B Test: Conversion Rate

Control: 50/1000 converted. Treatment: 70/1000 converted.

\hat{p}_1 = 0.05, \quad \hat{p}_2 = 0.07, \quad \hat{p} = 0.06

Z = \frac{0.05 - 0.07}{\sqrt{0.06 \times 0.94 \times (1/1000 + 1/1000)}} = \frac{-0.02}{0.0106} = -1.887

$p = 0.059 > 0.05$ . Fail to reject at $\alpha = 0.05$ — the difference is not statistically significant. However, the effect size (2 percentage points) may be practically meaningful; collect more data or consider the business context.

Sample Size Calculation

To detect a 5% improvement in conversion rate (from 10% to 15%) with 80% power at $\alpha = 0.05$ :

Using power analysis: $n \approx 6000$ per group. This ensures the study can detect the effect if it exists. Always compute this before running the experiment — underpowered studies waste resources and produce inconclusive results.

Feature Selection with Chi-Square

In NLP, you have 1000 word features and a binary target (spam/ham). For each word, test whether it's independent of the target using chi-square. Words with low p-values (strong association) are kept; words with high p-values are removed. Apply Benjamini-Hochberg FDR correction to control false discoveries across 1000 tests. Select top 50 features for your classifier.

Common Pitfalls in Applied Statistics

Pitfall	Why It's Wrong	Correct Approach
Stopping experiment when p < 0.05	Inflates false positive rate	Pre-specify sample size, run to completion
Ignoring practical significance	Trivial effects become "significant" with large $n$	Report effect sizes and confidence intervals
Cherry-picking subgroups	Inflates false discovery rate	Pre-specify subgroups, adjust for multiple testing
Using accuracy for imbalanced classes	95% accuracy by always predicting majority class	Use F1, AUC-ROC, or precision-recall curves
Correlation ≠ Causation	Observational association doesn't imply causation	Use experiments or causal inference methods

Key Takeaways

Summary: Applications in Data Science

A/B Testing: Use two-sample proportion or mean tests. Randomize, pre-specify $\alpha$ , compute power before collecting data.
Power Analysis: Determine sample size using Cohen's d, desired power (0.80), and $\alpha = 0.05$ . Underpowered studies waste resources.
Causal Inference: Randomized experiments are the gold standard. For observational data, use propensity scores, IV, or DiD under strong assumptions.
Feature Selection: Chi-square tests for categorical features; permutation importance for any model; mutual information for non-linear relationships.
Model Evaluation: Cross-validate for unbiased performance estimates. Use AUC-ROC for threshold-independent evaluation. Check calibration.
Multiple Comparisons: Every test inflates false positive risk. Use Bonferroni, Holm, or FDR correction when running many tests.
Reproducibility: Pre-register hypotheses, report all tests, provide confidence intervals alongside p-values, and share code/data.
Beyond p-values: Effect sizes, confidence intervals, and practical significance matter more than binary significant/not-significant decisions.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Machine Learning Applications

Statistics in Machine Learning — How statistical methods power ML: hypothesis testing for model comparison, confidence intervals for metrics, and Bayesian approaches

Review and Roadmap

Statistics Review and Roadmap — Comprehensive review of all statistical concepts with a structured learning roadmap

Applications in Data Science