🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Statistics: Hypothesis Testing, p-values, Confidence Intervals

Data Science Interview PremiumStatistics⭐ Premium

Advertisement

GOOGLE & NETFLIX INTERVIEW QUESTION

Statistics: Hypothesis Testing, p-values, Confidence Intervals

Statistical Inference & Decision Making

The Interview Question

ℹ️

Question: You're a data scientist at Netflix testing whether a new recommendation algorithm increases user engagement. You run an A/B test with:

  • Control group: 50,000 users, mean watch time = 45 min/day, std = 12 min
  • Treatment group: 50,000 users, mean watch time = 47 min/day, std = 13 min
  1. Set up the hypothesis test properly
  2. Calculate the p-value and interpret it
  3. Determine if the result is practically significant
  4. What are the potential pitfalls and how do you address them?

Detailed Answer

1. Hypothesis Testing Framework

Hypothesis testing is the foundation of statistical inference. It provides a structured way to make decisions about population parameters based on sample data.

Step 1: Define Hypotheses

Architecture Diagram
Null Hypothesis (H₀): μ_treatment - μ_control = 0
Alternative Hypothesis (H₁): μ_treatment - μ_control ≠ 0 (two-tailed)
                          OR: μ_treatment - μ_control > 0 (one-tailed)

💡

Pro Tip: For business decisions, a one-tailed test is often more appropriate. If we only care whether the new algorithm increases engagement (not decreases), use H₁: μ_treatment > μ_control.

Step 2: Choose Significance Level (α)

The significance level is the probability of rejecting the null hypothesis when it's actually true (Type I error).

Architecture Diagram
α = 0.05 (5%) — Standard for most tests
α = 0.01 (1%) — For high-stakes decisions
α = 0.10 (10%) — For exploratory analysis

2. Calculating the Test Statistic

For comparing two means with large samples, we use the z-test:

import numpy as np
from scipy import stats

# Given data
n_control = 50000
n_treatment = 50000
mean_control = 45
mean_treatment = 47
std_control = 12
std_treatment = 13

# Calculate pooled standard error
pooled_std = np.sqrt((std_control**2 / n_control) + (std_treatment**2 / n_treatment))

# Calculate z-statistic
z_stat = (mean_treatment - mean_control) / pooled_std

# Calculate p-value (two-tailed)
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.6f}")

Mathematical Formula:

Architecture Diagram
z = (x̄₁ - x̄₂) / √(s₁²/n₁ + s₂²/n₂)

where:
x̄₁, x̄₂ = sample means
s₁, s₂ = sample standard deviations
n₁, n₂ = sample sizes

3. Interpreting the P-value

# Interpret the results
alpha = 0.05

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Significance level (α): {alpha}")
print(f"\nInterpretation:")
if p_value < alpha:
    print(f"Since p-value ({p_value:.6f}) < α ({alpha})")
    print("We reject the null hypothesis")
    print("The new recommendation algorithm has a statistically significant effect")
else:
    print(f"Since p-value ({p_value:.6f}) >= α ({alpha})")
    print("We fail to reject the null hypothesis")
    print("There's insufficient evidence to conclude the algorithm has an effect")

Common Misconceptions About P-values:

MisconceptionReality
"P-value is the probability H₀ is true"P-value is P(data | H₀), not P(H₀ | data)
"P < 0.05 means the effect is real"P < 0.05 means we'd see this data rarely if H₀ were true
"Larger p-value means no effect"Large p-value means insufficient evidence, not proof of no effect
"P-value measures effect size"P-value depends on sample size; large n can make tiny effects significant

4. Confidence Intervals

A confidence interval provides a range of plausible values for the true difference in means.

# Calculate 95% confidence interval for difference in means
diff = mean_treatment - mean_control
se_diff = pooled_std

# 95% CI: diff ± 1.96 * SE
ci_lower = diff - 1.96 * se_diff
ci_upper = diff + 1.96 * se_diff

print(f"Point estimate: {diff} minutes")
print(f"95% CI: ({ci_lower:.3f}, {ci_upper:.3f})")
print(f"CI width: {ci_upper - ci_lower:.3f} minutes")

# Interpretation
print(f"\nInterpretation:")
print(f"We are 95% confident that the true difference in mean watch time")
print(f"between treatment and control is between {ci_lower:.2f} and {ci_upper:.2f} minutes")

Mathematical Formula:

Architecture Diagram
CI = (x̄₁ - x̄₂) ± z_(α/2) × √(s₁²/n₁ + s₂²/n₂)

For 95% CI: z_(α/2) = 1.96
For 99% CI: z_(α/2) = 2.576
For 90% CI: z_(α/2) = 1.645

5. Effect Size and Practical Significance

Statistical significance ≠ Practical significance. We need to measure effect size.

# Cohen's d for effect size
pooled_std_cohensd = np.sqrt(
    ((n_control - 1) * std_control**2 + (n_treatment - 1) * std_treatment**2) / 
    (n_control + n_treatment - 2)
)

cohens_d = (mean_treatment - mean_control) / pooled_std_cohensd

print(f"Cohen's d: {cohens_d:.4f}")
print(f"\nEffect size interpretation:")
if abs(cohens_d) < 0.2:
    print("Negligible effect")
elif abs(cohens_d) < 0.5:
    print("Small effect")
elif abs(cohens_d) < 0.8:
    print("Medium effect")
else:
    print("Large effect")

# Practical significance calculation
revenue_per_minute = 0.05  #假设每分钟观看产生$0.05收入
annual_impact_per_user = diff * revenue_per_minute * 365
total_annual_impact = annual_impact_per_user * 1000000  # 1M users

print(f"\nPractical Impact:")
print(f"Additional watch time per user per day: {diff} minutes")
print(f"Annual revenue impact per user: ${annual_impact_per_user:.2f}")
print(f"Total annual impact (1M users): ${total_annual_impact:,.0f}")

6. Power Analysis

Power is the probability of correctly rejecting a false null hypothesis (1 - β).

from statsmodels.stats.power import TTestIndPower

# Calculate required sample size for 80% power
effect_size = cohens_d
alpha = 0.05
power = 0.80

analysis = TTestIndPower()
required_n = analysis.solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    ratio=1.0  # Equal group sizes
)

print(f"Required sample size per group: {int(np.ceil(required_n))}")
print(f"Current sample size: {n_control}")
print(f"Sufficient power: {'Yes' if n_control >= required_n else 'No'}")

# Calculate actual power with current sample size
actual_power = analysis.solve_power(
    effect_size=effect_size,
    alpha=alpha,
    nobs1=n_control,
    ratio=1.0
)
print(f"Actual power: {actual_power:.4f}")

Power Analysis Formula:

Architecture Diagram
n = (z_(α/2) + z_β)² × 2σ² / δ²

where:
n = sample size per group
z_(α/2) = critical value for significance level
z_β = critical value for power (1 - β)
σ = standard deviation
δ = minimum detectable effect size

7. Potential Pitfalls and Solutions

# Pitfall 1: Multiple Comparisons
# When testing multiple metrics, inflate alpha
n_tests = 5
bonferroni_alpha = 0.05 / n_tests
print(f"Bonferroni-corrected alpha: {bonferroni_alpha:.4f}")

# Pitfall 2: Peeking at Results
# Use sequential testing or always-run procedures
from statsmodels.stats.power import TTestIndPower

# Calculate alpha spending function
def alpha_spending(spent_fraction, total_alpha=0.05):
    """O'Brien-Fleming spending function"""
    return total_alpha * (1 - np.exp(-4 * spent_fraction))

# Pitfall 3: Simpson's Paradox
# Check for confounding variables
print("\nExample of Simpson's Paradox:")
print("Overall: Treatment looks better")
print("But when stratified by user segment:")
print("- New users: Control better")
print("- Power users: Control better")
print("Treatment only better for medium users who happen to be larger group")

# Pitfall 4: Selection Bias
# Ensure random assignment
print("\nChecking for selection bias:")
print("Pre-test characteristics should be similar:")
print(f"Control group pre-test mean: {mean_control:.2f}")
print(f"Treatment group pre-test mean: {mean_treatment:.2f}")

⚠️

Critical Warning: Never stop a test early just because you see "significant" results. This inflates Type I error. Use proper sequential testing methods instead.

8. Common Follow-Up Questions

Follow-up 1: What if the data isn't normally distributed?

# Use non-parametric tests
from scipy.stats import mannwhitneyu, wilcoxon

# Mann-Whitney U test (non-parametric alternative to t-test)
stat, p_value_mw = mannwhitneyu(
    treatment_group_data, 
    control_group_data, 
    alternative='greater'
)
print(f"Mann-Whitney U test p-value: {p_value_mw:.6f}")

# Bootstrap confidence interval
def bootstrap_ci(data1, data2, n_bootstrap=10000, ci=0.95):
    """Calculate bootstrap confidence interval for difference in means"""
    boot_diffs = []
    for _ in range(n_bootstrap):
        boot1 = np.random.choice(data1, size=len(data1), replace=True)
        boot2 = np.random.choice(data2, size=len(data2), replace=True)
        boot_diffs.append(np.mean(boot1) - np.mean(boot2))
    
    lower = np.percentile(boot_diffs, (1-ci)/2 * 100)
    upper = np.percentile(boot_diffs, (1+ci)/2 * 100)
    return lower, upper

Follow-up 2: How do you handle multiple metrics?

# Family-wise error rate control
from statsmodels.stats.multitest import multipletests

p_values = [0.02, 0.04, 0.08, 0.12, 0.03]
metric_names = ['Watch Time', 'Completion Rate', 'Searches', 'Downloads', 'Return Visits']

# Bonferroni correction
rejected, corrected_p, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')

print("Results with Bonferroni correction:")
for name, p, corr_p, rej in zip(metric_names, p_values, corrected_p, rejected):
    print(f"{name}: p={p:.4f}, corrected_p={corr_p:.4f}, significant={rej}")

# False Discovery Rate (less conservative)
rejected_fdr, corrected_p_fdr, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
print("\nResults with FDR correction:")
for name, p, corr_p, rej in zip(metric_names, p_values, corrected_p_fdr, rejected_fdr):
    print(f"{name}: p={p:.4f}, corrected_p={corr_p:.4f}, significant={rej}")

Company-Specific Tips

ℹ️

Google Tips:

  • Google often asks about Bayesian vs Frequentist approaches
  • Be prepared to explain p-values in business terms
  • Know when to use z-test vs t-test vs chi-square
  • Practice power analysis calculations

Netflix Tips:

  • Netflix heavily tests on A/B testing methodology
  • Understand sequential testing and early stopping rules
  • Know how to handle network effects and interference
  • Be comfortable with regression-based analysis of experiments

Quiz Section


Related Topics

Advertisement