🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Data Collection Methods — Surveys, Experiments, Observations

Foundations of StatisticsData Collection🟢 Free Lesson

Advertisement

Data Collection Methods

Data Collection

Garbage In, Garbage Out — Collect Data Right or Fail

The quality of any statistical analysis depends entirely on the quality of the data. Poor data collection means no amount of sophisticated analysis will save your conclusions.

  • Experimental studies — The gold standard for establishing causation through randomization
  • Observational studies — Powerful for association but vulnerable to confounding
  • Survey design — Question wording and response modes can make or break your results
  • Error prevention — Understanding bias, nonresponse, and measurement error before they corrupt your data

The best statistician in the world cannot fix bad data. Design your collection carefully.


What is Data Collection?

Definition

Data collection is the process of gathering information to answer a research question. The method you choose determines what conclusions you can validly draw.

"Garbage in, garbage out." — Computer science proverb that applies equally to statistics.


Primary vs Secondary Data

TypeDefinitionExamples
PrimaryCollected directly for the current studySurvey you design, experiment you run
SecondaryPre-existing data collected by othersGovernment census, hospital records

Primary advantages: tailored to your question, you control quality
Secondary advantages: cheap, large scale, historical depth


Experimental Studies

The gold standard for establishing causation. The researcher:

  1. Randomly assigns subjects to treatment/control groups
  2. Applies a treatment (intervention)
  3. Measures the outcome

Key Principle

Randomization is the key — it distributes confounders equally across groups.

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

np.random.seed(42)
n = 100  # 100 participants

# Random assignment to treatment or control
assignments = np.random.choice(['treatment', 'control'], size=n)

# Simulate outcomes (treatment has true effect of +5 points)
outcomes = np.where(assignments == 'treatment',
                    np.random.normal(75, 10, n),   # treatment group
                    np.random.normal(70, 10, n))   # control group

df = pd.DataFrame({'group': assignments, 'score': outcomes})

# Compare groups
for group in ['treatment', 'control']:
    subset = df[df['group'] == group]['score']
    print(f"{group}: mean={subset.mean():.2f}, n={len(subset)}")

# Hypothesis test
t_stat, p_val = ttest_ind(
    df[df['group']=='treatment']['score'],
    df[df['group']=='control']['score']
)
print(f"\nt-statistic = {t_stat:.3f}, p-value = {p_val:.4f}")
print("Conclusion:", "Significant difference" if p_val < 0.05 else "No significant difference")

Observational Studies

DfObservational Study

The researcher observes without intervening. Can show association but cannot prove causation due to confounding.

Types of Observational Studies

TypeDirection in TimeStrengthExample
Cross-sectionalSnapshot (no time)WeakSurvey of current diet and BMI
Case-ControlBackward (retrospective)ModerateLung cancer patients vs. controls -> smoking history
CohortForward (prospective)StrongFollow smokers vs. non-smokers for 20 years
# Observational study simulation: coffee and productivity
# True causal structure: Exercise -> (Coffee consumption + Productivity)
# Naive analysis might conclude coffee CAUSES productivity

np.random.seed(0)
n = 500

exercise = np.random.normal(5, 2, n)  # hours/week — the true cause
coffee = 0.5 * exercise + np.random.normal(2, 1, n)  # coffee correlated with exercise
productivity = 0.8 * exercise + np.random.normal(7, 2, n)  # productivity caused by exercise

# Naive correlation
from scipy.stats import pearsonr
r_naive, p_naive = pearsonr(coffee, productivity)
print(f"Coffee × Productivity correlation: r = {r_naive:.3f}, p = {p_naive:.4f}")

# Partial correlation (controlling for exercise) — the truth
from numpy.linalg import lstsq
# Residualize out exercise
coffee_resid = coffee - (lstsq([[1, e] for e in exercise], coffee, rcond=None)[0][0] + 
                          lstsq([[1, e] for e in exercise], coffee, rcond=None)[0][1] * exercise)
prod_resid = productivity - (lstsq([[1, e] for e in exercise], productivity, rcond=None)[0][0] +
                              lstsq([[1, e] for e in exercise], productivity, rcond=None)[0][1] * exercise)

r_partial, p_partial = pearsonr(coffee_resid, prod_resid)
print(f"Coffee × Productivity (controlling exercise): r = {r_partial:.3f}, p = {p_partial:.4f}")
print("\nCoffee's apparent effect was mostly due to exercise (confounding)!")

Survey Methods

Key Survey Design Principles

1. Question Wording

  • Clear, unambiguous language
  • Avoid leading questions: "Don't you agree that X is better?" -> BAD
  • Neutral framing: "Which do you prefer, X or Y?" -> BETTER

2. Response Options

  • Exhaustive (cover all possibilities)
  • Mutually exclusive (no overlap)
  • Balanced scale (equal positive and negative options)

3. Survey Modes

ModeCostResponse RateCoverageBest For
In-personHigh~80%LimitedDetailed interviews
PhoneMedium~15-30%BroadNational surveys
MailLow-Medium~10-20%Very broadSensitive topics
OnlineVery low~10-30%Internet usersLarge-scale, fast
Survey Mode Comparison — Response Rate vs Cost
# Simulating response bias in surveys
# Suppose true population approval = 55%

np.random.seed(1)
true_approval = 0.55

# Random phone sample (representative)
n_phone = 1000
phone_responses = np.random.binomial(1, true_approval, n_phone)
print(f"Phone survey: {phone_responses.mean():.3f} (true: {true_approval})")

# Online opt-in sample (selection bias — engaged users more opinionated)
# People who disapprove are angrier and more likely to respond
prob_respond_approve = 0.3
prob_respond_disapprove = 0.6
population = np.random.binomial(1, true_approval, 10000)
responded = np.where(population == 1,
                     np.random.binomial(1, prob_respond_approve, 10000),
                     np.random.binomial(1, prob_respond_disapprove, 10000))
online_sample = population[responded == 1]
print(f"Online opt-in: {online_sample.mean():.3f} (true: {true_approval})")
print("Selection bias introduced error!")

Common Data Collection Errors

Error TypeDescriptionExample
Sampling errorRandom variation from sample to samplePoll shows 52% support; true value is 50%
Coverage errorPopulation not fully coveredPhone survey misses people without phones
Nonresponse errorNon-responders differ systematicallyDissatisfied customers less likely to respond
Measurement errorInaccurate responsesPeople underreport alcohol consumption
Processing errorData entry mistakesMistyped values during transcription

Data Collection in Machine Learning

Web ScrapingLogs, APIs, crawlSensors / IoTReal-time streamsUser InteractionClicks, views, buysLabelingHumans annotateML Training PipelineClean → Feature Engineer → Train → Evaluate → Deploy

In ML, data collection is everything. The model is only as good as the data it trains on.

ML Data SourceStatistics EquivalentML Concern
Web scrapingCensus dataSelection bias — only public data
User logsObservational studySurvivorship bias — only active users
Labeled datasetControlled experimentLabel quality — noisy labels hurt models
Sensor dataMeasurement studyDrift — data distribution changes over time
Survey + MLMixed methodsNonresponse bias — who answers?
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Simulate ML data collection: customer churn dataset
np.random.seed(42)
n = 5000

data = pd.DataFrame({
    'tenure_months': np.random.randint(1, 72, n),
    'monthly_charge': np.random.normal(65, 20, n).clip(20, 120),
    'num_support_calls': np.random.poisson(2, n),
    'contract_type': np.random.choice(['month-to-month', 'one-year', 'two-year'],
                                       n, p=[0.5, 0.3, 0.2])
})

# True churn logic (simplified)
prob_churn = 1 / (1 + np.exp(-(
    -2 + 0.03 * data['tenure_months']
    + 0.02 * data['monthly_charge']
    + 0.3 * data['num_support_calls']
    + (data['contract_type'] == 'month-to-month').astype(int) * 1.5
)))
data['churned'] = np.random.binomial(1, prob_churn)

print(f"Total data collected: {len(data)}")
print(f"Churn rate: {data['churned'].mean():.2%}")

# Data split — sampling for ML
train, test = train_test_split(data, test_size=0.2, random_state=42)
print(f"Training set: {len(train)} (sample for learning)")
print(f"Test set: {len(test)} (proxy for population performance)")

# Train a simple model
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_enc = train.copy()
test_enc = test.copy()
train_enc['contract_type'] = le.fit_transform(train_enc['contract_type'])
test_enc['contract_type'] = le.transform(test_enc['contract_type'])

features = ['tenure_months', 'monthly_charge', 'num_support_calls', 'contract_type']
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(train_enc[features], train_enc['churned'])

train_acc = accuracy_score(train_enc['churned'], model.predict(train_enc[features]))
test_acc = accuracy_score(test_enc['churned'], model.predict(test_enc[features]))
print(f"\nTrain accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")
print(f"Generalization gap: {train_acc - test_acc:.3f}")

Key Takeaways

Summary: Data Collection Methods

  1. Experimental designs with randomization are the only way to establish causation
  2. Observational studies show association — confounding is always a threat
  3. Surveys require careful design — question wording and mode affect results dramatically
  4. Nonresponse bias is one of the biggest practical threats to survey validity
  5. Secondary data is valuable but comes with limitations you didn't control
  6. Pre-register your study design before collecting data to avoid p-hacking

Premium Content

Data Collection Methods — Surveys, Experiments, Observations

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement