Data Collection Methods
Data Collection
Garbage In, Garbage Out — Collect Data Right or Fail
The quality of any statistical analysis depends entirely on the quality of the data. Poor data collection means no amount of sophisticated analysis will save your conclusions.
- Experimental studies — The gold standard for establishing causation through randomization
- Observational studies — Powerful for association but vulnerable to confounding
- Survey design — Question wording and response modes can make or break your results
- Error prevention — Understanding bias, nonresponse, and measurement error before they corrupt your data
The best statistician in the world cannot fix bad data. Design your collection carefully.
What is Data Collection?
Definition
Data collection is the process of gathering information to answer a research question. The method you choose determines what conclusions you can validly draw.
"Garbage in, garbage out." — Computer science proverb that applies equally to statistics.
Primary vs Secondary Data
| Type | Definition | Examples |
|---|---|---|
| Primary | Collected directly for the current study | Survey you design, experiment you run |
| Secondary | Pre-existing data collected by others | Government census, hospital records |
Primary advantages: tailored to your question, you control quality
Secondary advantages: cheap, large scale, historical depth
Experimental Studies
The gold standard for establishing causation. The researcher:
- Randomly assigns subjects to treatment/control groups
- Applies a treatment (intervention)
- Measures the outcome
Key Principle
Randomization is the key — it distributes confounders equally across groups.
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
np.random.seed(42)
n = 100 # 100 participants
# Random assignment to treatment or control
assignments = np.random.choice(['treatment', 'control'], size=n)
# Simulate outcomes (treatment has true effect of +5 points)
outcomes = np.where(assignments == 'treatment',
np.random.normal(75, 10, n), # treatment group
np.random.normal(70, 10, n)) # control group
df = pd.DataFrame({'group': assignments, 'score': outcomes})
# Compare groups
for group in ['treatment', 'control']:
subset = df[df['group'] == group]['score']
print(f"{group}: mean={subset.mean():.2f}, n={len(subset)}")
# Hypothesis test
t_stat, p_val = ttest_ind(
df[df['group']=='treatment']['score'],
df[df['group']=='control']['score']
)
print(f"\nt-statistic = {t_stat:.3f}, p-value = {p_val:.4f}")
print("Conclusion:", "Significant difference" if p_val < 0.05 else "No significant difference")
Observational Studies
DfObservational Study
The researcher observes without intervening. Can show association but cannot prove causation due to confounding.
Types of Observational Studies
| Type | Direction in Time | Strength | Example |
|---|---|---|---|
| Cross-sectional | Snapshot (no time) | Weak | Survey of current diet and BMI |
| Case-Control | Backward (retrospective) | Moderate | Lung cancer patients vs. controls -> smoking history |
| Cohort | Forward (prospective) | Strong | Follow smokers vs. non-smokers for 20 years |
# Observational study simulation: coffee and productivity
# True causal structure: Exercise -> (Coffee consumption + Productivity)
# Naive analysis might conclude coffee CAUSES productivity
np.random.seed(0)
n = 500
exercise = np.random.normal(5, 2, n) # hours/week — the true cause
coffee = 0.5 * exercise + np.random.normal(2, 1, n) # coffee correlated with exercise
productivity = 0.8 * exercise + np.random.normal(7, 2, n) # productivity caused by exercise
# Naive correlation
from scipy.stats import pearsonr
r_naive, p_naive = pearsonr(coffee, productivity)
print(f"Coffee × Productivity correlation: r = {r_naive:.3f}, p = {p_naive:.4f}")
# Partial correlation (controlling for exercise) — the truth
from numpy.linalg import lstsq
# Residualize out exercise
coffee_resid = coffee - (lstsq([[1, e] for e in exercise], coffee, rcond=None)[0][0] +
lstsq([[1, e] for e in exercise], coffee, rcond=None)[0][1] * exercise)
prod_resid = productivity - (lstsq([[1, e] for e in exercise], productivity, rcond=None)[0][0] +
lstsq([[1, e] for e in exercise], productivity, rcond=None)[0][1] * exercise)
r_partial, p_partial = pearsonr(coffee_resid, prod_resid)
print(f"Coffee × Productivity (controlling exercise): r = {r_partial:.3f}, p = {p_partial:.4f}")
print("\nCoffee's apparent effect was mostly due to exercise (confounding)!")
Survey Methods
Key Survey Design Principles
1. Question Wording
- Clear, unambiguous language
- Avoid leading questions: "Don't you agree that X is better?" -> BAD
- Neutral framing: "Which do you prefer, X or Y?" -> BETTER
2. Response Options
- Exhaustive (cover all possibilities)
- Mutually exclusive (no overlap)
- Balanced scale (equal positive and negative options)
3. Survey Modes
| Mode | Cost | Response Rate | Coverage | Best For |
|---|---|---|---|---|
| In-person | High | ~80% | Limited | Detailed interviews |
| Phone | Medium | ~15-30% | Broad | National surveys |
| Low-Medium | ~10-20% | Very broad | Sensitive topics | |
| Online | Very low | ~10-30% | Internet users | Large-scale, fast |
# Simulating response bias in surveys
# Suppose true population approval = 55%
np.random.seed(1)
true_approval = 0.55
# Random phone sample (representative)
n_phone = 1000
phone_responses = np.random.binomial(1, true_approval, n_phone)
print(f"Phone survey: {phone_responses.mean():.3f} (true: {true_approval})")
# Online opt-in sample (selection bias — engaged users more opinionated)
# People who disapprove are angrier and more likely to respond
prob_respond_approve = 0.3
prob_respond_disapprove = 0.6
population = np.random.binomial(1, true_approval, 10000)
responded = np.where(population == 1,
np.random.binomial(1, prob_respond_approve, 10000),
np.random.binomial(1, prob_respond_disapprove, 10000))
online_sample = population[responded == 1]
print(f"Online opt-in: {online_sample.mean():.3f} (true: {true_approval})")
print("Selection bias introduced error!")
Common Data Collection Errors
| Error Type | Description | Example |
|---|---|---|
| Sampling error | Random variation from sample to sample | Poll shows 52% support; true value is 50% |
| Coverage error | Population not fully covered | Phone survey misses people without phones |
| Nonresponse error | Non-responders differ systematically | Dissatisfied customers less likely to respond |
| Measurement error | Inaccurate responses | People underreport alcohol consumption |
| Processing error | Data entry mistakes | Mistyped values during transcription |
Data Collection in Machine Learning
In ML, data collection is everything. The model is only as good as the data it trains on.
| ML Data Source | Statistics Equivalent | ML Concern |
|---|---|---|
| Web scraping | Census data | Selection bias — only public data |
| User logs | Observational study | Survivorship bias — only active users |
| Labeled dataset | Controlled experiment | Label quality — noisy labels hurt models |
| Sensor data | Measurement study | Drift — data distribution changes over time |
| Survey + ML | Mixed methods | Nonresponse bias — who answers? |
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Simulate ML data collection: customer churn dataset
np.random.seed(42)
n = 5000
data = pd.DataFrame({
'tenure_months': np.random.randint(1, 72, n),
'monthly_charge': np.random.normal(65, 20, n).clip(20, 120),
'num_support_calls': np.random.poisson(2, n),
'contract_type': np.random.choice(['month-to-month', 'one-year', 'two-year'],
n, p=[0.5, 0.3, 0.2])
})
# True churn logic (simplified)
prob_churn = 1 / (1 + np.exp(-(
-2 + 0.03 * data['tenure_months']
+ 0.02 * data['monthly_charge']
+ 0.3 * data['num_support_calls']
+ (data['contract_type'] == 'month-to-month').astype(int) * 1.5
)))
data['churned'] = np.random.binomial(1, prob_churn)
print(f"Total data collected: {len(data)}")
print(f"Churn rate: {data['churned'].mean():.2%}")
# Data split — sampling for ML
train, test = train_test_split(data, test_size=0.2, random_state=42)
print(f"Training set: {len(train)} (sample for learning)")
print(f"Test set: {len(test)} (proxy for population performance)")
# Train a simple model
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_enc = train.copy()
test_enc = test.copy()
train_enc['contract_type'] = le.fit_transform(train_enc['contract_type'])
test_enc['contract_type'] = le.transform(test_enc['contract_type'])
features = ['tenure_months', 'monthly_charge', 'num_support_calls', 'contract_type']
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(train_enc[features], train_enc['churned'])
train_acc = accuracy_score(train_enc['churned'], model.predict(train_enc[features]))
test_acc = accuracy_score(test_enc['churned'], model.predict(test_enc[features]))
print(f"\nTrain accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")
print(f"Generalization gap: {train_acc - test_acc:.3f}")
Key Takeaways
Summary: Data Collection Methods
- Experimental designs with randomization are the only way to establish causation
- Observational studies show association — confounding is always a threat
- Surveys require careful design — question wording and mode affect results dramatically
- Nonresponse bias is one of the biggest practical threats to survey validity
- Secondary data is valuable but comes with limitations you didn't control
- Pre-register your study design before collecting data to avoid p-hacking