Types of Data in Statistics
Data Types
Know Your Data Before You Analyze It
Understanding data types is the first and most critical step in any statistical analysis. The type of data you have determines which statistical methods are valid, which visualizations are appropriate, and what conclusions you can draw. Choose wrong, and your entire analysis falls apart.
Here is what mastering data types helps you do:
- Select the Right Tests — Different data types require different statistical methods; using the wrong one produces meaningless results.
- Visualize Effectively — The chart type that works for continuous data is useless for categorical data, and vice versa.
- Avoid Common Mistakes — Stop treating ZIP codes as numbers, Likert scales as intervals, or discrete counts as continuous values.
- Communicate Clearly — Classifying variables correctly lets you describe your data accurately to stakeholders and peers.
Identifying your data type is not busywork — it is the foundation every analysis stands on.
Types of Data in Statistics
Definition
Data types classify variables based on their mathematical properties. The type determines which statistical methods, visualizations, and operations are valid.
The Data Type Hierarchy
Qualitative (Categorical) Data
DfQualitative Data
Qualitative data represents categories or groups — things that are described rather than measured numerically.
Nominal Data
Categories with no natural order. You can only determine equality or inequality.
Examples:
- Eye color: brown, blue, green, hazel
- Blood type: A, B, AB, O
- Country of birth
- Product category: electronics, clothing, food
- Survey responses: Yes / No
Valid operations: Count, mode, chi-square test
Invalid operations: Mean, median, subtraction
Ordinal Data
Categories with a meaningful order, but the gaps between categories are not necessarily equal.
Examples:
- Education level: high school < bachelor's < master's < PhD
- Customer satisfaction: Poor < Fair < Good < Excellent
- Military rank: Private < Corporal < Sergeant < Captain
- Star ratings: ★ < ★★ < ★★★ < ★★★★ < ★★★★★
Valid operations: Ordering, median, percentiles, Spearman correlation
Invalid operations: Arithmetic mean (controversial), subtraction (intervals unknown)
Key Distinction
"Excellent" is better than "Good", but is it exactly twice as good? Ordinal scales can't tell us.
Quantitative (Numerical) Data
DfQuantitative Data
Quantitative data represents measured or counted quantities — numbers that have mathematical meaning.
Discrete Data
Can only take specific, countable values — usually whole numbers. There are gaps between possible values.
Examples:
- Number of children in a family (0, 1, 2, 3, ... — not 1.7)
- Number of cars in a parking lot
- Number of defects in a product
- Shoe sizes (though not whole numbers, they're discrete: 8, 8.5, 9...)
- Number of goals scored in a soccer match
Valid operations: All arithmetic, count, Poisson distribution, binomial distribution
Continuous Data
Can take any value within a range, including fractions and decimals. Limited only by measurement precision.
Examples:
- Height (1.753847... meters)
- Temperature (23.7°C)
- Time to complete a task
- Weight, blood pressure, distance
- Stock prices
Valid operations: All arithmetic, normal distribution, integration, derivatives
Interval vs Ratio (A Deeper Cut)
Within quantitative data, we can further distinguish:
| Feature | Interval | Ratio |
|---|---|---|
| Equal intervals | ✅ Yes | ✅ Yes |
| True zero (zero = absence) | ❌ No | ✅ Yes |
| Meaningful ratios | ❌ No | ✅ Yes |
| Example | Temperature (°C), IQ | Height, weight, income |
Interval example: 0°C is not "no temperature." 40°C is not twice as hot as 20°C (in the thermodynamic sense). Temperature in Kelvin is ratio.
Ratio example: A person who weighs 80 kg is genuinely twice as heavy as someone who weighs 40 kg.
Why Data Types Matter for Statistics
| Analysis Goal | Nominal | Ordinal | Discrete/Continuous |
|---|---|---|---|
| Central tendency | Mode | Mode, Median | Mean, Median, Mode |
| Spread | Frequency | IQR | Std Dev, Variance |
| Correlation | Cramér's V | Spearman ρ | Pearson r |
| Group comparison | Chi-square | Kruskal-Wallis | ANOVA, t-test |
| Regression | Dummy variables | Ordinal logistic | Linear regression |
| Visualization | Bar chart | Bar/box | Histogram, scatter |
Python: Identifying and Working with Data Types
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load a rich dataset
df = sns.load_dataset('tips')
print("Dataset shape:", df.shape)
print("\nData types (pandas dtypes):")
print(df.dtypes)
Output:
Dataset shape: (244, 7)
Data types (pandas dtypes):
total_bill float64 <- Continuous quantitative
tip float64 <- Continuous quantitative
sex category <- Nominal qualitative
smoker category <- Nominal qualitative
day category <- Ordinal qualitative (Sun > Sat > Fri > Thur semantically)
time category <- Nominal qualitative
size int64 <- Discrete quantitative
# --- Statistical summaries differ by type ---
print("\n=== Quantitative Variables ===")
print(df[['total_bill', 'tip', 'size']].describe())
print("\n=== Qualitative Variables ===")
for col in ['sex', 'smoker', 'day', 'time']:
print(f"\n{col} — value counts:")
print(df[col].value_counts())
print(f"Mode: {df[col].mode()[0]}")
# --- Visualizations appropriate to each type ---
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
# Continuous: histogram
axes[0, 0].hist(df['total_bill'], bins=20, color='steelblue', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Total Bill (Continuous)\n-> Histogram')
axes[0, 0].set_xlabel('Amount ($)')
# Discrete: bar chart
size_counts = df['size'].value_counts().sort_index()
axes[0, 1].bar(size_counts.index, size_counts.values, color='coral', edgecolor='black')
axes[0, 1].set_title('Party Size (Discrete)\n-> Bar Chart')
axes[0, 1].set_xlabel('Size')
# Nominal: pie chart
sex_counts = df['sex'].value_counts()
axes[0, 2].pie(sex_counts.values, labels=sex_counts.index, autopct='%1.1f%%', startangle=90)
axes[0, 2].set_title('Sex (Nominal)\n-> Pie Chart')
# Ordinal: ordered bar
day_order = ['Thur', 'Fri', 'Sat', 'Sun']
day_counts = df['day'].value_counts().reindex(day_order)
axes[1, 0].bar(day_counts.index, day_counts.values, color='mediumseagreen', edgecolor='black')
axes[1, 0].set_title('Day (Ordinal)\n-> Ordered Bar Chart')
# Continuous: box plot by category
df.boxplot(column='tip', by='day', ax=axes[1, 1])
axes[1, 1].set_title('Tip by Day\n-> Box Plot')
# Scatter: two continuous
axes[1, 2].scatter(df['total_bill'], df['tip'], alpha=0.5, color='purple')
axes[1, 2].set_title('Bill vs Tip (Continuous × Continuous)\n-> Scatter Plot')
axes[1, 2].set_xlabel('Total Bill ($)')
axes[1, 2].set_ylabel('Tip ($)')
plt.tight_layout()
plt.savefig('data_types_visualization.png', dpi=150)
plt.show()
Data Type Classification in Practice
def classify_variable(series: pd.Series, nunique_threshold: int = 15) -> str:
"""Classify a pandas Series into a statistical data type."""
dtype = series.dtype
nunique = series.nunique()
if dtype == 'bool':
return 'Nominal (Binary)'
elif dtype.name == 'category' or dtype == 'object':
return 'Nominal Categorical'
elif dtype in ['int32', 'int64']:
if nunique <= nunique_threshold:
return f'Discrete Quantitative ({nunique} unique values)'
else:
return 'Discrete Quantitative (high cardinality)'
elif dtype in ['float32', 'float64']:
return 'Continuous Quantitative'
else:
return f'Unknown ({dtype})'
# Apply to the tips dataset
print("Variable Classification:")
print("-" * 50)
for col in df.columns:
classification = classify_variable(df[col])
print(f"{col:<15} -> {classification}")
Common Mistakes
Treating Ordinal as Interval
Averaging Likert-scale responses (1–5) as if they are interval data is common but technically incorrect. The difference between "Strongly Agree" and "Agree" may not equal the difference between "Neutral" and "Disagree."
Zip Codes as Quantitative
ZIP code 90210 is not 40,000 more than ZIP code 50000. It's a nominal identifier.
Treating Discrete Data as Continuous
Modeling number of children with a continuous distribution can predict 1.7 children — meaningless. Use Poisson or negative binomial.
Practice Exercises
Exercise 1: Classify each variable:
- a) Temperature in Fahrenheit
- b) Movie genre (Action, Comedy, Drama)
- c) Customer age
- d) Job satisfaction rating (1 = Very Unsatisfied, 5 = Very Satisfied)
- e) Number of siblings
Exercise 2: For each variable in the iris dataset, identify the type and choose the most appropriate visualization.
import seaborn as sns
iris = sns.load_dataset('iris')
print(iris.dtypes)
# Your classifications and visualizations here
See Solution
# sepal_length: float64 -> Continuous -> histogram or box plot
# sepal_width: float64 -> Continuous -> histogram or box plot
# petal_length: float64 -> Continuous -> histogram or box plot
# petal_width: float64 -> Continuous -> histogram or box plot
# species: object/category -> Nominal -> bar chart
import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset('iris')
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Continuous: petal_length distribution by species
for species in iris['species'].unique():
subset = iris[iris['species'] == species]['petal_length']
axes[0].hist(subset, bins=15, alpha=0.6, label=species)
axes[0].set_title('Petal Length by Species\n(Continuous, grouped)')
axes[0].legend()
# Nominal: species counts
iris['species'].value_counts().plot(kind='bar', ax=axes[1], color='coral', edgecolor='black')
axes[1].set_title('Species Count\n(Nominal)')
axes[1].tick_params(rotation=0)
plt.tight_layout()
plt.show()
Data Types in Machine Learning & Deep Learning
| Data Type | ML Encoding | Deep Learning | Example |
|---|---|---|---|
| Nominal | One-Hot, Label | Embedding layer | Color: [1,0,0] for Red |
| Ordinal | Ordinal encoding | Embedding layer | Rating: 1,2,3,4,5 |
| Discrete | Count, Bin | Embedding | Children: 0,1,2,3 |
| Continuous | StandardScaler, MinMax | Normalization layer | Height: 1.72m |
Example — Encoding Data for ML:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# Sample data
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['S', 'M', 'L', 'XL', 'M'], # Ordinal
'price': [10.5, 20.3, 15.7, 12.1, 25.0] # Continuous
})
# One-Hot Encode nominal data (color)
encoder = OneHotEncoder(sparse=False)
color_encoded = encoder.fit_transform(df[['color']])
print("One-Hot Encoded Color:")
print(pd.DataFrame(color_encoded, columns=encoder.get_feature_names_out()))
# Standardize continuous data (price)
scaler = StandardScaler()
price_scaled = scaler.fit_transform(df[['price']])
print("\nStandardized Price:")
print(price_scaled.flatten())
Output:
One-Hot Encoded Color:
color_blue color_green color_red
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
Standardized Price:
[-1.09 0.72 -0.17 -0.91 1.45]
Key Takeaways
Data type determines your entire analysis pipeline — identify types before doing anything else.
ML models need numeric input — encoding strategy depends on data type.
Deep learning uses embedding layers for categorical data — more powerful than one-hot for high cardinality.
Wrong type leads to wrong method leads to wrong conclusion — this is not just academic pedantry.
"Get the data type right, and the statistics follow. Get it wrong, and no method can save you."
What to Learn Next
-> Levels of Measurement Nominal, ordinal, interval, ratio — which statistics are valid for each?
-> Frequency Distributions Organize raw data into tables and charts — the first step in any analysis.
-> Data Collection Methods Surveys, experiments, observations — how to gather quality data.
-> Sampling Techniques Random, stratified, cluster — how to choose who gets measured.
-> Mean, Median, Mode The three ways to find the center of your data.
-> Probability Basics The math of uncertainty — foundation of all inference.