Types of Data in Statistics

Data Types

Know Your Data Before You Analyze It

Understanding data types is the first and most critical step in any statistical analysis. The type of data you have determines which statistical methods are valid, which visualizations are appropriate, and what conclusions you can draw. Choose wrong, and your entire analysis falls apart.

Here is what mastering data types helps you do:

Select the Right Tests — Different data types require different statistical methods; using the wrong one produces meaningless results.
Visualize Effectively — The chart type that works for continuous data is useless for categorical data, and vice versa.
Avoid Common Mistakes — Stop treating ZIP codes as numbers, Likert scales as intervals, or discrete counts as continuous values.
Communicate Clearly — Classifying variables correctly lets you describe your data accurately to stakeholders and peers.

Identifying your data type is not busywork — it is the foundation every analysis stands on.

Types of Data in Statistics

Definition

Data types classify variables based on their mathematical properties. The type determines which statistical methods, visualizations, and operations are valid.

The Data Type Hierarchy

Qualitative (Categorical) Data

DfQualitative Data

Qualitative data represents categories or groups — things that are described rather than measured numerically.

Nominal Data

Categories with no natural order. You can only determine equality or inequality.

Examples:

Eye color: brown, blue, green, hazel
Blood type: A, B, AB, O
Country of birth
Product category: electronics, clothing, food
Survey responses: Yes / No

Valid operations: Count, mode, chi-square test
Invalid operations: Mean, median, subtraction

Eye Color Distribution (Nominal Data)

Ordinal Data

Categories with a meaningful order, but the gaps between categories are not necessarily equal.

Examples:

Education level: high school < bachelor's < master's < PhD
Customer satisfaction: Poor < Fair < Good < Excellent
Military rank: Private < Corporal < Sergeant < Captain
Star ratings: ★ < ★★ < ★★★ < ★★★★ < ★★★★★

Valid operations: Ordering, median, percentiles, Spearman correlation
Invalid operations: Arithmetic mean (controversial), subtraction (intervals unknown)

Customer Satisfaction (Ordinal Data)

Key Distinction

"Excellent" is better than "Good", but is it exactly twice as good? Ordinal scales can't tell us.

Quantitative (Numerical) Data

DfQuantitative Data

Quantitative data represents measured or counted quantities — numbers that have mathematical meaning.

Discrete Data

Can only take specific, countable values — usually whole numbers. There are gaps between possible values.

Examples:

Number of children in a family (0, 1, 2, 3, ... — not 1.7)
Number of cars in a parking lot
Number of defects in a product
Shoe sizes (though not whole numbers, they're discrete: 8, 8.5, 9...)
Number of goals scored in a soccer match

Valid operations: All arithmetic, count, Poisson distribution, binomial distribution

Number of Children per Family (Discrete Data)

Continuous Data

Can take any value within a range, including fractions and decimals. Limited only by measurement precision.

Examples:

Height (1.753847... meters)
Temperature (23.7°C)
Time to complete a task
Weight, blood pressure, distance
Stock prices

Valid operations: All arithmetic, normal distribution, integration, derivatives

Height Distribution (Continuous Data)

Interval vs Ratio (A Deeper Cut)

Within quantitative data, we can further distinguish:

Feature	Interval	Ratio
Equal intervals	✅ Yes	✅ Yes
True zero (zero = absence)	❌ No	✅ Yes
Meaningful ratios	❌ No	✅ Yes
Example	Temperature (°C), IQ	Height, weight, income

Interval example: 0°C is not "no temperature." 40°C is not twice as hot as 20°C (in the thermodynamic sense). Temperature in Kelvin is ratio.

Ratio example: A person who weighs 80 kg is genuinely twice as heavy as someone who weighs 40 kg.

Why Data Types Matter for Statistics

Analysis Goal	Nominal	Ordinal	Discrete/Continuous
Central tendency	Mode	Mode, Median	Mean, Median, Mode
Spread	Frequency	IQR	Std Dev, Variance
Correlation	Cramér's V	Spearman ρ	Pearson r
Group comparison	Chi-square	Kruskal-Wallis	ANOVA, t-test
Regression	Dummy variables	Ordinal logistic	Linear regression
Visualization	Bar chart	Bar/box	Histogram, scatter

Python: Identifying and Working with Data Types

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load a rich dataset
df = sns.load_dataset('tips')
print("Dataset shape:", df.shape)
print("\nData types (pandas dtypes):")
print(df.dtypes)

Output:

Architecture Diagram

Dataset shape: (244, 7)

Data types (pandas dtypes):
total_bill     float64   <- Continuous quantitative
tip            float64   <- Continuous quantitative
sex           category   <- Nominal qualitative
smoker        category   <- Nominal qualitative
day           category   <- Ordinal qualitative (Sun > Sat > Fri > Thur semantically)
time          category   <- Nominal qualitative
size            int64    <- Discrete quantitative

# --- Statistical summaries differ by type ---

print("\n=== Quantitative Variables ===")
print(df[['total_bill', 'tip', 'size']].describe())

print("\n=== Qualitative Variables ===")
for col in ['sex', 'smoker', 'day', 'time']:
    print(f"\n{col} — value counts:")
    print(df[col].value_counts())
    print(f"Mode: {df[col].mode()[0]}")

# --- Visualizations appropriate to each type ---
fig, axes = plt.subplots(2, 3, figsize=(14, 8))

# Continuous: histogram
axes[0, 0].hist(df['total_bill'], bins=20, color='steelblue', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Total Bill (Continuous)\n-> Histogram')
axes[0, 0].set_xlabel('Amount ($)')

# Discrete: bar chart
size_counts = df['size'].value_counts().sort_index()
axes[0, 1].bar(size_counts.index, size_counts.values, color='coral', edgecolor='black')
axes[0, 1].set_title('Party Size (Discrete)\n-> Bar Chart')
axes[0, 1].set_xlabel('Size')

# Nominal: pie chart
sex_counts = df['sex'].value_counts()
axes[0, 2].pie(sex_counts.values, labels=sex_counts.index, autopct='%1.1f%%', startangle=90)
axes[0, 2].set_title('Sex (Nominal)\n-> Pie Chart')

# Ordinal: ordered bar
day_order = ['Thur', 'Fri', 'Sat', 'Sun']
day_counts = df['day'].value_counts().reindex(day_order)
axes[1, 0].bar(day_counts.index, day_counts.values, color='mediumseagreen', edgecolor='black')
axes[1, 0].set_title('Day (Ordinal)\n-> Ordered Bar Chart')

# Continuous: box plot by category
df.boxplot(column='tip', by='day', ax=axes[1, 1])
axes[1, 1].set_title('Tip by Day\n-> Box Plot')

# Scatter: two continuous
axes[1, 2].scatter(df['total_bill'], df['tip'], alpha=0.5, color='purple')
axes[1, 2].set_title('Bill vs Tip (Continuous × Continuous)\n-> Scatter Plot')
axes[1, 2].set_xlabel('Total Bill ($)')
axes[1, 2].set_ylabel('Tip ($)')

plt.tight_layout()
plt.savefig('data_types_visualization.png', dpi=150)
plt.show()

Data Type Classification in Practice

def classify_variable(series: pd.Series, nunique_threshold: int = 15) -> str:
    """Classify a pandas Series into a statistical data type."""
    dtype = series.dtype
    nunique = series.nunique()

    if dtype == 'bool':
        return 'Nominal (Binary)'
    elif dtype.name == 'category' or dtype == 'object':
        return 'Nominal Categorical'
    elif dtype in ['int32', 'int64']:
        if nunique <= nunique_threshold:
            return f'Discrete Quantitative ({nunique} unique values)'
        else:
            return 'Discrete Quantitative (high cardinality)'
    elif dtype in ['float32', 'float64']:
        return 'Continuous Quantitative'
    else:
        return f'Unknown ({dtype})'

# Apply to the tips dataset
print("Variable Classification:")
print("-" * 50)
for col in df.columns:
    classification = classify_variable(df[col])
    print(f"{col:<15} -> {classification}")

Common Mistakes

Treating Ordinal as Interval

Averaging Likert-scale responses (1–5) as if they are interval data is common but technically incorrect. The difference between "Strongly Agree" and "Agree" may not equal the difference between "Neutral" and "Disagree."

Zip Codes as Quantitative

ZIP code 90210 is not 40,000 more than ZIP code 50000. It's a nominal identifier.

Treating Discrete Data as Continuous

Modeling number of children with a continuous distribution can predict 1.7 children — meaningless. Use Poisson or negative binomial.

Practice Exercises

Exercise 1: Classify each variable:

a) Temperature in Fahrenheit
b) Movie genre (Action, Comedy, Drama)
c) Customer age
d) Job satisfaction rating (1 = Very Unsatisfied, 5 = Very Satisfied)
e) Number of siblings

Exercise 2: For each variable in the iris dataset, identify the type and choose the most appropriate visualization.

import seaborn as sns
iris = sns.load_dataset('iris')
print(iris.dtypes)
# Your classifications and visualizations here

See Solution

# sepal_length: float64 -> Continuous -> histogram or box plot
# sepal_width: float64 -> Continuous -> histogram or box plot
# petal_length: float64 -> Continuous -> histogram or box plot
# petal_width: float64 -> Continuous -> histogram or box plot
# species: object/category -> Nominal -> bar chart

import seaborn as sns
import matplotlib.pyplot as plt

iris = sns.load_dataset('iris')
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Continuous: petal_length distribution by species
for species in iris['species'].unique():
    subset = iris[iris['species'] == species]['petal_length']
    axes[0].hist(subset, bins=15, alpha=0.6, label=species)
axes[0].set_title('Petal Length by Species\n(Continuous, grouped)')
axes[0].legend()

# Nominal: species counts
iris['species'].value_counts().plot(kind='bar', ax=axes[1], color='coral', edgecolor='black')
axes[1].set_title('Species Count\n(Nominal)')
axes[1].tick_params(rotation=0)

plt.tight_layout()
plt.show()

Data Types in Machine Learning & Deep Learning

Data Type	ML Encoding	Deep Learning	Example
Nominal	One-Hot, Label	Embedding layer	Color: [1,0,0] for Red
Ordinal	Ordinal encoding	Embedding layer	Rating: 1,2,3,4,5
Discrete	Count, Bin	Embedding	Children: 0,1,2,3
Continuous	StandardScaler, MinMax	Normalization layer	Height: 1.72m

Example — Encoding Data for ML:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Sample data
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'XL', 'M'],  # Ordinal
    'price': [10.5, 20.3, 15.7, 12.1, 25.0]  # Continuous
})

# One-Hot Encode nominal data (color)
encoder = OneHotEncoder(sparse=False)
color_encoded = encoder.fit_transform(df[['color']])
print("One-Hot Encoded Color:")
print(pd.DataFrame(color_encoded, columns=encoder.get_feature_names_out()))

# Standardize continuous data (price)
scaler = StandardScaler()
price_scaled = scaler.fit_transform(df[['price']])
print("\nStandardized Price:")
print(price_scaled.flatten())

Output:

Architecture Diagram

One-Hot Encoded Color:
   color_blue  color_green  color_red
0         0.0          0.0        1.0
1         1.0          0.0        0.0
2         0.0          1.0        0.0
3         0.0          0.0        1.0
4         1.0          0.0        0.0

Standardized Price:
[-1.09  0.72 -0.17 -0.91  1.45]

Key Takeaways

Data type determines your entire analysis pipeline — identify types before doing anything else.

ML models need numeric input — encoding strategy depends on data type.

Deep learning uses embedding layers for categorical data — more powerful than one-hot for high cardinality.

Wrong type leads to wrong method leads to wrong conclusion — this is not just academic pedantry.

"Get the data type right, and the statistics follow. Get it wrong, and no method can save you."

What to Learn Next

-> Levels of Measurement Nominal, ordinal, interval, ratio — which statistics are valid for each?

-> Frequency Distributions Organize raw data into tables and charts — the first step in any analysis.

-> Data Collection Methods Surveys, experiments, observations — how to gather quality data.

-> Sampling Techniques Random, stratified, cluster — how to choose who gets measured.

-> Mean, Median, Mode The three ways to find the center of your data.

-> Probability Basics The math of uncertainty — foundation of all inference.

Types of Data in Statistics — Quantitative vs Qualitative

Types of Data in Statistics

Know Your Data Before You Analyze It

Types of Data in Statistics

Definition

The Data Type Hierarchy

Qualitative (Categorical) Data

DfQualitative Data

Nominal Data

Ordinal Data

Quantitative (Numerical) Data

DfQuantitative Data

Discrete Data

Continuous Data

Interval vs Ratio (A Deeper Cut)

Why Data Types Matter for Statistics

Python: Identifying and Working with Data Types

Data Type Classification in Practice

Common Mistakes

Practice Exercises

Data Types in Machine Learning & Deep Learning

Key Takeaways

What to Learn Next

Premium Content

Need Expert Statistics Help?