What is Data Science?

Data science sits at the intersection of statistics, computer science, and domain expertise. It is the practice of extracting meaningful insights and predictions from structured and unstructured data. In this lesson, you will learn what data science actually is, how it differs from related fields, and where it is used in the real world.

The three pillars of data science form a Venn diagram – you need Statistics for inference, Programming for implementation, and Domain Knowledge for context.

A Brief History of Data Science

The term "data science" has evolved over decades:

1960s–1970s: John Tukey coined "data analysis" and advocated for exploratory methods beyond formal statistics.
1974: Peter Naur proposed "data science" as an independent discipline in his book Concise Survey of Computer Methods.
1989: Gregory Piatetsky-Shapiro organized the first Knowledge Discovery in Databases (KDD) workshop.
2001: William Cleveland published "Data Science: An Action Plan," pushing for academic programs.
2008: DJ Patil and Jeff Hammerbacher popularized the title "Data Scientist" at LinkedIn and Facebook.
2012: Harvard Business Review called data scientist "the sexiest job of the 21st century."

Today, data science powers recommendations at Netflix, fraud detection at banks, autonomous driving at Tesla, and drug discovery at pharmaceutical companies.

Data Science vs Related Fields

Understanding the differences helps you choose your path:

Field	Focus	Key Skills
Data Analysis	Describing what happened	SQL, Excel, basic Python
Data Engineering	Building data infrastructure	Python, Spark, Airflow, cloud
Data Science	Predicting and prescribing	Statistics, ML, Python, domain knowledge
Machine Learning Engineering	Deploying ML models at scale	MLOps, Docker, Kubernetes
Research Science	Advancing scientific knowledge	Statistics, domain expertise, writing

The Data Analyst

A data analyst answers business questions using existing data. They build dashboards, write SQL queries, and create reports. Their work is retrospective – they explain what already happened.

The Data Engineer

A data engineer builds the pipelines and infrastructure that make data available. They design ETL processes, manage data warehouses, and ensure data quality and reliability at scale.

The Data Scientist

A data scientist goes further than analysis – they build predictive models, design experiments, and translate business problems into mathematical formulations. They need both coding skill and statistical intuition.

The Data Science Workflow

A typical data science project follows these steps:

Define the Question: What problem are you solving? What does success look like?
Collect Data: Gather data from databases, APIs, web scraping, or sensors.
Clean and Explore: Handle missing values, outliers, and understand distributions.
Feature Engineering: Create new variables that improve model performance.
Model Building: Choose and train algorithms.
Evaluate: Test on held-out data, check for bias, measure performance.
Deploy: Put the model into production.
Monitor: Track model performance over time and retrain as needed.

# Example: A simplified data science workflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Load data
df = pd.read_csv("customer_churn.csv")

# 2. Explore
print(df.shape)
print(df.head())
print(df["churn"].value_counts(normalize=True))

# 3. Feature engineering
df["avg_monthly_spend"] = df["total_spend"] / df["tenure_months"]

# 4. Prepare features and target
features = ["avg_monthly_spend", "tenure_months", "num_support_calls"]
X = df[features]
y = df["churn"]

# 5. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 6. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 7. Evaluate
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2%}")

Essential Tools and Technologies

Programming Languages

Python: The dominant language for data science. Rich ecosystem (pandas, scikit-learn, TensorFlow, PyTorch).
R: Strong in statistics and academic research. Excellent visualization with ggplot2.
SQL: Essential for data extraction from relational databases.
Julia: Emerging language for high-performance numerical computing.

Libraries and Frameworks

# Core Python data science stack
import numpy as np          # Numerical computing
import pandas as pd         # Data manipulation
import matplotlib.pyplot as plt  # Basic visualization
import seaborn as sns       # Statistical visualization
from sklearn import *       # Machine learning

Infrastructure

Jupyter Notebooks: Interactive computing and storytelling.
Git: Version control for code and notebooks.
Docker: Containerized environments for reproducibility.
Cloud Platforms: AWS, GCP, Azure for scalable computing.

Real-World Applications

Healthcare

# Example: Predicting patient readmission
# Features: age, diagnosis, length of stay, num prior visits
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=200, max_depth=4)
model.fit(X_train, y_train)

# Feature importance tells doctors what drives readmission
importance = pd.Series(
    model.feature_importances_, index=feature_names
).sort_values(ascending=False)
print(importance)

Hospitals use data science to predict which patients are likely to be readmitted within 30 days, allowing proactive intervention.

Finance

Banks use data science for:

Fraud detection: Real-time anomaly detection on transactions.
Credit scoring: Predicting default risk from borrower features.
Algorithmic trading: Using NLP and time series for market prediction.
Risk management: Portfolio optimization and stress testing.

Retail and E-Commerce

Recommendation engines: "Customers who bought X also bought Y."
Demand forecasting: Predicting inventory needs by store and region.
Dynamic pricing: Adjusting prices based on demand and competition.
Customer segmentation: Grouping customers by behavior for targeted marketing.

Transportation

Route optimization: Minimizing delivery time and fuel costs.
Predictive maintenance: Antenuating vehicle failures before they happen.
Autonomous vehicles: Computer vision and sensor fusion for self-driving.

Career Paths in Data Science

Entry-Level Roles

Junior Data Analyst: $55K–$ 75K – SQL, Excel, basic Python, dashboards.
Data Science Intern: $40K–$ 70K – Learning the full workflow under mentorship.

Mid-Level Roles

Data Scientist: $90K–$ 140K – End-to-end project ownership, modeling, deployment.
Machine Learning Engineer: $100K–$ 150K – Building and deploying production ML systems.
Analytics Engineer: $90K–$ 130K – Data modeling, dbt, warehouse design.

Senior and Leadership Roles

Senior Data Scientist: $130K–$ 180K – Technical leadership, mentoring, complex problems.
Staff/Principal Scientist: $170K–$ 250K+ – Strategic direction, research, cross-team impact.
Director of Data Science: $180K–$ 300K+ – Team management, business strategy, budgeting.

Skills Roadmap

Architecture Diagram

Year 1: Python, SQL, Pandas, Matplotlib, Statistics basics
Year 2: Machine Learning, Feature Engineering, A/B Testing
Year 3: Deep Learning, Cloud Deployment, MLOps
Year 4+: Leadership, Business Strategy, Research, Specialization

Common Misconceptions

"Data science is just machine learning" – ML is one tool. Statistics, visualization, and communication are equally important.
"You need a PhD" – Many successful data scientists are self-taught or came through bootcamps.
"More data is always better" – Quality matters more than quantity. Dirty data leads to garbage predictions.
"Models are the hard part" – Understanding the problem and cleaning the data typically takes 80% of the time.

Key Takeaways

Data science combines statistics, programming, and domain knowledge.
The field has distinct roles: analyst, engineer, scientist, and ML engineer.
Python and SQL are the most important tools to learn first.
Real-world impact comes from solving actual business problems, not from fancy algorithms.
The demand for data professionals continues to grow across every industry.

Core Statistical Formulas

Understanding the math behind data science is essential. Here are the foundational formulas:

Mean (Average):

\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i

Variance:

\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2

Standard Deviation:

\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}

Covariance:

\text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})

These formulas form the backbone of descriptive statistics used in data science.

In the next lesson, you will dive into Python data types and structures – the building blocks for everything that follows.

What is Data Science?

What is Data Science?

A Brief History of Data Science

Data Science vs Related Fields

The Data Analyst

The Data Engineer

The Data Scientist

The Data Science Workflow

Essential Tools and Technologies

Programming Languages

Libraries and Frameworks

Infrastructure

Real-World Applications

Healthcare

Finance

Retail and E-Commerce

Transportation

Career Paths in Data Science

Entry-Level Roles

Mid-Level Roles

Senior and Leadership Roles

Skills Roadmap

Common Misconceptions

Key Takeaways

Core Statistical Formulas

Premium Content

Need Expert Data Science Help?