What is Data Science?
Data science sits at the intersection of statistics, computer science, and domain expertise. It is the practice of extracting meaningful insights and predictions from structured and unstructured data. In this lesson, you will learn what data science actually is, how it differs from related fields, and where it is used in the real world.
The three pillars of data science form a Venn diagram β you need Statistics for inference, Programming for implementation, and Domain Knowledge for context.
A Brief History of Data Science
The term "data science" has evolved over decades:
- 1960sβ1970s: John Tukey coined "data analysis" and advocated for exploratory methods beyond formal statistics.
- 1974: Peter Naur proposed "data science" as an independent discipline in his book Concise Survey of Computer Methods.
- 1989: Gregory Piatetsky-Shapiro organized the first Knowledge Discovery in Databases (KDD) workshop.
- 2001: William Cleveland published "Data Science: An Action Plan," pushing for academic programs.
- 2008: DJ Patil and Jeff Hammerbacher popularized the title "Data Scientist" at LinkedIn and Facebook.
- 2012: Harvard Business Review called data scientist "the sexiest job of the 21st century."
Today, data science powers recommendations at Netflix, fraud detection at banks, autonomous driving at Tesla, and drug discovery at pharmaceutical companies.
Data Science vs Related Fields
Understanding the differences helps you choose your path:
| Field | Focus | Key Skills |
|---|---|---|
| Data Analysis | Describing what happened | SQL, Excel, basic Python |
| Data Engineering | Building data infrastructure | Python, Spark, Airflow, cloud |
| Data Science | Predicting and prescribing | Statistics, ML, Python, domain knowledge |
| Machine Learning Engineering | Deploying ML models at scale | MLOps, Docker, Kubernetes |
| Research Science | Advancing scientific knowledge | Statistics, domain expertise, writing |
The Data Analyst
A data analyst answers business questions using existing data. They build dashboards, write SQL queries, and create reports. Their work is retrospective β they explain what already happened.
The Data Engineer
A data engineer builds the pipelines and infrastructure that make data available. They design ETL processes, manage data warehouses, and ensure data quality and reliability at scale.
The Data Scientist
A data scientist goes further than analysis β they build predictive models, design experiments, and translate business problems into mathematical formulations. They need both coding skill and statistical intuition.
The Data Science Workflow
A typical data science project follows these steps:
- Define the Question: What problem are you solving? What does success look like?
- Collect Data: Gather data from databases, APIs, web scraping, or sensors.
- Clean and Explore: Handle missing values, outliers, and understand distributions.
- Feature Engineering: Create new variables that improve model performance.
- Model Building: Choose and train algorithms.
- Evaluate: Test on held-out data, check for bias, measure performance.
- Deploy: Put the model into production.
- Monitor: Track model performance over time and retrain as needed.
# Example: A simplified data science workflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 1. Load data
df = pd.read_csv("customer_churn.csv")
# 2. Explore
print(df.shape)
print(df.head())
print(df["churn"].value_counts(normalize=True))
# 3. Feature engineering
df["avg_monthly_spend"] = df["total_spend"] / df["tenure_months"]
# 4. Prepare features and target
features = ["avg_monthly_spend", "tenure_months", "num_support_calls"]
X = df[features]
y = df["churn"]
# 5. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 6. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 7. Evaluate
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2%}")
Essential Tools and Technologies
Programming Languages
- Python: The dominant language for data science. Rich ecosystem (pandas, scikit-learn, TensorFlow, PyTorch).
- R: Strong in statistics and academic research. Excellent visualization with ggplot2.
- SQL: Essential for data extraction from relational databases.
- Julia: Emerging language for high-performance numerical computing.
Libraries and Frameworks
# Core Python data science stack
import numpy as np # Numerical computing
import pandas as pd # Data manipulation
import matplotlib.pyplot as plt # Basic visualization
import seaborn as sns # Statistical visualization
from sklearn import * # Machine learning
Infrastructure
- Jupyter Notebooks: Interactive computing and storytelling.
- Git: Version control for code and notebooks.
- Docker: Containerized environments for reproducibility.
- Cloud Platforms: AWS, GCP, Azure for scalable computing.
Real-World Applications
Healthcare
# Example: Predicting patient readmission
# Features: age, diagnosis, length of stay, num prior visits
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200, max_depth=4)
model.fit(X_train, y_train)
# Feature importance tells doctors what drives readmission
importance = pd.Series(
model.feature_importances_, index=feature_names
).sort_values(ascending=False)
print(importance)
Hospitals use data science to predict which patients are likely to be readmitted within 30 days, allowing proactive intervention.
Finance
Banks use data science for:
- Fraud detection: Real-time anomaly detection on transactions.
- Credit scoring: Predicting default risk from borrower features.
- Algorithmic trading: Using NLP and time series for market prediction.
- Risk management: Portfolio optimization and stress testing.
Retail and E-Commerce
- Recommendation engines: "Customers who bought X also bought Y."
- Demand forecasting: Predicting inventory needs by store and region.
- Dynamic pricing: Adjusting prices based on demand and competition.
- Customer segmentation: Grouping customers by behavior for targeted marketing.
Transportation
- Route optimization: Minimizing delivery time and fuel costs.
- Predictive maintenance: Antenuating vehicle failures before they happen.
- Autonomous vehicles: Computer vision and sensor fusion for self-driving.
Career Paths in Data Science
Entry-Level Roles
- Junior Data Analyst: 75K β SQL, Excel, basic Python, dashboards.
- Data Science Intern: 70K β Learning the full workflow under mentorship.
Mid-Level Roles
- Data Scientist: 140K β End-to-end project ownership, modeling, deployment.
- Machine Learning Engineer: 150K β Building and deploying production ML systems.
- Analytics Engineer: 130K β Data modeling, dbt, warehouse design.
Senior and Leadership Roles
- Senior Data Scientist: 180K β Technical leadership, mentoring, complex problems.
- Staff/Principal Scientist: 250K+ β Strategic direction, research, cross-team impact.
- Director of Data Science: 300K+ β Team management, business strategy, budgeting.
Skills Roadmap
Year 1: Python, SQL, Pandas, Matplotlib, Statistics basics
Year 2: Machine Learning, Feature Engineering, A/B Testing
Year 3: Deep Learning, Cloud Deployment, MLOps
Year 4+: Leadership, Business Strategy, Research, Specialization
Common Misconceptions
- "Data science is just machine learning" β ML is one tool. Statistics, visualization, and communication are equally important.
- "You need a PhD" β Many successful data scientists are self-taught or came through bootcamps.
- "More data is always better" β Quality matters more than quantity. Dirty data leads to garbage predictions.
- "Models are the hard part" β Understanding the problem and cleaning the data typically takes 80% of the time.
Key Takeaways
- Data science combines statistics, programming, and domain knowledge.
- The field has distinct roles: analyst, engineer, scientist, and ML engineer.
- Python and SQL are the most important tools to learn first.
- Real-world impact comes from solving actual business problems, not from fancy algorithms.
- The demand for data professionals continues to grow across every industry.
Core Statistical Formulas
Understanding the math behind data science is essential. Here are the foundational formulas:
Mean (Average):
Variance:
Standard Deviation:
Covariance:
These formulas form the backbone of descriptive statistics used in data science.
In the next lesson, you will dive into Python data types and structures β the building blocks for everything that follows.