Linux and Networking Essentials for Data Engineers

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Most data infrastructure runs on Linux, and networking knowledge is essential for debugging pipeline connectivity issues.

Overview

Linux Fundamentals

Process Management

# List processes
ps aux                              # All processes
ps aux | grep python                # Find Python processes
top                                 # Interactive process viewer
htop                                # Better interactive viewer

# Kill processes
kill PID                            # Send SIGTERM (graceful)
kill -9 PID                         # Send SIGKILL (force)
pkill -f "etl_pipeline"            # Kill by process name pattern

# Background / foreground
nohup python etl.py &              # Run in background, survive logout
jobs                                # List background jobs
fg %1                               # Bring job 1 to foreground
bg %1                               # Resume job 1 in background

# System resources
free -h                             # Memory usage
df -h                               # Disk usage
du -sh /data/*                      # Directory sizes
uptime                              # Load average

Process Signals Reference

Signal	Number	Action	Use Case
SIGHUP	1	Reload config	Reload Nginx config
SIGINT	2	Interrupt (Ctrl+C)	Stop running script
SIGTERM	15	Graceful termination	`kill PID` default
SIGKILL	9	Force kill	Last resort, uncatchable
SIGUSR1	10	User-defined	Custom handler in your app
SIGSTOP	19	Pause process	`kill -STOP PID`
SIGCONT	18	Resume process	`kill -CONT PID`

File System Navigation

# Directory structure
ls -la                              # List all with details
ls -lh /var/log                    # Human-readable sizes
tree -L 2 /app                     # Tree view (2 levels deep)

# Find files
find /data -name "*.csv" -mtime +7         # CSV files older than 7 days
find /data -size +1G                        # Files larger than 1GB
find /data -user etl_user                   # Files owned by user
find /data -name "*.log" -delete            # Delete old log files
find /data -type f -perm 0600               # Files with 600 permissions

# Disk usage analysis
du -sh /var/log/*                    # Size of each subdirectory
ncdu /data                           # Interactive disk usage tool

# File permissions
chmod 755 script.sh                   # rwxr-xr-x
chmod 600 secrets.env                 # rw------- (owner only)
chmod +x deploy.sh                    # Make executable
chown -R etl_user:etl_group /data    # Change ownership

Permission Bits Explained

Permission	Octal	Effect
r (read)	4	Read file or list directory
w (write)	2	Write to file or create/delete in directory
x (execute)	1	Execute file or enter directory
rwxr-xr-x	755	Owner: full; Others: read + execute
rw-------	600	Owner: read + write; Others: nothing
rwxrwx---	770	Owner and group: full; Others: nothing

Package Management

# Debian/Ubuntu (apt)
apt update
apt install -y python3-pip postgresql-client
apt upgrade
apt remove package_name

# Red Hat/CentOS (yum/dnf)
dnf install -y python3-pip postgresql
dnf update

# Python packages
pip install --user pandas
pip install --upgrade pip
pip freeze > requirements.txt

Networking Concepts

TCP/IP Model

Layer	Protocol	Data Unit	Data Engineer Relevance
Application	HTTP, DNS, SSH, SMTP	Message	API calls, database connections
Transport	TCP, UDP	Segment	Reliable data transfer, port selection
Internet	IP, ICMP	Packet	Routing, connectivity
Network Access	Ethernet, Wi-Fi	Frame	Physical connectivity

DNS Resolution

# DNS lookup
nslookup mydb.example.com
dig mydb.example.com
dig +short mydb.example.com

# Trace DNS resolution
dig +trace mydb.example.com

# Check /etc/hosts
cat /etc/hosts

# Flush DNS cache (varies by OS)
systemd-resolve --flush-caches      # Linux (systemd)
sudo dscacheutil -flushcache        # macOS

Common Ports for Data Engineering

Service	Default Port	Protocol
PostgreSQL	5432	TCP
MySQL	3306	TCP
Redis	6379	TCP
MongoDB	27017	TCP
Kafka	9092	TCP
SSH	22	TCP
HTTP	80	TCP
HTTPS	443	TCP
Jupyter Notebook	8888	TCP
Airflow Web UI	8080	TCP
Spark UI	4040	TCP

HTTP/HTTPS for Data Engineers

# Test connectivity
curl -v https://api.example.com/health
curl -o /dev/null -s -w "%{http_code}\n" https://api.example.com

# Download data
curl -O https://data.example.com/dataset.csv

# POST with JSON body
curl -X POST https://api.example.com/ingest \
    -H "Content-Type: application/json" \
    -d '{"event": "click", "user_id": 123}'

# Check API response time
curl -w "Total time: %{time_total}s\n" -o /dev/null -s https://api.example.com

HTTP Status Codes Reference

Code	Meaning	Data Engineer Action
200	OK	Success — process the response
201	Created	Resource created successfully
400	Bad Request	Fix request body / parameters
401	Unauthorized	Check API key / credentials
403	Forbidden	Check permissions / IAM role
404	Not Found	Check endpoint URL
429	Too Many Requests	Implement backoff / rate limiting
500	Server Error	Retry with exponential backoff
503	Service Unavailable	Service is down — wait and retry

Firewall and Security

# UFW (Ubuntu)
ufw status
ufw allow 5432/tcp                  # Allow PostgreSQL
ufw allow from 10.0.0.0/24         # Allow subnet
ufw deny 22/tcp                     # Block SSH

# iptables (lower-level)
iptables -L -n                      # List rules
iptables -A INPUT -p tcp --dport 5432 -j ACCEPT

# Check open ports
ss -tlnp                            # List listening TCP ports
netstat -tlnp                       # Same (older tool)
lsof -i :5432                       # What process is using port 5432

Troubleshooting Network Issues

# Connectivity testing
ping -c 4 db-host.example.com       # Basic connectivity
traceroute db-host.example.com      # Network path
telnet db-host.example.com 5432     # Port connectivity
nc -zv db-host.example.com 5432     # Netcat port check

# Connection string testing
psql "postgresql://user:pass@db-host:5432/mydb" -c "SELECT 1;"

# Check DNS resolution order
cat /etc/nsswitch.conf | grep hosts
cat /etc/resolv.conf

# Monitor network traffic
tcpdump -i eth0 port 5432 -n       # Capture PostgreSQL traffic
iftop                                # Live bandwidth monitoring

Linux Service Management (systemd)

# Service management
systemctl start postgresql
systemctl stop postgresql
systemctl restart postgresql
systemctl status postgresql
systemctl enable postgresql          # Start on boot
systemctl disable postgresql         # Don't start on boot

# View logs
journalctl -u postgresql             # All logs
journalctl -u postgresql -f          # Follow logs
journalctl -u postgresql --since "1 hour ago"

# Create custom service
cat > /etc/systemd/system/etl-pipeline.service << EOF
[Unit]
Description=ETL Pipeline Service
After=postgresql.service

[Service]
Type=simple
User=etl_user
WorkingDirectory=/opt/etl
ExecStart=/opt/etl/venv/bin/python -m src.pipeline
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl start etl-pipeline

Best Practices for Data Engineers

Practice	Rationale
Use `systemctl` for service management	Proper startup, shutdown, and restart handling
Monitor with `top` / `htop` / `dstat`	Detect resource bottlenecks early
Check DNS before debugging connectivity	Most "connection refused" issues are DNS-related
Use named volumes for data persistence	Containers lose data on restart without volumes
Set up log rotation	Prevent disk space exhaustion from log files
Use firewall rules	Only expose necessary ports
Test with `nc -zv` before writing code	Verify port connectivity before debugging application code
Keep `/etc/hosts` updated	For local development environments

MathSummary Takeaways

Linux is the standard for data infrastructure — master process management (ps, kill), file operations (find, du), and service management (systemctl).
Understand TCP/IP layers — application-layer debugging requires knowledge of DNS, HTTP, and port-based connectivity.
DNS is the first thing to check — many "connection refused" errors are actually DNS resolution failures.
Know common port assignments — PostgreSQL (5432), MySQL (3306), Redis (6379), SSH (22).
Use systemctl for production services — proper lifecycle management with auto-restart on failure.
Firewalls control access — ufw or iptables to restrict which ports are exposed.
Monitor resources proactively — top, htop, df -h, free -h catch issues before they cause pipeline failures.
DNS + port testing = fast troubleshooting — dig + nc -zv resolve most connectivity issues in seconds.

Practice Exercises

Process management: Write a script that monitors a running Python ETL process and restarts it if it crashes.
Disk cleanup: Create a script that finds and archives log files older than 30 days, then reports freed space.
Network debugging: Debug a failing database connection using dig, nc, telnet, and psql. Document each step.
Firewall setup: Configure ufw to allow only SSH, PostgreSQL, and HTTP on a new server.
systemd service: Create a systemd service file for a data pipeline that starts after PostgreSQL and restarts on failure.

Linux and Networking Essentials for Data Engineers

Linux and Networking Essentials for Data Engineers

Overview

Linux Fundamentals

Process Management

Process Signals Reference

File System Navigation

Permission Bits Explained

Package Management

Networking Concepts

TCP/IP Model

DNS Resolution

Common Ports for Data Engineering

HTTP/HTTPS for Data Engineers

HTTP Status Codes Reference

Firewall and Security

Troubleshooting Network Issues

Linux Service Management (systemd)

Best Practices for Data Engineers

MathSummary Takeaways

See Also

Practice Exercises

Premium Content

Need Expert Data Engineering Help?