Linux and Networking Essentials for Data Engineers
Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Most data infrastructure runs on Linux, and networking knowledge is essential for debugging pipeline connectivity issues.
Overview
Linux Fundamentals
Process Management
# List processes
ps aux # All processes
ps aux | grep python # Find Python processes
top # Interactive process viewer
htop # Better interactive viewer
# Kill processes
kill PID # Send SIGTERM (graceful)
kill -9 PID # Send SIGKILL (force)
pkill -f "etl_pipeline" # Kill by process name pattern
# Background / foreground
nohup python etl.py & # Run in background, survive logout
jobs # List background jobs
fg %1 # Bring job 1 to foreground
bg %1 # Resume job 1 in background
# System resources
free -h # Memory usage
df -h # Disk usage
du -sh /data/* # Directory sizes
uptime # Load average
Process Signals Reference
| Signal | Number | Action | Use Case |
|---|---|---|---|
| SIGHUP | 1 | Reload config | Reload Nginx config |
| SIGINT | 2 | Interrupt (Ctrl+C) | Stop running script |
| SIGTERM | 15 | Graceful termination | kill PID default |
| SIGKILL | 9 | Force kill | Last resort, uncatchable |
| SIGUSR1 | 10 | User-defined | Custom handler in your app |
| SIGSTOP | 19 | Pause process | kill -STOP PID |
| SIGCONT | 18 | Resume process | kill -CONT PID |
File System Navigation
# Directory structure
ls -la # List all with details
ls -lh /var/log # Human-readable sizes
tree -L 2 /app # Tree view (2 levels deep)
# Find files
find /data -name "*.csv" -mtime +7 # CSV files older than 7 days
find /data -size +1G # Files larger than 1GB
find /data -user etl_user # Files owned by user
find /data -name "*.log" -delete # Delete old log files
find /data -type f -perm 0600 # Files with 600 permissions
# Disk usage analysis
du -sh /var/log/* # Size of each subdirectory
ncdu /data # Interactive disk usage tool
# File permissions
chmod 755 script.sh # rwxr-xr-x
chmod 600 secrets.env # rw------- (owner only)
chmod +x deploy.sh # Make executable
chown -R etl_user:etl_group /data # Change ownership
Permission Bits Explained
| Permission | Octal | Effect |
|---|---|---|
| r (read) | 4 | Read file or list directory |
| w (write) | 2 | Write to file or create/delete in directory |
| x (execute) | 1 | Execute file or enter directory |
| rwxr-xr-x | 755 | Owner: full; Others: read + execute |
| rw------- | 600 | Owner: read + write; Others: nothing |
| rwxrwx--- | 770 | Owner and group: full; Others: nothing |
Package Management
# Debian/Ubuntu (apt)
apt update
apt install -y python3-pip postgresql-client
apt upgrade
apt remove package_name
# Red Hat/CentOS (yum/dnf)
dnf install -y python3-pip postgresql
dnf update
# Python packages
pip install --user pandas
pip install --upgrade pip
pip freeze > requirements.txt
Networking Concepts
TCP/IP Model
| Layer | Protocol | Data Unit | Data Engineer Relevance |
|---|---|---|---|
| Application | HTTP, DNS, SSH, SMTP | Message | API calls, database connections |
| Transport | TCP, UDP | Segment | Reliable data transfer, port selection |
| Internet | IP, ICMP | Packet | Routing, connectivity |
| Network Access | Ethernet, Wi-Fi | Frame | Physical connectivity |
DNS Resolution
# DNS lookup
nslookup mydb.example.com
dig mydb.example.com
dig +short mydb.example.com
# Trace DNS resolution
dig +trace mydb.example.com
# Check /etc/hosts
cat /etc/hosts
# Flush DNS cache (varies by OS)
systemd-resolve --flush-caches # Linux (systemd)
sudo dscacheutil -flushcache # macOS
Common Ports for Data Engineering
| Service | Default Port | Protocol |
|---|---|---|
| PostgreSQL | 5432 | TCP |
| MySQL | 3306 | TCP |
| Redis | 6379 | TCP |
| MongoDB | 27017 | TCP |
| Kafka | 9092 | TCP |
| SSH | 22 | TCP |
| HTTP | 80 | TCP |
| HTTPS | 443 | TCP |
| Jupyter Notebook | 8888 | TCP |
| Airflow Web UI | 8080 | TCP |
| Spark UI | 4040 | TCP |
HTTP/HTTPS for Data Engineers
# Test connectivity
curl -v https://api.example.com/health
curl -o /dev/null -s -w "%{http_code}\n" https://api.example.com
# Download data
curl -O https://data.example.com/dataset.csv
# POST with JSON body
curl -X POST https://api.example.com/ingest \
-H "Content-Type: application/json" \
-d '{"event": "click", "user_id": 123}'
# Check API response time
curl -w "Total time: %{time_total}s\n" -o /dev/null -s https://api.example.com
HTTP Status Codes Reference
| Code | Meaning | Data Engineer Action |
|---|---|---|
| 200 | OK | Success β process the response |
| 201 | Created | Resource created successfully |
| 400 | Bad Request | Fix request body / parameters |
| 401 | Unauthorized | Check API key / credentials |
| 403 | Forbidden | Check permissions / IAM role |
| 404 | Not Found | Check endpoint URL |
| 429 | Too Many Requests | Implement backoff / rate limiting |
| 500 | Server Error | Retry with exponential backoff |
| 503 | Service Unavailable | Service is down β wait and retry |
Firewall and Security
# UFW (Ubuntu)
ufw status
ufw allow 5432/tcp # Allow PostgreSQL
ufw allow from 10.0.0.0/24 # Allow subnet
ufw deny 22/tcp # Block SSH
# iptables (lower-level)
iptables -L -n # List rules
iptables -A INPUT -p tcp --dport 5432 -j ACCEPT
# Check open ports
ss -tlnp # List listening TCP ports
netstat -tlnp # Same (older tool)
lsof -i :5432 # What process is using port 5432
Troubleshooting Network Issues
# Connectivity testing
ping -c 4 db-host.example.com # Basic connectivity
traceroute db-host.example.com # Network path
telnet db-host.example.com 5432 # Port connectivity
nc -zv db-host.example.com 5432 # Netcat port check
# Connection string testing
psql "postgresql://user:pass@db-host:5432/mydb" -c "SELECT 1;"
# Check DNS resolution order
cat /etc/nsswitch.conf | grep hosts
cat /etc/resolv.conf
# Monitor network traffic
tcpdump -i eth0 port 5432 -n # Capture PostgreSQL traffic
iftop # Live bandwidth monitoring
Linux Service Management (systemd)
# Service management
systemctl start postgresql
systemctl stop postgresql
systemctl restart postgresql
systemctl status postgresql
systemctl enable postgresql # Start on boot
systemctl disable postgresql # Don't start on boot
# View logs
journalctl -u postgresql # All logs
journalctl -u postgresql -f # Follow logs
journalctl -u postgresql --since "1 hour ago"
# Create custom service
cat > /etc/systemd/system/etl-pipeline.service << EOF
[Unit]
Description=ETL Pipeline Service
After=postgresql.service
[Service]
Type=simple
User=etl_user
WorkingDirectory=/opt/etl
ExecStart=/opt/etl/venv/bin/python -m src.pipeline
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl start etl-pipeline
Best Practices for Data Engineers
| Practice | Rationale |
|---|---|
Use systemctl for service management | Proper startup, shutdown, and restart handling |
Monitor with top / htop / dstat | Detect resource bottlenecks early |
| Check DNS before debugging connectivity | Most "connection refused" issues are DNS-related |
| Use named volumes for data persistence | Containers lose data on restart without volumes |
| Set up log rotation | Prevent disk space exhaustion from log files |
| Use firewall rules | Only expose necessary ports |
Test with nc -zv before writing code | Verify port connectivity before debugging application code |
Keep /etc/hosts updated | For local development environments |
MathSummary Takeaways
- Linux is the standard for data infrastructure β master process management (
ps,kill), file operations (find,du), and service management (systemctl). - Understand TCP/IP layers β application-layer debugging requires knowledge of DNS, HTTP, and port-based connectivity.
- DNS is the first thing to check β many "connection refused" errors are actually DNS resolution failures.
- Know common port assignments β PostgreSQL (5432), MySQL (3306), Redis (6379), SSH (22).
- Use
systemctlfor production services β proper lifecycle management with auto-restart on failure. - Firewalls control access β
ufworiptablesto restrict which ports are exposed. - Monitor resources proactively β
top,htop,df -h,free -hcatch issues before they cause pipeline failures. - DNS + port testing = fast troubleshooting β
dig+nc -zvresolve most connectivity issues in seconds.
See Also
- What is Data Engineering β Introduction to data engineering
- Command Line & Shell Scripting β Bash fundamentals
- Docker for Data Engineers β Containerizing data pipelines
- Version Control with Git β Git for data engineers
- Cloud Platforms Overview β AWS, GCP, and Azure comparison
Practice Exercises
-
Process management: Write a script that monitors a running Python ETL process and restarts it if it crashes.
-
Disk cleanup: Create a script that finds and archives log files older than 30 days, then reports freed space.
-
Network debugging: Debug a failing database connection using
dig,nc,telnet, andpsql. Document each step. -
Firewall setup: Configure
ufwto allow only SSH, PostgreSQL, and HTTP on a new server. -
systemd service: Create a systemd service file for a data pipeline that starts after PostgreSQL and restarts on failure.