πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Linux and Networking Essentials for Data Engineers

Data Engineering FoundationsData Engineering Fundamentals🟒 Free Lesson

Advertisement

Linux and Networking Essentials for Data Engineers

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Most data infrastructure runs on Linux, and networking knowledge is essential for debugging pipeline connectivity issues.

TCP/IP Stack DiagramApplication LayerHTTP, HTTPS, SSH, DNS, SMTP, FTPTransport LayerTCP (reliable), UDP (fast), PortsNetwork LayerIP addressing, Routing, ICMPLink LayerEthernet, MAC addresses, SwitchesPort 80, 443Port 22, 5432192.168.1.1AA:BB:CC:DD:EE

Overview

Linux Fundamentals

Process Management

# List processes
ps aux                              # All processes
ps aux | grep python                # Find Python processes
top                                 # Interactive process viewer
htop                                # Better interactive viewer

# Kill processes
kill PID                            # Send SIGTERM (graceful)
kill -9 PID                         # Send SIGKILL (force)
pkill -f "etl_pipeline"            # Kill by process name pattern

# Background / foreground
nohup python etl.py &              # Run in background, survive logout
jobs                                # List background jobs
fg %1                               # Bring job 1 to foreground
bg %1                               # Resume job 1 in background

# System resources
free -h                             # Memory usage
df -h                               # Disk usage
du -sh /data/*                      # Directory sizes
uptime                              # Load average

Process Signals Reference

SignalNumberActionUse Case
SIGHUP1Reload configReload Nginx config
SIGINT2Interrupt (Ctrl+C)Stop running script
SIGTERM15Graceful terminationkill PID default
SIGKILL9Force killLast resort, uncatchable
SIGUSR110User-definedCustom handler in your app
SIGSTOP19Pause processkill -STOP PID
SIGCONT18Resume processkill -CONT PID

File System Navigation

# Directory structure
ls -la                              # List all with details
ls -lh /var/log                    # Human-readable sizes
tree -L 2 /app                     # Tree view (2 levels deep)

# Find files
find /data -name "*.csv" -mtime +7         # CSV files older than 7 days
find /data -size +1G                        # Files larger than 1GB
find /data -user etl_user                   # Files owned by user
find /data -name "*.log" -delete            # Delete old log files
find /data -type f -perm 0600               # Files with 600 permissions

# Disk usage analysis
du -sh /var/log/*                    # Size of each subdirectory
ncdu /data                           # Interactive disk usage tool

# File permissions
chmod 755 script.sh                   # rwxr-xr-x
chmod 600 secrets.env                 # rw------- (owner only)
chmod +x deploy.sh                    # Make executable
chown -R etl_user:etl_group /data    # Change ownership

Permission Bits Explained

PermissionOctalEffect
r (read)4Read file or list directory
w (write)2Write to file or create/delete in directory
x (execute)1Execute file or enter directory
rwxr-xr-x755Owner: full; Others: read + execute
rw-------600Owner: read + write; Others: nothing
rwxrwx---770Owner and group: full; Others: nothing

Package Management

# Debian/Ubuntu (apt)
apt update
apt install -y python3-pip postgresql-client
apt upgrade
apt remove package_name

# Red Hat/CentOS (yum/dnf)
dnf install -y python3-pip postgresql
dnf update

# Python packages
pip install --user pandas
pip install --upgrade pip
pip freeze > requirements.txt

Networking Concepts

TCP/IP Model

LayerProtocolData UnitData Engineer Relevance
ApplicationHTTP, DNS, SSH, SMTPMessageAPI calls, database connections
TransportTCP, UDPSegmentReliable data transfer, port selection
InternetIP, ICMPPacketRouting, connectivity
Network AccessEthernet, Wi-FiFramePhysical connectivity

DNS Resolution

# DNS lookup
nslookup mydb.example.com
dig mydb.example.com
dig +short mydb.example.com

# Trace DNS resolution
dig +trace mydb.example.com

# Check /etc/hosts
cat /etc/hosts

# Flush DNS cache (varies by OS)
systemd-resolve --flush-caches      # Linux (systemd)
sudo dscacheutil -flushcache        # macOS

Common Ports for Data Engineering

ServiceDefault PortProtocol
PostgreSQL5432TCP
MySQL3306TCP
Redis6379TCP
MongoDB27017TCP
Kafka9092TCP
SSH22TCP
HTTP80TCP
HTTPS443TCP
Jupyter Notebook8888TCP
Airflow Web UI8080TCP
Spark UI4040TCP

HTTP/HTTPS for Data Engineers

# Test connectivity
curl -v https://api.example.com/health
curl -o /dev/null -s -w "%{http_code}\n" https://api.example.com

# Download data
curl -O https://data.example.com/dataset.csv

# POST with JSON body
curl -X POST https://api.example.com/ingest \
    -H "Content-Type: application/json" \
    -d '{"event": "click", "user_id": 123}'

# Check API response time
curl -w "Total time: %{time_total}s\n" -o /dev/null -s https://api.example.com

HTTP Status Codes Reference

CodeMeaningData Engineer Action
200OKSuccess β€” process the response
201CreatedResource created successfully
400Bad RequestFix request body / parameters
401UnauthorizedCheck API key / credentials
403ForbiddenCheck permissions / IAM role
404Not FoundCheck endpoint URL
429Too Many RequestsImplement backoff / rate limiting
500Server ErrorRetry with exponential backoff
503Service UnavailableService is down β€” wait and retry

Firewall and Security

# UFW (Ubuntu)
ufw status
ufw allow 5432/tcp                  # Allow PostgreSQL
ufw allow from 10.0.0.0/24         # Allow subnet
ufw deny 22/tcp                     # Block SSH

# iptables (lower-level)
iptables -L -n                      # List rules
iptables -A INPUT -p tcp --dport 5432 -j ACCEPT

# Check open ports
ss -tlnp                            # List listening TCP ports
netstat -tlnp                       # Same (older tool)
lsof -i :5432                       # What process is using port 5432

Troubleshooting Network Issues

# Connectivity testing
ping -c 4 db-host.example.com       # Basic connectivity
traceroute db-host.example.com      # Network path
telnet db-host.example.com 5432     # Port connectivity
nc -zv db-host.example.com 5432     # Netcat port check

# Connection string testing
psql "postgresql://user:pass@db-host:5432/mydb" -c "SELECT 1;"

# Check DNS resolution order
cat /etc/nsswitch.conf | grep hosts
cat /etc/resolv.conf

# Monitor network traffic
tcpdump -i eth0 port 5432 -n       # Capture PostgreSQL traffic
iftop                                # Live bandwidth monitoring

Linux Service Management (systemd)

# Service management
systemctl start postgresql
systemctl stop postgresql
systemctl restart postgresql
systemctl status postgresql
systemctl enable postgresql          # Start on boot
systemctl disable postgresql         # Don't start on boot

# View logs
journalctl -u postgresql             # All logs
journalctl -u postgresql -f          # Follow logs
journalctl -u postgresql --since "1 hour ago"

# Create custom service
cat > /etc/systemd/system/etl-pipeline.service << EOF
[Unit]
Description=ETL Pipeline Service
After=postgresql.service

[Service]
Type=simple
User=etl_user
WorkingDirectory=/opt/etl
ExecStart=/opt/etl/venv/bin/python -m src.pipeline
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl start etl-pipeline

Best Practices for Data Engineers

PracticeRationale
Use systemctl for service managementProper startup, shutdown, and restart handling
Monitor with top / htop / dstatDetect resource bottlenecks early
Check DNS before debugging connectivityMost "connection refused" issues are DNS-related
Use named volumes for data persistenceContainers lose data on restart without volumes
Set up log rotationPrevent disk space exhaustion from log files
Use firewall rulesOnly expose necessary ports
Test with nc -zv before writing codeVerify port connectivity before debugging application code
Keep /etc/hosts updatedFor local development environments

MathSummary Takeaways

  1. Linux is the standard for data infrastructure β€” master process management (ps, kill), file operations (find, du), and service management (systemctl).
  2. Understand TCP/IP layers β€” application-layer debugging requires knowledge of DNS, HTTP, and port-based connectivity.
  3. DNS is the first thing to check β€” many "connection refused" errors are actually DNS resolution failures.
  4. Know common port assignments β€” PostgreSQL (5432), MySQL (3306), Redis (6379), SSH (22).
  5. Use systemctl for production services β€” proper lifecycle management with auto-restart on failure.
  6. Firewalls control access β€” ufw or iptables to restrict which ports are exposed.
  7. Monitor resources proactively β€” top, htop, df -h, free -h catch issues before they cause pipeline failures.
  8. DNS + port testing = fast troubleshooting β€” dig + nc -zv resolve most connectivity issues in seconds.

See Also

Practice Exercises

  1. Process management: Write a script that monitors a running Python ETL process and restarts it if it crashes.

  2. Disk cleanup: Create a script that finds and archives log files older than 30 days, then reports freed space.

  3. Network debugging: Debug a failing database connection using dig, nc, telnet, and psql. Document each step.

  4. Firewall setup: Configure ufw to allow only SSH, PostgreSQL, and HTTP on a new server.

  5. systemd service: Create a systemd service file for a data pipeline that starts after PostgreSQL and restarts on failure.

⭐

Premium Content

Linux and Networking Essentials for Data Engineers

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement