πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Command Line and Shell Scripting for Data Engineers

Data Engineering FoundationsShell Scripting🟒 Free Lesson

Advertisement

Why the Command Line Matters

The command line is the data engineer's Swiss Army knife. From running pipelines to managing servers, from processing log files to automating backups β€” shell scripting is indispensable.

Shell Pipeline Compositioncat access.logRead raw data|grep "200 OK"Filter lines|awk '{print $4}'Extract field|sort | uniq -cCount unique|head -10Top resultscat access.log | grep "200 OK" | awk '{print $4}' | sort | uniq -c | head -10
Architecture Diagram
+-------------------------------------------------------------+
|              COMMAND LINE USE CASES                         |
+-------------------------------------------------------------+
|  ETL Automation      |  Cron jobs, shell scripts            |
|  Server Management   |  SSH, file transfers, monitoring     |
|  Data Processing     |  grep, awk, sed for quick transforms |
|  Pipeline Control    |  Airflow CLI, spark-submit           |
|  Log Analysis        |  grep, awk, sort, uniq               |
|  Backup & Recovery   |  rsync, tar, compression             |
|  CI/CD               |  Build scripts, deployment           |
|  Docker & K8s        |  Container management                |
+-------------------------------------------------------------+

Theory: How the Shell Works

When you type a command in Bash, the shell performs these steps:

  • Exit code 0 means success; non-zero means failure. Always check $? or use set -e.
  • Environment variables ($PATH, $HOME) control where the shell finds executables and where files go.
  • Signals (SIGTERM, SIGINT) allow graceful shutdown β€” use trap to handle them.

Bash Fundamentals

Variables and Operations

#!/bin/bash

# Variables
NAME="data-pipeline"
DATE=$(date +%Y-%m-%d)
LOG_DIR="/var/log/pipelines"
S3_BUCKET="s3://my-data-lake"

# String operations
echo "Pipeline: $NAME"
echo "Date: $DATE"
echo "Uppercase: ${NAME^^}"
echo "Length: ${#NAME}"

# Arithmetic
COUNT=10
INCREMENT=$((COUNT + 1))
echo "Next: $INCREMENT"

# Conditional
if [ -d "$LOG_DIR" ]; then
    echo "Log directory exists"
else
    mkdir -p "$LOG_DIR"
fi

# Loops
for TABLE in users orders products; do
    echo "Processing table: $TABLE"
    psql -c "SELECT COUNT(*) FROM $TABLE;"
done

# While loop with file
while IFS= read -r line; do
    echo "Processing: $line"
done < tables.txt

File Operations

# Directory operations
mkdir -p /data/raw /data/processed /data/archive

# File operations
touch data.csv
cp data.csv data_backup.csv
mv data.csv archived/
rm -f old_file.csv

# Find files
find /data -name "*.csv" -mtime +7         # CSV files older than 7 days
find /data -size +1G                        # Files larger than 1GB
find /data -name "*.log" -delete            # Delete old log files

# File information
ls -lh /data                               # List with sizes
du -sh /data/*                             # Directory sizes
wc -l data.csv                             # Count lines
head -5 data.csv                           # First 5 lines
tail -5 data.csv                           # Last 5 lines

# Permissions
chmod 755 script.sh                         # Make executable
chmod 600 secrets.env                       # Restrict access
chown -R user:group /data                   # Change ownership

# Symbolic links
ln -s /data/processed/latest.csv current.csv

Piping and Redirection

# Pipe output to next command
cat data.csv | grep "error" | wc -l

# Redirect output to file
echo "log message" >> pipeline.log          # Append
echo "new content" > output.txt             # Overwrite

# Redirect stderr
command 2> errors.log                       # stderr to file
command 2>&1                                # Merge stderr to stdout
command > output.log 2>&1                   # Both to file

# Suppress output
command > /dev/null 2>&1                    # Silent execution

# Here document
cat << EOF > config.yaml
database:
  host: localhost
  port: 5432
  name: production
EOF

# Here string
grep "error" <<< "$LOG_CONTENT"

Piping Reference Table

SyntaxMeaning
`cmd1cmd2`
cmd > fileRedirect stdout to file (overwrite)
cmd >> fileAppend stdout to file
cmd 2> fileRedirect stderr to file
cmd 2>&1Redirect stderr to stdout
cmd &> fileRedirect both stdout and stderr
cmd < fileFeed file as stdin to cmd
`cmd1tee file

grep β€” Pattern Searching

# Basic search
grep "error" application.log

# Case-insensitive
grep -i "error" application.log

# Recursive search
grep -r "TODO" /project/src/

# Count matches
grep -c "error" application.log

# Show line numbers
grep -n "error" application.log

# Invert match (lines NOT containing pattern)
grep -v "debug" application.log

# Regular expressions
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" data.csv    # Date format
grep -E "^[A-Z][a-z]+" names.txt                   # Starts with uppercase
grep -E "\b[0-9]{3}-[0-9]{3}-[0-9]{4}\b" data.txt # Phone numbers

# Multiple patterns
grep -E "error|warning|critical" application.log

# Context lines
grep -B 2 -A 2 "exception" application.log          # 2 lines before/after

# Perl regex for complex patterns
grep -P "\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}" access.log  # IP addresses

awk β€” Text Processing

# Print specific columns
awk '{print $1, $3}' data.csv                    # Columns 1 and 3

# Custom delimiter
awk -F',' '{print $2, $4}' data.csv              # CSV with comma delimiter

# Conditional filtering
awk -F',' '$3 > 1000 {print $1, $3}' orders.csv  # Amount > 1000

# Calculate sum
awk -F',' '{sum += $3} END {print "Total:", sum}' orders.csv

# Pattern matching
awk '/error/ {print NR, $0}' application.log     # Lines containing "error"

# Built-in variables
awk '{print NR, NF, $0}' data.csv                # Line number, field count, full line

# Complex processing
awk -F',' '
    NR > 1 {  # Skip header
        total[$2] += $3
        count[$2]++
    }
    END {
        for (region in total) {
            printf "%s: $%.2f (avg: $%.2f)\n", region, total[region], total[region]/count[region]
        }
    }
' sales.csv

# Process log files
awk '
    /ERROR/ {errors++}
    /WARNING/ {warnings++}
    /INFO/ {infos++}
    END {
        print "Errors:", errors
        print "Warnings:", warnings
        print "Info:", infos
    }
' application.log

grep vs awk vs sed: When to Use Which

ToolStrengthExample
grepFind lines matching a patterngrep "error" log.txt
sedFind and replace textsed 's/old/new/g' file.txt
awkColumn extraction and aggregationawk -F',' '{print $2}' data.csv
sortSort linessort -k2 -rn data.txt
uniq -cCount duplicates`sort data.txt
cutExtract fixed columnscut -d',' -f1,3 data.csv
trTranslate characterstr '[:upper:]' '[:lower:]'

sed β€” Stream Editing

# Replace text
sed 's/error/warning/g' application.log          # Replace all occurrences
sed -i 's/error/warning/g' application.log       # In-place edit
sed -i.bak 's/error/warning/g' file.txt          # In-place with backup

# Delete lines
sed '/^#/d' config.yaml                          # Remove comments
sed '/^$/d' data.csv                             # Remove empty lines
sed '1d' data.csv                                # Remove header

# Insert/append
sed '3a\New line after line 3' file.txt          # Append after line 3
sed '1i\Header line' file.txt                    # Insert before line 1

# Extract ranges
sed -n '10,20p' data.csv                         # Lines 10-20
sed -n '/START/,/END/p' file.txt                 # Between markers

# Multiple operations
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt

# Address by regex
sed '/^2024-01/d' data.csv                       # Delete lines starting with 2024-01

Practical Data Processing

Log Analysis Script

#!/bin/bash
# Analyze web server access logs

LOG_FILE="$1"
REPORT_DIR="/var/reports/$(date +%Y-%m-%d)"

mkdir -p "$REPORT_DIR"

# Top 10 IP addresses
echo "Top 10 IP Addresses:" > "$REPORT_DIR/access_report.txt"
awk '{print $1}' "$LOG_FILE" | sort | uniq -c | sort -rn | head -10 >> "$REPORT_DIR/access_report.txt"

# Request statistics
echo -e "\nHTTP Status Codes:" >> "$REPORT_DIR/access_report.txt"
awk '{print $9}' "$LOG_FILE" | sort | uniq -c | sort -rn >> "$REPORT_DIR/access_report.txt"

# Top endpoints
echo -e "\nTop Endpoints:" >> "$REPORT_DIR/access_report.txt"
awk '{print $7}' "$LOG_FILE" | sort | uniq -c | sort -rn | head -10 >> "$REPORT_DIR/access_report.txt"

# Bandwidth by IP
echo -e "\nBandwidth by IP (MB):" >> "$REPORT_DIR/access_report.txt"
awk '{ip[$1] += $10} END {for (i in ip) printf "%s: %.2f MB\n", i, ip[i]/1048576}' "$LOG_FILE" | sort -t: -k2 -rn | head -10 >> "$REPORT_DIR/access_report.txt"

echo "Report generated: $REPORT_DIR/access_report.txt"

ETL Shell Script

#!/bin/bash
# Daily ETL pipeline

set -euo pipefail  # Exit on error, undefined vars, pipe failures

# Configuration
SOURCE_DB="postgresql://user:pass@source-host/production"
TARGET_DB="postgresql://user:pass@target-host/analytics"
S3_BUCKET="s3://data-lake"
DATE=$(date +%Y-%m-%d)
LOG_FILE="/var/log/etl/daily_etl_${DATE}.log"

# Logging function
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Error handling
error_exit() {
    log "ERROR: $1"
    log "Pipeline failed at $(date)"
    exit 1
}

trap error_exit ERR

log "Starting daily ETL pipeline"

# Step 1: Extract
log "Step 1: Extracting data from source"
psql "$SOURCE_DB" -c "\COPY orders TO '/tmp/orders_${DATE}.csv' WITH CSV HEADER" || error_exit "Extraction failed"

# Step 2: Validate
log "Step 2: Validating extracted data"
ROW_COUNT=$(wc -l < "/tmp/orders_${DATE}.csv")
if [ "$ROW_COUNT" -lt 2 ]; then
    error_exit "No data extracted"
fi
log "Extracted $((ROW_COUNT - 1)) rows"

# Step 3: Transform
log "Step 3: Transforming data"
python3 << EOF
import pandas as pd
df = pd.read_csv('/tmp/orders_${DATE}.csv')
df = df.drop_duplicates(subset=['order_id'])
df = df[df['amount'] > 0]
df['processed_at'] = '${DATE}'
df.to_csv('/tmp/orders_transformed_${DATE}.csv', index=False)
print(f"Transformed {len(df)} rows")
EOF

# Step 4: Load to S3
log "Step 4: Uploading to S3"
aws s3 cp "/tmp/orders_transformed_${DATE}.csv" "${S3_BUCKET}/bronze/orders/${DATE}/"

# Step 5: Load to warehouse
log "Step 5: Loading to warehouse"
psql "$TARGET_DB" -c "\COPY orders FROM '/tmp/orders_transformed_${DATE}.csv' WITH CSV HEADER" || error_exit "Load failed"

# Step 6: Cleanup
log "Step 6: Cleaning up temporary files"
rm -f /tmp/orders_*.csv

log "Pipeline completed successfully"

Cron Jobs

Cron Syntax

Cron Expression Visual Reference02***MinuteHourDayMonthWeekday0 2 * * *Daily at 2 AM0 */6 * * *Every 6 hours30 1 * * 0Weekly Sun 1:30*/5 * * *Every 5 minutes0 0 1 * *Monthly

Common Cron Patterns

# Edit crontab
crontab -e

# View crontab
crontab -l

# Common schedules
0 2 * * * /path/to/daily_etl.sh          # Daily at 2 AM
0 */6 * * * /path/to/sync.sh             # Every 6 hours
30 1 * * 0 /path/to/weekly_report.sh     # Weekly Sunday 1:30 AM
0 0 1 * * /path/to/monthly_archive.sh    # Monthly at midnight

# Cron with logging
0 2 * * * /path/to/script.sh >> /var/log/script.log 2>&1

# Cron with environment
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin
0 2 * * * /path/to/script.sh

Cron Schedule Reference

PatternMeaning
* * * * *Every minute
0 * * * *Every hour, on the hour
0 2 * * *Daily at 2:00 AM
0 */6 * * *Every 6 hours
30 1 * * 0Weekly, Sunday at 1:30 AM
0 0 1 * *First day of every month, midnight
0 9-17 * * 1-5Every hour 9AM-5PM, Mon-Fri
*/5 * * * *Every 5 minutes

SSH and Remote Operations

# Basic SSH
ssh user@remote-host

# SSH with key
ssh -i ~/.ssh/key.pem user@remote-host

# Execute remote command
ssh user@remote-host "df -h"

# SSH tunnel (port forwarding)
ssh -L 5432:localhost:5432 user@remote-host

# SCP (secure copy)
scp file.csv user@remote-host:/data/
scp user@remote-host:/data/file.csv ./local/

# Rsync (efficient sync)
rsync -avz /local/data/ user@remote-host:/remote/data/
rsync -avz --delete /local/data/ user@remote-host:/remote/data/
rsync -avz --exclude="*.log" /local/data/ user@remote-host:/remote/data/

# SSH config (~/.ssh/config)
Host production
    HostName 192.168.1.100
    User deploy
    IdentityFile ~/.ssh/prod_key
    Port 22

# Use: ssh production

rsync Flags Reference

FlagMeaning
-aArchive mode (preserves permissions, timestamps, symlinks)
-vVerbose output
-zCompress data during transfer
--deleteDelete files on destination that are not on source
--excludeExclude files matching pattern
-nDry run (show what would be transferred)
--progressShow transfer progress
-e sshUse SSH as remote shell

Docker for Data Engineers

# Pull and run containers
docker pull postgres:15
docker run -d --name postgres-dev \
    -e POSTGRES_PASSWORD=secret \
    -p 5432:5432 \
    -v pgdata:/var/lib/postgresql/data \
    postgres:15

# Docker Compose
docker-compose up -d
docker-compose down
docker-compose logs -f

# Execute commands in containers
docker exec -it postgres-dev psql -U postgres

# Build custom image
docker build -t my-etl-tool .
docker run my-etl-tool

# Cleanup
docker system prune -a
docker volume prune

Best Practices for Shell Scripts

PracticeWhy It Matters
set -euo pipefailStops on errors, undefined vars, and pipe failures
trap ERR/EXITHandles cleanup and alerts on failure
Double-quote variablesPrevents word splitting on paths with spaces
Use [[ ]] over [ ]More robust conditional expressions
Prefer $(...) over backticksNestable, more readable command substitution
Log to file and stdouttee -a enables both real-time and persisted output
Validate inputs earlyCheck file existence, argument count, DB connectivity
Clean up temp filesUse trap ... EXIT to remove /tmp artifacts

MathSummary Takeaways

  1. Bash is essential β€” master variables, loops, conditionals, and functions for daily data engineering tasks.
  2. grep, awk, sed are powerful β€” use them for quick data processing, log analysis, and text transformations without loading full programs.
  3. Pipe everything β€” chain small, focused commands for complex transformations (cmd1 | cmd2 | cmd3).
  4. set -euo pipefail β€” always use this in production scripts to prevent silent failures.
  5. Cron automates tasks β€” schedule ETL jobs, maintenance, and reports using cron expressions.
  6. SSH and rsync β€” manage remote servers and synchronize data efficiently over the network.
  7. Shell scripts for ETL β€” combine Python, SQL, and shell commands in orchestrated scripts.
  8. Docker simplifies environments β€” containerize data tools for reproducibility across development and production.

See Also

Practice Exercises

  1. Log parser: Write a shell script that parses a web server log file and generates a report of top 10 IPs, status codes, and bandwidth.

  2. ETL script: Create a shell script that extracts data from PostgreSQL, transforms with awk/sed, and loads to S3.

  3. Cron automation: Set up a cron job that runs a data quality check every hour and sends an email alert on failure.

  4. Remote sync: Write a script that uses rsync to synchronize data between local and remote servers, with logging and error handling.

  5. Docker pipeline: Create a Docker Compose setup with PostgreSQL, Airflow, and a Jupyter notebook for data exploration.

⭐

Premium Content

Command Line and Shell Scripting for Data Engineers

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement