Text Processing in Linux: grep, awk, and Pipes That Actually Get Work Done
Text Processing in Linux: grep, awk, and Pipes That Actually Get Work Done
The Problem: Manually Searching Through Files
You need to find all error messages in a 10,000-line log file. Or extract usernames from system files. Or count how many times a specific IP address appears in access logs.
Opening the file in an editor and searching manually? That’s slow and error-prone.
Linux text processing tools turn these tasks into one-line commands.
The cut Command: Extract Columns
cut extracts specific characters or fields from each line.
By Character Position
# Get first character from each line
cut -c1 file.txt
# Get characters 1-3
cut -c1-3 file.txt
# Get characters 1, 2, and 4
cut -c1,2,4 file.txt
Real Example: Extract File Permissions
ls -l | cut -c1-10
# Output: drwxr-xr-x, -rw-r--r--, etc.
The awk Command: Pattern Scanning and Processing
awk is powerful for extracting and manipulating fields (columns).
Basic Field Extraction
# Print first column
awk '{print $1}' file.txt
# Print first and third columns
awk '{print $1, $3}' file.txt
# Print last column (NF = number of fields)
ls -l | awk '{print $NF}'
# Shows filenames from ls -l output
Search and Print
# Find lines containing "Jerry" and print them
awk '/Jerry/ {print}' file.txt
# Or shorter:
awk '/Jerry/' file.txt
Change Field Delimiter
By default, awk uses spaces. Use -F to change it:
# Use colon as delimiter (common in /etc/passwd)
awk -F: '{print $1}' /etc/passwd
# Output: List of all usernames
Modify Fields
# Replace second field with "JJ"
echo "Hello Tom" | awk '{$2="JJ"; print $0}'
# Output: Hello JJ
Filter by Length
# Get lines longer than 15 characters
awk 'length($0) > 15' file.txt
Real-World Example: Extract IP Addresses
# Get IP addresses from access log
awk '{print $1}' /var/log/nginx/access.log
# Count unique IPs
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c
The grep Command: Search Text
grep (global regular expression print) searches for keywords in files or output.
Basic Search
# Find keyword in file
grep keyword filename
# Search in output
ls -l | grep Desktop
Useful Flags
# Count occurrences
grep -c keyword file.txt
# Ignore case
grep -i keyword file.txt
# Finds: keyword, Keyword, KEYWORD
# Show line numbers
grep -n keyword file.txt
# Output: 5:line with keyword
# Exclude lines with keyword (invert match)
grep -v keyword file.txt
Real-World Example: Find Errors in Logs
# Find all error lines
grep -i error /var/log/syslog
# Count errors
grep -i error /var/log/syslog | wc -l
# Find errors but exclude specific ones
grep -i error /var/log/syslog | grep -v "ignore_this_error"
The egrep Command: Multiple Keywords
egrep (or grep -E) searches for multiple patterns at once.
# Search for keyword1 OR keyword2
egrep -i "keyword1|keyword2" file.txt
# Find lines with error or warning
egrep -i "error|warning" /var/log/syslog
The sort Command: Alphabetical Ordering
# Sort alphabetically
sort file.txt
# Reverse sort
sort -r file.txt
# Sort by second field
sort -k2 file.txt
Real Example: Sort By File Size
# Sort files by size (5th column in ls -l)
ls -l | sort -k5 -n
# -n flag for numerical sort
The uniq Command: Remove Duplicates
uniq filters out repeated lines. Important: File must be sorted first.
# Remove duplicates
sort file.txt | uniq
# Count duplicates
sort file.txt | uniq -c
# Output: 3 line_content (appears 3 times)
# Show only duplicates
sort file.txt | uniq -d
Real Example: Most Common Log Entries
# Find most common errors
grep error /var/log/syslog | sort | uniq -c | sort -rn | head -10
Breaking it down:
-
grep error– Find error lines -
sort– Sort so duplicates are together -
uniq -c– Count duplicates -
sort -rn– Sort by count (reverse numerical) -
head -10– Show top 10
The wc Command: Count Lines, Words, Bytes
wc (word count) reads files and counts.
# Count lines, words, bytes
wc file.txt
# Output: 45 300 2000 file.txt
# Only lines
wc -l file.txt
# Only words
wc -w file.txt
# Only bytes
wc -c file.txt
Real Examples
# Count files in directory
ls -l | wc -l
# Count how many times keyword appears
grep keyword file.txt | wc -l
# Count total lines of code in Python files
find . -name "*.py" -exec wc -l {} ; | awk '{sum+=$1} END {print sum}'
Comparing Files
diff: Line-by-Line Comparison
# Compare files
diff file1.txt file2.txt
# Output shows differences:
# < line in file1
# > line in file2
cmp: Byte-by-Byte Comparison
# Compare files
cmp file1.txt file2.txt
# Output: first byte that differs
# No output if files are identical
Combining Commands with Pipes
The real power comes from chaining commands together.
Example 1: Find and Count
# How many users have /bin/bash as their shell?
grep "/bin/bash" /etc/passwd | wc -l
Example 2: Top 5 Largest Files
ls -lh | sort -k5 -h -r | head -5
Example 3: Extract and Sort
# Get all usernames and sort them
awk -F: '{print $1}' /etc/passwd | sort
Example 4: Search, Extract, Count
# Find IP addresses that accessed /admin
grep "/admin" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn
This shows which IPs hit /admin most frequently.
Example 5: Log Analysis
# Find most common error types
grep -i error /var/log/app.log | awk '{print $5}' | sort | uniq -c | sort -rn | head -10
Real-World Scenarios
Scenario 1: Find Large Files
# Files larger than 100MB
find / -type f -size +100M 2>/dev/null | xargs ls -lh | awk '{print $5, $NF}'
Scenario 2: Monitor Active Connections
# Count connections per IP
netstat -an | grep ESTABLISHED | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn
Scenario 3: Check Failed Login Attempts
# Count failed SSH attempts by IP
grep "Failed password" /var/log/auth.log | awk '{print $11}' | sort | uniq -c | sort -rn
Scenario 4: Disk Usage by Directory
# Top 10 directories by size
du -h /var | sort -h -r | head -10
Scenario 5: Extract Email Addresses
# Find all email addresses in file
grep -Eo '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}' file.txt | sort | uniq
Common Patterns
Pattern 1: Search, Extract, Sort, Count
grep pattern file | awk '{print $2}' | sort | uniq -c | sort -rn
Pattern 2: Filter and Process
cat file | grep -v exclude_pattern | awk '{print $1}'
Pattern 3: Multiple Conditions
egrep "error|warning" file | grep -v "ignore" | wc -l
Quick Reference
cut
cut -c1-3 file # Characters 1-3
cut -d: -f1 file # First field (delimiter :)
awk
awk '{print $1}' file # First column
awk -F: '{print $1}' file # Custom delimiter
awk '/pattern/ {print}' file # Pattern matching
awk '{print $NF}' file # Last column
awk 'length($0) > 15' file # Lines > 15 chars
grep
grep pattern file # Search
grep -i pattern file # Ignore case
grep -c pattern file # Count
grep -n pattern file # Line numbers
grep -v pattern file # Invert (exclude)
egrep "pat1|pat2" file # Multiple patterns
sort
sort file # Alphabetical
sort -r file # Reverse
sort -k2 file # By second field
sort -n file # Numerical
uniq
sort file | uniq # Remove duplicates
sort file | uniq -c # Count occurrences
sort file | uniq -d # Show only duplicates
wc
wc file # Lines, words, bytes
wc -l file # Lines only
wc -w file # Words only
wc -c file # Bytes only
diff/cmp
diff file1 file2 # Line comparison
cmp file1 file2 # Byte comparison
Tips for Efficiency
Tip 1: Use pipes instead of temporary files
# Instead of:
grep pattern file > temp.txt
sort temp.txt > sorted.txt
# Do:
grep pattern file | sort
Tip 2: Combine grep with awk
# Filter then extract
grep error log.txt | awk '{print $1, $5}'
Tip 3: Use awk instead of multiple cuts
# Instead of:
cut -d: -f1 file | cut -d- -f1
# Do:
awk -F: '{print $1}' file | awk -F- '{print $1}'
Tip 4: Test patterns on small samples first
# Test on first 10 lines
head -10 large_file.txt | grep pattern
Key Takeaways
- cut – Extract characters or fields
- awk – Process fields, pattern matching, calculations
- grep – Search for patterns
- egrep – Search multiple patterns
- sort – Order lines
- uniq – Remove duplicates (must sort first)
- wc – Count lines, words, bytes
- Pipes (|) – Chain commands together
- diff/cmp – Compare files
These commands aren’t just for showing off. They solve real problems:
- Analyzing logs
- Extracting data
- Monitoring systems
- Processing reports
- Debugging issues
Master these tools and manual file searching becomes a thing of the past.
What text processing task do you do most often? Share your go-to command combinations in the comments.