Text Processing in Linux: grep, awk, and Pipes That Actually Get Work Done

Text Processing in Linux: grep, awk, and Pipes That Actually Get Work Done

The Problem: Manually Searching Through Files

You need to find all error messages in a 10,000-line log file. Or extract usernames from system files. Or count how many times a specific IP address appears in access logs.

Opening the file in an editor and searching manually? That’s slow and error-prone.

Linux text processing tools turn these tasks into one-line commands.

The cut Command: Extract Columns

cut extracts specific characters or fields from each line.

By Character Position

# Get first character from each line
cut -c1 file.txt

# Get characters 1-3
cut -c1-3 file.txt

# Get characters 1, 2, and 4
cut -c1,2,4 file.txt

Real Example: Extract File Permissions

ls -l | cut -c1-10
# Output: drwxr-xr-x, -rw-r--r--, etc.

The awk Command: Pattern Scanning and Processing

awk is powerful for extracting and manipulating fields (columns).

Basic Field Extraction

# Print first column
awk '{print $1}' file.txt

# Print first and third columns
awk '{print $1, $3}' file.txt

# Print last column (NF = number of fields)
ls -l | awk '{print $NF}'
# Shows filenames from ls -l output

Search and Print

# Find lines containing "Jerry" and print them
awk '/Jerry/ {print}' file.txt

# Or shorter:
awk '/Jerry/' file.txt

Change Field Delimiter

By default, awk uses spaces. Use -F to change it:

# Use colon as delimiter (common in /etc/passwd)
awk -F: '{print $1}' /etc/passwd
# Output: List of all usernames

Modify Fields

# Replace second field with "JJ"
echo "Hello Tom" | awk '{$2="JJ"; print $0}'
# Output: Hello JJ

Filter by Length

# Get lines longer than 15 characters
awk 'length($0) > 15' file.txt

Real-World Example: Extract IP Addresses

# Get IP addresses from access log
awk '{print $1}' /var/log/nginx/access.log

# Count unique IPs
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c

The grep Command: Search Text

grep (global regular expression print) searches for keywords in files or output.

Basic Search

# Find keyword in file
grep keyword filename

# Search in output
ls -l | grep Desktop

Useful Flags

# Count occurrences
grep -c keyword file.txt

# Ignore case
grep -i keyword file.txt
# Finds: keyword, Keyword, KEYWORD

# Show line numbers
grep -n keyword file.txt
# Output: 5:line with keyword

# Exclude lines with keyword (invert match)
grep -v keyword file.txt

Real-World Example: Find Errors in Logs

# Find all error lines
grep -i error /var/log/syslog

# Count errors
grep -i error /var/log/syslog | wc -l

# Find errors but exclude specific ones
grep -i error /var/log/syslog | grep -v "ignore_this_error"

The egrep Command: Multiple Keywords

egrep (or grep -E) searches for multiple patterns at once.

# Search for keyword1 OR keyword2
egrep -i "keyword1|keyword2" file.txt

# Find lines with error or warning
egrep -i "error|warning" /var/log/syslog

The sort Command: Alphabetical Ordering

# Sort alphabetically
sort file.txt

# Reverse sort
sort -r file.txt

# Sort by second field
sort -k2 file.txt

Real Example: Sort By File Size

# Sort files by size (5th column in ls -l)
ls -l | sort -k5 -n
# -n flag for numerical sort

The uniq Command: Remove Duplicates

uniq filters out repeated lines. Important: File must be sorted first.

# Remove duplicates
sort file.txt | uniq

# Count duplicates
sort file.txt | uniq -c
# Output: 3 line_content (appears 3 times)

# Show only duplicates
sort file.txt | uniq -d

Real Example: Most Common Log Entries

# Find most common errors
grep error /var/log/syslog | sort | uniq -c | sort -rn | head -10

Breaking it down:

  1. grep error – Find error lines
  2. sort – Sort so duplicates are together
  3. uniq -c – Count duplicates
  4. sort -rn – Sort by count (reverse numerical)
  5. head -10 – Show top 10

The wc Command: Count Lines, Words, Bytes

wc (word count) reads files and counts.

# Count lines, words, bytes
wc file.txt
# Output: 45 300 2000 file.txt

# Only lines
wc -l file.txt

# Only words
wc -w file.txt

# Only bytes
wc -c file.txt

Real Examples

# Count files in directory
ls -l | wc -l

# Count how many times keyword appears
grep keyword file.txt | wc -l

# Count total lines of code in Python files
find . -name "*.py" -exec wc -l {} ; | awk '{sum+=$1} END {print sum}'

Comparing Files

diff: Line-by-Line Comparison

# Compare files
diff file1.txt file2.txt

# Output shows differences:
# < line in file1
# > line in file2

cmp: Byte-by-Byte Comparison

# Compare files
cmp file1.txt file2.txt

# Output: first byte that differs
# No output if files are identical

Combining Commands with Pipes

The real power comes from chaining commands together.

Example 1: Find and Count

# How many users have /bin/bash as their shell?
grep "/bin/bash" /etc/passwd | wc -l

Example 2: Top 5 Largest Files

ls -lh | sort -k5 -h -r | head -5

Example 3: Extract and Sort

# Get all usernames and sort them
awk -F: '{print $1}' /etc/passwd | sort

Example 4: Search, Extract, Count

# Find IP addresses that accessed /admin
grep "/admin" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn

This shows which IPs hit /admin most frequently.

Example 5: Log Analysis

# Find most common error types
grep -i error /var/log/app.log | awk '{print $5}' | sort | uniq -c | sort -rn | head -10

Real-World Scenarios

Scenario 1: Find Large Files

# Files larger than 100MB
find / -type f -size +100M 2>/dev/null | xargs ls -lh | awk '{print $5, $NF}'

Scenario 2: Monitor Active Connections

# Count connections per IP
netstat -an | grep ESTABLISHED | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn

Scenario 3: Check Failed Login Attempts

# Count failed SSH attempts by IP
grep "Failed password" /var/log/auth.log | awk '{print $11}' | sort | uniq -c | sort -rn

Scenario 4: Disk Usage by Directory

# Top 10 directories by size
du -h /var | sort -h -r | head -10

Scenario 5: Extract Email Addresses

# Find all email addresses in file
grep -Eo '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}' file.txt | sort | uniq

Common Patterns

Pattern 1: Search, Extract, Sort, Count

grep pattern file | awk '{print $2}' | sort | uniq -c | sort -rn

Pattern 2: Filter and Process

cat file | grep -v exclude_pattern | awk '{print $1}'

Pattern 3: Multiple Conditions

egrep "error|warning" file | grep -v "ignore" | wc -l

Quick Reference

cut

cut -c1-3 file        # Characters 1-3
cut -d: -f1 file      # First field (delimiter :)

awk

awk '{print $1}' file              # First column
awk -F: '{print $1}' file          # Custom delimiter
awk '/pattern/ {print}' file       # Pattern matching
awk '{print $NF}' file             # Last column
awk 'length($0) > 15' file         # Lines > 15 chars

grep

grep pattern file                   # Search
grep -i pattern file               # Ignore case
grep -c pattern file               # Count
grep -n pattern file               # Line numbers
grep -v pattern file               # Invert (exclude)
egrep "pat1|pat2" file             # Multiple patterns

sort

sort file                          # Alphabetical
sort -r file                       # Reverse
sort -k2 file                      # By second field
sort -n file                       # Numerical

uniq

sort file | uniq                   # Remove duplicates
sort file | uniq -c                # Count occurrences
sort file | uniq -d                # Show only duplicates

wc

wc file                            # Lines, words, bytes
wc -l file                         # Lines only
wc -w file                         # Words only
wc -c file                         # Bytes only

diff/cmp

diff file1 file2                   # Line comparison
cmp file1 file2                    # Byte comparison

Tips for Efficiency

Tip 1: Use pipes instead of temporary files

# Instead of:
grep pattern file > temp.txt
sort temp.txt > sorted.txt

# Do:
grep pattern file | sort

Tip 2: Combine grep with awk

# Filter then extract
grep error log.txt | awk '{print $1, $5}'

Tip 3: Use awk instead of multiple cuts

# Instead of:
cut -d: -f1 file | cut -d- -f1

# Do:
awk -F: '{print $1}' file | awk -F- '{print $1}'

Tip 4: Test patterns on small samples first

# Test on first 10 lines
head -10 large_file.txt | grep pattern

Key Takeaways

  1. cut – Extract characters or fields
  2. awk – Process fields, pattern matching, calculations
  3. grep – Search for patterns
  4. egrep – Search multiple patterns
  5. sort – Order lines
  6. uniq – Remove duplicates (must sort first)
  7. wc – Count lines, words, bytes
  8. Pipes (|) – Chain commands together
  9. diff/cmp – Compare files

These commands aren’t just for showing off. They solve real problems:

  • Analyzing logs
  • Extracting data
  • Monitoring systems
  • Processing reports
  • Debugging issues

Master these tools and manual file searching becomes a thing of the past.

What text processing task do you do most often? Share your go-to command combinations in the comments.

Similar Posts