Handling 100+ Website Scrapers with Python's asyncio

A Quick Note on Timeline

Although this is Part 2 of the CollegeBuzz series, it was actually the first major component I built. The MongoDB archiving system came later because I needed to solve data consistency issues after scraping was already running in production.

Lesson learned: Start with archiving from Day 1. But hindsight is 20/20, right? 😅

The Problem: Scraping 100+ Colleges Without Losing My Mind

When I started building CollegeBuzz — an AICTE academic news aggregator — I needed to scrape notices, events, and announcements from 100+ Indian engineering colleges daily.

My first naive attempt:

import requests
from bs4 import BeautifulSoup

def scrape_college(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract data...
    return data

# The slow way
college_urls = ["https://iitb.ac.in", "https://nitt.edu", ...]
all_data = []

for url in college_urls:
    data = scrape_college(url)  
    all_data.append(data)

# 🐌 Total time: 4+ hours

Why so slow?

Each request waits for the previous one to complete. If one site takes 8 seconds to respond, my scraper just sits there… waiting. Multiply that across 100+ sites and you get an eternity.

I needed something better.

Discovering Crawl4AI: The Game Changer

After wrestling with BeautifulSoup and Selenium, I found Crawl4AI by @unclecode.

Why Crawl4AI for async scraping?

⚡ Built for asyncio from the ground up — native async/await support
🎯 CSS-based extraction strategies — no more manual BeautifulSoup parsing
📦 Works out of the box — handles browser automation, retries, error handling
🚀 Battle-tested — 50k+ GitHub stars

Resources:

📺 YouTube Channel — Excellent tutorials by the creator
🐙 GitHub Repository
📚 Official Documentation

My Async Scraping Architecture

Instead of trying to scrape everything at once, I built a controlled async pipeline:

async def extract_notices_and_events():
    """Main async scraping orchestrator"""

    # Initialize MongoDB handler
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    # Single crawler instance handles all sites
    async with AsyncWebCrawler(verbose=True) as crawler:
        for site in urls:  # Sequential at site level
            # Configure extraction strategy
            extraction_strategy = JsonCssExtractionStrategy(
                site["schema"], 
                verbose=True
            )

            config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,  # Always fresh data
                extraction_strategy=extraction_strategy
            )

            # Async scrape (non-blocking!)
            result = await crawler.arun(url=site["url"], config=config)

            if result.success:
                data = json.loads(result.extracted_content)
                # Process and store data
                process_and_store(data, mongo_handler)

Key Design Decision: Sequential Sites, Async Pages

I intentionally scrape sites one-by-one but use async for the actual HTTP requests. Why?

Avoid IP bans — 100 concurrent requests to different domains = red flags
Resource management — One browser at a time keeps memory under control
Error isolation — If one site fails, others continue

The Magic: CSS-Based Extraction

Instead of writing BeautifulSoup parsing for each site, I use declarative schemas:

# In crawler_config.py
urls = [
    {
        "url": "https://www.iitb.ac.in/",
        "schema": {
            "name": "IIT Bombay Notices",
            "baseSelector": ".notice-item",
            "fields": [
                {
                    "name": "title",
                    "selector": "h3.title",
                    "type": "text"
                },
                {
                    "name": "notice_url",
                    "selector": "a",
                    "type": "attribute",
                    "attribute": "href"
                }
            ]
        }
    },
    # ... 100+ more colleges
]

Then in my scraper:

for site in urls:
    extraction_strategy = JsonCssExtractionStrategy(
        site["schema"], 
        verbose=True
    )

    result = await crawler.arun(
        url=site["url"], 
        config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
    )

    # Get clean JSON data, no BeautifulSoup needed!
    data = json.loads(result.extracted_content)

Why this is powerful for async scraping:

✅ No manual parsing — Crawl4AI handles HTML extraction
✅ Maintainable — Update schemas without touching scraper logic
✅ Scalable — Add new colleges by adding new schema objects

Real-World Async Patterns I Used

Pattern 1: Context Manager for Resource Cleanup

async with AsyncWebCrawler(verbose=True) as crawler:
    # Use crawler
    result = await crawler.arun(url=url, config=config)
# Automatically closes browser and releases resources

Even if errors occur, the browser gets cleaned up. Critical for long-running scrapers.

Pattern 2: Handling Failures Gracefully

result = await crawler.arun(url=site["url"], config=config)

if not result.success:
    print(f"❌ Crawl failed for {site['url']}: {result.error_message}")
    continue  # Skip this site, move to next

# Process successful result
data = json.loads(result.extracted_content)

One failed site doesn’t crash the entire pipeline.

Pattern 3: Async Configuration

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,  # Fresh data every time
    extraction_strategy=extraction_strategy
)

result = await crawler.arun(url=site["url"], config=config)

Crawl4AI’s CrawlerRunConfig lets you customize behavior per site without creating new crawler instances.

Handling Real-World Edge Cases

Edge Case 1: Data Volume Control

Some colleges list 1000+ notices on their homepage. I don’t need all of them:

# Site-specific limits
if site["url"] in ["https://www.nitt.edu/", "https://www.iitkgp.ac.in/"]:
    data = data[:10]  # Only recent 10
elif site["url"] in ["https://www.iitk.ac.in/", "https://www.iiti.ac.in/"]:
    data = data[:4]   # These update slowly

Edge Case 2: URL Normalization

College websites have inconsistent URL formats:

def process_url(base_url, extracted_url):
    """Convert relative URLs to absolute"""
    if not extracted_url:
        return extracted_url

    extracted_url = extracted_url.strip()

    # Handle relative URLs
    if not extracted_url.startswith(("http://", "https://")):
        return urljoin(base_url, extracted_url)

    return extracted_url

Edge Case 3: JavaScript URL Madness (IIT Roorkee)

IIT Roorkee embeds JavaScript in href attributes:

<a href="window.open('/events/workshop.pdf')">View Event</a>

Solution:

if site["url"] == "https://www.iitr.ac.in/":
    for entry in data:
        if "upcoming_Event_url" in entry and "window.open(" in entry["upcoming_Event_url"]:
            # Extract actual URL from JavaScript
            match = re.search(r"window.open('([^']+)')", entry["upcoming_Event_url"])
            if match:
                entry["upcoming_Event_url"] = match.group(1)

Performance: The Numbers

Before (Sequential with Requests + BeautifulSoup):

⏱️ Time: 4 hours, 23 minutes
🐌 Average: ~150 seconds per site
💾 Memory: ~200MB stable

After (Async with Crawl4AI):

⏱️ Time: 12 minutes, 30 seconds
⚡ Average: ~7.5 seconds per site
💾 Memory: ~600MB peak (browser overhead)

20x faster with better reliability!

Why Not Fully Concurrent?

You might ask: “Why not scrape all 100+ sites simultaneously with asyncio.gather()?”

# Why I DON'T do this:
tasks = [scrape_college(crawler, site) for site in urls]
results = await asyncio.gather(*tasks)

I tried this. Results:

❌ IP bans from 12 colleges
❌ Memory explosion (100 browsers = 8GB+ RAM)
❌ Browser crashes

Lesson: For daily scraping of 100+ different domains, sequential with async is the sweet spot:

Fast enough (12 minutes vs 4 hours)
Respectful to websites (no hammering)
Stable and maintainable

Integration with the Full Pipeline

Here’s how async scraping fits into CollegeBuzz:

# In aictcscraper.py
async def extract_notices_and_events():
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    try:
        async with AsyncWebCrawler(verbose=True) as crawler:
            for site in urls:
                # Async scraping
                result = await crawler.arun(url=site["url"], config=config)
                data = json.loads(result.extracted_content)

                # Insert into MongoDB (with deduplication from Part 1)
                mongo_handler.insert_data(collection_name, records)

    except Exception as e:
        print(f"Error: {e}")
        raise
    finally:
        mongo_handler.close_connection()

The scraper feeds data into the MongoDB handler, which automatically:

Deduplicates records (from Part 1)
Updates timestamps
Archives old data

Running the Scraper

Manual Trigger

python aictcscraper.py

Via Flask API

# In app.py
@app.route('/api/scrape', methods=['POST'])
def run_scraper():
    try:
        result = asyncio.run(extract_notices_and_events())
        return jsonify({"status": "success"}), 200
    except Exception as e:
        return jsonify({"status": "error", "message": str(e)}), 500

Trigger via HTTP:

curl -X POST http://localhost:8080/api/scrape

Scheduled with Cron

# crontab -e
0 2 * * * cd /path/to/collegebuzz && python aictcscraper.py >> scraper.log 2>&1

Key Takeaways for Async Scraping

1. Context Managers Are Essential

async with AsyncWebCrawler() as crawler:
    # Always use context managers
    result = await crawler.arun(url)
# Automatic cleanup, even on errors

2. Don’t Over-Optimize

Sequential scraping of 100+ sites in 12 minutes is good enough for daily jobs. Don’t chase 100% concurrency at the cost of stability.

3. Schema-Based Extraction > Manual Parsing

Declarative CSS schemas are:

Easier to maintain
Easier to debug
Easier to scale

4. Handle Failures Gracefully

if not result.success:
    print(f"Failed: {result.error_message}")
    continue  # Don't crash entire pipeline

Resources & Credits

Huge thanks to @unclecode for creating Crawl4AI! This library made async scraping approachable.

Learn More:

CollegeBuzz Series:

📝 Part 1: MongoDB Archiving System

Closing Thoughts

Async scraping with Crawl4AI transformed CollegeBuzz from a 4-hour batch job to a 12-minute operation. But the real win was simplicity.

No threading nightmares. No multiprocessing complexity. Just clean async/await code with schema-based extraction.

If you’re scraping multiple websites in 2025, start with Crawl4AI and asyncio. Your future self will thank you.

Found this helpful? Hit that ❤️ and follow for Part 3!

Questions? Drop a comment or reach out @pradippanjiyar

This is Part 2 of the CollegeBuzz engineering series. All code examples are from my production system.

🎬 Watch the Video

Handling 100+ Website Scrapers with Python’s asyncio

A Quick Note on Timeline

The Problem: Scraping 100+ Colleges Without Losing My Mind

Discovering Crawl4AI: The Game Changer

My Async Scraping Architecture

The Magic: CSS-Based Extraction

Real-World Async Patterns I Used

Pattern 1: Context Manager for Resource Cleanup

Pattern 2: Handling Failures Gracefully

Pattern 3: Async Configuration

Handling Real-World Edge Cases

Edge Case 1: Data Volume Control

Edge Case 2: URL Normalization

Edge Case 3: JavaScript URL Madness (IIT Roorkee)

Performance: The Numbers

Before (Sequential with Requests + BeautifulSoup):

After (Async with Crawl4AI):

Why Not Fully Concurrent?

Integration with the Full Pipeline

Running the Scraper

Manual Trigger

Via Flask API

Scheduled with Cron

Key Takeaways for Async Scraping

1. Context Managers Are Essential

2. Don’t Over-Optimize

3. Schema-Based Extraction > Manual Parsing

4. Handle Failures Gracefully

Resources & Credits

Learn More:

CollegeBuzz Series:

Closing Thoughts

Part-122: 🚀Step-by-Step Guide: Create a GKE Private Cluster with Cloud NAT and Deploy an App

October 2025 Crypto Pulse: Bullish Vibes, DeFi Surges, and Market Shifts

October 2025 Forem Core Update: Hacktoberfest Momentum, PR Cleanups, and Self-Hosting Tweaks

Game on! These five gaming mice are cheaper then their July Prime Day price – up to 52% off

The best Nintendo Switch 2 accessories: controllers, screen protectors, and carry cases you can’t do without

Tutorial: Selenium WebDriver + Reqnroll

A Quick Note on Timeline

The Problem: Scraping 100+ Colleges Without Losing My Mind

Discovering Crawl4AI: The Game Changer

My Async Scraping Architecture

The Magic: CSS-Based Extraction

Real-World Async Patterns I Used

Pattern 1: Context Manager for Resource Cleanup

Pattern 2: Handling Failures Gracefully

Pattern 3: Async Configuration

Handling Real-World Edge Cases

Edge Case 1: Data Volume Control

Edge Case 2: URL Normalization

Edge Case 3: JavaScript URL Madness (IIT Roorkee)

Performance: The Numbers

Before (Sequential with Requests + BeautifulSoup):

After (Async with Crawl4AI):

Why Not Fully Concurrent?

Integration with the Full Pipeline

Running the Scraper

Manual Trigger

Via Flask API

Scheduled with Cron

Key Takeaways for Async Scraping

1. Context Managers Are Essential

2. Don’t Over-Optimize

3. Schema-Based Extraction > Manual Parsing

4. Handle Failures Gracefully

Resources & Credits

Learn More:

CollegeBuzz Series:

Closing Thoughts

Similar Posts