Handling 100+ Website Scrapers with Python’s asyncio

A Quick Note on Timeline

Although this is Part 2 of the CollegeBuzz series, it was actually the first major component I built. The MongoDB archiving system came later because I needed to solve data consistency issues after scraping was already running in production.

Lesson learned: Start with archiving from Day 1. But hindsight is 20/20, right? 😅

The Problem: Scraping 100+ Colleges Without Losing My Mind

When I started building CollegeBuzz — an AICTE academic news aggregator — I needed to scrape notices, events, and announcements from 100+ Indian engineering colleges daily.

My first naive attempt:

import requests
from bs4 import BeautifulSoup

def scrape_college(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract data...
    return data

# The slow way
college_urls = ["https://iitb.ac.in", "https://nitt.edu", ...]
all_data = []

for url in college_urls:
    data = scrape_college(url)  
    all_data.append(data)

# 🐌 Total time: 4+ hours

Why so slow?

Each request waits for the previous one to complete. If one site takes 8 seconds to respond, my scraper just sits there… waiting. Multiply that across 100+ sites and you get an eternity.

I needed something better.

Discovering Crawl4AI: The Game Changer

After wrestling with BeautifulSoup and Selenium, I found Crawl4AI by @unclecode.

Why Crawl4AI for async scraping?

  • Built for asyncio from the ground up — native async/await support
  • 🎯 CSS-based extraction strategies — no more manual BeautifulSoup parsing
  • 📦 Works out of the box — handles browser automation, retries, error handling
  • 🚀 Battle-tested — 50k+ GitHub stars

Resources:

My Async Scraping Architecture

Instead of trying to scrape everything at once, I built a controlled async pipeline:

async def extract_notices_and_events():
    """Main async scraping orchestrator"""

    # Initialize MongoDB handler
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    # Single crawler instance handles all sites
    async with AsyncWebCrawler(verbose=True) as crawler:
        for site in urls:  # Sequential at site level
            # Configure extraction strategy
            extraction_strategy = JsonCssExtractionStrategy(
                site["schema"], 
                verbose=True
            )

            config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,  # Always fresh data
                extraction_strategy=extraction_strategy
            )

            # Async scrape (non-blocking!)
            result = await crawler.arun(url=site["url"], config=config)

            if result.success:
                data = json.loads(result.extracted_content)
                # Process and store data
                process_and_store(data, mongo_handler)

Key Design Decision: Sequential Sites, Async Pages

I intentionally scrape sites one-by-one but use async for the actual HTTP requests. Why?

  1. Avoid IP bans — 100 concurrent requests to different domains = red flags
  2. Resource management — One browser at a time keeps memory under control
  3. Error isolation — If one site fails, others continue

The Magic: CSS-Based Extraction

Instead of writing BeautifulSoup parsing for each site, I use declarative schemas:

# In crawler_config.py
urls = [
    {
        "url": "https://www.iitb.ac.in/",
        "schema": {
            "name": "IIT Bombay Notices",
            "baseSelector": ".notice-item",
            "fields": [
                {
                    "name": "title",
                    "selector": "h3.title",
                    "type": "text"
                },
                {
                    "name": "notice_url",
                    "selector": "a",
                    "type": "attribute",
                    "attribute": "href"
                }
            ]
        }
    },
    # ... 100+ more colleges
]

Then in my scraper:

for site in urls:
    extraction_strategy = JsonCssExtractionStrategy(
        site["schema"], 
        verbose=True
    )

    result = await crawler.arun(
        url=site["url"], 
        config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
    )

    # Get clean JSON data, no BeautifulSoup needed!
    data = json.loads(result.extracted_content)

Why this is powerful for async scraping:

  • No manual parsing — Crawl4AI handles HTML extraction
  • Maintainable — Update schemas without touching scraper logic
  • Scalable — Add new colleges by adding new schema objects

Real-World Async Patterns I Used

Pattern 1: Context Manager for Resource Cleanup

async with AsyncWebCrawler(verbose=True) as crawler:
    # Use crawler
    result = await crawler.arun(url=url, config=config)
# Automatically closes browser and releases resources

Even if errors occur, the browser gets cleaned up. Critical for long-running scrapers.

Pattern 2: Handling Failures Gracefully

result = await crawler.arun(url=site["url"], config=config)

if not result.success:
    print(f"❌ Crawl failed for {site['url']}: {result.error_message}")
    continue  # Skip this site, move to next

# Process successful result
data = json.loads(result.extracted_content)

One failed site doesn’t crash the entire pipeline.

Pattern 3: Async Configuration

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,  # Fresh data every time
    extraction_strategy=extraction_strategy
)

result = await crawler.arun(url=site["url"], config=config)

Crawl4AI’s CrawlerRunConfig lets you customize behavior per site without creating new crawler instances.

Handling Real-World Edge Cases

Edge Case 1: Data Volume Control

Some colleges list 1000+ notices on their homepage. I don’t need all of them:

# Site-specific limits
if site["url"] in ["https://www.nitt.edu/", "https://www.iitkgp.ac.in/"]:
    data = data[:10]  # Only recent 10
elif site["url"] in ["https://www.iitk.ac.in/", "https://www.iiti.ac.in/"]:
    data = data[:4]   # These update slowly

Edge Case 2: URL Normalization

College websites have inconsistent URL formats:

def process_url(base_url, extracted_url):
    """Convert relative URLs to absolute"""
    if not extracted_url:
        return extracted_url

    extracted_url = extracted_url.strip()

    # Handle relative URLs
    if not extracted_url.startswith(("http://", "https://")):
        return urljoin(base_url, extracted_url)

    return extracted_url

Edge Case 3: JavaScript URL Madness (IIT Roorkee)

IIT Roorkee embeds JavaScript in href attributes:

<a href="window.open('/events/workshop.pdf')">View Event</a>

Solution:

if site["url"] == "https://www.iitr.ac.in/":
    for entry in data:
        if "upcoming_Event_url" in entry and "window.open(" in entry["upcoming_Event_url"]:
            # Extract actual URL from JavaScript
            match = re.search(r"window.open('([^']+)')", entry["upcoming_Event_url"])
            if match:
                entry["upcoming_Event_url"] = match.group(1)

Performance: The Numbers

Before (Sequential with Requests + BeautifulSoup):

⏱️ Time: 4 hours, 23 minutes
🐌 Average: ~150 seconds per site
💾 Memory: ~200MB stable

After (Async with Crawl4AI):

⏱️ Time: 12 minutes, 30 seconds
⚡ Average: ~7.5 seconds per site
💾 Memory: ~600MB peak (browser overhead)

20x faster with better reliability!

Why Not Fully Concurrent?

You might ask: “Why not scrape all 100+ sites simultaneously with asyncio.gather()?”

# Why I DON'T do this:
tasks = [scrape_college(crawler, site) for site in urls]
results = await asyncio.gather(*tasks)

I tried this. Results:

  • IP bans from 12 colleges
  • Memory explosion (100 browsers = 8GB+ RAM)
  • Browser crashes

Lesson: For daily scraping of 100+ different domains, sequential with async is the sweet spot:

  • Fast enough (12 minutes vs 4 hours)
  • Respectful to websites (no hammering)
  • Stable and maintainable

Integration with the Full Pipeline

Here’s how async scraping fits into CollegeBuzz:

# In aictcscraper.py
async def extract_notices_and_events():
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    try:
        async with AsyncWebCrawler(verbose=True) as crawler:
            for site in urls:
                # Async scraping
                result = await crawler.arun(url=site["url"], config=config)
                data = json.loads(result.extracted_content)

                # Insert into MongoDB (with deduplication from Part 1)
                mongo_handler.insert_data(collection_name, records)

    except Exception as e:
        print(f"Error: {e}")
        raise
    finally:
        mongo_handler.close_connection()

The scraper feeds data into the MongoDB handler, which automatically:

  • Deduplicates records (from Part 1)
  • Updates timestamps
  • Archives old data

Running the Scraper

Manual Trigger

python aictcscraper.py

Via Flask API

# In app.py
@app.route('/api/scrape', methods=['POST'])
def run_scraper():
    try:
        result = asyncio.run(extract_notices_and_events())
        return jsonify({"status": "success"}), 200
    except Exception as e:
        return jsonify({"status": "error", "message": str(e)}), 500

Trigger via HTTP:

curl -X POST http://localhost:8080/api/scrape

Scheduled with Cron

# crontab -e
0 2 * * * cd /path/to/collegebuzz && python aictcscraper.py >> scraper.log 2>&1

Key Takeaways for Async Scraping

1. Context Managers Are Essential

async with AsyncWebCrawler() as crawler:
    # Always use context managers
    result = await crawler.arun(url)
# Automatic cleanup, even on errors

2. Don’t Over-Optimize

Sequential scraping of 100+ sites in 12 minutes is good enough for daily jobs. Don’t chase 100% concurrency at the cost of stability.

3. Schema-Based Extraction > Manual Parsing

Declarative CSS schemas are:

  • Easier to maintain
  • Easier to debug
  • Easier to scale

4. Handle Failures Gracefully

if not result.success:
    print(f"Failed: {result.error_message}")
    continue  # Don't crash entire pipeline

Resources & Credits

Huge thanks to @unclecode for creating Crawl4AI! This library made async scraping approachable.

Learn More:

CollegeBuzz Series:

Closing Thoughts

Async scraping with Crawl4AI transformed CollegeBuzz from a 4-hour batch job to a 12-minute operation. But the real win was simplicity.

No threading nightmares. No multiprocessing complexity. Just clean async/await code with schema-based extraction.

If you’re scraping multiple websites in 2025, start with Crawl4AI and asyncio. Your future self will thank you.

Found this helpful? Hit that ❤️ and follow for Part 3!

Questions? Drop a comment or reach out @pradippanjiyar

This is Part 2 of the CollegeBuzz engineering series. All code examples are from my production system.

Similar Posts