Handling 100+ Website Scrapers with Python’s asyncio
A Quick Note on Timeline
Although this is Part 2 of the CollegeBuzz series, it was actually the first major component I built. The MongoDB archiving system came later because I needed to solve data consistency issues after scraping was already running in production.
Lesson learned: Start with archiving from Day 1. But hindsight is 20/20, right? 😅
The Problem: Scraping 100+ Colleges Without Losing My Mind
When I started building CollegeBuzz — an AICTE academic news aggregator — I needed to scrape notices, events, and announcements from 100+ Indian engineering colleges daily.
My first naive attempt:
import requests
from bs4 import BeautifulSoup
def scrape_college(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract data...
return data
# The slow way
college_urls = ["https://iitb.ac.in", "https://nitt.edu", ...]
all_data = []
for url in college_urls:
data = scrape_college(url)
all_data.append(data)
# 🐌 Total time: 4+ hours
Why so slow?
Each request waits for the previous one to complete. If one site takes 8 seconds to respond, my scraper just sits there… waiting. Multiply that across 100+ sites and you get an eternity.
I needed something better.
Discovering Crawl4AI: The Game Changer
After wrestling with BeautifulSoup and Selenium, I found Crawl4AI by @unclecode.
Why Crawl4AI for async scraping?
- ⚡ Built for asyncio from the ground up — native async/await support
- 🎯 CSS-based extraction strategies — no more manual BeautifulSoup parsing
- 📦 Works out of the box — handles browser automation, retries, error handling
- 🚀 Battle-tested — 50k+ GitHub stars
Resources:
- 📺 YouTube Channel — Excellent tutorials by the creator
- 🐙 GitHub Repository
- 📚 Official Documentation
My Async Scraping Architecture
Instead of trying to scrape everything at once, I built a controlled async pipeline:
async def extract_notices_and_events():
"""Main async scraping orchestrator"""
# Initialize MongoDB handler
mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))
# Single crawler instance handles all sites
async with AsyncWebCrawler(verbose=True) as crawler:
for site in urls: # Sequential at site level
# Configure extraction strategy
extraction_strategy = JsonCssExtractionStrategy(
site["schema"],
verbose=True
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # Always fresh data
extraction_strategy=extraction_strategy
)
# Async scrape (non-blocking!)
result = await crawler.arun(url=site["url"], config=config)
if result.success:
data = json.loads(result.extracted_content)
# Process and store data
process_and_store(data, mongo_handler)
Key Design Decision: Sequential Sites, Async Pages
I intentionally scrape sites one-by-one but use async for the actual HTTP requests. Why?
- Avoid IP bans — 100 concurrent requests to different domains = red flags
- Resource management — One browser at a time keeps memory under control
- Error isolation — If one site fails, others continue
The Magic: CSS-Based Extraction
Instead of writing BeautifulSoup parsing for each site, I use declarative schemas:
# In crawler_config.py
urls = [
{
"url": "https://www.iitb.ac.in/",
"schema": {
"name": "IIT Bombay Notices",
"baseSelector": ".notice-item",
"fields": [
{
"name": "title",
"selector": "h3.title",
"type": "text"
},
{
"name": "notice_url",
"selector": "a",
"type": "attribute",
"attribute": "href"
}
]
}
},
# ... 100+ more colleges
]
Then in my scraper:
for site in urls:
extraction_strategy = JsonCssExtractionStrategy(
site["schema"],
verbose=True
)
result = await crawler.arun(
url=site["url"],
config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
)
# Get clean JSON data, no BeautifulSoup needed!
data = json.loads(result.extracted_content)
Why this is powerful for async scraping:
- ✅ No manual parsing — Crawl4AI handles HTML extraction
- ✅ Maintainable — Update schemas without touching scraper logic
- ✅ Scalable — Add new colleges by adding new schema objects
Real-World Async Patterns I Used
Pattern 1: Context Manager for Resource Cleanup
async with AsyncWebCrawler(verbose=True) as crawler:
# Use crawler
result = await crawler.arun(url=url, config=config)
# Automatically closes browser and releases resources
Even if errors occur, the browser gets cleaned up. Critical for long-running scrapers.
Pattern 2: Handling Failures Gracefully
result = await crawler.arun(url=site["url"], config=config)
if not result.success:
print(f"❌ Crawl failed for {site['url']}: {result.error_message}")
continue # Skip this site, move to next
# Process successful result
data = json.loads(result.extracted_content)
One failed site doesn’t crash the entire pipeline.
Pattern 3: Async Configuration
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # Fresh data every time
extraction_strategy=extraction_strategy
)
result = await crawler.arun(url=site["url"], config=config)
Crawl4AI’s CrawlerRunConfig
lets you customize behavior per site without creating new crawler instances.
Handling Real-World Edge Cases
Edge Case 1: Data Volume Control
Some colleges list 1000+ notices on their homepage. I don’t need all of them:
# Site-specific limits
if site["url"] in ["https://www.nitt.edu/", "https://www.iitkgp.ac.in/"]:
data = data[:10] # Only recent 10
elif site["url"] in ["https://www.iitk.ac.in/", "https://www.iiti.ac.in/"]:
data = data[:4] # These update slowly
Edge Case 2: URL Normalization
College websites have inconsistent URL formats:
def process_url(base_url, extracted_url):
"""Convert relative URLs to absolute"""
if not extracted_url:
return extracted_url
extracted_url = extracted_url.strip()
# Handle relative URLs
if not extracted_url.startswith(("http://", "https://")):
return urljoin(base_url, extracted_url)
return extracted_url
Edge Case 3: JavaScript URL Madness (IIT Roorkee)
IIT Roorkee embeds JavaScript in href attributes:
<a href="window.open('/events/workshop.pdf')">View Event</a>
Solution:
if site["url"] == "https://www.iitr.ac.in/":
for entry in data:
if "upcoming_Event_url" in entry and "window.open(" in entry["upcoming_Event_url"]:
# Extract actual URL from JavaScript
match = re.search(r"window.open('([^']+)')", entry["upcoming_Event_url"])
if match:
entry["upcoming_Event_url"] = match.group(1)
Performance: The Numbers
Before (Sequential with Requests + BeautifulSoup):
⏱️ Time: 4 hours, 23 minutes
🐌 Average: ~150 seconds per site
💾 Memory: ~200MB stable
After (Async with Crawl4AI):
⏱️ Time: 12 minutes, 30 seconds
⚡ Average: ~7.5 seconds per site
💾 Memory: ~600MB peak (browser overhead)
20x faster with better reliability!
Why Not Fully Concurrent?
You might ask: “Why not scrape all 100+ sites simultaneously with asyncio.gather()
?”
# Why I DON'T do this:
tasks = [scrape_college(crawler, site) for site in urls]
results = await asyncio.gather(*tasks)
I tried this. Results:
- ❌ IP bans from 12 colleges
- ❌ Memory explosion (100 browsers = 8GB+ RAM)
- ❌ Browser crashes
Lesson: For daily scraping of 100+ different domains, sequential with async is the sweet spot:
- Fast enough (12 minutes vs 4 hours)
- Respectful to websites (no hammering)
- Stable and maintainable
Integration with the Full Pipeline
Here’s how async scraping fits into CollegeBuzz:
# In aictcscraper.py
async def extract_notices_and_events():
mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))
try:
async with AsyncWebCrawler(verbose=True) as crawler:
for site in urls:
# Async scraping
result = await crawler.arun(url=site["url"], config=config)
data = json.loads(result.extracted_content)
# Insert into MongoDB (with deduplication from Part 1)
mongo_handler.insert_data(collection_name, records)
except Exception as e:
print(f"Error: {e}")
raise
finally:
mongo_handler.close_connection()
The scraper feeds data into the MongoDB handler, which automatically:
- Deduplicates records (from Part 1)
- Updates timestamps
- Archives old data
Running the Scraper
Manual Trigger
python aictcscraper.py
Via Flask API
# In app.py
@app.route('/api/scrape', methods=['POST'])
def run_scraper():
try:
result = asyncio.run(extract_notices_and_events())
return jsonify({"status": "success"}), 200
except Exception as e:
return jsonify({"status": "error", "message": str(e)}), 500
Trigger via HTTP:
curl -X POST http://localhost:8080/api/scrape
Scheduled with Cron
# crontab -e
0 2 * * * cd /path/to/collegebuzz && python aictcscraper.py >> scraper.log 2>&1
Key Takeaways for Async Scraping
1. Context Managers Are Essential
async with AsyncWebCrawler() as crawler:
# Always use context managers
result = await crawler.arun(url)
# Automatic cleanup, even on errors
2. Don’t Over-Optimize
Sequential scraping of 100+ sites in 12 minutes is good enough for daily jobs. Don’t chase 100% concurrency at the cost of stability.
3. Schema-Based Extraction > Manual Parsing
Declarative CSS schemas are:
- Easier to maintain
- Easier to debug
- Easier to scale
4. Handle Failures Gracefully
if not result.success:
print(f"Failed: {result.error_message}")
continue # Don't crash entire pipeline
Resources & Credits
Huge thanks to @unclecode for creating Crawl4AI! This library made async scraping approachable.
Learn More:
- 📺 Crawl4AI YouTube Tutorials
- 🐙 GitHub: unclecode/crawl4ai
- 📚 Official Documentation
- 🐍 Python asyncio Docs
CollegeBuzz Series:
Closing Thoughts
Async scraping with Crawl4AI transformed CollegeBuzz from a 4-hour batch job to a 12-minute operation. But the real win was simplicity.
No threading nightmares. No multiprocessing complexity. Just clean async/await code with schema-based extraction.
If you’re scraping multiple websites in 2025, start with Crawl4AI and asyncio. Your future self will thank you.
Found this helpful? Hit that ❤️ and follow for Part 3!
Questions? Drop a comment or reach out @pradippanjiyar
This is Part 2 of the CollegeBuzz engineering series. All code examples are from my production system.