Building a Roboflow Universe Search Agent: Automating ML Model Discovery

The Problem

As a machine learning enthusiast, I often find myself browsing Roboflow Universe looking for pre-trained models. But manually searching, clicking through pages, and copying API endpoints is tedious. I wanted a way to:

  • Search for models by keywords
  • Extract detailed information (metrics, classes, API endpoints)
  • Get structured data I could use programmatically

So I built a Python web scraper that does exactly that! 🚀

What It Does

The Roboflow Universe Search Agent is a Python tool that:

✅ Searches Roboflow Universe with custom keywords

✅ Extracts model details (title, author, metrics, classes)

Finds API endpoints using multiple extraction strategies

✅ Outputs structured JSON data

✅ Handles retries and errors gracefully

The Challenge: Finding API Endpoints

The trickiest part was reliably extracting API endpoints. Roboflow displays them in various places:

  • JavaScript code snippets
  • Model ID variables
  • Input fields
  • Page text
  • Legacy endpoint formats

I needed a robust solution that wouldn’t break if the website structure changed.

The Solution: Multi-Strategy Extraction

Instead of relying on a single method, I implemented 6 different extraction strategies with fallbacks:

Strategy 1: JavaScript Code Blocks

The most reliable source – API endpoints appear in code snippets:

js_patterns = [
    r'url:s*["']https://serverless.roboflow.com/([^"'?s]+)["']',
    r'"https://serverless.roboflow.com/([^"'?s]+)"',
    r'https://serverless.roboflow.com/([a-z0-9-_]+/d+)',
]

Strategy 2: Model ID Patterns

Extract from JavaScript variables:

model_id_patterns = [
    r'model_id["']?s*[:=]s*["']([a-z0-9-_]+/d+)["']',
    r'MODEL_ENDPOINT["']?s*[:=]s*["']([a-z0-9-_]+/d+)["']',
]

Strategy 3: Input Fields & Textareas

Check form elements and code blocks:

input_selectors = [
    "input[value*='serverless.roboflow.com']",
    "textarea",
    "code",
]

Strategy 4: Page Text Search

Fallback to visible text on the page

Strategy 5: Legacy Endpoints

Support older endpoint formats:

  • detect.roboflow.com
  • classify.roboflow.com
  • segment.roboflow.com

Strategy 6: URL Construction

Build endpoint from page URL structure if all else fails

This multi-strategy approach ensures we find the API endpoint even if the page structure changes!

Tech Stack

  • Playwright: Browser automation (more reliable than requests for dynamic content)
  • Python 3.7+: Core language
  • Regex: Pattern matching for extraction

Usage

Basic Example

# Search for basketball detection models
SEARCH_KEYWORDS="basketball model object detection" 
MAX_PROJECTS=5 
python roboflow_search_agent.py

JSON Output

# Get structured JSON output
SEARCH_KEYWORDS="soccer ball instance segmentation" 
OUTPUT_JSON=true 
python roboflow_search_agent.py

Example Output

[
  {
    "project_title": "Basketball Detection",
    "url": "https://universe.roboflow.com/workspace/basketball-detection",
    "author": "John Doe",
    "project_type": "Object Detection",
    "has_model": true,
    "mAP": "85.2%",
    "precision": "87.1%",
    "recall": "83.5%",
    "training_images": "5000",
    "classes": ["basketball", "player"],
    "api_endpoint": "https://serverless.roboflow.com/basketball-detection/1",
    "model_identifier": "workspace/basketball-detection"
  }
]

Key Features

1. Intelligent Search

The tool applies the “Has a Model” filter automatically and handles keyword prioritization.

2. Comprehensive Data Extraction

Extracts:

  • Performance metrics (mAP@50, Precision, Recall)
  • Training data info (image count, classes)
  • Project metadata (author, update time, tags)
  • API endpoints (the hard part!)

3. Robust Error Handling

  • Automatic retries (3 attempts)
  • Graceful failure handling
  • Timeout management

4. Flexible Output

  • Human-readable console output
  • JSON format for programmatic use
  • Configurable via environment variables

Technical Highlights

Browser Automation with Playwright

def connect_browser(headless=True):
    playwright = sync_playwright().start()
    browser = playwright.chromium.launch(
        headless=headless,
        args=["--no-sandbox", "--disable-setuid-sandbox"]
    )
    context = browser.new_context(viewport={"width": 1440, "height": 900})
    return playwright, browser, context, page

Smart Scrolling

Instead of fixed waits, the scraper detects when content stops loading:

def scroll_page(page, max_scrolls=15):
    last_height = 0
    for i in range(max_scrolls):
        page.evaluate("window.scrollBy(0, window.innerHeight)")
        page.wait_for_timeout(800)
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

Lessons Learned

  1. Multiple Strategies > Single Strategy: Having fallbacks makes the scraper much more reliable
  2. Playwright > Requests: For dynamic sites, browser automation is essential
  3. Pattern Matching: Regex patterns need careful testing with real data
  4. Error Handling: Web scraping is fragile – always have retry logic

Use Cases

  • Research: Quickly find models for specific tasks
  • API Discovery: Extract endpoints for integration
  • Model Comparison: Compare metrics across multiple models
  • Automation: Integrate into ML pipelines

Installation

# Clone the repository
git clone https:https://github.com/SumitS10/Roboflow-.git
cd roboflow

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

Future Improvements

  • [ ] Add filtering by metrics (e.g., mAP > 80%)
  • [ ] Support for batch processing multiple searches
  • [ ] Export to CSV/Excel
  • [ ] Add model comparison features
  • [ ] Cache results to avoid re-scraping

Conclusion

Building this scraper taught me a lot about web scraping, browser automation, and handling edge cases. The multi-strategy approach for API extraction was key to making it reliable.

If you’re working with Roboflow models or need to automate model discovery, give it a try! Contributions and feedback are welcome.

Links

🔗 GitHub Repository: https://github.com/SumitS10/Roboflow-.git
🌐 Roboflow Universe: universe.roboflow.com

Tags: #python #webscraping #machinelearning #roboflow #playwright #automation #api #ml

Similar Posts