Extract structured company intelligence from any website — no paid scraping APIs needed.
Built with Pydantic schemas + OpenAI structured outputs, so the extraction schema lives in code, not in prompts. Adding a new field = adding one line to models.py. Zero prompt changes.
- 3-Tier Fetch — automatic escalation:
requests→cloudscraper(Cloudflare bypass) →Playwright(full headless browser) - Auto Sub-Page Discovery — finds and crawls
/about,/team,/pricing,/contactpages automatically - Pydantic-Driven Extraction — schema defined as Python models, enforced by OpenAI structured outputs
- Multi-Engine LinkedIn Search — DuckDuckGo → Brave Search → SearXNG fallback chain for founder profiles
- HTML Caching — 24-hour local cache to avoid redundant fetches during development
- Batch Mode — scrape a CSV of URLs and export results as JSON or CSV
- JS Rendering — optional Playwright support for JavaScript-heavy SPAs
# Clone
git clone https://github.com/Autonitia/web-intel.git
cd web-intel
# Setup
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
playwright install chromium
# Configure
cp .env.example .env
# Edit .env and add your OpenAI API key
# Run
python -m web_intel https://example.com# Single URL
python -m web_intel https://providentestate.com
# Batch mode (CSV with a 'url' column)
python -m web_intel --batch examples/batch_urls.csv
# Export as CSV
python -m web_intel https://example.com --output csv
# Export both JSON and CSV
python -m web_intel https://example.com --output both
# Force JS rendering via Playwright
python -m web_intel https://spa-heavy-site.com --js
# Skip cache
python -m web_intel https://example.com --no-cache
# Clear all cached HTML
python -m web_intel --clear-cache
# Quiet mode (no progress logs)
python -m web_intel https://example.com --quiet{
"company_name": "Provident Estate",
"description": "Your one-stop for all real estate services, including selling, renting, snagging, conveyancing, mortgages, property management, & expert property consultants.",
"founders": [
{
"name": "Loai Al Fakir",
"role": "CEO",
"linkedin": "https://ae.linkedin.com/in/loaifakir"
}
],
"social_media_links": {
"website": "https://providentestate.com/"
},
"features": [
"Property Management",
"Mortgages",
"Conveyancing",
"Short Term Rentals",
"Property Snagging",
"Partner Program",
"Currency Exchange"
],
"contact": {"email": "", "phone": "", "address": ""},
"year_founded": "2008",
"headquarters": "Dubai, UAE",
"is_open_source": false
}web-intel/
├── .env.example # API key template
├── requirements.txt
├── examples/
│ └── batch_urls.csv # Sample batch input
├── output/ # JSON + CSV exports
└── web_intel/
├── models.py # Pydantic schema — the single source of truth
├── config.py # Settings loaded from .env
├── fetcher.py # 3-tier fetch with auto-escalation
├── cleaner.py # HTML → clean text + meta tags + links
├── crawler.py # Auto-discover relevant sub-pages
├── search.py # Multi-engine LinkedIn search
├── extractor.py # OpenAI structured outputs via Pydantic
├── export.py # JSON + CSV export
├── cache.py # 24hr HTML cache
├── pipeline.py # Orchestrates the full pipeline
└── cli.py # CLI entry point
- Fetch — tries plain HTTP, escalates to cloudscraper (Cloudflare bypass), then Playwright (headless browser)
- Discover — scans links for relevant sub-pages (
/about,/team,/pricing,/contact) - Clean — strips scripts, styles, and noise; extracts meta tags and all links
- Extract — sends cleaned content to OpenAI with the Pydantic schema enforced via structured outputs
- Enrich — searches DuckDuckGo/Brave/SearXNG for founder LinkedIn profiles
Add to .env for broader founder LinkedIn search coverage:
# Brave Search — free tier, 2000 queries/month
BRAVE_API_KEY=your-key
# SearXNG — self-hosted meta-search engine
SEARXNG_URL=http://localhost:8080MIT