Skip to content

Autonitia/web-intel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-intel

Extract structured company intelligence from any website — no paid scraping APIs needed.

Built with Pydantic schemas + OpenAI structured outputs, so the extraction schema lives in code, not in prompts. Adding a new field = adding one line to models.py. Zero prompt changes.

Features

  • 3-Tier Fetch — automatic escalation: requestscloudscraper (Cloudflare bypass) → Playwright (full headless browser)
  • Auto Sub-Page Discovery — finds and crawls /about, /team, /pricing, /contact pages automatically
  • Pydantic-Driven Extraction — schema defined as Python models, enforced by OpenAI structured outputs
  • Multi-Engine LinkedIn Search — DuckDuckGo → Brave Search → SearXNG fallback chain for founder profiles
  • HTML Caching — 24-hour local cache to avoid redundant fetches during development
  • Batch Mode — scrape a CSV of URLs and export results as JSON or CSV
  • JS Rendering — optional Playwright support for JavaScript-heavy SPAs

Quickstart

# Clone
git clone https://github.com/Autonitia/web-intel.git
cd web-intel

# Setup
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
playwright install chromium

# Configure
cp .env.example .env
# Edit .env and add your OpenAI API key

# Run
python -m web_intel https://example.com

Usage

# Single URL
python -m web_intel https://providentestate.com

# Batch mode (CSV with a 'url' column)
python -m web_intel --batch examples/batch_urls.csv

# Export as CSV
python -m web_intel https://example.com --output csv

# Export both JSON and CSV
python -m web_intel https://example.com --output both

# Force JS rendering via Playwright
python -m web_intel https://spa-heavy-site.com --js

# Skip cache
python -m web_intel https://example.com --no-cache

# Clear all cached HTML
python -m web_intel --clear-cache

# Quiet mode (no progress logs)
python -m web_intel https://example.com --quiet

Example Output

{
    "company_name": "Provident Estate",
    "description": "Your one-stop for all real estate services, including selling, renting, snagging, conveyancing, mortgages, property management, & expert property consultants.",
    "founders": [
        {
            "name": "Loai Al Fakir",
            "role": "CEO",
            "linkedin": "https://ae.linkedin.com/in/loaifakir"
        }
    ],
    "social_media_links": {
        "website": "https://providentestate.com/"
    },
    "features": [
        "Property Management",
        "Mortgages",
        "Conveyancing",
        "Short Term Rentals",
        "Property Snagging",
        "Partner Program",
        "Currency Exchange"
    ],
    "contact": {"email": "", "phone": "", "address": ""},
    "year_founded": "2008",
    "headquarters": "Dubai, UAE",
    "is_open_source": false
}

Project Structure

web-intel/
├── .env.example          # API key template
├── requirements.txt
├── examples/
│   └── batch_urls.csv    # Sample batch input
├── output/               # JSON + CSV exports
└── web_intel/
    ├── models.py         # Pydantic schema — the single source of truth
    ├── config.py         # Settings loaded from .env
    ├── fetcher.py        # 3-tier fetch with auto-escalation
    ├── cleaner.py        # HTML → clean text + meta tags + links
    ├── crawler.py        # Auto-discover relevant sub-pages
    ├── search.py         # Multi-engine LinkedIn search
    ├── extractor.py      # OpenAI structured outputs via Pydantic
    ├── export.py         # JSON + CSV export
    ├── cache.py          # 24hr HTML cache
    ├── pipeline.py       # Orchestrates the full pipeline
    └── cli.py            # CLI entry point

How It Works

  1. Fetch — tries plain HTTP, escalates to cloudscraper (Cloudflare bypass), then Playwright (headless browser)
  2. Discover — scans links for relevant sub-pages (/about, /team, /pricing, /contact)
  3. Clean — strips scripts, styles, and noise; extracts meta tags and all links
  4. Extract — sends cleaned content to OpenAI with the Pydantic schema enforced via structured outputs
  5. Enrich — searches DuckDuckGo/Brave/SearXNG for founder LinkedIn profiles

Optional: Extra Search Engines

Add to .env for broader founder LinkedIn search coverage:

# Brave Search — free tier, 2000 queries/month
BRAVE_API_KEY=your-key

# SearXNG — self-hosted meta-search engine
SEARXNG_URL=http://localhost:8080

License

MIT

About

Extract structured company intelligence from any website using Pydantic + OpenAI structured outputs. 3-tier fetch, auto sub-page discovery, multi-engine LinkedIn search.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages