Scrapit Documentation
Everything you need to scrape any website with just a YAML file — including AI agent integration, transforms, validation, and storage.
Installation
Core
$ pip install scrapit-scraper
Optional extras
# Playwright backend (JS-rendered sites) $ pip install scrapit-scraper[playwright] $ playwright install chromium # AI integrations $ pip install scrapit-scraper[anthropic] # Claude $ pip install scrapit-scraper[openai] # GPT-4o $ pip install scrapit-scraper[langchain] # LangChain / CrewAI $ pip install scrapit-scraper[llamaindex] # LlamaIndex $ pip install scrapit-scraper[mcp] # MCP server # Storage backends $ pip install scrapit-scraper[mongo] $ pip install scrapit-scraper[excel] $ pip install scrapit-scraper[sheets] # Everything $ pip install scrapit-scraper[all]
From source
$ git clone https://github.com/joaobenedetmachado/scrapit $ cd scrapit $ python -m venv .venv && source .venv/bin/activate $ pip install -e .[dev]
Quick start
1. Create a directive
site: https://news.ycombinator.com use: beautifulsoup scrape: titles: - '.titleline > a' - attr: text all: true links: - '.titleline > a' - attr: href all: true
2. Run it
$ scrapit scrape hn --json → saved: output/hn.json $ scrapit scrape hn --json --diff # detect changes $ scrapit scrape hn --csv --sqlite # multiple outputs $ scrapit scrape hn --preview # print, don't save
3. Query results
$ scrapit query --backend sqlite --directive hn --limit 5
CLI reference
scrapit scrape
| Flag | Description |
|---|---|
| directive | Directive name or path to YAML file. required |
| --json | Save result as output/<name>.json |
| --csv | Append result to output/<name>.csv |
| --sqlite | Save to output/scrapit.db |
| --mongo | Save to MongoDB (requires MONGO_URI in .env) |
| --preview | Print result, do not save anything |
| --diff | Compare against previous JSON run, fire webhook on change |
scrapit batch
Run all directives in a folder. Accepts the same output flags as scrape.
$ scrapit batch scraper/directives/ --json --sqlite
scrapit list
List available directives in the default or a custom directory.
$ scrapit list $ scrapit list --dir /path/to/my/directives
scrapit query
| Flag | Default | Description |
|---|---|---|
| --backend | sqlite | sqlite or mongo |
| --directive | — | Filter by directive name |
| --url | — | Filter by URL fragment |
| --limit | 20 | Max results |
scrapit cache
$ scrapit cache stats $ scrapit cache clear $ scrapit cache invalidate --url https://example.com
Environment (.env)
Create a .env file at the project root. All variables are optional.
# MongoDB MONGO_URI=mongodb://localhost:27017 MONGO_DB=scrapit MONGO_COLLECTION=results # RabbitMQ RABBITMQ_HOST=localhost RABBITMQ_QUEUE=scrapit # Output directory (default: ./output) OUTPUT_DIR=./output
Directive structure
A directive is a YAML file that fully describes a scraping job. Place them anywhere and reference by name or path.
| Key | Type | Description | |
|---|---|---|---|
| site | string | required | URL to scrape |
| sites | list | optional | List of URLs (multi-site mode) |
| use | string | required | beautifulsoup or playwright |
| scrape | dict | required | Field definitions with selectors |
| transform | dict | optional | Transform pipeline per field |
| validate | dict | optional | Validation rules per field |
| paginate | dict | optional | Pagination config |
| follow | dict | optional | Spider / link-following config |
| headers | dict | optional | Custom HTTP headers |
| cookies | dict | optional | Cookies to send with request |
| proxy | string | optional | Proxy URL |
| cache | dict | optional | HTTP cache config (ttl in seconds) |
| retries | int | optional | Retry attempts on failure (default: 3) |
| notify | dict | optional | Webhook notification config |
| wait_for | string | optional | Playwright: CSS selector to wait for before parsing |
| screenshot | bool | optional | Playwright: save full-page screenshot |
scrape:
Defines the fields to extract. Each field is a list: selector → attr config.
scrape: # Single selector, get text title: - 'h1' - attr: text # Get an HTML attribute image: - 'img.hero' - attr: src # Return ALL matches as a list all_links: - 'a.result' - attr: href all: true # Fallback selectors — tries each in order price: - ['span.price-new', 'span.price', '.cost'] - attr: text
attr values
| attr | Returns |
|---|---|
| text | Inner text of the element (default) |
| html | Inner HTML of the element |
| href | The href attribute |
| src | The src attribute |
| class | Full class string (e.g. "star-rating Three") |
| any HTML attr | Any attribute name: data-id, aria-label, etc. |
paginate:
Automatically follow "next page" links. Scrapit fetches each page and merges results into lists.
paginate: selector: 'a.next' # CSS selector for the next-page link attr: href # attribute to extract the URL from max_pages: 5 # safety limit
all: true are concatenated across pages.follow: (spider mode)
Discover all links on a page and scrape each one. Good for blog indexes, product catalogs, and documentation sites.
site: https://myblog.com/posts use: beautifulsoup follow: selector: 'a.post-link' attr: href max: 100 same_domain: true scrape: title: - 'h1' - attr: text body: - 'article' - attr: text
Returns a list of dicts, one per page scraped. Each item includes the url field automatically.
headers / proxy / cache
headers: Accept-Language: en-US,en;q=0.9 Referer: https://google.com cookies: session_id: abc123 proxy: http://user:pass@proxy:8080 cache: ttl: 3600 # seconds. 0 to disable. retries: 3 # exponential backoff # Playwright only wait_for: '#content' screenshot: true
notify:
Fire a webhook when --diff detects a change. Works with Slack, Discord, or any HTTP endpoint.
notify: webhook: https://hooks.slack.com/services/...
The webhook receives a POST with a JSON body containing the changed fields and the full new result.
Multi-site directives
Scrape multiple URLs with the same directive using sites: instead of site:.
sites: - https://books.toscrape.com/catalogue/page-1.html - https://books.toscrape.com/catalogue/page-2.html - https://books.toscrape.com/catalogue/page-3.html use: beautifulsoup scrape: titles: - 'h3 > a' - attr: title all: true
Transforms
Transforms run after scraping, per field, in order. Each step receives the output of the previous one.
| Transform | Argument | Description | Example |
|---|---|---|---|
| strip | — | Remove leading/trailing whitespace | - strip |
| lower / upper / title | — | Change case | - lower |
| int / float | — | Parse number, strips non-numeric chars | - float |
| regex | pattern | Extract first regex match | {regex: '\d+'} |
| regex_group | {pattern, group} | Extract a capture group | {regex_group: {pattern: '(\w+)', group: 1}} |
| replace | {old: new} | String substitution (multiple pairs) | {replace: {"£": ""}} |
| split / join | separator | Split string to list / join list to string | {split: ","} |
| first / last | — | Pick first or last item from list | - first |
| default | value | Fallback if value is None | {default: "N/A"} |
| slice | {start, end} | Substring or sublist | {slice: {end: 200}} |
| prepend / append | string | Add text before or after value | {prepend: "https:"} |
| remove_tags | — | Strip HTML tags | - remove_tags |
| template | "prefix {value}" | String template with {value} | {template: "USD {value}"} |
| slugify | — | "Hello World" → "hello-world" | - slugify |
| truncate | N | Cut at N chars without breaking words | {truncate: 150} |
| normalize_whitespace | — | Collapse multiple spaces/newlines | - normalize_whitespace |
transform: price: - strip # " £ 12,99 " → "£ 12,99" - replace: {"£": "", ",": "."} - float # → 12.99 slug: - strip - normalize_whitespace - slugify # "My Title!" → "my-title" summary: - remove_tags - normalize_whitespace - truncate: 200
Validation
Runs after transforms. Invalid records are flagged with _valid: false and _errors. The scrape still completes.
| Rule | Type | Description |
|---|---|---|
| required | bool | Must not be None |
| not_empty | bool | Must not be empty string or list |
| type | string | str, int, float, list, bool |
| min / max | number | Numeric range |
| min_length / max_length | int | String or list length |
| pattern | string | Regex must match |
| in | list | Value must be one of the listed options |
validate: price: required: true type: float min: 0 title: required: true min_length: 2 max_length: 500 status: in: [active, inactive, pending] sku: pattern: '^[A-Z]{2}\d{4}$'
Hooks
Register Python callbacks for scrape lifecycle events.
| Event | Called when | Arguments |
|---|---|---|
| before_scrape | Before fetching the page | (dados) |
| after_scrape | After scraping and transforms | (result, dados) |
| on_error | On any exception | (exc, dados) |
| on_save | After saving to storage | (result, path) |
| on_change | When --diff detects a change | (changes, result) |
from scraper import hooks @hooks.on("after_scrape") def log_result(result, dados): print(f"scraped {result['url']} — {len(result)} fields") @hooks.on("on_change") def alert(changes, result): print(f"changed fields: {list(changes.keys())}") @hooks.on("on_error") def handle_error(exc, dados): print(f"failed on {dados['site']}: {exc}")
Storage: JSON
Flag: --json · Output: output/<directive>.json
Saves the latest scrape result. Overwritten on each run. Best for monitoring, APIs, or feeding into other scripts.
Storage: CSV
Flag: --csv · Output: output/<directive>.csv
Appends one row per run. Header is written only on the first run. Best for time-series datasets.
Storage: SQLite
Flag: --sqlite · Output: output/scrapit.db
Zero-config. All directives share one database file. Query with scrapit query --backend sqlite.
Storage: MongoDB
Flag: --mongo · Requires MONGO_URI in .env
$ pip install scrapit-scraper[mongo]
MONGO_URI=mongodb://localhost:27017 MONGO_DB=scrapit MONGO_COLLECTION=results
AI: Quick API
Use Scrapit programmatically without YAML — ideal for feeding pages into LLMs directly.
from scraper.integrations import ( scrape_url, # clean text scrape_page, # structured metadata scrape_with_selectors, # CSS-driven extraction scrape_many, # parallel URLs scrape_directive, # run a YAML directive ) text = scrape_url("https://example.com") page = scrape_page("https://example.com") # → {url, title, description, main_content, links, word_count} data = scrape_with_selectors( "https://books.toscrape.com/...", selectors={"title": "h1", "price": "p.price_color"}, all_matches={"tags": True}, ) pages = scrape_many( ["https://a.com", "https://b.com"], mode="page", max_workers=8, )
AI: MCP Server
Run Scrapit as an MCP server so Claude Desktop, Cursor, and Claude Code can call it as a native tool.
$ pip install scrapit-scraper[mcp]
Claude Code
$ claude mcp add scrapit -- python -m scraper.integrations.mcp
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"scrapit": {
"command": "python",
"args": ["-m", "scraper.integrations.mcp"],
"cwd": "/path/to/scrapit"
}
}
}
Tools exposed
| Tool | Description |
|---|---|
| scrape_url_tool | Fetch any URL, return clean text |
| scrape_page_tool | Fetch any URL, return title + description + links + content |
| scrape_with_selectors_tool | Extract specific fields using CSS selectors |
| run_directive_tool | Run a pre-configured YAML directive by name |
AI: Anthropic SDK
$ pip install scrapit-scraper[anthropic]
Built-in agent loop
from scraper.integrations.anthropic import ScrapitAnthropicAgent agent = ScrapitAnthropicAgent( model="claude-opus-4-6", max_iterations=10, system="You are a research assistant...", ) answer = agent.run("What are the top 5 stories on Hacker News?") print(answer)
Manual tool use
import anthropic from scraper.integrations.anthropic import as_anthropic_tools, handle_tool_call client = anthropic.Anthropic() tools = as_anthropic_tools() response = client.messages.create( model="claude-opus-4-6", max_tokens=1024, tools=tools, messages=[{"role": "user", "content": "..."}], ) for block in response.content: if block.type == "tool_use": result = handle_tool_call(block.name, block.input)
AI: OpenAI SDK
$ pip install scrapit-scraper[openai]
Built-in agent loop
from scraper.integrations.openai import ScrapitOpenAIAgent agent = ScrapitOpenAIAgent(model="gpt-4o") answer = agent.run("Summarize the Python Wikipedia page.") print(answer)
Manual function calling
from openai import OpenAI from scraper.integrations.openai import as_openai_functions, handle_function_call client = OpenAI() tools = as_openai_functions() response = client.chat.completions.create( model="gpt-4o", tools=tools, messages=[{"role": "user", "content": "..."}], ) for call in response.choices[0].message.tool_calls: result = handle_function_call(call.function.name, call.function.arguments)
AI: LangChain / CrewAI / LangGraph
$ pip install scrapit-scraper[langchain]
ScrapitToolkit
from scraper.integrations.langchain import ScrapitToolkit tools = ScrapitToolkit().get_tools() tools = ScrapitToolkit(directives=["wikipedia", "hn"]).get_tools()
Available tools
| Tool | Input | Returns |
|---|---|---|
| ScrapitTool | URL string | Clean page text |
| ScrapitPageTool | URL string | JSON with title, description, links, word_count |
| ScrapitSelectorTool | {"url": "...", "selectors": {...}} | JSON with extracted fields |
| ScrapitDirectiveTool | Directive name | JSON with scraped data |
LangChain agent
from langchain.agents import initialize_agent, AgentType from langchain_openai import ChatOpenAI agent = initialize_agent( tools=ScrapitToolkit().get_tools(), llm=ChatOpenAI(model="gpt-4o"), agent=AgentType.OPENAI_FUNCTIONS, ) agent.run("What's on the front page of Hacker News?")
CrewAI
from crewai import Agent researcher = Agent( role="Web Researcher", goal="Find and summarize information from the web", tools=ScrapitToolkit().get_tools(), llm="gpt-4o", )
Document Loader (RAG)
from scraper.integrations.langchain import ScrapitLoader from langchain.text_splitter import RecursiveCharacterTextSplitter loader = ScrapitLoader("https://example.com") docs = loader.load() chunks = RecursiveCharacterTextSplitter(chunk_size=1000).split_documents(docs)
AI: LlamaIndex
$ pip install scrapit-scraper[llamaindex]
from scraper.integrations.llamaindex import ScrapitReader from llama_index.core import VectorStoreIndex reader = ScrapitReader() docs = reader.load_data(url="https://example.com") docs = reader.load_data(urls=["https://a.com", "https://b.com"]) docs = reader.load_data(directive="wikipedia") engine = VectorStoreIndex.from_documents(docs).as_query_engine() result = engine.query("What is the main topic?")
Python API
Run directives programmatically from your own Python code.
import asyncio from scraper.scrapers import grab_elements_by_directive from scraper.storage import json_file, sqlite result = asyncio.run( grab_elements_by_directive("scraper/directives/hn.yaml") ) json_file.save(result, "hn") sqlite.save(result, "hn")
RabbitMQ queue
Send directives to a background worker queue. Useful for scheduled scraping at scale.
$ pip install scrapit-scraper[rabbitmq]
# Producer — send a directive to the queue from scraper.queue.producer import call_producer call_producer("directives/hn.yaml") # Consumer — start a blocking worker $ python -m scraper.queue.consumer
Configure via RABBITMQ_HOST and RABBITMQ_QUEUE in .env. Workers save results to MongoDB automatically.