Scrapit Documentation

Everything you need to scrape any website with just a YAML file — including AI agent integration, transforms, validation, and storage.

Installation

Core

$ pip install scrapit-scraper

Optional extras

# Playwright backend (JS-rendered sites)
$ pip install scrapit-scraper[playwright]
$ playwright install chromium

# AI integrations
$ pip install scrapit-scraper[anthropic]   # Claude
$ pip install scrapit-scraper[openai]      # GPT-4o
$ pip install scrapit-scraper[langchain]   # LangChain / CrewAI
$ pip install scrapit-scraper[llamaindex]  # LlamaIndex
$ pip install scrapit-scraper[mcp]         # MCP server

# Storage backends
$ pip install scrapit-scraper[mongo]
$ pip install scrapit-scraper[excel]
$ pip install scrapit-scraper[sheets]

# Everything
$ pip install scrapit-scraper[all]

i

All integration extras are optional. The core (YAML scraping, transforms, validation, SQLite) works with no extras installed.

From source

$ git clone https://github.com/joaobenedetmachado/scrapit
$ cd scrapit
$ python -m venv .venv && source .venv/bin/activate
$ pip install -e .[dev]

Quick start

1. Create a directive

directives/hn.yaml

site: https://news.ycombinator.com
use: beautifulsoup

scrape:
  titles:
    - '.titleline > a'
    - attr: text
      all: true
  links:
    - '.titleline > a'
    - attr: href
      all: true

2. Run it

$ scrapit scrape hn --json
→ saved: output/hn.json

$ scrapit scrape hn --json --diff  # detect changes
$ scrapit scrape hn --csv --sqlite  # multiple outputs
$ scrapit scrape hn --preview       # print, don't save

3. Query results

$ scrapit query --backend sqlite --directive hn --limit 5

CLI reference

`scrapit scrape`

Flag	Description
directive	Directive name or path to YAML file. required
--json	Save result as `output/<name>.json`
--csv	Append result to `output/<name>.csv`
--sqlite	Save to `output/scrapit.db`
--mongo	Save to MongoDB (requires `MONGO_URI` in .env)
--preview	Print result, do not save anything
--diff	Compare against previous JSON run, fire webhook on change

`scrapit batch`

Run all directives in a folder. Accepts the same output flags as scrape.

$ scrapit batch scraper/directives/ --json --sqlite

`scrapit list`

List available directives in the default or a custom directory.

$ scrapit list
$ scrapit list --dir /path/to/my/directives

`scrapit query`

Flag	Default	Description
--backend	`sqlite`	`sqlite` or `mongo`
--directive	—	Filter by directive name
--url	—	Filter by URL fragment
--limit	`20`	Max results

`scrapit cache`

$ scrapit cache stats
$ scrapit cache clear
$ scrapit cache invalidate --url https://example.com

Environment (.env)

Create a .env file at the project root. All variables are optional.

.env

# MongoDB
MONGO_URI=mongodb://localhost:27017
MONGO_DB=scrapit
MONGO_COLLECTION=results

# RabbitMQ
RABBITMQ_HOST=localhost
RABBITMQ_QUEUE=scrapit

# Output directory (default: ./output)
OUTPUT_DIR=./output

Directive structure

A directive is a YAML file that fully describes a scraping job. Place them anywhere and reference by name or path.

Key	Type		Description
site	string	required	URL to scrape
sites	list	optional	List of URLs (multi-site mode)
use	string	required	`beautifulsoup` or `playwright`
scrape	dict	required	Field definitions with selectors
transform	dict	optional	Transform pipeline per field
validate	dict	optional	Validation rules per field
paginate	dict	optional	Pagination config
follow	dict	optional	Spider / link-following config
headers	dict	optional	Custom HTTP headers
cookies	dict	optional	Cookies to send with request
proxy	string	optional	Proxy URL
cache	dict	optional	HTTP cache config (`ttl` in seconds)
retries	int	optional	Retry attempts on failure (default: 3)
notify	dict	optional	Webhook notification config
wait_for	string	optional	Playwright: CSS selector to wait for before parsing
screenshot	bool	optional	Playwright: save full-page screenshot

scrape:

Defines the fields to extract. Each field is a list: selector → attr config.

scrape:
  # Single selector, get text
  title:
    - 'h1'
    - attr: text

  # Get an HTML attribute
  image:
    - 'img.hero'
    - attr: src

  # Return ALL matches as a list
  all_links:
    - 'a.result'
    - attr: href
      all: true

  # Fallback selectors — tries each in order
  price:
    - ['span.price-new', 'span.price', '.cost']
    - attr: text

attr values

attr	Returns
text	Inner text of the element (default)
html	Inner HTML of the element
href	The `href` attribute
src	The `src` attribute
class	Full class string (e.g. `"star-rating Three"`)
any HTML attr	Any attribute name: `data-id`, `aria-label`, etc.

paginate:

Automatically follow "next page" links. Scrapit fetches each page and merges results into lists.

paginate:
  selector: 'a.next'     # CSS selector for the next-page link
  attr: href              # attribute to extract the URL from
  max_pages: 5           # safety limit

i

Supported by both backends. Fields with all: true are concatenated across pages.

follow: (spider mode)

Discover all links on a page and scrape each one. Good for blog indexes, product catalogs, and documentation sites.

site: https://myblog.com/posts
use: beautifulsoup

follow:
  selector: 'a.post-link'
  attr: href
  max: 100
  same_domain: true

scrape:
  title:
    - 'h1'
    - attr: text
  body:
    - 'article'
    - attr: text

Returns a list of dicts, one per page scraped. Each item includes the url field automatically.

headers / proxy / cache

headers:
  Accept-Language: en-US,en;q=0.9
  Referer: https://google.com

cookies:
  session_id: abc123

proxy: http://user:pass@proxy:8080

cache:
  ttl: 3600    # seconds. 0 to disable.

retries: 3    # exponential backoff

# Playwright only
wait_for: '#content'
screenshot: true

notify:

Fire a webhook when --diff detects a change. Works with Slack, Discord, or any HTTP endpoint.

notify:
  webhook: https://hooks.slack.com/services/...

The webhook receives a POST with a JSON body containing the changed fields and the full new result.

Multi-site directives

Scrape multiple URLs with the same directive using sites: instead of site:.

sites:
  - https://books.toscrape.com/catalogue/page-1.html
  - https://books.toscrape.com/catalogue/page-2.html
  - https://books.toscrape.com/catalogue/page-3.html

use: beautifulsoup

scrape:
  titles:
    - 'h3 > a'
    - attr: title
      all: true

Transforms

Transforms run after scraping, per field, in order. Each step receives the output of the previous one.

Transform	Argument	Description	Example
strip	—	Remove leading/trailing whitespace	`- strip`
lower / upper / title	—	Change case	`- lower`
int / float	—	Parse number, strips non-numeric chars	`- float`
regex	`pattern`	Extract first regex match	`{regex: '\d+'}`
regex_group	`{pattern, group}`	Extract a capture group	`{regex_group: {pattern: '(\w+)', group: 1}}`
replace	`{old: new}`	String substitution (multiple pairs)	`{replace: {"£": ""}}`
split / join	`separator`	Split string to list / join list to string	`{split: ","}`
first / last	—	Pick first or last item from list	`- first`
default	`value`	Fallback if value is None	`{default: "N/A"}`
slice	`{start, end}`	Substring or sublist	`{slice: {end: 200}}`
prepend / append	`string`	Add text before or after value	`{prepend: "https:"}`
remove_tags	—	Strip HTML tags	`- remove_tags`
template	`"prefix {value}"`	String template with `{value}`	`{template: "USD {value}"}`
slugify	—	"Hello World" → "hello-world"	`- slugify`
truncate	`N`	Cut at N chars without breaking words	`{truncate: 150}`
normalize_whitespace	—	Collapse multiple spaces/newlines	`- normalize_whitespace`

chained example

transform:
  price:
    - strip                      # "  £ 12,99  " → "£ 12,99"
    - replace: {"£": "", ",": "."}
    - float                      # → 12.99

  slug:
    - strip
    - normalize_whitespace
    - slugify                    # "My  Title!" → "my-title"

  summary:
    - remove_tags
    - normalize_whitespace
    - truncate: 200

Validation

Runs after transforms. Invalid records are flagged with _valid: false and _errors. The scrape still completes.

Rule	Type	Description
required	bool	Must not be `None`
not_empty	bool	Must not be empty string or list
type	string	`str`, `int`, `float`, `list`, `bool`
min / max	number	Numeric range
min_length / max_length	int	String or list length
pattern	string	Regex must match
in	list	Value must be one of the listed options

validate:
  price:
    required: true
    type: float
    min: 0
  title:
    required: true
    min_length: 2
    max_length: 500
  status:
    in: [active, inactive, pending]
  sku:
    pattern: '^[A-Z]{2}\d{4}$'

Hooks

Register Python callbacks for scrape lifecycle events.

Event	Called when	Arguments
before_scrape	Before fetching the page	`(dados)`
after_scrape	After scraping and transforms	`(result, dados)`
on_error	On any exception	`(exc, dados)`
on_save	After saving to storage	`(result, path)`
on_change	When `--diff` detects a change	`(changes, result)`

from scraper import hooks

@hooks.on("after_scrape")
def log_result(result, dados):
    print(f"scraped {result['url']} — {len(result)} fields")

@hooks.on("on_change")
def alert(changes, result):
    print(f"changed fields: {list(changes.keys())}")

@hooks.on("on_error")
def handle_error(exc, dados):
    print(f"failed on {dados['site']}: {exc}")

Storage: JSON

Flag: --json · Output: output/<directive>.json

Saves the latest scrape result. Overwritten on each run. Best for monitoring, APIs, or feeding into other scripts.

Storage: CSV

Flag: --csv · Output: output/<directive>.csv

Appends one row per run. Header is written only on the first run. Best for time-series datasets.

Storage: SQLite

Flag: --sqlite · Output: output/scrapit.db

Zero-config. All directives share one database file. Query with scrapit query --backend sqlite.

✓

Recommended default. No server required — SQLite is included in Python's stdlib.

Storage: MongoDB

Flag: --mongo · Requires MONGO_URI in .env

$ pip install scrapit-scraper[mongo]

.env

MONGO_URI=mongodb://localhost:27017
MONGO_DB=scrapit
MONGO_COLLECTION=results

AI: Quick API

Use Scrapit programmatically without YAML — ideal for feeding pages into LLMs directly.

from scraper.integrations import (
    scrape_url,             # clean text
    scrape_page,            # structured metadata
    scrape_with_selectors,  # CSS-driven extraction
    scrape_many,            # parallel URLs
    scrape_directive,       # run a YAML directive
)

text = scrape_url("https://example.com")

page = scrape_page("https://example.com")
# → {url, title, description, main_content, links, word_count}

data = scrape_with_selectors(
    "https://books.toscrape.com/...",
    selectors={"title": "h1", "price": "p.price_color"},
    all_matches={"tags": True},
)

pages = scrape_many(
    ["https://a.com", "https://b.com"],
    mode="page",
    max_workers=8,
)

AI: MCP Server

Run Scrapit as an MCP server so Claude Desktop, Cursor, and Claude Code can call it as a native tool.

$ pip install scrapit-scraper[mcp]

Claude Code

$ claude mcp add scrapit -- python -m scraper.integrations.mcp

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "scrapit": {
      "command": "python",
      "args": ["-m", "scraper.integrations.mcp"],
      "cwd": "/path/to/scrapit"
    }
  }
}

Tools exposed

Tool	Description
scrape_url_tool	Fetch any URL, return clean text
scrape_page_tool	Fetch any URL, return title + description + links + content
scrape_with_selectors_tool	Extract specific fields using CSS selectors
run_directive_tool	Run a pre-configured YAML directive by name

AI: Anthropic SDK

$ pip install scrapit-scraper[anthropic]

Built-in agent loop

from scraper.integrations.anthropic import ScrapitAnthropicAgent

agent = ScrapitAnthropicAgent(
    model="claude-opus-4-6",
    max_iterations=10,
    system="You are a research assistant...",
)

answer = agent.run("What are the top 5 stories on Hacker News?")
print(answer)

Manual tool use

import anthropic
from scraper.integrations.anthropic import as_anthropic_tools, handle_tool_call

client   = anthropic.Anthropic()
tools    = as_anthropic_tools()

response = client.messages.create(
    model="claude-opus-4-6", max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "..."}],
)

for block in response.content:
    if block.type == "tool_use":
        result = handle_tool_call(block.name, block.input)

AI: OpenAI SDK

$ pip install scrapit-scraper[openai]

Built-in agent loop

from scraper.integrations.openai import ScrapitOpenAIAgent

agent  = ScrapitOpenAIAgent(model="gpt-4o")
answer = agent.run("Summarize the Python Wikipedia page.")
print(answer)

Manual function calling

from openai import OpenAI
from scraper.integrations.openai import as_openai_functions, handle_function_call

client   = OpenAI()
tools    = as_openai_functions()

response = client.chat.completions.create(
    model="gpt-4o", tools=tools,
    messages=[{"role": "user", "content": "..."}],
)

for call in response.choices[0].message.tool_calls:
    result = handle_function_call(call.function.name, call.function.arguments)

AI: LangChain / CrewAI / LangGraph

$ pip install scrapit-scraper[langchain]

ScrapitToolkit

from scraper.integrations.langchain import ScrapitToolkit

tools = ScrapitToolkit().get_tools()
tools = ScrapitToolkit(directives=["wikipedia", "hn"]).get_tools()

Available tools

Tool	Input	Returns
ScrapitTool	URL string	Clean page text
ScrapitPageTool	URL string	JSON with title, description, links, word_count
ScrapitSelectorTool	`{"url": "...", "selectors": {...}}`	JSON with extracted fields
ScrapitDirectiveTool	Directive name	JSON with scraped data

LangChain agent

from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI

agent = initialize_agent(
    tools=ScrapitToolkit().get_tools(),
    llm=ChatOpenAI(model="gpt-4o"),
    agent=AgentType.OPENAI_FUNCTIONS,
)
agent.run("What's on the front page of Hacker News?")

CrewAI

from crewai import Agent

researcher = Agent(
    role="Web Researcher",
    goal="Find and summarize information from the web",
    tools=ScrapitToolkit().get_tools(),
    llm="gpt-4o",
)

Document Loader (RAG)

from scraper.integrations.langchain import ScrapitLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = ScrapitLoader("https://example.com")
docs   = loader.load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1000).split_documents(docs)

AI: LlamaIndex

$ pip install scrapit-scraper[llamaindex]

from scraper.integrations.llamaindex import ScrapitReader
from llama_index.core import VectorStoreIndex

reader = ScrapitReader()

docs = reader.load_data(url="https://example.com")
docs = reader.load_data(urls=["https://a.com", "https://b.com"])
docs = reader.load_data(directive="wikipedia")

engine = VectorStoreIndex.from_documents(docs).as_query_engine()
result = engine.query("What is the main topic?")

Python API

Run directives programmatically from your own Python code.

import asyncio
from scraper.scrapers import grab_elements_by_directive
from scraper.storage import json_file, sqlite

result = asyncio.run(
    grab_elements_by_directive("scraper/directives/hn.yaml")
)

json_file.save(result, "hn")
sqlite.save(result, "hn")

RabbitMQ queue

Send directives to a background worker queue. Useful for scheduled scraping at scale.

$ pip install scrapit-scraper[rabbitmq]

# Producer — send a directive to the queue
from scraper.queue.producer import call_producer
call_producer("directives/hn.yaml")

# Consumer — start a blocking worker
$ python -m scraper.queue.consumer

Configure via RABBITMQ_HOST and RABBITMQ_QUEUE in .env. Workers save results to MongoDB automatically.