Scrapit Documentation

Everything you need to scrape any website with just a YAML file — including AI agent integration, transforms, validation, and storage.

Installation

Core

$ pip install scrapit-scraper

Optional extras

# Playwright backend (JS-rendered sites)
$ pip install scrapit-scraper[playwright]
$ playwright install chromium

# AI integrations
$ pip install scrapit-scraper[anthropic]   # Claude
$ pip install scrapit-scraper[openai]      # GPT-4o
$ pip install scrapit-scraper[langchain]   # LangChain / CrewAI
$ pip install scrapit-scraper[llamaindex]  # LlamaIndex
$ pip install scrapit-scraper[mcp]         # MCP server

# Storage backends
$ pip install scrapit-scraper[mongo]
$ pip install scrapit-scraper[excel]
$ pip install scrapit-scraper[sheets]

# Everything
$ pip install scrapit-scraper[all]
i
All integration extras are optional. The core (YAML scraping, transforms, validation, SQLite) works with no extras installed.

From source

$ git clone https://github.com/joaobenedetmachado/scrapit
$ cd scrapit
$ python -m venv .venv && source .venv/bin/activate
$ pip install -e .[dev]

Quick start

1. Create a directive

directives/hn.yaml
site: https://news.ycombinator.com
use: beautifulsoup

scrape:
  titles:
    - '.titleline > a'
    - attr: text
      all: true
  links:
    - '.titleline > a'
    - attr: href
      all: true

2. Run it

$ scrapit scrape hn --json
→ saved: output/hn.json

$ scrapit scrape hn --json --diff  # detect changes
$ scrapit scrape hn --csv --sqlite  # multiple outputs
$ scrapit scrape hn --preview       # print, don't save

3. Query results

$ scrapit query --backend sqlite --directive hn --limit 5

CLI reference

scrapit scrape

FlagDescription
directiveDirective name or path to YAML file. required
--jsonSave result as output/<name>.json
--csvAppend result to output/<name>.csv
--sqliteSave to output/scrapit.db
--mongoSave to MongoDB (requires MONGO_URI in .env)
--previewPrint result, do not save anything
--diffCompare against previous JSON run, fire webhook on change

scrapit batch

Run all directives in a folder. Accepts the same output flags as scrape.

$ scrapit batch scraper/directives/ --json --sqlite

scrapit list

List available directives in the default or a custom directory.

$ scrapit list
$ scrapit list --dir /path/to/my/directives

scrapit query

FlagDefaultDescription
--backendsqlitesqlite or mongo
--directiveFilter by directive name
--urlFilter by URL fragment
--limit20Max results

scrapit cache

$ scrapit cache stats
$ scrapit cache clear
$ scrapit cache invalidate --url https://example.com

Environment (.env)

Create a .env file at the project root. All variables are optional.

.env
# MongoDB
MONGO_URI=mongodb://localhost:27017
MONGO_DB=scrapit
MONGO_COLLECTION=results

# RabbitMQ
RABBITMQ_HOST=localhost
RABBITMQ_QUEUE=scrapit

# Output directory (default: ./output)
OUTPUT_DIR=./output

Directive structure

A directive is a YAML file that fully describes a scraping job. Place them anywhere and reference by name or path.

KeyTypeDescription
sitestringrequiredURL to scrape
siteslistoptionalList of URLs (multi-site mode)
usestringrequiredbeautifulsoup or playwright
scrapedictrequiredField definitions with selectors
transformdictoptionalTransform pipeline per field
validatedictoptionalValidation rules per field
paginatedictoptionalPagination config
followdictoptionalSpider / link-following config
headersdictoptionalCustom HTTP headers
cookiesdictoptionalCookies to send with request
proxystringoptionalProxy URL
cachedictoptionalHTTP cache config (ttl in seconds)
retriesintoptionalRetry attempts on failure (default: 3)
notifydictoptionalWebhook notification config
wait_forstringoptionalPlaywright: CSS selector to wait for before parsing
screenshotbooloptionalPlaywright: save full-page screenshot

scrape:

Defines the fields to extract. Each field is a list: selector → attr config.

scrape:
  # Single selector, get text
  title:
    - 'h1'
    - attr: text

  # Get an HTML attribute
  image:
    - 'img.hero'
    - attr: src

  # Return ALL matches as a list
  all_links:
    - 'a.result'
    - attr: href
      all: true

  # Fallback selectors — tries each in order
  price:
    - ['span.price-new', 'span.price', '.cost']
    - attr: text

attr values

attrReturns
textInner text of the element (default)
htmlInner HTML of the element
hrefThe href attribute
srcThe src attribute
classFull class string (e.g. "star-rating Three")
any HTML attrAny attribute name: data-id, aria-label, etc.

paginate:

Automatically follow "next page" links. Scrapit fetches each page and merges results into lists.

paginate:
  selector: 'a.next'     # CSS selector for the next-page link
  attr: href              # attribute to extract the URL from
  max_pages: 5           # safety limit
i
Supported by both backends. Fields with all: true are concatenated across pages.

follow: (spider mode)

Discover all links on a page and scrape each one. Good for blog indexes, product catalogs, and documentation sites.

site: https://myblog.com/posts
use: beautifulsoup

follow:
  selector: 'a.post-link'
  attr: href
  max: 100
  same_domain: true

scrape:
  title:
    - 'h1'
    - attr: text
  body:
    - 'article'
    - attr: text

Returns a list of dicts, one per page scraped. Each item includes the url field automatically.

headers / proxy / cache

headers:
  Accept-Language: en-US,en;q=0.9
  Referer: https://google.com

cookies:
  session_id: abc123

proxy: http://user:pass@proxy:8080

cache:
  ttl: 3600    # seconds. 0 to disable.

retries: 3    # exponential backoff

# Playwright only
wait_for: '#content'
screenshot: true

notify:

Fire a webhook when --diff detects a change. Works with Slack, Discord, or any HTTP endpoint.

notify:
  webhook: https://hooks.slack.com/services/...

The webhook receives a POST with a JSON body containing the changed fields and the full new result.

Multi-site directives

Scrape multiple URLs with the same directive using sites: instead of site:.

sites:
  - https://books.toscrape.com/catalogue/page-1.html
  - https://books.toscrape.com/catalogue/page-2.html
  - https://books.toscrape.com/catalogue/page-3.html

use: beautifulsoup

scrape:
  titles:
    - 'h3 > a'
    - attr: title
      all: true

Transforms

Transforms run after scraping, per field, in order. Each step receives the output of the previous one.

TransformArgumentDescriptionExample
stripRemove leading/trailing whitespace- strip
lower / upper / titleChange case- lower
int / floatParse number, strips non-numeric chars- float
regexpatternExtract first regex match{regex: '\d+'}
regex_group{pattern, group}Extract a capture group{regex_group: {pattern: '(\w+)', group: 1}}
replace{old: new}String substitution (multiple pairs){replace: {"£": ""}}
split / joinseparatorSplit string to list / join list to string{split: ","}
first / lastPick first or last item from list- first
defaultvalueFallback if value is None{default: "N/A"}
slice{start, end}Substring or sublist{slice: {end: 200}}
prepend / appendstringAdd text before or after value{prepend: "https:"}
remove_tagsStrip HTML tags- remove_tags
template"prefix {value}"String template with {value}{template: "USD {value}"}
slugify"Hello World" → "hello-world"- slugify
truncateNCut at N chars without breaking words{truncate: 150}
normalize_whitespaceCollapse multiple spaces/newlines- normalize_whitespace
chained example
transform:
  price:
    - strip                      # "  £ 12,99  " → "£ 12,99"
    - replace: {"£": "", ",": "."}
    - float                      # → 12.99

  slug:
    - strip
    - normalize_whitespace
    - slugify                    # "My  Title!" → "my-title"

  summary:
    - remove_tags
    - normalize_whitespace
    - truncate: 200

Validation

Runs after transforms. Invalid records are flagged with _valid: false and _errors. The scrape still completes.

RuleTypeDescription
requiredboolMust not be None
not_emptyboolMust not be empty string or list
typestringstr, int, float, list, bool
min / maxnumberNumeric range
min_length / max_lengthintString or list length
patternstringRegex must match
inlistValue must be one of the listed options
validate:
  price:
    required: true
    type: float
    min: 0
  title:
    required: true
    min_length: 2
    max_length: 500
  status:
    in: [active, inactive, pending]
  sku:
    pattern: '^[A-Z]{2}\d{4}$'

Hooks

Register Python callbacks for scrape lifecycle events.

EventCalled whenArguments
before_scrapeBefore fetching the page(dados)
after_scrapeAfter scraping and transforms(result, dados)
on_errorOn any exception(exc, dados)
on_saveAfter saving to storage(result, path)
on_changeWhen --diff detects a change(changes, result)
from scraper import hooks

@hooks.on("after_scrape")
def log_result(result, dados):
    print(f"scraped {result['url']} — {len(result)} fields")

@hooks.on("on_change")
def alert(changes, result):
    print(f"changed fields: {list(changes.keys())}")

@hooks.on("on_error")
def handle_error(exc, dados):
    print(f"failed on {dados['site']}: {exc}")

Storage: JSON

Flag: --json  ·  Output: output/<directive>.json

Saves the latest scrape result. Overwritten on each run. Best for monitoring, APIs, or feeding into other scripts.

Storage: CSV

Flag: --csv  ·  Output: output/<directive>.csv

Appends one row per run. Header is written only on the first run. Best for time-series datasets.

Storage: SQLite

Flag: --sqlite  ·  Output: output/scrapit.db

Zero-config. All directives share one database file. Query with scrapit query --backend sqlite.

Recommended default. No server required — SQLite is included in Python's stdlib.

Storage: MongoDB

Flag: --mongo  ·  Requires MONGO_URI in .env

$ pip install scrapit-scraper[mongo]
.env
MONGO_URI=mongodb://localhost:27017
MONGO_DB=scrapit
MONGO_COLLECTION=results

AI: Quick API

Use Scrapit programmatically without YAML — ideal for feeding pages into LLMs directly.

from scraper.integrations import (
    scrape_url,             # clean text
    scrape_page,            # structured metadata
    scrape_with_selectors,  # CSS-driven extraction
    scrape_many,            # parallel URLs
    scrape_directive,       # run a YAML directive
)

text = scrape_url("https://example.com")

page = scrape_page("https://example.com")
# → {url, title, description, main_content, links, word_count}

data = scrape_with_selectors(
    "https://books.toscrape.com/...",
    selectors={"title": "h1", "price": "p.price_color"},
    all_matches={"tags": True},
)

pages = scrape_many(
    ["https://a.com", "https://b.com"],
    mode="page",
    max_workers=8,
)

AI: MCP Server

Run Scrapit as an MCP server so Claude Desktop, Cursor, and Claude Code can call it as a native tool.

$ pip install scrapit-scraper[mcp]

Claude Code

$ claude mcp add scrapit -- python -m scraper.integrations.mcp

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "scrapit": {
      "command": "python",
      "args": ["-m", "scraper.integrations.mcp"],
      "cwd": "/path/to/scrapit"
    }
  }
}

Tools exposed

ToolDescription
scrape_url_toolFetch any URL, return clean text
scrape_page_toolFetch any URL, return title + description + links + content
scrape_with_selectors_toolExtract specific fields using CSS selectors
run_directive_toolRun a pre-configured YAML directive by name

AI: Anthropic SDK

$ pip install scrapit-scraper[anthropic]

Built-in agent loop

from scraper.integrations.anthropic import ScrapitAnthropicAgent

agent = ScrapitAnthropicAgent(
    model="claude-opus-4-6",
    max_iterations=10,
    system="You are a research assistant...",
)

answer = agent.run("What are the top 5 stories on Hacker News?")
print(answer)

Manual tool use

import anthropic
from scraper.integrations.anthropic import as_anthropic_tools, handle_tool_call

client   = anthropic.Anthropic()
tools    = as_anthropic_tools()

response = client.messages.create(
    model="claude-opus-4-6", max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "..."}],
)

for block in response.content:
    if block.type == "tool_use":
        result = handle_tool_call(block.name, block.input)

AI: OpenAI SDK

$ pip install scrapit-scraper[openai]

Built-in agent loop

from scraper.integrations.openai import ScrapitOpenAIAgent

agent  = ScrapitOpenAIAgent(model="gpt-4o")
answer = agent.run("Summarize the Python Wikipedia page.")
print(answer)

Manual function calling

from openai import OpenAI
from scraper.integrations.openai import as_openai_functions, handle_function_call

client   = OpenAI()
tools    = as_openai_functions()

response = client.chat.completions.create(
    model="gpt-4o", tools=tools,
    messages=[{"role": "user", "content": "..."}],
)

for call in response.choices[0].message.tool_calls:
    result = handle_function_call(call.function.name, call.function.arguments)

AI: LangChain / CrewAI / LangGraph

$ pip install scrapit-scraper[langchain]

ScrapitToolkit

from scraper.integrations.langchain import ScrapitToolkit

tools = ScrapitToolkit().get_tools()
tools = ScrapitToolkit(directives=["wikipedia", "hn"]).get_tools()

Available tools

ToolInputReturns
ScrapitToolURL stringClean page text
ScrapitPageToolURL stringJSON with title, description, links, word_count
ScrapitSelectorTool{"url": "...", "selectors": {...}}JSON with extracted fields
ScrapitDirectiveToolDirective nameJSON with scraped data

LangChain agent

from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI

agent = initialize_agent(
    tools=ScrapitToolkit().get_tools(),
    llm=ChatOpenAI(model="gpt-4o"),
    agent=AgentType.OPENAI_FUNCTIONS,
)
agent.run("What's on the front page of Hacker News?")

CrewAI

from crewai import Agent

researcher = Agent(
    role="Web Researcher",
    goal="Find and summarize information from the web",
    tools=ScrapitToolkit().get_tools(),
    llm="gpt-4o",
)

Document Loader (RAG)

from scraper.integrations.langchain import ScrapitLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = ScrapitLoader("https://example.com")
docs   = loader.load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1000).split_documents(docs)

AI: LlamaIndex

$ pip install scrapit-scraper[llamaindex]
from scraper.integrations.llamaindex import ScrapitReader
from llama_index.core import VectorStoreIndex

reader = ScrapitReader()

docs = reader.load_data(url="https://example.com")
docs = reader.load_data(urls=["https://a.com", "https://b.com"])
docs = reader.load_data(directive="wikipedia")

engine = VectorStoreIndex.from_documents(docs).as_query_engine()
result = engine.query("What is the main topic?")

Python API

Run directives programmatically from your own Python code.

import asyncio
from scraper.scrapers import grab_elements_by_directive
from scraper.storage import json_file, sqlite

result = asyncio.run(
    grab_elements_by_directive("scraper/directives/hn.yaml")
)

json_file.save(result, "hn")
sqlite.save(result, "hn")

RabbitMQ queue

Send directives to a background worker queue. Useful for scheduled scraping at scale.

$ pip install scrapit-scraper[rabbitmq]
# Producer — send a directive to the queue
from scraper.queue.producer import call_producer
call_producer("directives/hn.yaml")

# Consumer — start a blocking worker
$ python -m scraper.queue.consumer

Configure via RABBITMQ_HOST and RABBITMQ_QUEUE in .env. Workers save results to MongoDB automatically.