Scrapit — YAML-driven web scraper framework

01 —

Write YAML.
Get data.

Define your target once. Re-run anytime with different output flags.

01

Describe your target — site URL, CSS selectors, transforms, and validation rules in a YAML file.
02

Run one command — Scrapit fetches the page, applies your pipeline, and saves the result.
03

Pick your output — JSON, CSV, SQLite, or MongoDB. Same directive, different flag.
04

Plug into AI — use ScrapitToolkit or the MCP server to give any agent web access.

terminal

$ scrapit scrape product --json --diff

{
  "title": "A Light in the Attic",
  "price": 51.77,
  "rating": "Three",
  "url": "https://books.toscrape.com/...",
  "timestamp": "2026-03-05 09:00:00"
}

→ saved: output/product.json
→ no changes since last run

directives/blog.yaml

site: https://myblog.com
use: beautifulsoup

follow:
  selector: 'a.post-link'
  attr: href
  max: 50
  same_domain: true

scrape:
  title:
    - 'h1.post-title'
    - attr: text
  body:
    - 'div.post-content'
    - attr: text
  date:
    - 'time'
    - attr: datetime

agent.py

from scraper.integrations.anthropic import ScrapitAnthropicAgent

agent = ScrapitAnthropicAgent(model="claude-opus-4-6")

answer = agent.run(
  "What are the top 5 Hacker News posts today? "
  "Summarize each one in one sentence."
)

# Agent automatically calls scrape_page("https://news.ycombinator.com"),
# reads the results, and produces a final answer.

directives/product.yaml

site: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000
use: beautifulsoup   # or: playwright

headers:
  Accept-Language: en-US,en;q=0.9

cache:
  ttl: 3600    # don't re-fetch during dev

scrape:
  title:
    - 'h1'
    - attr: text
  price:
    - 'p.price_color'
    - attr: text
  rating:
    - 'p.star-rating'
    - attr: class
  tags:
    - 'ul.breadcrumb li'
    - attr: text
      all: true    # return all matches as list

transform:
  price:
    - strip
    - replace: {"£": ""}
    - float
  rating:
    - last         # "star-rating Three" → "Three"
  title:
    - strip
    - slugify      # "A Light in the Attic" → "a-light-in-the-attic"

validate:
  price:
    required: true
    type: float
    min: 0
  title:
    required: true
    min_length: 2

notify:
  webhook: https://hooks.slack.com/...
  # fires when --diff detects a price change

02 —

Everything you need,
nothing you don't.

All configured from YAML. No Python needed to add a new scraping target.

// engines

Five scraping backends

BeautifulSoup, Playwright (JS), httpx (async), GraphQL, and Bright Data. Switch with use:.

// pagination

Pagination

Follow "next page" links automatically with paginate:. Controls selector, attr, and max_pages.

// spider

Spider mode

Discover and scrape linked pages from an index. Set follow.parallel: 10 for async concurrent fetching with httpx.

// pipeline

Transform pipeline

28+ declarative transforms per field: strip, regex, float, date, hash, boolean, slugify and more.

// validation

Per-field validation

Enforce data quality at scrape time: required, type, min/max, pattern, in. Runs after transforms.

// storage

8 storage backends

JSON, CSV, SQLite, MongoDB, PostgreSQL, Excel, Google Sheets, Parquet. Mix and match with CLI flags on each run.

// cache

HTTP cache

File-based or Redis-backed cache with TTL. Share cache across instances with cache: {backend: redis}.

// diff

Change detection

Diff results against the previous run. Fire a webhook when content changes. Use with --diff.

// proxy

Proxy rotation

Define a pool of proxies per directive. Scrapit rotates round-robin or randomly and retries on failure.

// stealth

Stealth mode

Playwright with stealth: true — randomised UA, viewport, locale, and navigator fingerprint injection.

// schedule

Built-in scheduler

Add schedule: "*/30 * * * *" to any directive and run scrapit run as a daemon. Cron syntax supported.

// stream

Streaming output

Use --stream to emit NDJSON lines to stdout as each spider page completes. No waiting for the full run.

// export

Backend export

Migrate data between backends: scrapit export --from sqlite --to csv --directive hn. Supports --since date filter.

// dashboard

Web dashboard

Run scrapit serve to open a dark-theme UI for browsing results, running directives, and downloading output.

// hooks

Lifecycle hooks

// plugins

Plugin system

Publish custom transforms or storage backends as pip packages. Scrapit discovers them via entry_points at startup.

// ai

AI-native

Native integrations for Claude, OpenAI, LangChain, CrewAI, LlamaIndex, and MCP. One import away.

// reporting

Run reports

Timing, field coverage, and validation summary printed after every scrape. Always know what happened.

03 —

Clean data
before it's stored.

Transforms run in order, per field, after scraping. Chain as many as you need in YAML.

Transform	Example	Description
strip	- strip	Remove leading/trailing whitespace
lower / upper / title	- lower	Change case
int / float	- float	Parse number, removes non-numeric chars
regex	{regex: '\d+'}	Extract first match
regex_group	{regex_group: {pattern: '(\d+)', group: 1}}	Extract a capture group
replace	{replace: {"€": ""}}	String substitution (multiple pairs)
split	{split: ","}	Split string into list
join	{join: ", "}	Join list into string
first / last	- first	Pick first or last item from list
default	{default: "N/A"}	Fallback when value is None
slice	{slice: {end: 200}}	Substring or sublist
prepend / append	{prepend: "https:"}	Add text before or after
remove_tags	- remove_tags	Strip HTML tags from string
template	{template: "USD {value}"}	String template with {value} or {field}
slugify	- slugify	"Hello World" → "hello-world"
normalize_whitespace	- normalize_whitespace	Collapse multiple spaces/newlines
truncate	{truncate: 100}	Cut at N chars without breaking words
capitalize	- capitalize	First char upper, rest unchanged
sentence_case	- sentence_case	First char upper, rest lower
count	- count	Length of string or list
boolean	- boolean	"true"/"yes"/"1" → True, "false"/"no" → False
date	- date	Parse date string to ISO YYYY-MM-DD (auto-detects format)
parse_date	{parse_date: {input_format: "%d/%m/%Y"}}	Parse date with custom format, optional output format
pad	{pad: {width: 5, char: "0", side: left}}	Pad string to fixed width
hash	{hash: sha256}	Hash value with md5/sha1/sha256/sha512

Transforms chain left to right. Each step receives the output of the previous one.

price cleaning

transform:
  price:
    - strip                    # "  £ 12,99  " → "£ 12,99"
    - replace: {"£": "", ",": "."}
                              # "£ 12,99" → " 12.99"
    - float                    # " 12.99" → 12.99

  title:
    - strip
    - normalize_whitespace
    - slugify                  # "My  Title!" → "my-title"

  description:
    - remove_tags              # strip <b>, <span> etc.
    - strip
    - truncate: 200

  tags:
    - split: ","               # "a, b, c" → ["a", " b", " c"]
    - join: " | "             # ["a", "b", "c"] → "a | b | c"

  image_url:
    - prepend: "https://site.com"
    - default: "/placeholder.jpg"

validation

validate:
  price:
    required: true
    type: float
    min: 0
    max: 10000

  title:
    required: true
    min_length: 2
    max_length: 500

  status:
    in: [active, inactive, pending]

  sku:
    pattern: '^[A-Z]{2}\d{4}$'

04 —

Pick your output.

Same directive, different flag. Mix multiple outputs per run.

JSON

--json

Last scrape saved as a single JSON file. Overwritten on each run. Good for monitoring and APIs.

CSV

--csv

Appends one row per run. Good for building a time-series dataset or importing into spreadsheets.

SQLite

--sqlite

Zero-config local database. All runs stored in output/scrapit.db. Query with scrapit query.

MongoDB

--mongo

Stores results in a MongoDB collection. Configure via MONGO_URI in .env.

PostgreSQL

--postgres

Enterprise-grade relational storage. Configure via POSTGRES_URL. Requires psycopg2.

Excel

--excel

Appends rows to an .xlsx workbook. One sheet per directive. Requires openpyxl.

Google Sheets

--sheets

Appends rows to a live Google Sheets spreadsheet. Uses service account credentials.

Parquet

--parquet

Columnar binary format for analytics pipelines. Requires pyarrow. Great for Pandas / DuckDB.

multiple outputs + query

$ scrapit scrape product --json --csv --sqlite --diff

$ scrapit query --backend sqlite --directive product --limit 10

$ scrapit batch scraper/directives/ --sqlite  # run all directives

05 —

Plug into
any AI agent.

Native support for every major framework. All dependencies are optional — install only what you need.

MCP Server

Claude Desktop · Cursor · Claude Code

# Claude Code
$ claude mcp add scrapit -- \
    python -m scraper.integrations.mcp

# claude_desktop_config.json
{
  "mcpServers": {
    "scrapit": {
      "command": "python",
      "args": ["-m", "scraper.integrations.mcp"]
    }
  }
}

$ pip install scrapit-scraper[mcp]

Anthropic SDK

claude-opus-4-6

from scraper.integrations.anthropic import (
    ScrapitAnthropicAgent,
    as_anthropic_tools,
    handle_tool_call,
)

# Built-in agentic loop
agent = ScrapitAnthropicAgent(
    model="claude-opus-4-6"
)
answer = agent.run(
    "What are today's top HN posts?"
)

$ pip install scrapit-scraper[anthropic]

OpenAI SDK

gpt-4o

from scraper.integrations.openai import (
    ScrapitOpenAIAgent,
    as_openai_functions,
    handle_function_call,
)

# Built-in agentic loop
agent = ScrapitOpenAIAgent(model="gpt-4o")
answer = agent.run(
    "Summarize the Python Wikipedia page."
)

$ pip install scrapit-scraper[openai]

LangChain / CrewAI / LangGraph

Toolkit

from scraper.integrations.langchain import ScrapitToolkit

tools = ScrapitToolkit().get_tools()
# → [ScrapitTool, ScrapitPageTool, ScrapitSelectorTool]

# LangChain agent
agent = initialize_agent(
    tools=tools, llm=ChatOpenAI(model="gpt-4o"),
    agent=AgentType.OPENAI_FUNCTIONS,
)

# CrewAI — same tools, different framework
researcher = Agent(role="Researcher", tools=tools)

$ pip install scrapit-scraper[langchain]

LlamaIndex

RAG

from scraper.integrations.llamaindex import ScrapitReader
from llama_index.core import VectorStoreIndex

reader = ScrapitReader()

# Parallel URL scraping
docs = reader.load_data(urls=[
    "https://site1.com",
    "https://site2.com",
])

index  = VectorStoreIndex.from_documents(docs)
engine = index.as_query_engine()
engine.query("Summarize the main points.")

$ pip install scrapit-scraper[llamaindex]

Quick API

No YAML needed

from scraper.integrations import (
    scrape_url,           # → clean text
    scrape_page,          # → structured metadata
    scrape_with_selectors,# → CSS-defined fields
    scrape_many,          # → parallel URLs
)

page = scrape_page("https://example.com")
# → {title, description, main_content, links, word_count}

data = scrape_with_selectors("https://example.com", {
    "title": "h1",
    "price": ".price-color",
})

$ pip install scrapit-scraper

06 —

Built in public.
Open to everyone.

All contributions are welcome — code, directives, docs, or just opening an issue.

// bug

Bug report

Include your directive YAML and full error traceback.

Open a bug report →

// issues

Good first issues

10 issues open — new transforms, directives for Reddit/IMDb/PyPI, and tests.

View open issues →

// directive

Share a directive

Got a working YAML for a site? Share it — it's the easiest way to contribute.

Share a directive →

// transform

New transform

Add a function to transforms/__init__.py, document it, open a PR.

Read CONTRIBUTING.md →

08 —

Common questions.

Do I need to write Python to use Scrapit?

No. For standard scraping you only write YAML. Python is only needed if you want to add custom hooks or a new storage backend.

Does it work on JavaScript-rendered sites?

Yes. Set use: playwright in your directive. Install with pip install scrapit-scraper[playwright] and run playwright install chromium.

How do I give Claude or GPT-4 web access with Scrapit?

Use the MCP server (python -m scraper.integrations.mcp) for Claude Desktop/Cursor, or use ScrapitAnthropicAgent / ScrapitOpenAIAgent in your own code.

What's the difference between `scrape_url` and `scrape_page`?

scrape_url returns clean plain text. scrape_page returns a structured dict with title, description, links, and word count — more useful for agent context.

Can I scrape multiple pages at once?

Yes, three ways: scrape_many(urls) for parallel URL scraping, paginate: in a directive for sequential pages, or follow: for spider mode.

How does change detection work?

Run with --diff. Scrapit compares the current result against the previous JSON output (ignoring timestamps). If anything changed, it fires a webhook if configured.

Scrape any sitewith a YAML file. // no code needed

Write YAML.Get data.

Everything you need,nothing you don't.