Define your target in a config file. Scrapit handles fetching, parsing, transforms, validation, storage, and AI agent integration — zero boilerplate.
Define your target once. Re-run anytime with different output flags.
ScrapitToolkit or the MCP server to give any agent web access.$ scrapit scrape product --json --diff { "title": "A Light in the Attic", "price": 51.77, "rating": "Three", "url": "https://books.toscrape.com/...", "timestamp": "2026-03-05 09:00:00" } → saved: output/product.json → no changes since last run
site: https://myblog.com use: beautifulsoup follow: selector: 'a.post-link' attr: href max: 50 same_domain: true scrape: title: - 'h1.post-title' - attr: text body: - 'div.post-content' - attr: text date: - 'time' - attr: datetime
from scraper.integrations.anthropic import ScrapitAnthropicAgent agent = ScrapitAnthropicAgent(model="claude-opus-4-6") answer = agent.run( "What are the top 5 Hacker News posts today? " "Summarize each one in one sentence." ) # Agent automatically calls scrape_page("https://news.ycombinator.com"), # reads the results, and produces a final answer.
site: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000 use: beautifulsoup # or: playwright headers: Accept-Language: en-US,en;q=0.9 cache: ttl: 3600 # don't re-fetch during dev scrape: title: - 'h1' - attr: text price: - 'p.price_color' - attr: text rating: - 'p.star-rating' - attr: class tags: - 'ul.breadcrumb li' - attr: text all: true # return all matches as list transform: price: - strip - replace: {"£": ""} - float rating: - last # "star-rating Three" → "Three" title: - strip - slugify # "A Light in the Attic" → "a-light-in-the-attic" validate: price: required: true type: float min: 0 title: required: true min_length: 2 notify: webhook: https://hooks.slack.com/... # fires when --diff detects a price change
All configured from YAML. No Python needed to add a new scraping target.
BeautifulSoup, Playwright (JS), httpx (async), GraphQL, and Bright Data. Switch with use:.
Follow "next page" links automatically with paginate:. Controls selector, attr, and max_pages.
Discover and scrape linked pages from an index. Set follow.parallel: 10 for async concurrent fetching with httpx.
28+ declarative transforms per field: strip, regex, float, date, hash, boolean, slugify and more.
Enforce data quality at scrape time: required, type, min/max, pattern, in. Runs after transforms.
JSON, CSV, SQLite, MongoDB, PostgreSQL, Excel, Google Sheets, Parquet. Mix and match with CLI flags on each run.
File-based or Redis-backed cache with TTL. Share cache across instances with cache: {backend: redis}.
Diff results against the previous run. Fire a webhook when content changes. Use with --diff.
Define a pool of proxies per directive. Scrapit rotates round-robin or randomly and retries on failure.
Playwright with stealth: true — randomised UA, viewport, locale, and navigator fingerprint injection.
Add schedule: "*/30 * * * *" to any directive and run scrapit run as a daemon. Cron syntax supported.
Use --stream to emit NDJSON lines to stdout as each spider page completes. No waiting for the full run.
Migrate data between backends: scrapit export --from sqlite --to csv --directive hn. Supports --since date filter.
Run scrapit serve to open a dark-theme UI for browsing results, running directives, and downloading output.
Register Python callbacks for before_scrape, after_scrape, on_error, on_change, on_save.
Publish custom transforms or storage backends as pip packages. Scrapit discovers them via entry_points at startup.
Native integrations for Claude, OpenAI, LangChain, CrewAI, LlamaIndex, and MCP. One import away.
Timing, field coverage, and validation summary printed after every scrape. Always know what happened.
Transforms run in order, per field, after scraping. Chain as many as you need in YAML.
| Transform | Example | Description |
|---|---|---|
| strip | - strip | Remove leading/trailing whitespace |
| lower / upper / title | - lower | Change case |
| int / float | - float | Parse number, removes non-numeric chars |
| regex | {regex: '\d+'} | Extract first match |
| regex_group | {regex_group: {pattern: '(\d+)', group: 1}} | Extract a capture group |
| replace | {replace: {"€": ""}} | String substitution (multiple pairs) |
| split | {split: ","} | Split string into list |
| join | {join: ", "} | Join list into string |
| first / last | - first | Pick first or last item from list |
| default | {default: "N/A"} | Fallback when value is None |
| slice | {slice: {end: 200}} | Substring or sublist |
| prepend / append | {prepend: "https:"} | Add text before or after |
| remove_tags | - remove_tags | Strip HTML tags from string |
| template | {template: "USD {value}"} | String template with {value} or {field} |
| slugify | - slugify | "Hello World" → "hello-world" |
| normalize_whitespace | - normalize_whitespace | Collapse multiple spaces/newlines |
| truncate | {truncate: 100} | Cut at N chars without breaking words |
| capitalize | - capitalize | First char upper, rest unchanged |
| sentence_case | - sentence_case | First char upper, rest lower |
| count | - count | Length of string or list |
| boolean | - boolean | "true"/"yes"/"1" → True, "false"/"no" → False |
| date | - date | Parse date string to ISO YYYY-MM-DD (auto-detects format) |
| parse_date | {parse_date: {input_format: "%d/%m/%Y"}} | Parse date with custom format, optional output format |
| pad | {pad: {width: 5, char: "0", side: left}} | Pad string to fixed width |
| hash | {hash: sha256} | Hash value with md5/sha1/sha256/sha512 |
Transforms chain left to right. Each step receives the output of the previous one.
transform: price: - strip # " £ 12,99 " → "£ 12,99" - replace: {"£": "", ",": "."} # "£ 12,99" → " 12.99" - float # " 12.99" → 12.99 title: - strip - normalize_whitespace - slugify # "My Title!" → "my-title" description: - remove_tags # strip <b>, <span> etc. - strip - truncate: 200 tags: - split: "," # "a, b, c" → ["a", " b", " c"] - join: " | " # ["a", "b", "c"] → "a | b | c" image_url: - prepend: "https://site.com" - default: "/placeholder.jpg"
validate: price: required: true type: float min: 0 max: 10000 title: required: true min_length: 2 max_length: 500 status: in: [active, inactive, pending] sku: pattern: '^[A-Z]{2}\d{4}$'
Same directive, different flag. Mix multiple outputs per run.
Last scrape saved as a single JSON file. Overwritten on each run. Good for monitoring and APIs.
Appends one row per run. Good for building a time-series dataset or importing into spreadsheets.
Zero-config local database. All runs stored in output/scrapit.db. Query with scrapit query.
Stores results in a MongoDB collection. Configure via MONGO_URI in .env.
Enterprise-grade relational storage. Configure via POSTGRES_URL. Requires psycopg2.
Appends rows to an .xlsx workbook. One sheet per directive. Requires openpyxl.
Appends rows to a live Google Sheets spreadsheet. Uses service account credentials.
Columnar binary format for analytics pipelines. Requires pyarrow. Great for Pandas / DuckDB.
$ scrapit scrape product --json --csv --sqlite --diff $ scrapit query --backend sqlite --directive product --limit 10 $ scrapit batch scraper/directives/ --sqlite # run all directives
Native support for every major framework. All dependencies are optional — install only what you need.
# Claude Code $ claude mcp add scrapit -- \ python -m scraper.integrations.mcp # claude_desktop_config.json { "mcpServers": { "scrapit": { "command": "python", "args": ["-m", "scraper.integrations.mcp"] } } }
from scraper.integrations.anthropic import ( ScrapitAnthropicAgent, as_anthropic_tools, handle_tool_call, ) # Built-in agentic loop agent = ScrapitAnthropicAgent( model="claude-opus-4-6" ) answer = agent.run( "What are today's top HN posts?" )
from scraper.integrations.openai import ( ScrapitOpenAIAgent, as_openai_functions, handle_function_call, ) # Built-in agentic loop agent = ScrapitOpenAIAgent(model="gpt-4o") answer = agent.run( "Summarize the Python Wikipedia page." )
from scraper.integrations.langchain import ScrapitToolkit tools = ScrapitToolkit().get_tools() # → [ScrapitTool, ScrapitPageTool, ScrapitSelectorTool] # LangChain agent agent = initialize_agent( tools=tools, llm=ChatOpenAI(model="gpt-4o"), agent=AgentType.OPENAI_FUNCTIONS, ) # CrewAI — same tools, different framework researcher = Agent(role="Researcher", tools=tools)
from scraper.integrations.llamaindex import ScrapitReader from llama_index.core import VectorStoreIndex reader = ScrapitReader() # Parallel URL scraping docs = reader.load_data(urls=[ "https://site1.com", "https://site2.com", ]) index = VectorStoreIndex.from_documents(docs) engine = index.as_query_engine() engine.query("Summarize the main points.")
from scraper.integrations import ( scrape_url, # → clean text scrape_page, # → structured metadata scrape_with_selectors,# → CSS-defined fields scrape_many, # → parallel URLs ) page = scrape_page("https://example.com") # → {title, description, main_content, links, word_count} data = scrape_with_selectors("https://example.com", { "title": "h1", "price": ".price-color", })
All contributions are welcome — code, directives, docs, or just opening an issue.
10 issues open — new transforms, directives for Reddit/IMDb/PyPI, and tests.
View open issues →Got a working YAML for a site? Share it — it's the easiest way to contribute.
Share a directive →Add a function to transforms/__init__.py, document it, open a PR.
Live from GitHub. Pick one up.
No. For standard scraping you only write YAML. Python is only needed if you want to add custom hooks or a new storage backend.
Yes. Set use: playwright in your directive. Install with pip install scrapit-scraper[playwright] and run playwright install chromium.
Use the MCP server (python -m scraper.integrations.mcp) for Claude Desktop/Cursor, or use ScrapitAnthropicAgent / ScrapitOpenAIAgent in your own code.
scrape_url and scrape_page?scrape_url returns clean plain text. scrape_page returns a structured dict with title, description, links, and word count — more useful for agent context.
Yes, three ways: scrape_many(urls) for parallel URL scraping, paginate: in a directive for sequential pages, or follow: for spider mode.
Run with --diff. Scrapit compares the current result against the previous JSON output (ignoring timestamps). If anything changed, it fires a webhook if configured.