Guide

JSON Extraction

Describe what data you want in plain English and WebExtract extracts structured JSON objects from every page — product prices, article titles, author names, review scores, or any information visible on the page.

How it works

1

You write a prompt

Describe the data structure you want extracted in natural language.

2

AI reads the page

Each crawled page is analyzed against your prompt using an LLM.

3

Structured JSON returned

Results are stored as extractedJson on each page object.

Example: Product data

Crawl an e-commerce site and extract product names, prices, and availability from every page.

Request
{
  "url": "https://store.example.com",
  "formats": ["json"],
  "jsonOptions": {
    "prompt": "Extract the product name, price in USD, 
             and whether it is in stock.",
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "product",
        "schema": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "price": { "type": "number" },
            "inStock": { "type": "boolean" }
          }
        }
      }
    }
  }
}
page.extractedJson
{
  "name": "Wireless Mechanical Keyboard",
  "price": 89.99,
  "inStock": true
}

Prompt tips

Be specific about field names

Say "extract the product name as productName" rather than just "extract the name".

Use response_format for strict shapes

Providing a JSON schema ensures consistent output across all pages.

Combine with markdown format

Request both "markdown" and "json" to get the full page text plus structured data.

Null fields are normal

If a field isn't found on a page, it will be null — not an error.

Up next