Guide

Exporting Data

All crawled pages are stored permanently in your account. Export them at any time in four formats — no lock-in, no expiry, no re-crawl required.

Export formats

JSON.json

Array of page objects with all fields. Best for programmatic processing.

APIs, JavaScript, Python

JSONL.jsonl

One JSON object per line. Ideal for streaming large datasets or LLM fine-tuning.

LLM training, data pipelines

CSV.csv

Spreadsheet-compatible. URL, status, title, and metadata columns. No HTML/Markdown.

Excel, Google Sheets, analytics

Markdown.zip

One .md file per page, named by URL slug. Includes all content, ready for docs pipelines.

RAG, knowledge bases, docs

Export from the dashboard

Open any completed crawl job in your dashboard. Click the Export dropdown in the top-right corner and select your format. The file downloads immediately.

Export sizes

Dashboard exports include all pages for jobs with up to 100 pages. For larger jobs, use the API to paginate and process pages in batches.

Export via API

For large crawls or automated pipelines, paginate through pages using the list endpoint and process them in your own code.

Paginate through all pages
let cursor = 0, allPages = [];

while (true) {
  const res = await fetch(
    `/api/crawl/${jobId}/pages?limit=100&cursor=${cursor}`,
    { headers: { Authorization: `Bearer ${key}` } }
  );
  const { pages, cursor: next } = await res.json();
  allPages.push(...pages);
  if (pages.length < 100) break;
  cursor = next;
}

Each page in the list response omits heavy fields (html, markdown, extractedJson) by default. Fetch GET /api/crawl/:id/pages/:pageId to get full content for individual pages.

Related