Crawl website | Supadata

Crawling is a long running task. To get the content of a crawl, you first create a crawl job and then check the results of the job.

Request

import {
  Supadata,
  JobId
} from '@supadata/js';

// Initialize the client
const supadata = new Supadata({
  apiKey: 'YOUR_API_KEY',
});

// Crawl website
const crawl: JobId = await supadata.web.crawl({
  url: 'https://supadata.ai',
  limit: 10,
});

console.log(`Started crawl job: ${crawl.jobId}`);

The crawler will follow only the child links. For example, if you crawl https://supadata.ai/blog, the crawler will follow links like https://supadata.ai/blog/article-1 , but not https://supadata.ai/about. To crawl the whole website, provide the top URL (ie https://supadata.ai) as the URL to crawl.

Parameters

Parameter	Type	Required	Description
url	string	Yes	URL of the webpage to scrape
limit	number	No	Maximum number of pages to crawl. Defaults to 100.

Response

{
  "jobId": "string" // The ID of the crawl job
}

Results

After starting a crawl job, you can check the status of it. If the job is completed, you can get the results of the crawl. The results can be paginated for large crawls. In such cases, the response will contain a next field which you can use to get the next page of results.

Crawl Job

import { Supadata } from "@supadata/js";

const supadata = new Supadata("YOUR_API_KEY");

// Get crawl job results
// This automatically handles pagination and returns all pages
const crawlResult = await supadata.web.getCrawlResults(jobId);

if (crawlResult.status === "completed") {
  console.log("Crawl job completed successfully!");
  console.log(`Total pages crawled: ${crawlResult.pages.length}`);
  
  // Process each page
  crawlResult.pages.forEach((page, index) => {
    console.log(`Page ${index + 1}: ${page.name}`);
    console.log(`URL: ${page.url}`);
    console.log(`Description: ${page.description}`);
    console.log(`Content preview: ${page.content.substring(0, 100)}...`);
    console.log("---");
  });
} else if (crawlResult.status === "failed") {
  console.error("Crawl job failed:", crawlResult.error);
} else {
  console.log("Job status:", crawlResult.status);
}

Crawl Results

{
  "status": "string", // The status of the crawl job: 'scraping', 'completed', 'failed' or 'cancelled'
  "pages": [
    // If job is completed, contains list of pages that were crawled
    {
      "url": "string", // The URL that was scraped
      "content": "string", // The markdown content extracted from the URL
      "name": "string", // The title of the webpage
      "description": "string" // A description of the webpage
    }
  ],
  "next": "string" // Large crawls will be paginated. Call this endpoint to get the next page of results
}

Error Codes

The API returns HTTP status codes and error codes. See this page for more details.

Respect robots.txt and website terms of service when scraping web content.

Pricing

1 crawl request = 1 credit
1 crawled page = 1 credit

Getting Started

Features

Crawl

Request

Parameters

Response

Results

Crawl Job

Crawl Results

Error Codes

Pricing

Getting Started

Features

​Request

​Parameters

​Response

​Results

​Crawl Job

​Crawl Results

​Error Codes

​Pricing

Request

Parameters

Response

Results

Crawl Job

Crawl Results

Error Codes

Pricing