Web Archive API - Bright Data Docs

The Web Archive API allows you to access and retrieve Data Snapshots from Bright Data’s cached data collections in a seamless and efficient method.

To access this API, you will need a Bright Data API token

Run a Search

To initiate a search of our Web Archive, use the following /search endpoint. Endpoint: POST api.brightdata.com/webarchive/search

Request
Response
Code Example
Dictionary

Request

POST api.brightdata.com/webarchive/search
{
    filters: {
        max_age?: Duration,
        min_date?: yyyy-mm-dd,
        max_date?: yyyy-mm-dd,
        domain_whitelist?: ['example.com'],
        domain_blacklist?: ['example.com'],
        domain_regex_whitelist?: ['.*example..*'],
        domain_regex_blacklist?: ['.*example..*'],
        category_whitelist?: ['Automotive'],
        category_blacklist?: ['Automotive'],
        path_regex_whitelist?: ['.*/products/.*'],
        path_regex_blacklist?: ['.*/products/.*'],
        language_whitelist?: ['eng'], // ISO 639-3 letter language codes
        language_blacklist?: ['eng'],
        ip_country_whitelist?: ['us', 'ie', 'in'],
        ip_country_blacklist?: ['mx', 'ae', 'ca'],
        captcha?: true,
        robots_block?: true,
    }
}

{search_id: <search_id>}

curl -X POST https://api.brightdata.com/webarchive/search \
  -H "Authorization: Bearer $API_TOKEN" \
  -H 'Content-Type: application/json' \
  --data '{"filters": {"max_age": "1d", "domain_whitelist": ["example.com"]}}'

Here is a brief explanation of each of the parameters you are able to use in your requests:

Parameter	Description
max_age	Limits results to records collected within a specified time range.
min_date	Returns records collected on or after the specified date.
max_date	Returns records collected on or before the specified date.
domain_whitelist	Includes results only from listed domains.
domain_blacklist	Excludes results from listed domains.
category_whitelist	Includes results only from specified categories.
category_blacklist	Excludes results from specified categories.
path_regex_whitelist	Includes results only matching the specified path regex.
path_regex_blacklist	Excludes results matching the specified path regex.
language_whitelist	Includes results only for specific language codes (ISO 639-3).
language_blacklist	Excludes results for specific language codes.
ip_country_whitelist	Includes results collected through IPs or peers only from specified countries.
ip_country_blacklist	Excludes results collected through IPs or peers from specified countries.
captcha	Return only results with captcha triggered
robots_block	Return only results with robots block

You can run up to 100 searches per day without triggering a dump. Once you trigger a dump, that search no longer count against your limit.

Get Search Status

To check the status of a specific query that was made. Endpoint: GET api.brightdata.com/webarchive/search/<search_id> When successful it will retrieve:

The number of entries for your query
The estimated size and cost of the full Data Snapshot

Request
Response
Code Example

GET api.brightdata.com/webarchive/search/<search_id>

The status code of all three following responses is 200 OK

{
    search_id: "ID",
    status: "in_progress"
}

curl https://api.brightdata.com/webarchive/search/$SEARCH_ID \
  -H "Authorization: Bearer $API_TOKEN"

Get All Search Statuses

Check the status of all current searches. Endpoint: GET api.brightdata.com/webarchive/searches

Request
Response
Code Example

GET api.brightdat.com/webarchive/searches

200 OK

[
    {
        search_id: "ID",
        status: "in_progress"
    },
    {
        search_id: "ID",
        status: "done"
    },
    // ... + rest of the searches and status
}

Curl

curl https://api.brightdata.com/webarchive/searches \
  -H "Authorization: Bearer $API_TOKEN"

How data range affects delivery time

If your query is matching data within last 72h - your snapshot will start processing/delivering immediately. If some of your matched data is older than 72h - it needs to be retrieved from a colder archive before delivery and it may take up to 72h.

We recommend using max_age = 1d for initial testing.

Deliver Snapshot to Amazon S3 Storage

To use S3 storage delivery, you will first need to do the following:

Create an AWS role which gives Bright Data access to your system.
- During this setup, you will be asked by Amazon for an “external ID” that is used with the role.
- Your external ID for S3 is your Bright Data Account ID that can be found within Account Settings
Once a role is created, you will need to allow our system delivery role to AssumeRole that role.
- Our system delivery role is: arn:aws:iam::422310177405:role/brd.ec2.zs-dca-delivery

To deliver a specific Snapshot from a specific search_id to S3 storage, use the following /dump endpoint. Endpoint: POST api.brightdata.com/webarchive/dump

Request
Response
Code Example

POST api.brightdata.com/webarchive/dump
{
    search_id: <search_id>,
    max_entries?: 1000000, // (optional) limit how many files you purchase
    delivery: {
        strategy: 's3',
	    settings: {
            bucket: <your_bucket_name>,
            assume_role: {
                role_arn: <role_you_created_above>,
            },
        },
    },
}

200 OK

{dump_id: <dump_id>}

curl -X POST https://api.brightdata.com/webarchive/dump \
  -H "Authorization: Bearer $API_TOKEN" \
  -H 'Content-Type: application/json' \
  --data @- <<EOF
{
    "search_id": "$SEARCH_ID",
    "max_entries": 1000000,
    "delivery": {
        "strategy": "s3",
	    "settings": {
            "bucket": "$YOUR_BUCKET_NAME",
            "assume_role": {
                "role_arn": "$YOUR_DELIVERY_ROLE"
            }
        }
    }
}
EOF

Collect Snapshot via Webhook

Collect a Data Snapshot via webhook from a specific search_id Endpoint: POST api.brightdata.com/webarchive/dump

Request
Response
Code Example

{
    search_id: <search_id>,
    max_entries?: 1000000,
    delivery: {
		strategy: 'webhook',
		settings: {
             url: string(),
             auth?: string(), // will be added as an Authorization header
        },
    }
}

200 OK

{"dump_id": <dump_id>}

curl -X POST https://api.brightdata.com/webarchive/dump \
  -H "Authorization: Bearer $API_TOKEN" \
  -H 'Content-Type: application/json' \
  --data @- <<EOF
{
    "search_id": "$SEARCH_ID",
    "max_entries": 1000000,
    "delivery": {
        "strategy": "webhook",
	    "settings": {
            "url": "$YOUR_WEBHOOK_URL"
        }
    }
}
EOF

Get Status of Data Snapshot

Check the status of a specific Data Snapshot (dump) using the dump_id. Endpoint: GET api.brightdata.com/webarchive/dump/<dump_id>

Request
Response
Code Example

GET api.brightdata.com/webarchive/dump/<dump_id>

The status code of all three following responses is 200 OK

{
    dump_id: <id>,
    status: 'in_progress',
    batches_total: 130,
    batches_uploaded: 29,
    files_total: 1241241251,
    estimate_finish: ISODate
}

Curl

curl https://api.brightdata.com/webarchive/dump/$DUMP_ID \
  -H "Authorization: Bearer $API_TOKEN"

Get the Status of all Data Snapshots

Endpoint: GET api.brightdata.com/webarchive/dumps

Response
Code Example

200 OK

[
    {
        dump_id: 'ID',
        status: 'in_progress',
        batches_total: 130,
        batches_uploaded: 29,
        files_total: 1241241251,
        estimate_finish: Date
    },
    {
        dump_id: 'ID',
        status: 'done',
        batches_total: 130,
        files_total: 1241241251,
        files_uploaded: 2412515,
        completed_at: Date
    }
    // ... rest of the dumps
]

Curl

curl https://api.brightdata.com/webarchive/dumps \
  -H "Authorization: Bearer $API_TOKEN"

Documentation Index

​Run a Search

​Get Search Status

​Get All Search Statuses

​How data range affects delivery time

​Deliver Snapshot to Amazon S3 Storage

​Collect Snapshot via Webhook

​Get Status of Data Snapshot

​Get the Status of all Data Snapshots

​High-level process flow diagram

Run a Search

Get Search Status

Get All Search Statuses

How data range affects delivery time

Deliver Snapshot to Amazon S3 Storage

Collect Snapshot via Webhook

Get Status of Data Snapshot

Get the Status of all Data Snapshots

High-level process flow diagram