Skip to main content

Documentation Index

Fetch the complete documentation index at: https://brightdata-ipv6-release.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The Web Archive API allows you to access and retrieve Data Snapshots from Bright Data’s cached data collections in a seamless and efficient method.
To access this API, you will need a Bright Data API token
To initiate a search of our Web Archive, use the following /search endpoint. Endpoint: POST api.brightdata.com/webarchive/search
Request
POST api.brightdata.com/webarchive/search
{
    filters: {
        max_age?: Duration,
        min_date?: yyyy-mm-dd,
        max_date?: yyyy-mm-dd,
        domain_whitelist?: ['example.com'],
        domain_blacklist?: ['example.com'],
        domain_regex_whitelist?: ['.*example..*'],
        domain_regex_blacklist?: ['.*example..*'],
        category_whitelist?: ['Automotive'],
        category_blacklist?: ['Automotive'],
        path_regex_whitelist?: ['.*/products/.*'],
        path_regex_blacklist?: ['.*/products/.*'],
        language_whitelist?: ['eng'], // ISO 639-3 letter language codes
        language_blacklist?: ['eng'],
        ip_country_whitelist?: ['us', 'ie', 'in'],
        ip_country_blacklist?: ['mx', 'ae', 'ca'],
        captcha?: true,
        robots_block?: true,
    }
}
You can run up to 100 searches per day without triggering a dump. Once you trigger a dump, that search no longer count against your limit.

Get Search Status

To check the status of a specific query that was made. Endpoint: GET api.brightdata.com/webarchive/search/<search_id> When successful it will retrieve:
  • The number of entries for your query
  • The estimated size and cost of the full Data Snapshot
GET api.brightdata.com/webarchive/search/<search_id>

Get All Search Statuses

Check the status of all current searches. Endpoint: GET api.brightdata.com/webarchive/searches
GET api.brightdat.com/webarchive/searches

How data range affects delivery time

If your query is matching data within last 72h - your snapshot will start processing/delivering immediately. If some of your matched data is older than 72h - it needs to be retrieved from a colder archive before delivery and it may take up to 72h.
We recommend using max_age = 1d for initial testing.

Deliver Snapshot to Amazon S3 Storage

To use S3 storage delivery, you will first need to do the following:
  • Create an AWS role which gives Bright Data access to your system.
    • During this setup, you will be asked by Amazon for an “external ID” that is used with the role.
    • Your external ID for S3 is your Bright Data Account ID that can be found within Account Settings
  • Once a role is created, you will need to allow our system delivery role to AssumeRole that role.
    • Our system delivery role is: arn:aws:iam::422310177405:role/brd.ec2.zs-dca-delivery
To deliver a specific Snapshot from a specific search_id to S3 storage, use the following /dump endpoint. Endpoint: POST api.brightdata.com/webarchive/dump
POST api.brightdata.com/webarchive/dump
{
    search_id: <search_id>,
    max_entries?: 1000000, // (optional) limit how many files you purchase
    delivery: {
        strategy: 's3',
	    settings: {
            bucket: <your_bucket_name>,
            assume_role: {
                role_arn: <role_you_created_above>,
            },
        },
    },
}

Collect Snapshot via Webhook

Collect a Data Snapshot via webhook from a specific search_id Endpoint: POST api.brightdata.com/webarchive/dump
{
    search_id: <search_id>,
    max_entries?: 1000000,
    delivery: {
		strategy: 'webhook',
		settings: {
             url: string(),
             auth?: string(), // will be added as an Authorization header
        },
    }
}

Get Status of Data Snapshot

Check the status of a specific Data Snapshot (dump) using the dump_id. Endpoint: GET api.brightdata.com/webarchive/dump/<dump_id>
GET api.brightdata.com/webarchive/dump/<dump_id>

Get the Status of all Data Snapshots

Endpoint: GET api.brightdata.com/webarchive/dumps
200 OK
[
    {
        dump_id: 'ID',
        status: 'in_progress',
        batches_total: 130,
        batches_uploaded: 29,
        files_total: 1241241251,
        estimate_finish: Date
    },
    {
        dump_id: 'ID',
        status: 'done',
        batches_total: 130,
        files_total: 1241241251,
        files_uploaded: 2412515,
        completed_at: Date
    }
    // ... rest of the dumps
]

High-level process flow diagram

flow diagram