View Source CommonCrawl.IndexAPI (CommonCrawl v0.3.1)

Interacting with Common Crawl index search API.

Further info: https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference

Summary

Functions

Searches the Common Crawl CDX API for entries matching the given query parameters.

Searches for the latest available version of a URL across recent crawls. Returns the most recent entry if found, nil otherwise.

Functions

get(cdx_api_url, query, opts \\ [])

@spec get(String.t(), Enum.t(), keyword()) :: {:ok, list()} | {:error, any()}

Searches the Common Crawl CDX API for entries matching the given query parameters.

The cdx_api_url can be found in CommonCrawl.collinfo(). The "url" parameter is required in the query map.

Parameters

  • cdx_api_url - The CDX API endpoint URL for a specific crawl
  • query - Map of query parameters (url is required)
  • opts - Request options passed to Req.get/2

Examples

# Search for a specific URL in a crawl
cdx_api_url = "https://index.commoncrawl.org/CC-MAIN-2023-50-index"
{:ok, entries} = CommonCrawl.IndexAPI.get(cdx_api_url, %{"url" => "https://example.com"})

# Search with additional filters
{:ok, entries} = CommonCrawl.IndexAPI.get(
  cdx_api_url,
  %{
    "url" => "https://example.com",
    "filter" => "statuscode:200",
    "limit" => "1"
  }
)

get_latest_for_url(url, opts \\ [])

@spec get_latest_for_url(
  String.t(),
  keyword()
) :: {String.t(), non_neg_integer(), map()} | nil

Searches for the latest available version of a URL across recent crawls. Returns the most recent entry if found, nil otherwise.

Examples

# Search in default 4 most recent crawls
get_latest_for_url("https://example.com")

# Search in 6 most recent crawls
get_latest_for_url("https://example.com", crawls_to_check: 6)

Return Value

{"com,example)/", 20241214183829,
 %{
   "digest" => "JI6OR3QR4CI526JD6TMMNZNV4QPMPQCH",
   "encoding" => "UTF-8",
   "filename" => "crawl-data/CC-MAIN-2024-51/segments/1733066125982.36/warc/CC-MAIN-20241214181735-20241214211735-00433.warc.gz",
   "languages" => "eng",
   "length" => "1223",
   "mime" => "text/html",
   "mime-detected" => "text/html",
   "offset" => "36544657",
   "status" => "200",
   "url" => "http://www.example.com"
 }}