View Source CommonCrawl.IndexAPI (CommonCrawl v0.3.1)
Interacting with Common Crawl index search API.
Further info: https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference
Summary
Functions
Searches the Common Crawl CDX API for entries matching the given query parameters.
Searches for the latest available version of a URL across recent crawls. Returns the most recent entry if found, nil otherwise.
Functions
Searches the Common Crawl CDX API for entries matching the given query parameters.
The cdx_api_url
can be found in CommonCrawl.collinfo()
.
The "url"
parameter is required in the query
map.
Parameters
cdx_api_url
- The CDX API endpoint URL for a specific crawlquery
- Map of query parameters (url is required)opts
- Request options passed to Req.get/2
Examples
# Search for a specific URL in a crawl
cdx_api_url = "https://index.commoncrawl.org/CC-MAIN-2023-50-index"
{:ok, entries} = CommonCrawl.IndexAPI.get(cdx_api_url, %{"url" => "https://example.com"})
# Search with additional filters
{:ok, entries} = CommonCrawl.IndexAPI.get(
cdx_api_url,
%{
"url" => "https://example.com",
"filter" => "statuscode:200",
"limit" => "1"
}
)
@spec get_latest_for_url( String.t(), keyword() ) :: {String.t(), non_neg_integer(), map()} | nil
Searches for the latest available version of a URL across recent crawls. Returns the most recent entry if found, nil otherwise.
Examples
# Search in default 4 most recent crawls
get_latest_for_url("https://example.com")
# Search in 6 most recent crawls
get_latest_for_url("https://example.com", crawls_to_check: 6)
Return Value
{"com,example)/", 20241214183829,
%{
"digest" => "JI6OR3QR4CI526JD6TMMNZNV4QPMPQCH",
"encoding" => "UTF-8",
"filename" => "crawl-data/CC-MAIN-2024-51/segments/1733066125982.36/warc/CC-MAIN-20241214181735-20241214211735-00433.warc.gz",
"languages" => "eng",
"length" => "1223",
"mime" => "text/html",
"mime-detected" => "text/html",
"offset" => "36544657",
"status" => "200",
"url" => "http://www.example.com"
}}