View Source CommonCrawl (CommonCrawl v0.3.1)
CommonCrawl library helps to interact with Common Crawl data.
Summary
Functions
Cached collinfo from disk.
Fetches current collinfo with all available crawls. Make sure to cache it and to not make repeated requests.
Fetches the latest available crawl data for a given URL.
Functions
@spec collinfo() :: [map()]
Cached collinfo from disk.
Examples
CommonCrawl.collinfo()
[%{
"cdx-api" => "https://index.commoncrawl.org/CC-MAIN-2021-43-index",
"id" => "CC-MAIN-2021-43",
"name" => "October 2021 Index",
"timegate" => "https://index.commoncrawl.org/CC-MAIN-2021-43/"
}, ...]
Fetches current collinfo with all available crawls. Make sure to cache it and to not make repeated requests.
Examples
iex> CommonCrawl.get_collinfo()
{:ok, [
%{
"cdx-api" => "https://index.commoncrawl.org/CC-MAIN-2024-01-index",
"id" => "CC-MAIN-2024-01",
"name" => "January 2024 Index",
"timegate" => "https://index.commoncrawl.org/CC-MAIN-2024-01/"
},
# ...more entries
]}
Fetches the latest available crawl data for a given URL.
Examples
iex> CommonCrawl.get_latest_for_url("https://example.com")
{:ok,
%{
warc: "WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2024-01-14...",
headers: "HTTP/1.1 200 OK\r\nContent-Type: text/html...",
response: "<!doctype html>\n<html>\n<head>\n<title>Example Domain</title>..."
}}