View Source CommonCrawl (CommonCrawl v0.3.1)

CommonCrawl library helps to interact with Common Crawl data.

Summary

Functions

Cached collinfo from disk.

Fetches current collinfo with all available crawls. Make sure to cache it and to not make repeated requests.

Fetches the latest available crawl data for a given URL.

Functions

collinfo()

@spec collinfo() :: [map()]

Cached collinfo from disk.

Examples

CommonCrawl.collinfo()
[%{
  "cdx-api" => "https://index.commoncrawl.org/CC-MAIN-2021-43-index",
  "id" => "CC-MAIN-2021-43",
  "name" => "October 2021 Index",
  "timegate" => "https://index.commoncrawl.org/CC-MAIN-2021-43/"
}, ...]

get_collinfo(opts \\ [])

Fetches current collinfo with all available crawls. Make sure to cache it and to not make repeated requests.

Examples

iex> CommonCrawl.get_collinfo()
{:ok, [
  %{
    "cdx-api" => "https://index.commoncrawl.org/CC-MAIN-2024-01-index",
    "id" => "CC-MAIN-2024-01",
    "name" => "January 2024 Index",
    "timegate" => "https://index.commoncrawl.org/CC-MAIN-2024-01/"
  },
  # ...more entries
]}

get_latest_for_url(url, opts \\ [])

@spec get_latest_for_url(
  String.t(),
  keyword()
) :: {:ok, map()} | {:error, any()}

Fetches the latest available crawl data for a given URL.

Examples

iex> CommonCrawl.get_latest_for_url("https://example.com")
{:ok,
 %{
   warc: "WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2024-01-14...",
   headers: "HTTP/1.1 200 OK\r\nContent-Type: text/html...",
   response: "<!doctype html>\n<html>\n<head>\n<title>Example Domain</title>..."
 }}