View Source CommonCrawl (CommonCrawl v0.3.4)

CommonCrawl library helps to interact with Common Crawl data.

Summary

Functions

collinfo()

Cached collinfo from disk.

get_collinfo(opts \\ [])

Fetches current collinfo with all available crawls. Make sure to cache it and to not make repeated requests.

get_latest_for_url(url, opts \\ [])

Fetches the latest available crawl data for a given URL.

Functions

collinfo()

@spec collinfo() :: [map()]

Cached collinfo from disk.

Examples

CommonCrawl.collinfo()
[%{
  "cdx-api" => "https://index.commoncrawl.org/CC-MAIN-2021-43-index",
  "id" => "CC-MAIN-2021-43",
  "name" => "October 2021 Index",
  "timegate" => "https://index.commoncrawl.org/CC-MAIN-2021-43/"
}, ...]

get_collinfo(opts \\ [])

Fetches current collinfo with all available crawls. Make sure to cache it and to not make repeated requests.

Examples

iex> CommonCrawl.get_collinfo()
{:ok, [
  %{
    "cdx-api" => "https://index.commoncrawl.org/CC-MAIN-2024-01-index",
    "id" => "CC-MAIN-2024-01",
    "name" => "January 2024 Index",
    "timegate" => "https://index.commoncrawl.org/CC-MAIN-2024-01/"
  },
  # ...more entries
]}

get_latest_for_url(url, opts \\ [])

@spec get_latest_for_url(
  String.t(),
  keyword()
) :: {:ok, map()} | {:error, any()}

Fetches the latest available crawl data for a given URL.

Examples

iex> CommonCrawl.get_latest_for_url("https://example.com")
{:ok,
 %{
   warc: "WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2024-01-14...",
   headers: "HTTP/1.1 200 OK\r\nContent-Type: text/html...",
   response: "<!doctype html>\n<html>\n<head>\n<title>Example Domain</title>..."
 }}