View Source CommonCrawl.Index (CommonCrawl v0.3.1)

Interacting with index files of Common Crawl.

Summary

Functions

Returns URL of the file containing the index paths for a given crawl ID.

Returns URL of the cluster.idx file.

Fetches a gzipped index file.

Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.

Fetches the cluster.idx file.

Parses a line of an index file into a tuple containing the search key, timestamp, and metadata map.

Creates a stream of parsed index entries from index files.

Returns URL of the index file.

Functions

all_paths_url(crawl_id)

@spec all_paths_url(String.t()) :: String.t()

Returns URL of the file containing the index paths for a given crawl ID.

Examples

iex> CommonCrawl.Index.all_paths_url("CC-MAIN-2017-34")
"https://data.commoncrawl.org/crawl-data/CC-MAIN-2017-34/cc-index.paths.gz"

cluster_idx_url(crawl_id)

@spec cluster_idx_url(String.t()) :: String.t()

Returns URL of the cluster.idx file.

Examples

iex> CommonCrawl.Index.cluster_idx_url("CC-MAIN-2017-34")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cluster.idx"

get(crawl_id, filename, opts \\ [])

Fetches a gzipped index file.

Examples

iex> CommonCrawl.Index.get("CC-MAIN-2024-51", "cdx-00000.gz")
{:ok, <<31, 139, 8, 0, 0, 0, 0, 0, 0, 3, ...>>}

get_all_paths(crawl_id, opts \\ [])

Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.

Examples

iex> CommonCrawl.Index.get_all_paths("CC-MAIN-2024-51")
{:ok, [
  "cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00000.gz",
  "cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00001.gz",
  # ... more index files
  "cc-index/collections/CC-MAIN-2024-51/indexes/metadata.yaml",
  "cc-index/collections/CC-MAIN-2024-51/indexes/cluster.idx"
]}

get_cluster_idx(crawl_id, opts \\ [])

Fetches the cluster.idx file.

Examples

iex> CommonCrawl.Index.get_cluster_idx("CC-MAIN-2024-51")
{:ok, "0,100,22,165)/ 20241209080420..."}

parser(line)

@spec parser(Enum.t()) :: {:ok, {String.t(), integer(), map()}} | {:error, any()}

Parses a line of an index file into a tuple containing the search key, timestamp, and metadata map.

Examples

iex> line = "com,example)/ 20240108123456 {"url": "http://www.example.com"}"
iex> CommonCrawl.Index.parser(line)
{:ok, {"com,example)/", 20240108123456, %{"url" => "http://www.example.com"}}}

stream(crawl_id, opts \\ [])

@spec stream(
  String.t(),
  keyword()
) :: Enumerable.t()

Creates a stream of parsed index entries from index files.

Options

  • :preprocess_fun - function to preprocess the stream before processing (default: & &1)
  • :dir - temporary directory for storing downloaded files (default: System.tmp_dir!())
  • :max_attempts - maximum number of retry attempts for fetching cluster.idx (default: 3)
  • :backoff - milliseconds to wait between retry attempts (default: 500)

Examples

# Stream all index entries
CommonCrawl.Index.stream("CC-MAIN-2024-51")

# Stream only German domains and shuffle them before processing
CommonCrawl.Index.stream("CC-MAIN-2024-51", preprocess_fun: fn stream ->
  stream
  |> Stream.filter(&String.starts_with?(&1, "de"))
  |> Enum.shuffle()
end)

url(crawl_id, filename)

@spec url(String.t(), String.t()) :: String.t()

Returns URL of the index file.

Examples

iex> CommonCrawl.Index.url("CC-MAIN-2017-34", "cdx-00203.gz")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cdx-00203.gz"