View Source CommonCrawl.Index (CommonCrawl v0.3.1)
Interacting with index files of Common Crawl.
Summary
Functions
Returns URL of the file containing the index paths for a given crawl ID.
Returns URL of the cluster.idx file.
Fetches a gzipped index file.
Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.
Fetches the cluster.idx file.
Parses a line of an index file into a tuple containing the search key, timestamp, and metadata map.
Creates a stream of parsed index entries from index files.
Returns URL of the index file.
Functions
Returns URL of the file containing the index paths for a given crawl ID.
Examples
iex> CommonCrawl.Index.all_paths_url("CC-MAIN-2017-34")
"https://data.commoncrawl.org/crawl-data/CC-MAIN-2017-34/cc-index.paths.gz"
Returns URL of the cluster.idx file.
Examples
iex> CommonCrawl.Index.cluster_idx_url("CC-MAIN-2017-34")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cluster.idx"
Fetches a gzipped index file.
Examples
iex> CommonCrawl.Index.get("CC-MAIN-2024-51", "cdx-00000.gz")
{:ok, <<31, 139, 8, 0, 0, 0, 0, 0, 0, 3, ...>>}
Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.
Examples
iex> CommonCrawl.Index.get_all_paths("CC-MAIN-2024-51")
{:ok, [
"cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00000.gz",
"cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00001.gz",
# ... more index files
"cc-index/collections/CC-MAIN-2024-51/indexes/metadata.yaml",
"cc-index/collections/CC-MAIN-2024-51/indexes/cluster.idx"
]}
Fetches the cluster.idx file.
Examples
iex> CommonCrawl.Index.get_cluster_idx("CC-MAIN-2024-51")
{:ok, "0,100,22,165)/ 20241209080420..."}
Parses a line of an index file into a tuple containing the search key, timestamp, and metadata map.
Examples
iex> line = "com,example)/ 20240108123456 {"url": "http://www.example.com"}"
iex> CommonCrawl.Index.parser(line)
{:ok, {"com,example)/", 20240108123456, %{"url" => "http://www.example.com"}}}
@spec stream( String.t(), keyword() ) :: Enumerable.t()
Creates a stream of parsed index entries from index files.
Options
:preprocess_fun
- function to preprocess the stream before processing (default: & &1):dir
- temporary directory for storing downloaded files (default: System.tmp_dir!()):max_attempts
- maximum number of retry attempts for fetching cluster.idx (default: 3):backoff
- milliseconds to wait between retry attempts (default: 500)
Examples
# Stream all index entries
CommonCrawl.Index.stream("CC-MAIN-2024-51")
# Stream only German domains and shuffle them before processing
CommonCrawl.Index.stream("CC-MAIN-2024-51", preprocess_fun: fn stream ->
stream
|> Stream.filter(&String.starts_with?(&1, "de"))
|> Enum.shuffle()
end)
Returns URL of the index file.
Examples
iex> CommonCrawl.Index.url("CC-MAIN-2017-34", "cdx-00203.gz")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cdx-00203.gz"