View Source CommonCrawl.WARC (CommonCrawl v0.3.1)

Common Crawl .warc file download and parsing

Summary

Functions

Fetches a segment of the WARC file.

Functions

get_segment(filename, offset, length, opts \\ [])

@spec get_segment(String.t(), integer(), integer(), keyword()) ::
  {:ok, %{warc: String.t(), headers: String.t(), response: String.t()}}
  | {:error, any()}

Fetches a segment of the WARC file.

Examples

iex> CommonCrawl.WARC.get_segment(
...>   "crawl-data/CC-MAIN-2024-51/segments/1733066125982.36/warc/CC-MAIN-20241214181735-20241214211735-00433.warc.gz",
...>   36544657,
...>   1223
...> )

Return Value

{:ok,
 %{
   warc: "WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2024...",
   headers: "HTTP/1.1 200 OK\r\nContent-Type: text/html...", 
   response: "<!doctype html>\n<html>\n<head>\n<title>Example..."
 }}