TesseractJs.Models (tesseract_js v0.1.0)

Copy Markdown View Source

Model registry — single source of truth for the languages and core WASM files that tesseract_js knows about. Both CDN mode and local mode are driven from this module.

The curated list below covers ~20 common languages with checksums and approximate sizes for the :standard tessdata tier. Any language code (e.g. "hin", "chi_tra+chi_sim") that is not in the curated list still works at runtime — it just falls through to the URL template without checksum verification.

Use the helpers to resolve URLs and paths:

TesseractJs.Models.cdn_url("eng")
TesseractJs.Models.cdn_url("jpn_vert", :best)
TesseractJs.Models.local_path("eng")
TesseractJs.Models.filename("eng")

The tessdata-version pinned by this package is 4.0.0 (the tessdata bundle format used by tesseract.js 5.x). Bump it in lockstep with the JS core release.

Summary

Functions

Builds the jsDelivr URL for a language's traineddata file.

jsDelivr URL for the tesseract.js-core WASM bundle.

Filename for a WASM core variant.

Returns the tesseract.js-core version this package is pinned to.

Filename for a language's traineddata file.

Returns the registry entry for a language, or nil if not curated.

Returns the curated registry as a map of lang => %{name, size_mb, sha256}.

Local filesystem path (relative to a Phoenix app's priv/static/) where the Mix download task will write a language file.

Splits a +-combined lang string into individual lang codes.

Returns the tessdata version (4.0.0) this package is pinned to.

Functions

cdn_url(lang, tier \\ :standard)

@spec cdn_url(String.t(), atom()) :: String.t()

Builds the jsDelivr URL for a language's traineddata file.

iex> TesseractJs.Models.cdn_url("eng")
"https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng@1.0.0/4.0.0/eng.traineddata.gz"

Supports +-combined langs by returning the URL for the first lang — the consumer is expected to download each lang separately. (For local mode, all combined langs must be present in the same dir.)

Tiers

  • :standard — full LSTM+legacy combined model, ~11 MB/lang gzipped.
  • :best — smaller LSTM-only model (the _best_int variant on jsDelivr), ~3 MB/lang gzipped, similar accuracy to standard for most languages.

The :fast tier (tessdata_fast) requires uncompressed .traineddata files served from a different source and isn't supported in v0.1.

core_cdn_url(variant \\ :simd_lstm)

@spec core_cdn_url(atom()) :: String.t()

jsDelivr URL for the tesseract.js-core WASM bundle.

core_filename(variant \\ :simd_lstm)

@spec core_filename(atom()) :: String.t()

Filename for a WASM core variant.

core_local_path(variant \\ :simd_lstm, base \\ "/assets/vendor/tesseract")

@spec core_local_path(atom(), Path.t()) :: String.t()

Local path for the WASM core.

core_version()

Returns the tesseract.js-core version this package is pinned to.

filename(lang)

@spec filename(String.t()) :: String.t()

Filename for a language's traineddata file.

get(lang)

@spec get(String.t()) :: map() | nil

Returns the registry entry for a language, or nil if not curated.

list()

@spec list() :: %{required(String.t()) => map()}

Returns the curated registry as a map of lang => %{name, size_mb, sha256}.

local_path(lang, base \\ "/assets/vendor/tesseract")

@spec local_path(String.t(), Path.t()) :: String.t()

Local filesystem path (relative to a Phoenix app's priv/static/) where the Mix download task will write a language file.

iex> TesseractJs.Models.local_path("eng")
"/assets/vendor/tesseract/eng.traineddata.gz"

split_langs(lang)

@spec split_langs(String.t()) :: [String.t()]

Splits a +-combined lang string into individual lang codes.

tessdata_version()

Returns the tessdata version (4.0.0) this package is pinned to.