Bit-exact port of fastText's string hash function.
fastText uses a Fowler–Noll–Vo (FNV-1a) variant with one quirk: each input
byte is reinterpreted as a signed 8-bit integer before being widened to
unsigned 32-bit. Bytes with the high bit set therefore contribute their
sign-extended value to the hash mix step, not their unsigned value. This
is documented in src/dictionary.cc (Dictionary::hash) of the fastText
source as a deliberate compatibility decision so that all already-released
models hash identically.
Translating the C++ literally:
uint32_t h = 2166136261;
for (size_t i = 0; i < str.size(); i++) {
h = h ^ uint32_t(int8_t(str[i]));
h = h * 16777619;
}The two constants are the canonical FNV offset basis and FNV prime for 32-bit FNV-1a.
Any deviation from the reference here will silently produce wrong subword
indices and wreck the model's predictions for non-ASCII scripts. The hash
is exercised by golden tests against fastText's own get_subwords/1
output for a large corpus of words.
Summary
Functions
Returns the 32-bit FNV-1a-with-signed-byte hash of a binary.
Functions
@spec hash(binary()) :: non_neg_integer()
Returns the 32-bit FNV-1a-with-signed-byte hash of a binary.
Arguments
binaryis any UTF-8 string or arbitrary byte sequence. fastText operates on UTF-8 byte sequences, so passing aString.t/0is the typical use.
Returns
- A non-negative integer in
[0, 2^32 - 1].
Examples
iex> Text.Language.Classifier.Fasttext.Hash.hash("")
2166136261
iex> Text.Language.Classifier.Fasttext.Hash.hash("a")
3826002220
iex> Text.Language.Classifier.Fasttext.Hash.hash("the")
3020861980