content_indexer v0.2.4 API Reference
Modules
Documentation for ContentIndexer
struct to store the details of what data is held in the index
It provides a new/2 function for instantiating the struct that includes a generated UUID
Summary
Indexer is a Genserver that holds the index state - basically a list of index structs that have the filename, tokens and weights
Each time an index struct is added to the server/index the weightings are re-calculated. Since they are stored in memory the index searching is fast
Summary
calculates the content_indexer weights for a document of tokens against a corpus of tokenized documents
Summary
ListCheckerServer is the OTP server that uses Genserver to handle the
interactions with the individual workers and the parent caller
The ListCheckerWorkers each process a list of tokens
and checks that list for a given token. Once it is done a message is
returned to the ListCheckerServer.
The server in turn sends a message to the callee - advising it once the whole
list of token lists has been checked successfully!
genserver based approach to the ListCheckerWorker Summary
ListCheckerWorker is the OTP actor that handles the actual ContentIndexerService.list_contains to check
whether a given word is contained in a list of tokens
content and query pre-process functions that are passed to the SearchUtils.compile and SearchUtils.compile_query functions - here we are just some some extra stuf with a markdown file - i.e. removing the header
utility functions to crawl a folder with files and extract content - the actual processing of the content is handled by the file_pre_process_func function that we are using from the ContentIndexer.Services.PreProcess module - however this can easily be swapped out by passing your own pre-process
Summary This module accepts a list of tuples which contain the document id and a hash of terms and and their TF_IDF weights, it also accepts query terms in the form of a hash of terms and weights, same format as in the tuple above
Calculate the TF_IDF weights for a given document_name tokens
Summary
Corpus is a Genserver that simply holds the total number of docs in the index
Summary DocCounts is a GenServer that contains a map of documents with their respective total number of terms in the document
Summary DocTerms is a GenServer with a Map of tuples that has the document, and a count of each of the terms in the document
Summary TermCounts is a GenServer that contains the numbers of documents that have a term âtâ Basically a map of all unique terms and their respective counts i.e. 1 per for each document that has this term
Summary The WeightsIndex is the actual tf_idf list stored by document_name