content_indexer v0.2.4 API Reference

Modules

Documentation for ContentIndexer

struct to store the details of what data is held in the index It provides a new/2 function for instantiating the struct that includes a generated UUID

ContentIndexer.IndexInitialiser

ContentIndexer.Indexer

Summary

Indexer is a Genserver that holds the index state - basically a list of index structs that have the filename, tokens and weights
Each time an index struct is added to the server/index the weightings are re-calculated. Since they are stored in memory the index searching is fast

ContentIndexer.Services.Calculator

Summary

calculates the content_indexer weights for a document of tokens against a corpus of tokenized documents

ContentIndexer.Services.ListCheckerServer

Summary

ListCheckerServer is the OTP server that uses Genserver to handle the
interactions with the individual workers and the parent caller
The ListCheckerWorkers each process a list of tokens
and checks that list for a given token. Once it is done a message is
returned to the ListCheckerServer.
The server in turn sends a message to the callee - advising it once the whole
list of token lists has been checked successfully!

ContentIndexer.Services.ListCheckerWorker

genserver based approach to the ListCheckerWorker Summary

ListCheckerWorker is the OTP actor that handles the actual ContentIndexerService.list_contains to check
whether a given word is contained in a list of tokens

ContentIndexer.Services.PreProcess

content and query pre-process functions that are passed to the SearchUtils.compile and SearchUtils.compile_query functions - here we are just some some extra stuf with a markdown file - i.e. removing the header

ContentIndexer.Services.SearchUtils

utility functions to crawl a folder with files and extract content - the actual processing of the content is handled by the file_pre_process_func function that we are using from the ContentIndexer.Services.PreProcess module - however this can easily be swapped out by passing your own pre-process

ContentIndexer.Services.Similarity

Summary This module accepts a list of tuples which contain the document id and a hash of terms and and their TF_IDF weights, it also accepts query terms in the form of a hash of terms and weights, same format as in the tuple above

ContentIndexer.TfIdf.Calculate

Calculate the TF_IDF weights for a given document_name tokens

ContentIndexer.TfIdf.Corpus

Summary

Corpus is a Genserver that simply holds the total number of docs in the index

ContentIndexer.TfIdf.DocCounts

Summary DocCounts is a GenServer that contains a map of documents with their respective total number of terms in the document

ContentIndexer.TfIdf.DocTerms

Summary DocTerms is a GenServer with a Map of tuples that has the document, and a count of each of the terms in the document

ContentIndexer.TfIdf.TermCounts

Summary TermCounts is a GenServer that contains the numbers of documents that have a term ‘t’ Basically a map of all unique terms and their respective counts i.e. 1 per for each document that has this term

ContentIndexer.TfIdf.WeightsIndexer

Summary The WeightsIndex is the actual tf_idf list stored by document_name