Tutorial :High level explanation of Similarity Class for Lucene?


Do you know where I can find a high level explanation of Lucene Similarity Class algorithm. I will like to understand it without having to decipher all the math and terms involved with searching and indexing.


Lucene's built-in Similarity is a fairly standard "Inverse Document Frequency" scoring algorithm. The Wikipedia article is brief, but covers the basics. The book Lucene in Action breaks down the Lucene formula in more detail; it doesn't mirror the current Lucene formula perfectly, but all of the main concepts are explained.

Primarily, the score varies with number of times that term occurs in the current document (the term frequency), and inversely with the number of times a term occurs in all documents (the document frequency). The other factors in the formula are secondary, adjusting the score in attempt to make scores from different queries fairly comparable to each other.


Think of each document and search term as a vector whose coordinates represent some measure of how important each word in the entire corpus of documents is to that particular document or search term. Similarity tells your the distance between two different vectors.

Say your corpus is normalized to ignore some terms, then a document consisting only of those terms would be located at the origin of a graph of all of your documents in the vector space defined by your corpus. Each document that contains some other terms, then represents a point in the space whose coordinates are defined by the importance of that term in the document relative to that term in the corpus. Two documents (or a document and search) whose coordinates put their "points" closer together are more similar than those with coordinates that put their "points" further apart.


How was mentioned by erickson in Lucene is Cosine similarity Term Frequency-Inverse document frequency (TF-IDF). Imagine that you have two bags of terms in the query and in the document. This measurement only match exactly terms and after in the context include their semantically weights. Terms with very frequetly occurence has smaller weight (importancy), because you could them find it in lot of documents. But the serious problem what I see that Cosine similarity TF-IDF is not so robust on more inconsistent data, where you need to compute similarity betweens the query and the document more robust e.g. misspeling, typographical and phonetical errors. Because the words must have exact match.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »