Issue
As I understand, min_term_freq=2
look at the input text and the term is used for searching only if it occurs at least two times.
But what does min_doc_freq
mean? The documentation says
The minimum document frequency below which the terms will be ignored from the input document. Defaults to 5.
But I am not able to figure out what that means? Does it look at the input document or the rest of the index?
Solution
Lucene scoring formula uses TF-IDF weights to reflect how meaningful a word is to a document in a corpus.
Therefore, the terms of the input document that have the highest tf-idf are good representatives of that document, and could be used within a disjunctive query (or OR) to retrieve similar documents.
That's why the More Like This component uses this numerical statistic.
The MLT query simply extracts the text from the input document, analyzes it, usually using the same analyzer at the field, then selects the top K terms with highest tf-idf to form a disjunctive query of these terms.
The idf represents the inverse of the number of documents in which a given term appears : a term appearing in every document would be considered as not pertinent (high doc frequency, and thus low idf).
That being said, a word that appears only one time in one document could also be a typo, a lorem ipsum excerpt, or something like that : a term without any meaning but that get a significant tf-idf weight, hence the need to leave some "room" to avoid issues induced by nothing more than a "theoretical meaningfulness".
The min_doc_freq
allows to set a threshold below which any term having a docFreq
less than this value (among the selected K terms with highest tf-idf) will be ignored from the input document. For example, min_doc_freq=5
term must appear at least in 5 documents otherwise it will be excluded from the MLT query. This can be useful in situations where you want MLT to return documents similar to the given one only if the terms of the query yields a well-addressed topic (addressed in at least 5 documents).
So, Does it look at the input document or the rest of the index?
Both : from the input document, it needs the top K terms and for each one of them, to check their docFreq
which is a TermStatistics queried against the index.
In the same context, you would use max_doc_freq
to ignore highly frequent words such as stop words.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
Answered By - EricLavault Answer Checked By - Terry (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.