Issue

As I understand, min_term_freq=2 look at the input text and the term is used for searching only if it occurs at least two times.

But what does min_doc_freq mean? The documentation says

The minimum document frequency below which the terms will be ignored from the input document. Defaults to 5.

But I am not able to figure out what that means? Does it look at the input document or the rest of the index?

Solution

Lucene scoring formula uses TF-IDF weights to reflect how meaningful a word is to a document in a corpus.

Therefore, the terms of the input document that have the highest tf-idf are good representatives of that document, and could be used within a disjunctive query (or OR) to retrieve similar documents.

That's why the More Like This component uses this numerical statistic.

The MLT query simply extracts the text from the input document, analyzes it, usually using the same analyzer at the field, then selects the top K terms with highest tf-idf to form a disjunctive query of these terms.

The idf represents the inverse of the number of documents in which a given term appears : a term appearing in every document would be considered as not pertinent (high doc frequency, and thus low idf).

That being said, a word that appears only one time in one document could also be a typo, a lorem ipsum excerpt, or something like that : a term without any meaning but that get a significant tf-idf weight, hence the need to leave some "room" to avoid issues induced by nothing more than a "theoretical meaningfulness".

The min_doc_freq allows to set a threshold below which any term having a docFreq less than this value (among the selected K terms with highest tf-idf) will be ignored from the input document. For example, min_doc_freq=5 term must appear at least in 5 documents otherwise it will be excluded from the MLT query. This can be useful in situations where you want MLT to return documents similar to the given one only if the terms of the query yields a well-addressed topic (addressed in at least 5 documents).

So, Does it look at the input document or the rest of the index?
Both : from the input document, it needs the top K terms and for each one of them, to check their docFreq which is a TermStatistics queried against the index.

In the same context, you would use max_doc_freq to ignore highly frequent words such as stop words.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

Answered By - EricLavault

Answer Checked By - Terry (PHPFixing Volunteer)

Monday, October 24, 2022

[FIXED] How does min_doc_freq work in More Like This query?

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Monday, October 24, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To