# Elasticsearch: understand the theory behind the textual search engine

A few months ago I started doing a specialization on Data Mining on Coursera and the second course of this specialization did not seem very attractive to me at first, but it ended up being very useful in understanding one of the work tools we use at Extreme Digital Solutions : elasticsearch .

Much of the theoretical content that I will present are personal notes that I took from the course Text Recovery and Search Engines .

The purpose of this article is to be something more expository and does not intend to explain each subject in an exhaustive way but to show the “stones” path and references in case you want to go deeper.

In the end I hope you can answer the following questions:

• What is the problem that a textual search engine needs to solve?
• What is the “state of the art” today to solve this problem?
• How can eslasticsearch solve this problem quickly?
• What is a vector space for documents and how is it created?
• What are the possible ways to measure the relevance of a document and what is the way used by elasticsearch ?
• How to interpret and adjust the BM25 transformation function used by elasticsearch ?

I will use a lot of mathematical notation but I will try to explain the concepts to those who do not speak “mathematicians”.

I will use the Python language to exemplify some of the concepts that will be presented.

## Formal Formulation of a Textual Search Engine

Let’s start by giving “names to oxen” using mathematical notation.

Suppose you have a collection of text documents. All words, or terms, that exist within these documents form a “vocabulary” that we will call V and the words in that vocabulary we will call w (w of word).

The queries that a user types will be called q which is formed by the words “q1”, “q2”, etc. Words that belong to the V vocabulary:

A textual document “d” of index i is formed by words “d_i_j” (word j from document i) that belong to vocabulary V.

Let’s call our collection of M documents C, that is, if we had 100 documents then M = 100 and C is the set of all these 100 documents:

A textual search engine must be able to find a subset of relevant documents based on the textual query written by the user. This translated to “matematiquês” would be 𝑅 (𝑞) ⊆ 𝐶. The 𝑞 query is a “tip” of which documents will be ranked by the 𝑅 (𝑞) function.

The task of a textual search engine is to compute 𝑅 ′ (𝑞) ≈𝑅 (𝑞), that is, estimate the parameters of the supposed ideal function 𝑅 (𝑞).

We have two types of strategies for estimating 𝑅 (𝑞):

• Selection of documents : 𝑅 ′ (𝑞) = {𝑑 ∈ 𝐶 | 𝑓 (𝑑, 𝑞) = 1}, where 𝑓 (𝑑, 𝑞) ∈ {0.1} is an indicator function of the binary classifier type. The system has to decide whether the document will be listed to the user or not.
• Ranking of documents : 𝑅 ′ (𝑞) = {𝑑∈𝐶 | 𝑓 (𝑑, 𝑞) ≥ 𝜃}, where 𝑓 (𝑑, 𝑞) ∈𝑅 is a function that measures relevance and 𝜃 is a predefined threshold. The system will have to list the documents and indicate their degrees of relevance.

We will emphasize the second case because it is the path used by elasticsearch .

This function 𝑅 (𝑞) can follow probabilistic, inferential or similarity-based models.

Elasticsearch follows the model based on similarity. Some possible similarity functions: Euclidean distance, internal product and cosine of similarity.

But to calculate this similarity function we first need to build a vector space that will make it possible to obtain this result.

## Documents as Vectors of a Space

Each document can be represented as a vector in a vector space that has a dimension (axis) for each word in vocabulary V. Thus, each dimension of that vector represents the number of times the word associated with that dimension appears in the document in question. The same applies to the vector representation of the query q entered by the user.

This idea of ​​counting how many times a word appears in a document is called Term Frequency (TF). There are different ways to measure the TF of a word (see more about it here ) but the one we will use here is the approach of simply getting the number of times the word in question appears in the document.

As an example, let’s assume that our vocabulary is:

And our document is formed by the words:

And the query entered by the user is:

Then we would have the following representation of 𝑑1 and 𝑞 in the vector space:

## Domestic Product as a Similarity Measure and Ranking Function

From analytical geometry we know that the internal product between two vectors relates their modules, that is, their lengths, with the angle between them. Given 𝑞 = (𝑥1, 𝑥2,…, 𝑥𝑛) and 𝑑 = (𝑦1, 𝑦2,…, 𝑦𝑛):

Let’s play around with this idea of ​​calculating similarity between documents using the internal product:

Note that the result for the second and third documents, which speak about different things, received the same score because the word “about” had as much weight as the word “presidential”.

It would be smarter not to “give too much attention” to words that are repeated a lot in all documents.

So that this does not happen, we can weigh the vectors of the documents by the IDF (please see tf – idf ), thus, a term will only have greater weight if it is a term not so common in other documents.

This will solve the problem of equal scores for the second and third documents.

As you can see, the third and fifth documents speak of the same thing, but as the term “campaign” appears many times in the fifth document it still receives a higher score.

If a word “campaign” appears once it gives me a clue as to what the document is talking about, if it appears a second, third and fourth time it will further strengthen my confidence that the document in question is really talking about some campaign, but there comes a point that if I see the word “campaign” (50 times for example) more often it won’t add much information to the point of releasing one: “okay, I understand that you’re talking about the campaign . I don’t need that much! ”. The name of this is Term Saturation.

To solve this problem of saturation of the term resulting from the function 𝑇𝐹 (𝑤, 𝑑), we can apply some transformations. The next section deals with one of these transformations that work very well, the BM25 and that is the standard transformation used by elasticsearch .

## BM25 transformation: controlling the influence of term saturation

If we were to just take 𝑐 (𝑤, 𝑑) ⟼ 𝑇𝐹 (𝑤, 𝑑) (in this case TF (w, d) is the relative frequency of the word in the document while c (w, f) the absolute frequency) we would have a linear transformation, but it would be interesting to have a transformation that had a ceiling, a maximum value, that could be adjusted.

Among the various transformations that try to minimize the effect of words that are repeated many times in the same document, Best-Match-25 (BM25) is the one that brings the best results and it has a ceiling that is adjusted using the parameter 𝑘. The transformation follows:

If 𝑘 = 0 the ceiling will be 1, that is, I am saying that I am not interested in the repetitions of a term, just seeing it once is enough. The higher the value of the parameter 𝑘, the closer the transformation will be to the linear transformation 𝑐 (𝑤, 𝑑) ⟼ 𝑇𝐹 (𝑤, 𝑑).

With some algebraic manipulations it is easy to see why the ceiling of this transformation is (𝑘 + 1) because:

Here are some ranking functions that use the BM25:

• Okapi BM25
• BM25L
• BM25 +
• BM25T

All of these variations can be seen in detail in the article Improvements to BM25 and Language Models Examined .

The ranking function that we will deal with here is the Okapi BM25 because it is the one that is used by elasticsearch today because it is considered today as “state of the art”. But first we need to understand about normalizing document length.

## Normalization: controlling the influence of document length

The length of a document is its total number of words.

Why can document length be an issue? Very long documents are more likely to match any query, due to the large number of words. The document can be long for two reasons:

• It is verbose, that is, it uses many words to describe the same subject.
• It actually has more content, that is, different information.

We want to penalize the first case more and not so much the second case. This process of penalizing verbal documents is called Document Length Normalization.

The average length of documents in collection C is used as a reference point (pivot), that is, a document will only be normalized if it is greater or less than the average.

As you can see, parameter b assumes values ​​between 0 and 1 (included) and the higher b is, the greater the penalty applied to a document is greater than average.

## Okapi BM25: the merger

Combining the BM25 transformation, plus the compliance normalization presented above and the TF-IDF, we will have the following ranking function called BM25 Okapi:

According to the article Practical BM25 – Part 1: How Shards Affect Relevance Scoring in Elasticsearch published on the elasticsearch blog:

“In Elasticsearch 5.0, we switched to Okapi BM25 as our standard similarity algorithm, which is used to rank results related to a query.

## Reverse Indexing of Documents

In order for the search engine to be quick in ranking documents, we need to pre-calculate the values ​​of TF, DF and IDF as much as possible and store them in a data structure that supports textual search algorithms such as the BM25 Okapi.

This process of creating this data structure is called Reverse Document Indexing .

The image above shows a little of what this reverse index object would look like. First of all, it creates a dictionary that represents the vocabulary V, which indicates the number of documents have the word / term in question and its absolute frequency, which in turn is associated with another object called “Posting” which indicates the number of times that this word appears in the documents with which it is associated.

The ElasticSearch uses a Java library called Lucene to take care of the entire process of reverse indexing and searching documents.

“Apache LuceneTM is a high-performance, full-featured text search engine library, entirely written in Java. It is a technology suitable for virtually any application that requires full-text search. ” Apache Lucene Core

The elastic needs to create the object of inverse indexing associated with the text fields to advance as much as possible the calculations of the relevance scores.

The elasticsearch stores different types of data in data structures more appropriate to the data in question, the textual data will be stored in specific structures for inverse indexing.

Before performing the inverse indexing of a textual field, we can apply analyzer (see Anatomy of an analyzer ) that help to treat the text data, removing stopwords , applying stemming , removing accentuation, leaving all words in lowercase, clearing the text from html tags, etc. Such analyzers are configured in the mapping (see Mapping ).

## Customizing the BM25 Parameters

We will do some tests using different values ​​for the parameters of the BM25 Okapi.

In the following mapping we will define k1 = 0, we mean that we will not care about the saturation of the term when calculating the relevance score. But we will apply an average penalty (b = 05) for long documents.

In this other index, we will define b = 0, we mean that we will not care about the length of the document when calculating the relevance score of a document. But we will consider the word saturation ceiling as 3 (remember that the ceiling is k + 1 ).

The next index will let us penalize the maximum (b = 1) documents that are larger than average and our word saturation ceiling is 3.

So let’s create our index:

We will index some documents in each of the index for comparison.

Now we will make the following query for the index newspaper1, passing the explain parameter as true, that way you can get the explanation of each step of the relevance score calculation that elastic performed associated to the documents listed as relevant, just print the content that is in res1 [“hits”] [“hits”] [0] [“_ explanation”].

Note that all the first 3 docs containing “news” and “campaigns” were listed even though the 3rd has nothing to do with the query subject. The first 2 still have one more word that is relevant, the word “presidential” but the largest document was in 2nd place in the order of relevance, that is, it was penalized by a mean (b = 0.5) for being higher than the average.

Now let’s see the same query applied in the index Jornal2:

This time, because we did not penalize the length of (b = 0), the largest document came in first place, even with the saturation of the term “campaign” because the ceiling is approximately 3 words (k1 = 2, remember that the ceiling is equal a k1 + 1), so the first 3 words “campaign” were considered in the relevance calculation while the remaining 2 were ignored.

Note that even the 3rd most relevant document this time was the one that talks about organic foods but that has the word “campaign” repeated 4 times, as approximately 3 of them were used for the calculation of relevance.

Now let’s look at the index Jornal3:

We now have a scenario close to the newspaper Jornal1 index for the first 3, but with the difference that the 3rd most relevant document is no longer about organic foods as these documents were “bombarded” with penalties due to length, saturation of terms and unimportant words as “food” and “organic”.

I leave it as homework for you to check the step by step calculation of the relevance score that is contained in res1, res2 and res3 (res1 [“hits”] [“hits”] [] [“_ explanation” ]). Also index the same documents in an index with the default settings for the BM25 and execute the same query that we did here for comparison purposes.

That’s it folks, any hesitation I made in the explanations please leave in the comments. Hug to everyone!