An analysis of the calculation principle of search scoring in elastic search

Several key words of search scoring calculation

  • TF: token frequency, the number of times a search field appears in the field (the field to be searched) in the document after segmentation

  • IDF: inverse document frequency, reverse the frequency of a search field in all documents

  • TFNORM: token frequency normalized
  • BM25: algorithm: (freq + k1 * (1 - b + b * dl / avgdl))

Two documents are as follows:

{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "321697",
        "_score" : 6.6273837,
        "_source" : {
          "title" : "Steve Jobs"
      }
}
{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "23706",
        "_score" : 6.0948296,
        "_source" : {
          "title" : "All About Steve"
      }
}

If we query through match of title

GET /movies/_search
{
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

From the scoring results, we can see that the first document scores higher than the second one. The specific reasons are as follows:

On the TF side, the frequency of occurrence on the field with search is the same

The frequency of IDF in the whole document is the same

TFNORM is different. In the first document, the word accounts for 1 / 2, and in the second document, the word accounts for 1 / 3. Therefore, the score of the first document is higher than that of the second index, so the TFNORM calculation method is used in ES algorithm (freq + K1 * (1 - B + b * DL / avgdl))

Finally, TF algorithm in ES combines word frequency normalization and BM25

If we want to see a scoring algorithm of Elasticsearch, we can show it by the following command

GET /movies/_search
{
  // Similar to MySQL's execution plan
  "explain": true, 
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

Execution results, view one of them

{
    "_shard": "[movies][1]",
    "_node": "pqNhgutvQfqcLqLEzIDnbQ",
    "_index": "movies",
    "_type": "_doc",
    "_id": "321697",
    "_score": 6.6273837,
    "_source": {
        "overview": "Set backstage at three iconic product launches and ending in 1998 with the unveiling of the iMac, Steve Jobs takes us behind the scenes of the digital revolution to paint an intimate portrait of the brilliant man at its epicenter.",
        "voteAverage": 6.8,
        "keywords": [
            {
                "id": 5565,
                "name": "biography"
            },
            {
                "id": 6104,
                "name": "computer"
            },
            {
                "id": 15300,
                "name": "father daughter relationship"
            },
            {
                "id": 157935,
                "name": "apple computer"
            },
            {
                "id": 161160,
                "name": "steve jobs"
            },
            {
                "id": 185722,
                "name": "based on true events"
            }
        ],
        "releaseDate": "2015-01-01T00:00:00.000Z",
        "runtime": 122,
        "originalLanguage": "en",
        "title": "Steve Jobs",
        "productionCountries": [
            {
                "iso_3166_1": "US",
                "name": "United States of America"
            }
        ],
        "revenue": 34441873,
        "genres": [
            {
                "id": 18,
                "name": "Drama"
            },
            {
                "id": 36,
                "name": "History"
            }
        ],
        "originalTitle": "Steve Jobs",
        "popularity": 53.670525,
        "tagline": "Can a great man be a good man?",
        "spokenLanguages": [
            {
                "iso_639_1": "en",
                "name": "English"
            }
        ],
        "id": 321697,
        "voteCount": 1573,
        "productionCompanies": [
            {
                "name": "Universal Pictures",
                "id": 33
            },
            {
                "name": "Scott Rudin Productions",
                "id": 258
            },
            {
                "name": "Legendary Pictures",
                "id": 923
            },
            {
                "name": "The Mark Gordon Company",
                "id": 1557
            },
            {
                "name": "Management 360",
                "id": 4220
            },
            {
                "name": "Cloud Eight Films",
                "id": 6708
            }
        ],
        "budget": 30000000,
        "homepage": "http://www.stevejobsthefilm.com",
        "status": "Released"
    },
    -          }
                ]
            }
        ]
    }
}

At this time, you can see that the result is more than the following set of data (execution plan)

{
    "_explanation": {
        "value": 6.6273837,
        // The weight of the title field value steve in all 1526 matching documents
        "description": "weight(title:steve in 1526) [PerFieldSimilarity], result of:",
        "details": [
            {
                // value = idf.value * tf.value * 2.2
                // 6.6273837 = 6.4412656 * 0.46767938 * 2.2
                "value": 6.6273837,
                "description": "score(freq=1.0), product of:",
                "details": [
                    {
                        "value": 2.2,
                        // Zoom factor. This value can be specified when creating an index. The default value is 2.2
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 6.4412656,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                                "value": 2,
                                "description": "n, number of documents containing term",
                                "details": []
                            },
                            {
                                "value": 1567,
                                "description": "N, total number of documents with field",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 0.46767938,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 1,
                                "description": "freq, occurrences of term within document",
                                "details": []
                            },
                            // The BM25 algorithm (freq + k1 * (1 - b + b * dl / avgdl)) is proposed here
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            // In this case, a normalized operation algorithm can be proposed
                            {
                                "value": 2,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 2.1474154,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

Keywords: Web Development ElasticSearch MySQL

Added by millwardt on Fri, 21 Feb 2020 17:53:35 +0200