Word term similarity algorithm of elastic search algorithm

1, Word item similarity

Elastic search supports spelling correction, and the acquisition of suggested words requires the calculation of word item similarity; Today, let's learn the word similarity algorithm through different distance algorithms;

2, Data preparation

To calculate the similarity of word items, we need to quantify the word items first; We can use the following two methods

Character vectorization, which maps each character into a unique number, we can directly use character coding;

import numpy as np

def vectorize_words(words):
    lower_words = [word.lower() for word in words]
    words = [np.array([ord(c) for c in word])  for word in lower_words]
    return words

vlan = 'valn'
vlna = 'vlna'
vlan233 ='vlan233'
http='http'

vlan_vector, vlna_vector, vlan233_vector, http_vector = vectorize_words([vlan, vlna, vlan233, http])
print(f'vlan_vector: {vlan_vector}')
print(f'vlna_vector: {vlna_vector}')
print(f'vlan233_vector: {vlan233_vector}')
print(f'http_vector: {http_vector}')

vlan_vector: [118  97 108 110]
vlna_vector: [118 108 110  97]
vlan233_vector: [118 108  97 110  50  51  51]
http_vector: [104 116 116 112]

3, Hamming distance

Hamming distance is a very popular distance measurement method, which is widely used in the field of information and communication; It represents the number of different characters or symbols between two strings of equal length. Considering the two word items u and v with length n, the mathematical expression of Hamming distance is:

\[hd(u,v)=\sum_{i}^{n}(u_{i}\ne v_{i} ) \]

The normalized Hamming distance can also be calculated by dividing the total length of words

\[norm\_hd(u,v) = \frac {\sum_{i}^{n}(u_{i}\ne v_{i} )} {n} \]

We use the following hamming_distance is used to calculate the Hamming editing distance, and the parameter norm is used to control whether normalization is carried out;

def hamming_distance(u, v, norm=True):
    if u.shape != v.shape:
        raise ValueError('the vector must have equal lenghts.')

    return (u!=v).mean() if norm else (u!=v).sum()

We use the following code to calculate the Hamming distance between valn and the other three words;

Through the output information, we can see that the minimum editing distance is 2 and 0.5 after normalization;

vlan = 'vlan'
vlna = 'vlna'
http='http'
words = [vlan, vlna, http]


input_word = 'valn'
input_vector = vectorize_words([input_word]).pop()

word_vectors = vectorize_words(words)
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} hamming distance is {hamming_distance(input_vector, vector, norm=False)}')

print()
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} hamming distance is {hamming_distance(input_vector, vector)}')
    
    
    
valn and vlan hamming distance is 2
valn and vlna hamming distance is 3
valn and http hamming distance is 4

valn and vlan hamming distance is 0.5
valn and vlna hamming distance is 0.75
valn and http hamming distance is 1.0

4, Manhattan distance

Manhattan distance mainly calculates the sum of the difference between the characters at each position of two strings; Manhattan distance is also known as city block distance, L1 norm and taxi measurement;

For two words u and v of the same length n, the mathematical formula of Manhattan distance is

\[md(u,v)=\|u - v\|_{1} = \sum_{i=1}^{n}|u_{i} - v_{i}| \]

Similarly, we can divide the length of the word to calculate the Manhattan planning distance

\[norm\_md(u,v)=\frac {\|u - v\|_{1}} {n} = \frac {\sum_{i=1}^{n}|u_{i} - v_{i}|} {n} \]

We can use the following methods to calculate Manhattan distance, and also use norm to control normalization;

def manhattan_distance(u, v, norm=True):
    if u.shape != v.shape:
        raise ValueError('the vector must have equal lenghts.')

    return abs(u-v).mean() if norm else abs(u-v).sum()

Using the same word, use the following code to calculate the Manhattan distance;

It can also be seen that the Manhattan distance between the smallest valn and vlan is 22, and after normalization, it is 5.5;

vlan = 'vlan'
vlna = 'vlna'
http='http'
words = [vlan, vlna, http]


input_word = 'valn'
input_vector = vectorize_words([input_word]).pop()
word_vectors = vectorize_words(words)
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} manhattan distance is {manhattan_distance(input_vector, vector, norm=False)}')

print()
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} manhattan distance is {manhattan_distance(input_vector, vector)}')
    
    
valn and vlan manhattan distance is 22
valn and vlna manhattan distance is 26
valn and http manhattan distance is 43

valn and vlan manhattan distance is 5.5
valn and vlna manhattan distance is 6.5
valn and http manhattan distance is 10.75

5, Euclidean distance

Euclidean distance calculates the shortest linear distance between two points, also known as Euclidean norm, L2 norm or L2 distance;

For two words u and v with the same length n, the mathematical formula of Euclidean distance is

\[ed(u,v)=\|u - v\|_{2} = \sqrt{\sum_{i=1}^{n}(u_{i} - v_{i})^2} \]

We use the following method to calculate the Euclidean distance

def euclidean_distance(u, v):
    if u.shape != v.shape:
        raise ValueError('the vector must have equal lenghts.')

    return np.sqrt(np.sum(np.square(u - v)))

For the same keyword, use the following code to calculate the Euclidean distance;

vlan = 'vlan'
vlna = 'vlna'
http='http'
words = [vlan, vlna, http]


input_word = 'valn'
input_vector = vectorize_words([input_word]).pop()
word_vectors = vectorize_words(words)
for word, vector in zip(words, word_vectors):
    print(f'{input_word} and {word} euclidean distance is {euclidean_distance(input_vector, vector)}')
    
valn and vlan euclidean distance is 15.556349186104045
valn and vlna euclidean distance is 17.146428199482248
valn and http euclidean distance is 25.0

Added by jomama on Thu, 20 Jan 2022 18:10:51 +0200