1, Word item similarity
Elastic search supports spelling correction, and the acquisition of suggested words requires the calculation of word item similarity; Today, let's learn the word similarity algorithm through different distance algorithms;
2, Data preparation
To calculate the similarity of word items, we need to quantify the word items first; We can use the following two methods
Character vectorization, which maps each character into a unique number, we can directly use character coding;
import numpy as np def vectorize_words(words): lower_words = [word.lower() for word in words] words = [np.array([ord(c) for c in word]) for word in lower_words] return words vlan = 'valn' vlna = 'vlna' vlan233 ='vlan233' http='http' vlan_vector, vlna_vector, vlan233_vector, http_vector = vectorize_words([vlan, vlna, vlan233, http]) print(f'vlan_vector: {vlan_vector}') print(f'vlna_vector: {vlna_vector}') print(f'vlan233_vector: {vlan233_vector}') print(f'http_vector: {http_vector}') vlan_vector: [118 97 108 110] vlna_vector: [118 108 110 97] vlan233_vector: [118 108 97 110 50 51 51] http_vector: [104 116 116 112]
3, Hamming distance
Hamming distance is a very popular distance measurement method, which is widely used in the field of information and communication; It represents the number of different characters or symbols between two strings of equal length. Considering the two word items u and v with length n, the mathematical expression of Hamming distance is:
The normalized Hamming distance can also be calculated by dividing the total length of words
We use the following hamming_distance is used to calculate the Hamming editing distance, and the parameter norm is used to control whether normalization is carried out;
def hamming_distance(u, v, norm=True): if u.shape != v.shape: raise ValueError('the vector must have equal lenghts.') return (u!=v).mean() if norm else (u!=v).sum()
We use the following code to calculate the Hamming distance between valn and the other three words;
Through the output information, we can see that the minimum editing distance is 2 and 0.5 after normalization;
vlan = 'vlan' vlna = 'vlna' http='http' words = [vlan, vlna, http] input_word = 'valn' input_vector = vectorize_words([input_word]).pop() word_vectors = vectorize_words(words) for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} hamming distance is {hamming_distance(input_vector, vector, norm=False)}') print() for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} hamming distance is {hamming_distance(input_vector, vector)}') valn and vlan hamming distance is 2 valn and vlna hamming distance is 3 valn and http hamming distance is 4 valn and vlan hamming distance is 0.5 valn and vlna hamming distance is 0.75 valn and http hamming distance is 1.0
4, Manhattan distance
Manhattan distance mainly calculates the sum of the difference between the characters at each position of two strings; Manhattan distance is also known as city block distance, L1 norm and taxi measurement;
For two words u and v of the same length n, the mathematical formula of Manhattan distance is
Similarly, we can divide the length of the word to calculate the Manhattan planning distance
We can use the following methods to calculate Manhattan distance, and also use norm to control normalization;
def manhattan_distance(u, v, norm=True): if u.shape != v.shape: raise ValueError('the vector must have equal lenghts.') return abs(u-v).mean() if norm else abs(u-v).sum()
Using the same word, use the following code to calculate the Manhattan distance;
It can also be seen that the Manhattan distance between the smallest valn and vlan is 22, and after normalization, it is 5.5;
vlan = 'vlan' vlna = 'vlna' http='http' words = [vlan, vlna, http] input_word = 'valn' input_vector = vectorize_words([input_word]).pop() word_vectors = vectorize_words(words) for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} manhattan distance is {manhattan_distance(input_vector, vector, norm=False)}') print() for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} manhattan distance is {manhattan_distance(input_vector, vector)}') valn and vlan manhattan distance is 22 valn and vlna manhattan distance is 26 valn and http manhattan distance is 43 valn and vlan manhattan distance is 5.5 valn and vlna manhattan distance is 6.5 valn and http manhattan distance is 10.75
5, Euclidean distance
Euclidean distance calculates the shortest linear distance between two points, also known as Euclidean norm, L2 norm or L2 distance;
For two words u and v with the same length n, the mathematical formula of Euclidean distance is
We use the following method to calculate the Euclidean distance
def euclidean_distance(u, v): if u.shape != v.shape: raise ValueError('the vector must have equal lenghts.') return np.sqrt(np.sum(np.square(u - v)))
For the same keyword, use the following code to calculate the Euclidean distance;
vlan = 'vlan' vlna = 'vlna' http='http' words = [vlan, vlna, http] input_word = 'valn' input_vector = vectorize_words([input_word]).pop() word_vectors = vectorize_words(words) for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} euclidean distance is {euclidean_distance(input_vector, vector)}') valn and vlan euclidean distance is 15.556349186104045 valn and vlna euclidean distance is 17.146428199482248 valn and http euclidean distance is 25.0