Similarity algorithm of recommendation system based on elastic search algorithm

1, Introduction to recommendation system

The recommendation system is mainly based on the analysis and processing of the user's historical behavior data to find the content that the user may be interested in, so as to actively recommend the content that the user may be interested in to the user;

From the perspective of the long tail theory of goods, the recommendation system finds users' personalized needs by exploring users' behavior, so as to accurately recommend long tail goods to users who need it, and help users find goods they are interested in but difficult to find.

The recommendation system uses neighborhood based algorithms, one is user based collaborative filtering algorithm, the other is item based collaborative filtering algorithm;

2, Dataset preparation

We use the MovieLens dataset provided by GroupLens

These files contain 1,000,209 anonymous ratings of approximately 3,900 movies 
made by 6,040 MovieLens users who joined MovieLens in 2000.

Use code to load the scoring file ml-1m / ratings dat

def get_ratings():
    ratings = pd.read_csv('ml-1m/ratings.dat',
                          sep='::',
                          names=['UserID', 'MovieID', 'Rating', 'Timestamp'],
                          nrows=100000
                          )

    return ratings

We can view the data of the first five rows

rating = get_ratings()
print(rating.head())

   UserID  MovieID  Rating  Timestamp
0       1     1193       5  978300760
1       1      661       3  978302109
2       1      914       3  978301968
3       1     3408       4  978300275
4       1     2355       5  978824291

Find out the number of movies users are interested in

rating = get_ratings()
plt.hist(rating['UserID'], bins=100, edgecolor='black')
plt.show()

3, User based collaborative filtering algorithm

The user based collaborative filtering algorithm is the oldest algorithm in the recommendation system. The algorithm first needs to find users with similar interests to the current user, and then recommend the found items that the user is interested in but are not in the current user's interest list to the current user;

The user based collaborative filtering algorithm is mainly divided into two steps

  1. Find a set of users with similar interests to the current user;

The algorithm calculates the user interest similarity based on the user's positive feedback behavior on the history of items; We give users u and v, so that N(u) represents the collection of items that user u has had positive feedback, and N(v) represents the collection of items that user v has had positive feedback; We can calculate the similarity of users u and v through cosine similarity:

w_{uv} = \frac{N(u) \cap N(v)}{\sqrt {N(u) \cup N(v)}}

For example, user U has positive feedback records on a, b and c, and user V has positive feedback records on a and c;

U  a b c
V  a c

We can calculate the interest similarity of U and V by using cosine similarity

w U V = ∣ { a , b , c } ∩ { a , c } ∣ ∣ { a , b , c } ∣ ∣ { a , c } ∣ = 2 6 w_{UV}=\frac{|\{a,b,c\} \cap \{a,c\}|}{\sqrt {|\{a,b,c\}| |\{a,c\}|}} = \frac{2}{\sqrt{6}} wUV​=∣{a,b,c}∣∣{a,c}∣ ​∣{a,b,c}∩{a,c}∣​=6 ​2​

We use the following user_similarity to calculate user similarity

def user_similarity(ratings):
    matrix = []
    rating_groups = ratings.groupby('UserID')
    for u_id, u_ratings in rating_groups:
        row = []
        matrix.append(row)
        u_movieIds = u_ratings['MovieID'].values
        for v_id, v_ratings in rating_groups:
            v_movieIds = v_ratings['MovieID'].values
            u_v_movieIds = np.intersect1d(u_movieIds, v_movieIds)
            similarity = len(u_v_movieIds)/math.sqrt(len(u_movieIds) * len(v_movieIds))
            row.append(similarity)

    result = pd.DataFrame(matrix, columns= rating_groups.groups.keys(), index=rating_groups.groups.keys())
    return result


rating = get_ratings()
similarity_matrix = user_similarity(rating)
print(similarity_matrix.head(10))

         1         2         3    ...       667       668       669
1   1.000000  0.084657  0.115406  ...  0.010504  0.068680  0.076194
2   0.084657  1.000000  0.147945  ...  0.087529  0.161416  0.048839
3   0.115406  0.147945  1.000000  ...  0.085666  0.070014  0.077674
4   0.119898  0.153704  0.152783  ...  0.083438  0.036370  0.000000
5   0.097618  0.125142  0.059708  ...  0.119562  0.142134  0.059131
6   0.163017  0.114939  0.099710  ...  0.063529  0.000000  0.032915
7   0.049341  0.284641  0.150899  ...  0.164817  0.179605  0.099627
8   0.116508  0.201633  0.083139  ...  0.090808  0.113092  0.023525
9   0.200125  0.162482  0.122407  ...  0.118842  0.178069  0.053877
10  0.240081  0.215441  0.216773  ...  0.126021  0.083229  0.096951

[10 rows x 669 columns]

In the above code, all users are directly traversed to calculate the similarity. In many cases, due to the large number of items or the few interests and concerns of each user, a large number of users do not have the so-called Union; We can first reverse the data structure to product users, then calculate the total number of products of interest to different users and the intersection of products of interest between related users, and finally calculate the cosine similarity;

import math
import numpy as np
import pandas as pd


def change_user_ratings(rating):
    grouped = rating.groupby('MovieID')
    result = {}
    for movieId,m_rating in grouped:
        result[movieId] = m_rating['UserID'].values

    df = pd.DataFrame({
        'MovieID': result.keys(),
        'UserIDs': result.values()
    })

    return df.set_index(df.columns.values[0])



def cal_count(product_users):
    user_counts = {}
    rel_user_counts = {}
    for movieId, row in product_users.iterrows():
        userIds = row['UserIDs']
        for uid in userIds:
            if uid not in user_counts:
                user_counts[uid] = 0
            user_counts[uid] += 1
            for vid in userIds:
                if (uid, vid) not in rel_user_counts:
                    rel_user_counts[(uid, vid)] = 0
                rel_user_counts[(uid, vid)] += 1



    user_counts = pd.DataFrame({'UserID': user_counts.keys(), 'Movie_Count': user_counts.values()})
    rel_user_counts = pd.DataFrame({'Rel_UserID':rel_user_counts.keys(), 'Movie_Count':rel_user_counts.values()})
    return  user_counts.set_index(user_counts.columns.values[0]), rel_user_counts.set_index(rel_user_counts.columns.values[0])




def cosin_similarity(user_counts, rel_user_counts):
    result = []
    for u, u_row in user_counts.iterrows():
        row = []
        result.append(row)
        u_count = u_row['Movie_Count']
        for v, v_row in user_counts.iterrows():
            v_count = v_row['Movie_Count']
            if rel_user_counts.index.isin([(u,v)]).any():
                count = rel_user_counts.at[(u,v), 'Movie_Count']
                row.append(count/math.sqrt(u_count * v_count))
            else:
                row.append(0)

    return pd.DataFrame(result, index=user_counts.index.values,  columns=user_counts.index.values)




def user_similarity(ratings):
    rating_users = change_user_ratings(ratings)
    user_counts, rel_user_counts = cal_count(rating_users)
    s = cosin_similarity(user_counts, rel_user_counts)
    return s


ratings = get_ratings()
similarity_matrix = user_similarity(ratings)
print(similarity_matrix.head(10))

         1         2         3    ...       667       668       669
1   1.000000  0.084657  0.115406  ...  0.010504  0.068680  0.076194
2   0.084657  1.000000  0.147945  ...  0.087529  0.161416  0.048839
3   0.115406  0.147945  1.000000  ...  0.085666  0.070014  0.077674
4   0.119898  0.153704  0.152783  ...  0.083438  0.036370  0.000000
5   0.097618  0.125142  0.059708  ...  0.119562  0.142134  0.059131
6   0.163017  0.114939  0.099710  ...  0.063529  0.000000  0.032915
7   0.049341  0.284641  0.150899  ...  0.164817  0.179605  0.099627
8   0.116508  0.201633  0.083139  ...  0.090808  0.113092  0.023525
9   0.200125  0.162482  0.122407  ...  0.118842  0.178069  0.053877
10  0.240081  0.215441  0.216773  ...  0.126021  0.083229  0.096951

[10 rows x 669 columns]

  1. Find the items that the user is interested in and the current user has not heard of and recommend them to him;

Through the above calculation, we get the matrix of user similarity. Next, we need to find the K users with the most similar interests to the target users, and then recommend the items that only these K users like to the target users; Here, we need to further measure the interest of target users in specific products

In this formula, S(u,K) contains k users whose interests are closest to those of user u, N(i) is the set of users who have acted on item i, Wuv is the interest similarity between user u and user v, and rvi represents user v's interest in item i. because the implicit feedback data of single behavior is used, all rvi=1.
p ( u , i ) = ∑ v ∈ s ( u , K ) ∩ N ( i ) w u v r v i p(u,i) = \sum_{v \in s(u,K) \cap N(i)} w_{uv} r_{vi} p(u,i)=v∈s(u,K)∩N(i)∑​wuv​rvi​

We use the following recommended method to implement this recommendation

def recommend(ratings,u, matrix, k):
    result = pd.Series(dtype='float64');
    grouped = dict(list(ratings.groupby('UserID')))
    u_ratings = grouped[u][['MovieID','Rating']]
    row = matrix.loc[u].sort_values(ascending=False)
    for v in row.index:
        if u != v:
            similarity = row.loc[v]
            v_ratings = grouped[v][['MovieID','Rating']]
            diff = pd.concat([v_ratings, u_ratings, u_ratings]).drop_duplicates(subset=pd.Index(['MovieID']), keep=False)
            for movieId, s_rating in diff.set_index('MovieID').iterrows():
                like = similarity * (s_rating['Rating']/5)
                s_movieId = str(movieId)
                if movieId in result:
                    result[s_movieId] += like
                else:
                    result[s_movieId] = like

    return result.sort_values(ascending=False).head(k)
    

We calculate the first three items recommended to user A and they are most interested in; From the calculation results, it can be seen that goods c and e;

ratings = get_ratings()
similarity_matrix = user_similarity(ratings)
recommend_movies = recommend(ratings, 1, similarity_matrix, 10)
print(recommend_movies.head(10))

2049    0.240081
3292    0.212965
1067    0.204131
2559    0.193922
3620    0.168068
963     0.168068
2179    0.165928
2211    0.165928
1817    0.165928
2227    0.165928
dtype: float64

Note: due to rating According to the actual situation of data, each user and each movie will have a corresponding score, so the second algorithm does not have any performance advantages and needs to be selected according to the actual situation of their own data;

Keywords: ElasticSearch Algorithm

Added by multe-media on Fri, 28 Jan 2022 00:07:16 +0200