Collaborative filtering based on regression model
Considering the score as a continuous value rather than a discrete value, we can predict the score of the target user on an item with the help of the idea of linear regression. A Baseline implementation strategy is called Baseline.
1. Baseline: benchmark forecast
The Baseline design idea is based on the following assumptions:
- Some users' scores are generally higher than others, and some users' scores are generally lower than others. For example, some users are naturally willing to give high praise to others, soft hearted and easy to talk, while others are more demanding and always give a score of no more than 3 points (full score of 5 points);
- Some items are generally rated higher than others, and some items are generally rated lower than others. For example, as soon as some goods are produced, their status is determined. Some are more popular, while others are despised.
This user or item is generally above or below the average difference, which is called bias.
2. Baseline objectives:
- Find out the bias value that each user is generally higher or lower than others b u b_u bu
- Find out that each item is generally higher or lower than the offset value of other items b i b_i bi
- The goal is to find the best b u b_u bu # and b i b_i bi
The steps of using the algorithm idea of Baseline to predict the score are as follows:
-
Calculate the average score of all films μ \mu μ (i.e. global average score)
-
Calculate the score and average score of each user μ \mu μ Offset value of b u b_u bu
-
Calculate the score and average score of each film μ \mu μ Offset value of b i b_i bi
-
Predict user ratings for movies:
r ^ u i = b u i = μ + b u + b i \hat{r}_{ui} = b_{ui} = \mu + b_u + b_i r^ui=bui=μ+bu+bi
give an example:
For example, if you want to use Baseline to predict the score of user A on the movie "Forrest Gump", first calculate the average score of the whole score data set μ \mu μ Yes, 3.5 points; User A is A harsh user, and his score is relatively strict, generally 0.5 points lower than the average score, that is, the offset value of user A b i b_i bi ﹤ is -0.5; The film "Forrest Gump" is a popular and highly praised film. Its score is generally 1.2 points higher than the average score, so the bias value of the film "Forrest Gump" b i b_i bi ^ is + 1.2, so it can be predicted that the score of user A on the movie "Forrest Gump" is: 3.5 + ( − 0.5 ) + 1.2 3.5+(-0.5)+1.2 3.5 + (− 0.5) + 1.2, that is, 4.2 points.
Average score for all films
μ
\mu
μ It can be calculated directly, so the problem is to measure the number of users
b
u
b_u
bu , value and of each film
b
i
b_i
The value of bi. For the linear regression problem, we can use the square difference to construct the loss function as follows:
C
o
s
t
=
∑
u
,
i
∈
R
(
r
u
i
−
r
^
u
i
)
2
=
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
2
Cost =\sum_{u,i\in R}(r_{ui}-\hat{r}_{ui})^2=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2
Cost=u,i∈R∑(rui−r^ui)2=u,i∈R∑(rui−μ−bu−bi)2
Add L2 regularization:
C
o
s
t
=
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
2
+
λ
∗
(
∑
u
b
u
2
+
∑
i
b
i
2
)
Cost=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2 + \lambda*(\sum_u {b_u}^2 + \sum_i {b_i}^2)
Cost=u,i∈R∑(rui−μ−bu−bi)2+λ∗(u∑bu2+i∑bi2)
Formula analysis:
- Formula part I ∑ u , i ∈ R ( r u i − μ − b u − b i ) 2 \sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2 ∑u,i∈R(rui− μ − bu − bi) 2 is used to find the best fit with the known score data b u b_u bu and b i b_i bi
- Formula Part II λ ∗ ( ∑ u b u 2 + ∑ i b i 2 ) \lambda*(\sum_u {b_u}^2 + \sum_i {b_i}^2) λ * (∑ u bu 2 + ∑ i bi 2) is a regularization term used to avoid over fitting
For the solution of the minimum process, we generally use the random gradient descent method or alternating least square method to optimize the implementation.
3. Optimization method
Method 1: random gradient descent optimization
Prediction of Baseline bias using stochastic gradient descent optimization algorithm
step 1: derivation of gradient descent method
Loss function:
J
(
θ
)
=
C
o
s
t
=
f
(
b
u
,
b
i
)
J
(
θ
)
=
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
2
+
λ
∗
(
∑
u
b
u
2
+
∑
i
b
i
2
)
\begin{aligned} &J(\theta)=Cost=f(b_u, b_i)\\ \\ &J(\theta)=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2 + \lambda*(\sum_u {b_u}^2 + \sum_i {b_i}^2) \end{aligned}
J(θ)=Cost=f(bu,bi)J(θ)=u,i∈R∑(rui−μ−bu−bi)2+λ∗(u∑bu2+i∑bi2)
Gradient descent parameter update original formula:
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
)
\theta_j:=\theta_j-\alpha\cfrac{\partial }{\partial \theta_j}J(\theta)
θj:=θj−α∂θj∂J(θ)
Gradient descent update
b
u
b_u
bu:
Derivation of partial derivative of loss function:
∂
∂
b
u
J
(
θ
)
=
∂
∂
b
u
f
(
b
u
,
b
i
)
=
2
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
(
−
1
)
+
2
λ
b
u
=
−
2
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
+
2
λ
∗
b
u
\begin{aligned} \cfrac{\partial}{\partial b_u} J(\theta)&=\cfrac{\partial}{\partial b_u} f(b_u, b_i)\\ & =2\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)(-1) + 2\lambda{b_u} \\ & =-2\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) + 2\lambda*b_u \end{aligned}
∂bu∂J(θ)=∂bu∂f(bu,bi)=2u,i∈R∑(rui−μ−bu−bi)(−1)+2λbu=−2u,i∈R∑(rui−μ−bu−bi)+2λ∗bu
b
u
b_u
bu # update (because alpha can be controlled manually, 2 can be omitted):
b
u
:
=
b
u
−
α
∗
(
−
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
+
λ
∗
b
u
)
:
=
b
u
+
α
∗
(
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
−
λ
∗
b
u
)
\begin{aligned} b_u&:=b_u - \alpha*(-\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) + \lambda * b_u)\\ &:=b_u + \alpha*(\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) - \lambda* b_u) \end{aligned}
bu:=bu−α∗(−u,i∈R∑(rui−μ−bu−bi)+λ∗bu):=bu+α∗(u,i∈R∑(rui−μ−bu−bi)−λ∗bu)
Similarly, gradient descent update
b
i
b_i
bi:
b
i
:
=
b
i
+
α
∗
(
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
−
λ
∗
b
i
)
b_i:=b_i + \alpha*(\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) -\lambda*b_i)
bi:=bi+α∗(u,i∈R∑(rui−μ−bu−bi)−λ∗bi)
step 2: random gradient descent
Since the stochastic gradient descent method essentially uses the loss of each sample to update the parameters instead of calculating the total loss sum each time, when using SGD:
Single sample loss value:
e
r
r
o
r
=
r
u
i
−
r
^
u
i
=
r
u
i
−
(
μ
+
b
u
+
b
i
)
=
r
u
i
−
μ
−
b
u
−
b
i
\begin{aligned} error &=r_{ui}-\hat{r}_{ui} \\&= r_{ui}-(\mu+b_u+b_i) \\&= r_{ui}-\mu-b_u-b_i \end{aligned}
error=rui−r^ui=rui−(μ+bu+bi)=rui−μ−bu−bi
Parameter update:
b
u
:
=
b
u
+
α
∗
(
(
r
u
i
−
μ
−
b
u
−
b
i
)
−
λ
∗
b
u
)
:
=
b
u
+
α
∗
(
e
r
r
o
r
−
λ
∗
b
u
)
b
i
:
=
b
i
+
α
∗
(
(
r
u
i
−
μ
−
b
u
−
b
i
)
−
λ
∗
b
i
)
:
=
b
i
+
α
∗
(
e
r
r
o
r
−
λ
∗
b
i
)
\begin{aligned} b_u&:=b_u + \alpha*((r_{ui}-\mu-b_u-b_i) -\lambda*b_u) \\ &:=b_u + \alpha*(error - \lambda*b_u) \\ \\ b_i&:=b_i + \alpha*((r_{ui}-\mu-b_u-b_i) -\lambda*b_i)\\ &:=b_i + \alpha*(error -\lambda*b_i) \end{aligned}
bubi:=bu+α∗((rui−μ−bu−bi)−λ∗bu):=bu+α∗(error−λ∗bu):=bi+α∗((rui−μ−bu−bi)−λ∗bi):=bi+α∗(error−λ∗bi)
Dataset link: MovieLens Latest Datasets Small
step 3: algorithm implementation
import pandas as pd import numpy as np class BaselineCFBySGD(object): def __init__(self, number_epochs, alpha, reg, columns=None): """ :param number_epochs: Maximum iterations of gradient descent :param alpha: Learning rate :param reg: Regular parameter :param columns: Data set user-item-rating The name of the field """ if columns is None: columns = ["uid", "iid", "rating"] self.number_epochs = number_epochs self.alpha = alpha self.reg = reg self.columns = columns def fit(self, dataset): """ :param dataset: User rating data :return: """ self.dataset = dataset # User rating data self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]] # Item scoring data self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]] # Calculate global average self.global_mean = self.dataset[self.columns[2]].mean() # Call sgd method to train model parameters self.bu, self.bi = self.sgd() def sgd(self): """ Using random gradient descent to optimize bu,bi Value of :return: bu, bi """ # Initialize bu and set the values of bi to 0 bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings)))) bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings)))) for i in range(self.number_epochs): print("iter%d start:" % i) for uid, iid, real_rating in self.dataset.itertuples(index=False): error = real_rating - (self.global_mean + bu[uid] + bi[iid]) bu[uid] += self.alpha * (error - self.reg * bu[uid]) bi[iid] += self.alpha * (error - self.reg * bi[iid]) return bu, bi def predict(self, uid, iid): predict_rating = self.global_mean + self.bu[uid] + self.bi[iid] return predict_rating if __name__ == '__main__': dtype = [("userId", np.int32), ("moiveId", np.int32), ("rating", np.float32)] dataset = pd.read_csv("ratings.csv", usecols=range(3), dtype=dtype) bcf = BaselineCFBySGD(10, 0.1, 0.1, columns=["userId", "movieId", "rating"]) bcf.fit(dataset) print(bcf.predict(1, 1))
Step 4: accuracy index evaluation
- Add the test method, and then calculate the accuracy index using the previously implemented accuary method
import pandas as pd import numpy as np def data_split(data_path, x=0.8, random=False): """ Split data set: in order to ensure that the number of users remains unchanged, the score data of each user is split proportionally :param data_path: Dataset path :param x: Proportion of training set, such as x=0.8,Then 0.2 Is a test set :param random: Random segmentation :return: user-Item scoring matrix """ print("Start splitting dataset...") # Sets the type of data field to load dtype = {'userId': np.int32, 'movieId': np.int32, 'rating': np.float32} # Load data. Only the first three columns of data are used, namely user ID, movie ID, and the corresponding score of the user on the movie ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3)) # Index of the test set testset_index = [] # In order to ensure that each user has data in the test set and training set, it is aggregated by userId for uid in ratings.groupby('userId').any().index: user_rating_data = ratings.where(ratings['userId'] == uid).dropna() if random: index = list(user_rating_data.index) np.random.shuffle(index) _index = round(len(user_rating_data) * x) testset_index += list(index[_index:]) else: index = round(len(user_rating_data) * x) testset_index += list(user_rating_data.index.values[index:]) testset = ratings.loc[testset_index] trainset = ratings.drop(testset_index) print("Complete data set segmentation...") return trainset, testset def accuracy(predict_results, method="all"): """ Accuracy index :param predict_results: The prediction result is of type container, and each element is a containing uid,iid,real_rating,pred_rating Sequence of :param method: Indicator method, type is string, rmse or mae,Otherwise, return both rmse and mae :return: Index value """ def rmse(predict_results): length = 0 _rmse_sum = 0 for uid, iid, real_rating, pred_rating in predict_results: length += 1 _rmse_sum += (pred_rating - real_rating) ** 2 return round(np.sqrt(_rmse_sum / length), 4) def mae(predict_results): length = 0 _mae_sum = 0 for uid, iid, real_rating, pred_rating in predict_results: length += 1 _mae_sum += abs(pred_rating - real_rating) return round(_mae_sum / length, 4) def rmse_mae(predict_results): length = 0 _rmse_sum = 0 _mae_sum = 0 for uid, iid, real_rating, pred_rating in predict_results: length += 1 _rmse_sum += (pred_rating - real_rating) ** 2 _mae_sum += abs(pred_rating - real_rating) return round(np.sqrt(_rmse_sum / length), 4), round(_mae_sum / length, 4) if method.lower() == "rmse": rmse(predict_results) elif method.lower() == "mae": mae(predict_results) else: return rmse_mae(predict_results) class BaselineCFBySGD(object): def __init__(self, number_epochs, alpha, reg, columns=None): """ :param number_epochs: Maximum iterations of gradient descent :param alpha: Learning rate :param reg: Regular parameter :param columns: Data set user-item-rating The name of the field """ if columns is None: columns = ["uid", "iid", "rating"] self.number_epochs = number_epochs self.alpha = alpha self.reg = reg self.columns = columns def fit(self, dataset): """ :param dataset: User rating data :return: """ self.dataset = dataset # User rating data self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]] # Item scoring data self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]] # Calculate global average self.global_mean = self.dataset[self.columns[2]].mean() # Call sgd method to train model parameters self.bu, self.bi = self.sgd() def sgd(self): """ Using random gradient descent to optimize bu,bi Value of :return: bu, bi """ # Initialize bu and set the values of bi to 0 bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings)))) bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings)))) for i in range(self.number_epochs): print("iter%d start:" % i) for uid, iid, real_rating in self.dataset.itertuples(index=False): error = real_rating - (self.global_mean + bu[uid] + bi[iid]) bu[uid] += self.alpha * (error - self.reg * bu[uid]) bi[iid] += self.alpha * (error - self.reg * bi[iid]) return bu, bi def predict(self, uid, iid): # Score prediction if iid not in self.items_ratings.index: raise Exception("Unable to predict user<{uid}>Right movie<{iid}>Because the training focus is missing<{iid}>Data".format(uid=uid, iid=iid)) predict_rating = self.global_mean + self.bu[uid] + self.bi[iid] return predict_rating def test(self, testset): # Predictive test set data for uid, iid, real_rating in testset.itertuples(index=False): try: pred_rating = self.predict(uid, iid) except Exception as e: print(e) else: yield uid, iid, real_rating, pred_rating if __name__ == '__main__': trainset, testset = data_split("ratings.csv", random=True) bcf = BaselineCFBySGD(20, 0.1, 0.1, ["userId", "movieId", "rating"]) bcf.fit(trainset) pred_results = bcf.test(testset) rmse, mae = accuracy(pred_results) print("rmse: ", rmse, "mae: ", mae)
Method 2: alternating least squares optimization
Prediction of Baseline bias using alternating least squares optimization algorithm
step 1: derivation of alternating least squares method
The least square method, like the gradient descent method, can be used to find the extreme value.
The idea of least square method: find the partial derivative of the loss function, and then make the partial derivative 0
Similarly, the loss function:
J
(
θ
)
=
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
2
+
λ
∗
(
∑
u
b
u
2
+
∑
i
b
i
2
)
J(\theta)=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2 + \lambda*(\sum_u {b_u}^2 + \sum_i {b_i}^2)
J(θ)=u,i∈R∑(rui−μ−bu−bi)2+λ∗(u∑bu2+i∑bi2)
Partial derivative of loss function:
∂
∂
b
u
f
(
b
u
,
b
i
)
=
−
2
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
+
2
λ
∗
b
u
\cfrac{\partial}{\partial b_u} f(b_u, b_i) =-2 \sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) + 2\lambda * b_u
∂bu∂f(bu,bi)=−2u,i∈R∑(rui−μ−bu−bi)+2λ∗bu
If the partial derivative is 0, you can get:
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
−
b
i
)
=
λ
∗
b
u
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
i
)
=
∑
u
,
i
∈
R
b
u
+
λ
∗
b
u
\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) = \lambda* b_u \\\sum_{u,i\in R}(r_{ui}-\mu-b_i) = \sum_{u,i\in R} b_u+\lambda * b_u
u,i∈R∑(rui−μ−bu−bi)=λ∗buu,i∈R∑(rui−μ−bi)=u,i∈R∑bu+λ∗bu
To simplify the formula, let
∑
u
,
i
∈
R
b
u
≈
∣
R
(
u
)
∣
∗
b
u
\sum_{u,i\in R} b_u \approx |R(u)|*b_u
Σ u,i ∈ r bu ≈∣ R(u) ∣ * bu, that is, directly assuming that the offsets of each term are equal, we can get:
b
u
:
=
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
i
)
λ
1
+
∣
R
(
u
)
∣
b_u := \cfrac {\sum_{u,i\in R}(r_{ui}-\mu-b_i)}{\lambda_1 + |R(u)|}
bu:=λ1+∣R(u)∣∑u,i∈R(rui−μ−bi)
among
∣
R
(
u
)
∣
|R(u)|
∣ R(u) ∣ indicates User
u
u
u's number of scores
Similarly:
b
i
:
=
∑
u
,
i
∈
R
(
r
u
i
−
μ
−
b
u
)
λ
2
+
∣
R
(
i
)
∣
b_i := \cfrac {\sum_{u,i\in R}(r_{ui}-\mu-b_u)}{\lambda_2 + |R(i)|}
bi:=λ2+∣R(i)∣∑u,i∈R(rui−μ−bu)
among
∣
R
(
i
)
∣
|R(i)|
∣ R(i) ∣ indicates the item
i
i
i number of scores received
b u b_u bu # and b i b_i bi ¢ belongs to the offset of users and items respectively, so their regularization parameters can be set with two independent parameters respectively
step 2: application of alternating least squares method
Through the least square derivation, we finally get b u b_u bu # and b i b_i bi ^ but their expressions contain each other, so here we will use a method called alternating least squares to calculate their values:
- When calculating one item, fix other unknown parameters first, that is, regard other unknown parameters as known
- If requested b u b_u bu ; will b i b_i bi is regarded as known; seek b i b_i When bi , the b u b_u bu is regarded as known; In this way, the values of the two are updated repeatedly to obtain the final result. This is called alternating least squares (ALS)
step 3: algorithm implementation
Step 4: accuracy index evaluation
import pandas as pd import numpy as np def data_split(data_path, x=0.8, random=False): """ Split data set: in order to ensure that the number of users remains unchanged, the score data of each user is split proportionally :param data_path: Dataset path :param x: Proportion of training set, such as x=0.8,Then 0.2 Is a test set :param random: Random segmentation :return: user-Item scoring matrix """ print("Start splitting dataset...") # Sets the type of data field to load dtype = {'userId': np.int32, 'movieId': np.int32, 'rating': np.float32} # Load data. Only the first three columns of data are used, namely user ID, movie ID, and the corresponding score of the user on the movie ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3)) # Index of the test set testset_index = [] # In order to ensure that each user has data in the test set and training set, it is aggregated by userId for uid in ratings.groupby('userId').any().index: user_rating_data = ratings.where(ratings['userId'] == uid).dropna() if random: index = list(user_rating_data.index) np.random.shuffle(index) _index = round(len(user_rating_data) * x) testset_index += list(index[_index:]) else: index = round(len(user_rating_data) * x) testset_index += list(user_rating_data.index.values[index:]) testset = ratings.loc[testset_index] trainset = ratings.drop(testset_index) print("Complete data set segmentation...") return trainset, testset def accuracy(predict_results, method="all"): """ Accuracy index :param predict_results: The prediction result is of type container, and each element is a containing uid,iid,real_rating,pred_rating Sequence of :param method: Indicator method, type is string, rmse or mae,Otherwise, return both rmse and mae :return: Index value """ def rmse(predict_results): length = 0 _rmse_sum = 0 for uid, iid, real_rating, pred_rating in predict_results: length += 1 _rmse_sum += (pred_rating - real_rating) ** 2 return round(np.sqrt(_rmse_sum / length), 4) def mae(predict_results): length = 0 _mae_sum = 0 for uid, iid, real_rating, pred_rating in predict_results: length += 1 _mae_sum += abs(pred_rating - real_rating) return round(_mae_sum / length, 4) def rmse_mae(predict_results): length = 0 _rmse_sum = 0 _mae_sum = 0 for uid, iid, real_rating, pred_rating in predict_results: length += 1 _rmse_sum += (pred_rating - real_rating) ** 2 _mae_sum += abs(pred_rating - real_rating) return round(np.sqrt(_rmse_sum / length), 4), round(_mae_sum / length, 4) if method.lower() == "rmse": rmse(predict_results) elif method.lower() == "mae": mae(predict_results) else: return rmse_mae(predict_results) class BaselineCFBySGD(object): def __init__(self, number_epochs, reg_bu, reg_bi, columns=None): """ :param number_epochs: Maximum iterations of gradient descent :param reg_bu: User bias parameters :param reg_bi: Item offset parameters :param columns: Data set user-item-rating The name of the field """ if columns is None: columns = ["uid", "iid", "rating"] self.number_epochs = number_epochs self.reg_bu = reg_bu self.reg_bi = reg_bi self.columns = columns def fit(self, dataset): """ :param dataset: User rating data :return: """ self.dataset = dataset # User rating data self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]] # Item scoring data self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]] # Calculate global average self.global_mean = self.dataset[self.columns[2]].mean() # Call als method to train model parameters self.bu, self.bi = self.als() def als(self): """ Using alternating least square method, optimization bu, bi Value of :return: bu, bi """ bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings)))) bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings)))) for i in range(self.number_epochs): print("iter%d" % i) for iid, uids, ratings in self.items_ratings.itertuples(index=True): _sum = 0 for uid, rating in zip(uids, ratings): _sum += rating - self.global_mean - bu[uid] bi[iid] = _sum / (self.reg_bi + len(uids)) for uid, iids, ratings in self.users_ratings.itertuples(index=True): _sum = 0 for iid, rating in zip(iids, ratings): _sum += rating - self.global_mean - bi[iid] bu[uid] = _sum / (self.reg_bu + len(iids)) return bu, bi def predict(self, uid, iid): # Score prediction if iid not in self.items_ratings.index: raise Exception("Unable to predict user<{uid}>Right movie<{iid}>Because the training focus is missing<{iid}>Data".format(uid=uid, iid=iid)) predict_rating = self.global_mean + self.bu[uid] + self.bi[iid] return predict_rating def test(self, testset): # Predictive test set data for uid, iid, real_rating in testset.itertuples(index=False): try: pred_rating = self.predict(uid, iid) except Exception as e: print(e) else: yield uid, iid, real_rating, pred_rating if __name__ == '__main__': trainset, testset = data_split("ratings.csv", random=True) bcf = BaselineCFBySGD(20, 0.1, 0.1, ["userId", "movieId", "rating"]) bcf.fit(trainset) pred_results = bcf.test(testset) rmse, mae = accuracy(pred_results) print("rmse: ", rmse, "mae: ", mae)