Dry goods | actual case of data analysis -- user behavior prediction

Produced by CDA Data Analyst Author: CDA teaching and Research Group Editor: Mika

Case introduction

Background: Taking the user behavior data of a large e-commerce platform as the data set, the user behavior characteristics under massive data are analyzed by using big data processing technology, and the user behavior is predicted by establishing logistic regression model and random forest;

Case ideas:

  • Using big data processing technology to read massive data
  • Massive data preprocessing
  • Extract part of data debugging model
  • Building models with massive data
#All row output
from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all"

Data dictionary:

U_Id:the serialized ID that represents a user

T_Id:the serialized ID that represents an item

C_Id:the serialized ID that represents the category which the corresponding item belongs to Ts:the timestamp of the behavior

Be_type:enum-type from ('pv', 'buy', 'cart', 'fav')

pv: Page view of an item's detail page, equivalent to an item click

buy: Purchase an item

cart: Add an item to shopping cart fav: Favor an item

Read data

The key here is to use dask library to process massive data. Most of its operations run about ten times faster than conventional pandas and other libraries.

Pandas is very popular and powerful in analyzing structured data, but its biggest limitation is that scalability is not considered in the design. Pandas is especially suitable for dealing with small structured data, and is highly optimized to perform fast and efficient operations on the data stored in memory. However, with the substantial increase in the amount of data, a single machine will certainly not be able to read. Cluster processing is the best choice. This is where the Dask DataFrame API works: by providing a wrapper for pandas, you can intelligently separate huge dataframes into smaller fragments, disperse them into multiple workers (frames), and store them on disk instead of RAM.

Dask DataFrame will be divided into multiple departments. Each part is called a partition. Each partition is a relatively small DataFrame, which can be assigned to any worker and maintain its complete data when replication is needed. The specific operation is to operate each partition in parallel or separately (multiple machines can also be parallel), and then merge the results. In fact, it can be inferred intuitively that dask must do so.

# Installation Library (Tsinghua image)
# pip install dask -i https://pypi.tuna.tsinghua.edu.cn/simple
import os
import gc # Garbage collection interface
from tqdm import tqdm # Progress bar library
import dask # Parallel computing interface
from dask.diagnostics import ProgressBar
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import dask.dataframe as dd # dask Data table processing library in import sys # External parameter acquisition interface

In the face of massive data, you can add a line of GC after running a module Collect() to recycle memory fragments. Dask Dataframes and Pandas Dataframes have the same API

gc.collect()
42
# Load data
data = dd.read_csv('UserBehavior_all.csv')# If necessary, you can set the blocksize = parameter to manually specify the partition method. The default is 64MB (it needs to be set as a multiple of the bus, otherwise it will slow down)
data.head()
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
data

Dask DataFrame Structure :

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

Dask Name: read-csv, 58 tasks

Unlike pandas, we only get the structure of the data frame here, not the actual data frame. Dask has loaded the data frame into several blocks, which exist on disk but not in RAM. If you have to output data frames, you first need to put all data frames into RAM, sew them together, and then display the final data frame. use. Compute () forces it to do so, otherwise it won't compute() . In fact, dask uses a delayed data loading mechanism, which is similar to the iterator component of python. It will actually load data only when data needs to be used.

# Really load data compute()
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
# Visual work process, 58 partition tasks, data visualize()

Data preprocessing

data compression

# View the current data type data dtypes
U_Id        int64
T_Id        int64
C_Id        int64
Be_type    object
Ts          int64
dtype: object
# Compressed into 32-bit uint, unsigned integer, because the transaction data has no negative number dtypes ={
    'U_Id': 'uint32',
    'T_Id': 'uint32',
    'C_Id': 'uint32',
    'Be_type': 'object',
    'Ts': 'int64'
}
data = data.astype(dtypes)
data.dtypes
U_Id       uint32
T_Id       uint32
C_Id       uint32
Be_type    object
Ts          int64
dtype: object

Missing value

# The data read by dask interface cannot be used directly Common pandas functions such as isnull () screen for missing data isnull()

Dask DataFrame Structure :

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
columns1 = [ 'U_Id', 'T_Id', 'C_Id', 'Be_type', 'Ts']
tmpDf1 = pd.DataFrame(columns=columns1)
tmpDf1
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
s = data["U_Id"].isna()
s.loc[s == True]
Dask Series Structure:
npartitions=58
bool ...
... ...
     ...
Name: U_Id, dtype: bool
Dask Name: loc-series, 348 tasks
U_Id The number of column missing values is 0 
T_Id The number of column missing values is 0 
C_Id The number of column missing values is 0 
Be_type The number of column missing values is 0 
Ts The number of column missing values is 0 
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

No missing values

Data exploration and visualization

Here we use the pyecarts library. Pyecarts is a data visualization tool that combines python with Baidu's open source ecarts. New version 1 X and older versions of 0.5 The code rules of version x are very different. See the official documents for the new version https://gallery.py echarts.org/#/README

# pip install pyecharts -i https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: pyecharts in d:\anaconda\lib\site-packages (0.1.9.4)
Requirement already satisfied: jinja2 in d:\anaconda\lib\site-packages (from pyecharts)
(3.0.2)
Requirement already satisfied: future in d:\anaconda\lib\site-packages (from pyecharts)
(0.18.2)
Requirement already satisfied: pillow in d:\anaconda\lib\site-packages (from pyecharts)
(8.3.2)
Requirement already satisfied: MarkupSafe>=2.0 in d:\anaconda\lib\site-packages (from
jinja2->pyecharts) (2.0.1)
Note: you may need to restart the kernel to use updated packages.
U_Id The number of column missing values is 0 T_Id The number of column missing values is 0 C_Id The number of column missing values is 0 Be_type The number of column missing values is 0 Ts The number of column missing values is 0
WARNING: Ignoring invalid distribution -umpy (d:\anaconda\lib\site-packages)
WARNING: Ignoring invalid distribution -ip (d:\anaconda\lib\site-packages)
WARNING: Ignoring invalid distribution -umpy (d:\anaconda\lib\site-packages)
WARNING: Ignoring invalid distribution -ip (d:\anaconda\lib\site-packages)
WARNING: Ignoring invalid distribution -umpy (d:\anaconda\lib\site-packages)
WARNING: Ignoring invalid distribution -ip (d:\anaconda\lib\site-packages)
WARNING: Ignoring invalid distribution -umpy (d:\anaconda\lib\site-packages)
WARNING: Ignoring invalid distribution -ip (d:\anaconda\lib\site-packages)
WARNING: Ignoring invalid distribution -umpy (d:\anaconda\lib\site-packages)
WARNING: Ignoring invalid distribution -ip (d:\anaconda\lib\site-packages)

Pie chart

# For example, we want to draw a beautiful pie chart to see the proportion of various user behaviors data["Be_type"]
# When using dask, all supported original pandas functions need to be added compute() for final execution 
Be_counts = data["Be_type"].value_counts().compute()
Be_counts
pv      89716264
cart     5530446
fav      2888258
buy      2015839
Name: Be_type, dtype: int64
Be_index = Be_counts.index.tolist() # Extract label
Be_index
['pv', 'cart', 'fav', 'buy']
Be_values = Be_counts.values.tolist() # Extract value 
Be_values
[89716264, 5530446, 2888258, 2015839]
from pyecharts import options as opts
from pyecharts.charts import Pie
#The data in the pie package must be passed into a list of tuples
c = Pie()
c.add("", [list(z) for z in zip(Be_index, Be_values)]) # zip The iteratable object() function packages iteratable objects into tuples and returns a list of tuples c.set_global_opts(title_opts=opts.TitleOpts(title="User behavior")) # Global parameter (Figure naming) c.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
c.render_notebook() # Output to current notebook environment
# c.render("pie_base.html") # If necessary, you can output the diagram to this machine
<pyecharts.charts.basic_charts.pie.Pie at 0x1b2da75ae48>
<div id="490361952ca944fcab93351482e4b254" style="width:900px; height:500px;"></div>

Funnel diagram

from pyecharts.charts import Funnel # The old version of pyecarts is not required Import pyechards options as opts
from IPython.display import Image as IMG
from pyecharts import options as opts
from pyecharts.charts import Pie
<pyecharts.charts.basic_charts.funnel.Funnel at 0x1b2939d50c8>
<div id="071b3b906c27405aaf6bc7a686e36aaa" style="width:800px; height:400px;"></div>

Data analysis

Timestamp conversion

dask's support for timestamps is very unfriendly

type(data)
dask.dataframe.core.DataFrame
data['Ts1']=data['Ts'].apply(lambda x: time.strftime("%Y-%m-%d %H:%M:%S",
time.localtime(x)))
data['Ts2']=data['Ts'].apply(lambda x: time.strftime("%Y-%m-%d", time.localtime(x)))
data['Ts3']=data['Ts'].apply(lambda x: time.strftime("%H:%M:%S", time.localtime(x)))
D:\anaconda\lib\site-packages\dask\dataframe\core.py:3701: UserWarning:
You did not provide metadata, so Dask is running your function on a small dataset to
guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the
`meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('Ts', 'object'))
  warnings.warn(meta_warning(meta))
data.head(1)
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
data.dtypes
U_Id       uint32
T_Id       uint32
C_Id       uint32
Be_type    object
Ts          int64
Ts1 object
Ts2 object
Ts3 object
dtype: object

Extract some data to debug the code

df = data.head(1000000)
df.head(1)
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

Analysis of user traffic and purchase time

User behavior statistics

describe = df.loc[:,["U_Id","Be_type"]]
ids = pd.DataFrame(np.zeros(len(set(list(df["U_Id"])))),index=set(list(df["U_Id"])))
pv_class=describe[describe["Be_type"]=="pv"].groupby("U_Id").count()
pv_class.columns  = ["pv"]
buy_class=describe[describe["Be_type"]=="buy"].groupby("U_Id").count()
buy_class.columns  = ["buy"]
fav_class=describe[describe["Be_type"]=="fav"].groupby("U_Id").count()
fav_class.columns  = ["fav"]
cart_class=describe[describe["Be_type"]=="cart"].groupby("U_Id").count()
cart_class.columns  = ["cart"]
user_behavior_counts=ids.join(pv_class).join(fav_class).join(cart_class).join(buy_class).
iloc[:,1:]
user_behavior_counts.head()
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

Time change analysis of total traffic volume (days)

from matplotlib import font_manager
# Solve the confusion of negative sign of coordinate axis scale
# Resolve minus sign'-'Questions displayed as squares plt.rcParams['axes.unicode_minus'] = False # Solve the Chinese garbled code problem PLT rcParams['font.sans-serif'] = ['Simhei']

According to the analysis of the time change of total traffic and trading volume, there was a slight fluctuation in traffic and trading volume from November 25, 2017 to December 1, 2017. The traffic and trading volume increased significantly on December 2, 2017, and maintained high traffic and trading volume on December 2 and 3. One of the reasons for this phenomenon is that December 2 and 3 are weekends. At the same time, considering that there may be some promotional activities on December 2 and 3, it can be analyzed in combination with the actual business situation. (in the figure, the number of visitors increased on Friday, but the trading volume decreased. It is speculated that this phenomenon may be related to the postponement of trading on Friday caused by weekend activities.)

Time change analysis of total traffic volume (hours)

# Data preparation df_pv_timestamp=df[df["Be_type"]=="pv"][["Be_type","Ts1"]] df_pv_timestamp["Ts1"] = pd.to_datetime(df_pv_timestamp["Ts1"])
df_pv_timestamp=df_pv_timestamp.set_index("Ts1")
df_pv_timestamp=df_pv_timestamp.resample("H").count()["Be_type"]
df_pv_timestamp
df_buy_timestamp=df[df["Be_type"]=="buy"][["Be_type","Ts1"]]
df_buy_timestamp["Ts1"] = pd.to_datetime(df_buy_timestamp["Ts1"])
df_buy_timestamp=df_buy_timestamp.set_index("Ts1")
df_buy_timestamp=df_buy_timestamp.resample("H").count()["Be_type"]
df_buy_timestamp
Ts1
2017-09-11 16:00:00        1
2017-09-11 17:00:00        0
2017-09-11 18:00:00        0
2017-09-11 19:00:00        0
2017-09-11 20:00:00        0
                       ...
2017-12-03 20:00:00     8587
2017-12-03 21:00:00    10413
2017-12-03 22:00:00     9862
2017-12-03 23:00:00     7226
2017-12-04 00:00:00        1
Freq: H, Name: Be_type, Length: 2001, dtype: int64
Ts1
2017-11-25 00:00:00     64
2017-11-25 01:00:00     29
2017-11-25 02:00:00     18
2017-11-25 03:00:00      8
2017-11-25 04:00:00      3
                      ...
2017-12-03 19:00:00    141
2017-12-03 20:00:00    159
2017-12-03 21:00:00    154
2017-12-03 22:00:00    154
2017-12-03 23:00:00    123
Freq: H, Name: Be_type, Length: 216, dtype: int64
#mapping
plt.figure(figsize=(20,6),dpi =70)
x2= df_buy_timestamp.index plt.plot(range(len(x2)),df_buy_timestamp.values,label="Turnover",color="blue",linewidth=2) plt.title("Discount chart of total trading volume change(hour)")
x2 = [i.strftime("%Y-%m-%d %H:%M") for i in x2]
plt.xticks(range(len(x2))[::4],x2[::4],rotation=90)
plt.xlabel("Ts2")
plt.ylabel("Ts3")
plt.grid(alpha=0.4);

Characteristic Engineering

Idea: regardless of the time window, only the user's click, collection and other behaviors are used to predict the purchase process: take the user ID(U_Id) as the grouping key, and count the click, collection and shopping cart behaviors of each user, respectively:

Whether to click and the number of clicks; Collection or not, collection times; Whether to add shopping cart and the number of times to add shopping cart

So as to predict whether to buy in the end

# Remove timestamp
df = df[["U_Id", "T_Id", "C_Id", "Be_type"]] df
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

Behavior type

U_Id
1 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,... 
100 [1,1,1,1,1,1,1,1,1,3,1,1,3,1,3,... 
115 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,...
117 [4,1,1,1,1,1,1,4,1,1,1,1,1,1,1,...
118 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...
Name: Be_type1, dtype: object

Finally, create a DataFrame to store the calculated user behavior.

df_new = pd.DataFrame()

Number of hits

df_new['pv_much'] = df_Be.apply(lambda x: Counter(x)['1'])
df_new
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

Additional purchase times

#Add purchase or not
df_new['is_cart'] = df_Be.apply(lambda x: 1 if '2' in x else 0) 
df_new
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
#Several additional purchases
df_new['cart_much'] = df_Be.apply(lambda x: 0 if '2' not in x else Counter(x)['2']) 
df_new
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

Collection times

#Collect
df_new['is_fav'] = df_Be.apply(lambda x: 1 if '3' in x else 0) 
df_new
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
#Collected several times
df_new['fav_much'] = df_Be.apply(lambda x: 0 if '3' not in x else Counter(x)['3']) 
df_new
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

correlation analysis

#Partial data df_new.corr('spearman')
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

There is a certain correlation between the number of additional purchases and the number of additional purchases, and between the number of collections and the number of collections. However, it is verified that the effect of eliminating one of them is basically the same as that of including all variables, so all variables are used for modeling later.

Data label

import seaborn as sns
#Buy or not
df_new['is_buy'] = df_Be.apply(lambda x: 1 if '4' in x else 0) 
df_new
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
df_new.is_buy.value_counts()
1    6689
0    3050
Name: is_buy, dtype: int64
df_new['label'] = df_new['is_buy']
del df_new['is_buy']
df_new.head()
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
f,ax=plt.subplots(1,2,figsize=(12,5)) 
sns.set_palette(["#9b59b6","#3498db",]) #Set the colors of all graphs and use hls color space 
sns.distplot(df_new['fav_much'],bins=30,kde=True,label='123',ax=ax[0]); 
sns.distplot(df_new['cart_much'],bins=30,kde=True,label='12',ax=ax[1]);
C:\Users\CDA\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning:
`distplot` is a deprecated function and will be removed in a future version. Please adapt
your code to use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
C:\Users\CDA\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning:
`distplot` is a deprecated function and will be removed in a future version. Please adapt
your code to use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

Model building

Partition dataset

from sklearn.model_selection import train_test_split
X = df_new.iloc[:,:-1]
Y = df_new.iloc[:,-1]
X.head()
Y.head()
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
U_Id
10
100 1
115 0
117 1
118 0
Name: label, dtype: int64
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size= 0.3,random_state= 42)

logistic regression

Model establishment

from sklearn.linear_model import LogisticRegression
LR_1 = LogisticRegression().fit(Xtrain,Ytrain)
#Simple test LR_1.score(Xtest,Ytest)
0.6741957563312799

Model evaluation

from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import auc,roc_curve
#Confusion matrix
print(metrics.confusion_matrix(Ytest, LR_1.predict(Xtest)))
[[   0  952]
 [   0 1970]]
print(classification_report(Ytest,LR_1.predict(Xtest)))
D:\anaconda\lib\site-packages\sklearn\metrics\_classification.py:1308:
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in
labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
D:\anaconda\lib\site-packages\sklearn\metrics\_classification.py:1308:
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in
labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
D:\anaconda\lib\site-packages\sklearn\metrics\_classification.py:1308:
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in
labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
fpr,tpr,threshold = roc_curve(Ytest,LR_1.predict_proba(Xtest)[:,1])
roc_auc = auc(fpr,tpr)
print(roc_auc)
0.6379193682549162

Random forest

Model establishment

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200, max_depth=1)
rfc.fit(Xtrain, Ytrain)
RandomForestClassifier(max_depth=1, n_estimators=200)

Model evaluation

#Confusion matrix
print(metrics.confusion_matrix(Ytest, rfc.predict(Xtest)))
[[   0  952]
 [   0 1970]]
#Classification Report
print(metrics.classification_report(Ytest, rfc.predict(Xtest)))
D:\anaconda\lib\site-packages\sklearn\metrics\_classification.py:1308:
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in
labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
D:\anaconda\lib\site-packages\sklearn\metrics\_classification.py:1308:
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in
labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
D:\anaconda\lib\site-packages\sklearn\metrics\_classification.py:1308:
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in
labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Added by Matt Phelps on Sun, 26 Dec 2021 01:50:11 +0200