Analysis of e-commerce user behavior

Project background

Taking the behavior data of an e-commerce user as the data set, this data report analyzes the behavior of Taobao users through industry indicators, so as to explore the behavior mode of an e-commerce user. The specific indicators include daily PV and daily UV analysis, payment rate analysis, funnel loss analysis and user price RFM analysis.

Understanding data

This data set is the user behavior data of an e-commerce from November 18, 2014 to December 18, 2014, with a total of 6 columns of fields, which are:

user_id: user identity, desensitization

item_id: Commodity ID, desensitization

behavior_type: user behavior type (including clicking, collecting, adding shopping cart and paying, represented by numbers 1, 2, 3 and 4 respectively)

user_geohash: geographic location

item_category: category ID (category to which the commodity belongs)

Time: the time when the user behavior occurs

Data cleaning

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Remove the display of warning log reminder
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
#Solve the display problems such as Chinese garbled code
%matplotlib inline
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
%config InlineBackend.figure_format = 'svg'
data = pd.read_csv(r"D:\study\Data analysis project\Taobao user behavior analysis\tianchi_mobile_recommend_train_user\tianchi_mobile_recommend_train_user.csv")
data.head()
user_iditem_idbehavior_typeuser_geohashitem_categorytime
0980478372324315621NaN42452014-12-06 02
1977261363835835901NaN58942014-12-09 20
298607707647497121NaN28832014-12-18 11
398662432320593836196nn52n65622014-12-06 10
4981459082902085201NaN139262014-12-16 21

Missing value processing

data.isnull().sum()
user_id                0
item_id                0
behavior_type          0
user_geohash     8334824
item_category          0
time                   0
dtype: int64
data.fillna('miss',inplace=True)

Standardized treatment

#Split dataset
import re
data['date'] = data['time'].map(lambda s: re.compile(' ').split(s)[0])
data['hour'] = data['time'].map(lambda s: re.compile(' ').split(s)[1])
data.head()
user_iditem_idbehavior_typeuser_geohashitem_categorytimedatehour
0980478372324315621miss42452014-12-06 022014-12-0602
1977261363835835901miss58942014-12-09 202014-12-0920
298607707647497121miss28832014-12-18 112014-12-1811
398662432320593836196nn52n65622014-12-06 102014-12-0610
4981459082902085201miss139262014-12-16 212014-12-1621
data.dtypes
user_id           int64
item_id           int64
behavior_type     int64
user_geohash     object
item_category     int64
time             object
date             object
hour             object
dtype: object
#Convert the time column and date column into date data type, and the hour column should be string data type.
#Data type conversion
data['date']=pd.to_datetime(data['date'])
data['time']=pd.to_datetime(data['time'])
data['hour']=data['hour'].astype('int64')
data.dtypes
user_id                   int64
item_id                   int64
behavior_type             int64
user_geohash             object
item_category             int64
time             datetime64[ns]
date             datetime64[ns]
hour                      int64
dtype: object
data = data.sort_values(by='time',ascending=True)#Sorting processing
data = data.reset_index(drop=True)#Indexing
data
user_iditem_idbehavior_typeuser_geohashitem_categorytimedatehour
0734627153784852331miss91302014-11-18 00:00:002014-11-180
1360901372367481151miss105232014-11-18 00:00:002014-11-180
2404597331552181771miss85612014-11-18 00:00:002014-11-180
38141991498085241miss90532014-11-18 00:00:002014-11-180
411330998257308611miss37832014-11-18 00:00:002014-11-180
...........................
122569011326530971199460622miss60542014-12-18 23:00:002014-12-1823
122569021300825532961968191miss115322014-12-18 23:00:002014-12-1823
122569034359294535059483219rhhgph95412014-12-18 23:00:002014-12-1823
12256904128337991869939381954g37v37982014-12-18 23:00:002014-12-1823
1225690577522552692921911miss8892014-12-18 23:00:002014-12-1823

12256906 rows × 8 columns

User behavior analysis

PV (traffic): that is, Page View, specifically refers to the page views or clicks of the website, which is calculated once the page is refreshed.

UV (independent visitor): that is, Unique Visitor. A computer client accessing the website is a visitor.

Daily traffic analysis

#pv_daily records the number of user operations per day, uv_daily records the number of different online users every day
pv_daily = data.groupby('date')['user_id'].count().reset_index().rename(columns={'user_id':'pv'})
uv_daily = data.groupby('date')['user_id'].apply(lambda x:x.drop_duplicates().count()).reset_index().rename(columns={'user_id':'uv'})
fig,axes=plt.subplots(2,1)
pv_daily.plot(x='date',y='pv',ax=axes[0])
uv_daily.plot(x='date',y='uv',ax=axes[1])
axes[0].set_title('pv_daily')
axes[1].set_title('uv_daily')

plt.show()

The results show that, as shown in the above figure, the pv and uv traffic reaches the peak value during the double twelve, and it can be found that there is a large gap between the uv and pv traffic values.

Hourly traffic analysis

#pv_hour records the number of user operations per hour, UV_ Number of online users per hour
pv_hour = data.groupby('hour')['user_id'].count().reset_index().rename(columns={'user_id':'pv'})
uv_hour = data.groupby('hour')['user_id'].apply(lambda x:x.drop_duplicates().count()).reset_index().rename(columns={'user_id':'uv'})
fig,axes = plt.subplots(2,1,sharex=True)
pv_hour.plot(x='hour',y='pv',ax=axes[0])
uv_hour.plot(x='hour',y='uv',ax=axes[1])
axes[0].set_title('pv_hour')
axes[1].set_title('uv_hour')
plt.show()

The fluctuation of pv and uv is the same during 0-5 a.m., and both show a downward trend. The traffic reaches the lowest point at 5 a.m. at the same time, at about 18:00 p.m., the fluctuation of pv is relatively intense. Compared with uv, it is not obvious. It can be seen that after 18:00 p.m., it is the active time period for Taobao users to visit the app.

pv analysis of users with different behavior types

pv_detail = data.groupby(['behavior_type','hour'])['user_id'].count().reset_index().rename(columns={'user_id':'total_pv'})
fig,axes = plt.subplots(2,1,sharex=True)
sns.pointplot(x='hour',y='total_pv',hue='behavior_type',data=pv_detail,ax=axes[0])
sns.pointplot(x='hour',y='total_pv',hue='behavior_type',data=pv_detail[pv_detail.behavior_type!=1],ax=axes[1])

axes[0].set_title('pv_different_behavior_type')
plt.show()

The chart shows that compared with the other three types of user behaviors, the click user behavior has higher pv visits, and the fluctuations of the four user behaviors are basically the same. Therefore, no matter which user behavior is at night, the pv visits are the highest.

Consumer behavior analysis

Analysis of user purchase times

data_buy=data[data.behavior_type==4].groupby('user_id')['behavior_type'].count()
sns.displot(data_buy)
plt.title('daily_buy')
plt.show()

data_buy=data[data.behavior_type==4].groupby('user_id')['behavior_type'].count().reset_index().rename(columns={'behavior_type':'buy_count'})
data_buy.describe()
user_idbuy_count
count8.886000e+038886.000000
mean7.152087e+0713.527459
std4.120719e+0719.698786
min4.913000e+031.000000
25%3.567731e+074.000000
50%7.238800e+078.000000
75%1.071945e+0817.000000
max1.424559e+08809.000000
bins = [1,10,20,50]
data_buy['buy_count_cut'] = pd.cut(data_buy['buy_count'],bins,labels = ['1-10','10-20','20-50'])
buy_count_cut = data_buy['buy_count_cut'].value_counts()
buy_count_cut
1-10     4472
10-20    1972
20-50    1417
Name: buy_count_cut, dtype: int64

Taobao users generally spend less than 10 times, so we need to focus on the consumer user groups who buy more than 10 times.

Rate

Payment rate = number of consumers / number of active users

data.groupby('date').apply(lambda x:x[x.behavior_type==4].count()/len(x.user_id.unique())).plot()
plt.title('Rate ')
Text(0.5, 1.0, 'Rate ')

About 6% of the daily active users have consumption behavior, with the largest number of users during the double 12.

Funnel loss analysis

Browse the home page - click on products - add to shopping cart & collection - start payment - complete purchase

data_user = data.groupby(['behavior_type']).count()
data_user.head()
user_iditem_iduser_geohashitem_categorytimedatehour
behavior_type
111550581115505811155058111550581115505811155058111550581
2242556242556242556242556242556242556242556
3343564343564343564343564343564343564343564
4120205120205120205120205120205120205120205
pv_all=data['user_id'].count()
print(pv_all)
12256906

Total views: 12256906
Hits: 11550581
Collection quantity + additional purchase quantity: 242556 + 343564 = 586120
Purchase volume: 120205

Number of hits - loss rate of purchase intention: 94.93%

Purchase intention - loss rate of purchase volume: 79.49%

In this data set, there is a lack of user behavior data of "placing orders", so it is impossible to analyze the turnover rate from placing orders to the completion of final payment.
At the same time, because there is no superior subordinate relationship between the two user behaviors of collection and additional purchase, they are combined into purchase intention for analysis.

Analysis of user value RFM model

Meaning of RFM:

R(Recency): The interval between the last transaction of the customer. R The larger the value, the longer the date of customer transaction; otherwise, the closer the date of customer transaction.
F(Frequency): The number of transactions the customer has made in the recent period. F The higher the value, the more frequent the customer transactions; otherwise, the less active the customer transactions.
M(Monetary): The amount of the customer's transaction in the latest period of time. M The higher the value, the higher the customer value; otherwise, the lower the customer value.
from datetime import datetime
datenow=datetime(2014,12,20)
#Latest purchase time per user
recent_buy_time=data[data.behavior_type==4].groupby('user_id').date.apply(lambda x:datetime(2014,12,20)-x.sort_values().iloc[-1]).reset_index().rename(columns={'date':'recent'})
#Consumption frequency per user
buy_freq=data[data.behavior_type==4].groupby('user_id').date.count().reset_index().rename(columns={'date':'freq'})
rfm=pd.merge(recent_buy_time,buy_freq,left_on='user_id',right_on='user_id',how='outer')
#Score each dimension
rfm['recent_value']=pd.qcut(rfm.recent,2,labels=['2','1'])
rfm['freq_value']=pd.qcut(rfm.freq,2,labels=['1','2'])
rfm['rfm']=rfm['recent_value'].str.cat(rfm['freq_value'])
rfm.head()
user_idrecentfreqrecent_valuefreq_valuerfm
049134 days62121
161183 days12121
275287 days61111
375917 days211212
4126456 days82121
from datetime import datetime
datenow=datetime(2014,12,20)
#Latest purchase time per user
recent_buy_time=data[data.behavior_type==4].groupby('user_id').date.apply(lambda x:datetime(2014,12,20)-x.sort_values().iloc[-1]).reset_index().rename(columns={'date':'recent'})
#Consumption frequency per user
buy_freq=data[data.behavior_type==4].groupby('user_id').date.count().reset_index().rename(columns={'date':'freq'})
rfm=pd.merge(recent_buy_time,buy_freq,left_on='user_id',right_on='user_id',how='outer')

#Score each dimension
rfm['recent_value']=pd.qcut(rfm.recent,2,labels=['2','1'])
rfm['freq_value']=pd.qcut(rfm.freq,2,labels=['1','2'])
rfm['rfm']=rfm['recent_value'].str.cat(rfm['freq_value'])
rfm.head()
user_idrecentfreqrecent_valuefreq_valuerfm
049134 days62121
161183 days12121
275287 days61111
375917 days211212
4126456 days82121

Because this data set does not provide consumption amount, only R and F can conduct user value analysis.
Through RF user value analysis, for 22 users, it is a key user and needs attention; For those with high loyalty and insufficient purchase frequency, users can increase their purchase frequency by means of activity coupons and so on; For 12 such users with low loyalty and low consumption frequency, they need to pay attention to their shopping habits, do more push and precision marketing.

rfm.groupby(['rfm']).count()
user_idrecentfreqrecent_valuefreq_value
rfm
1127672767276727672767
1212191219121912191219
2117211721172117211721
2231793179317931793179
rfm_count = rfm['rfm'].value_counts()
plt.figure(figsize=(8,8))
plt.pie(rfm_count.values,labels=rfm_count.index,autopct='%.2f%%',
       wedgeprops={'linewidth':0.5,'edgecolor':'green'},
       textprops={'fontsize':30,'color':'#003371'}
       ) 
plt.title('RFM',size=30)
plt.show()

Added by Bjblatz on Mon, 31 Jan 2022 13:15:31 +0200