Retail store order data analysis

Detailed data of consumer goods sales obtained by "scanning" the bar code of individual products at the electronic point of sale of retail stores. These data provide detailed information about the quantity, characteristics, value and price of the goods sold.

2, Data source

https://www.kaggle.com/marian447/retail-store-sales-transactions

3, Ask questions

Consumption analysis and user purchase mode analysis
RFM and CLV analysis
Mining association rules of different categories of goods

4, Understanding data

Date: purchase date
Customer_ User ID: user ID
Transaction_ID: transaction ID
SKU_Category: SKU code of commodity classification
SKU: the unique SKU code of the product
Quantity: purchase quantity
Sales_Amount: purchase amount

5, Data cleaning

1. Import data

In [1]:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
# Change design style
plt.style.use('ggplot')
plt.rcParams['font.sans-serif'] = ['SimHei']

In [2]:

df = pd.read_csv('E:/googledownload/archive (5)/scanner_data.csv')
df.head()

Out[2]:

	Unnamed: 0	Date	Customer_ID	Transaction_ID	SKU_Category	SKU	Quantity	Sales_Amount
0	1	2017-01-02	2547	1	X52	0EM7L	1.0	3.13
1	2	2017-01-02	822	2	2ML	68BRQ	1.0	5.46
2	3	2017-01-02	3686	3	0H2	CZUZX	1.0	6.35
3	4	2017-01-02	3719	4	0H2	549KK	1.0	5.59
4	5	2017-01-02	9200	5	0H2	K8EHH	1.0	6.88

In [3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131706 entries, 0 to 131705
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0      131706 non-null  int64  
 1   Date            131706 non-null  object 
 2   Customer_ID     131706 non-null  int64  
 3   Transaction_ID  131706 non-null  int64  
 4   SKU_Category    131706 non-null  object 
 5   SKU             131706 non-null  object 
 6   Quantity        131706 non-null  float64
 7   Sales_Amount    131706 non-null  float64
dtypes: float64(2), int64(3), object(3)
memory usage: 8.0+ MB

2. Select subset

The first column is the data number, which has been indexed, so it is deleted

In [4]:

df.drop(columns='Unnamed: 0', inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131706 entries, 0 to 131705
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Date            131706 non-null  object 
 1   Customer_ID     131706 non-null  int64  
 2   Transaction_ID  131706 non-null  int64  
 3   SKU_Category    131706 non-null  object 
 4   SKU             131706 non-null  object 
 5   Quantity        131706 non-null  float64
 6   Sales_Amount    131706 non-null  float64
dtypes: float64(2), int64(2), object(3)
memory usage: 7.0+ MB

3. Delete duplicate values

In [5]:

df.duplicated().sum()

Out[5]:

Data has no duplicate value

4. Missing value processing

In [6]:

df.isnull().sum()

Out[6]:

Date              0
Customer_ID       0
Transaction_ID    0
SKU_Category      0
SKU               0
Quantity          0
Sales_Amount      0
dtype: int64

There is no missing value in the data

5. Standardized treatment

In [7]:

df.dtypes

Out[7]:

Date               object
Customer_ID         int64
Transaction_ID      int64
SKU_Category       object
SKU                object
Quantity          float64
Sales_Amount      float64
dtype: object

Date is an object type and needs to be standardized into date type format

In [8]:

df.Date = pd.to_datetime(df.Date, format='%Y-%m-%d')
df.dtypes

Out[8]:

Date              datetime64[ns]
Customer_ID                int64
Transaction_ID             int64
SKU_Category              object
SKU                       object
Quantity                 float64
Sales_Amount             float64
dtype: object

6. Abnormal value handling

In [9]:

df[['Quantity','Sales_Amount']].describe()

Out[9]:

	Quantity	Sales_Amount
count	131706.000000	131706.000000
mean	1.485311	11.981524
std	3.872667	19.359699
min	0.010000	0.020000
25%	1.000000	4.230000
50%	1.000000	6.920000
75%	1.000000	12.330000
max	400.000000	707.730000

The purchase quantity is less than 1 because the weighing unit is less than 1, which is not an abnormal value

6, Analysis content

1. Monthly analysis of consumption

(1) Trend analysis of total monthly consumption

In [10]:

df['Month'] = df.Date.astype('datetime64[M]')
df.head()

Out[10]:

	Date	Customer_ID	Transaction_ID	SKU_Category	SKU	Quantity	Sales_Amount	Month
0	2017-01-02	2547	1	X52	0EM7L	1.0	3.13	2017-01-01
1	2017-01-02	822	2	2ML	68BRQ	1.0	5.46	2017-01-01
2	2017-01-02	3686	3	0H2	CZUZX	1.0	6.35	2017-01-01
3	2017-01-02	3719	4	0H2	549KK	1.0	5.59	2017-01-01
4	2017-01-02	9200	5	0H2	K8EHH	1.0	6.88	2017-01-01

In [11]:

grouped_month = df.groupby('Month')

In [12]:

grouped_month.Sales_Amount.sum()

Out[12]:

Month
2017-01-01    111200.28
2017-02-01    118323.24
2017-03-01    126770.11
2017-04-01    140114.18
2017-05-01    141536.95
2017-06-01    130384.85
2017-07-01    124357.15
2017-08-01    114295.16
2017-09-01    139665.01
2017-10-01    141692.79
2017-11-01    131676.89
2017-12-01    156308.81
2018-01-01      1713.20
Name: Sales_Amount, dtype: float64

The data of January 2018 may be incomplete and will not be included in the trend analysis

In [13]:

grouped_month.Sales_Amount.sum().head(12).plot()

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e990450a20>

It can be seen from the above figure that the consumption amount fluctuates greatly, which keeps rising in the first quarter, and fluctuates greatly in the follow-up, showing an upward trend as a whole

(2) Trend analysis of monthly transaction times

In [14]:

grouped_month.Transaction_ID.nunique().head(12).plot()

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e9905160f0>

It can be seen from the above figure that the trading times fluctuated greatly and showed an upward trend in the early stage. After May, the trading times began to decline, fell to the lowest value in August, and then began to fluctuate and recover, and returned to the peak in December

(3) Trend analysis of monthly commodity purchase quantity

In [15]:

grouped_month.Quantity.sum().head(12).plot()

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e990c9f9e8>

It can be seen from the above figure that the quantity of goods purchased fluctuates greatly, and the overall trend is consistent with the number of transactions

(4) Monthly consumption trend analysis

In [16]:

grouped_month.Customer_ID.nunique().head(12).plot()

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e990d10048>

It can be seen from the above figure that the monthly number of buyers can be divided into three stages: from January to may, from June to August, and from September to December, it shows a fluctuating upward trend

2. User distribution analysis

(1) New user distribution

In [17]:

grouped_customer = df.groupby('Customer_ID')
grouped_customer.Date.min().value_counts().plot()

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e990fc3eb8>

It can be seen from the above figure that the acquisition of new users is unstable and fluctuates greatly, with a slight downward trend as a whole

In [18]:

grouped_customer.Month.min().value_counts().plot()

Out[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e991149ef0>

As can be seen from the above figure: according to monthly statistics, the number of new users per month has an obvious downward trend. It shows that the acquisition of new users is in a sharp downward trend, which needs attention. Appropriately increase marketing activities to improve the acquisition of new users

(2) Analysis on the proportion of one-time consumption and multiple consumption users

In [19]:

#Proportion of users who consume only once
(grouped_customer.Transaction_ID.nunique() == 1).sum()/df.Customer_ID.nunique()

Out[19]:

0.5098342541436464

According to the calculation, half of the users only consume once

In [20]:

grouped_month_customer = df.groupby(['Month', 'Customer_ID'])

In [21]:

#The first purchase time of each user every month
data_month_min_date = grouped_month_customer.Date.min().reset_index()
#First purchase time of each user
data_min_date = grouped_customer.Date.min().reset_index()

In [22]:

#Through Customer_ID simultaneous two tables
merged_date = pd.merge(data_month_min_date, data_min_date, on='Customer_ID')
merged_date.head()

Out[22]:

	Month	Customer_ID	Date_x	Date_y
0	2017-01-01	1	2017-01-22	2017-01-22
1	2017-01-01	3	2017-01-02	2017-01-02
2	2017-01-01	11	2017-01-29	2017-01-29
3	2017-01-01	12	2017-01-07	2017-01-07
4	2017-01-01	13	2017-01-11	2017-01-11

In [23]:

#Date_x equals Date_y is the new user every month
((merged_date.query('Date_x == Date_y')).groupby('Month').Customer_ID.count() / merged_date.groupby('Month').Customer_ID.count()).plot()

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e991202cf8>

It can be seen from the above figure that the proportion of new users per month shows a downward trend as a whole. Combined with the trend of monthly consumption, the consumption has an upward trend in the fourth quarter, so the number of re purchases has increased during the period

3. User layered analysis

(1) RFM layered analysis

In [24]:

pivot_rfm = df.pivot_table(index='Customer_ID',
              values=['Date', 'Transaction_ID', 'Sales_Amount'],
              aggfunc={'Date':'max', 'Transaction_ID':'nunique', 'Sales_Amount':'sum'})

In [25]:

pivot_rfm['R'] = (pivot_rfm.Date.max() - pivot_rfm.Date)/np.timedelta64(1, 'D')
pivot_rfm.rename(columns={'Transaction_ID':'F', 'Sales_Amount':'M'}, inplace=True)

In [26]:

def label_func(data):
    label = data.apply(lambda x:'1' if x > 0 else '0')
    label = label.R + label.F + label.M
    labels = {
        '111':'Important value customers',
        '011':'Important to keep customers',
        '101':'Key development customers',
        '001':'Important retention customers',
        '110':'General value customers',
        '010':'General customer retention',
        '100':'General development customers',
        '000':'General retention of customers'
    }
    return labels[label]
pivot_rfm['label'] = pivot_rfm[['R','F','M']].apply(lambda x:x-x.mean()).apply(label_func, axis=1)

In [27]:

pivot_rfm.label.value_counts().plot.barh()

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e9914a5f98>

In [28]:

pivot_rfm.groupby('label').M.sum().plot.pie(figsize=(6,6), autopct='%3.2f%%')

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e991b29d30>

In [29]:

pivot_rfm.groupby('label').agg(['sum', 'count'])

Out[29]:

	M		F		R
	sum	count	sum	count	sum	count
label
General value customers	38702.46	991	3653	991	258396.0	991
General customer retention	75475.62	1802	7161	1802	97961.0	1802
General development customers	144065.48	8329	10089	8329	2256955.0	8329
General retention of customers	119097.65	6313	8158	6313	443616.0	6313
Important value customers	192090.11	1028	5475	1028	270431.0	1028
Important to keep customers	860862.51	3069	28458	3069	120487.0	3069
Key development customers	81377.67	600	892	600	162986.0	600
Important retention customers	66367.12	493	796	493	33989.0	493

According to the above table and figure:

The main source of sales is important customers, and the general development customers account for the highest proportion
Important to keep customers: the main source of sales, recent consumption, high consumption and insufficient consumption frequency. Marketing activities can be held appropriately to improve the purchase frequency of customers at this level
Important value customers: the second source of sales, with recent consumption, high consumption and high frequency, so as to keep the current situation of customers at this level as far as possible
Important development customers: consumption and consumption frequency are high, and there is no consumption in the near future. You can use appropriate strategies to recall users and participate in consumption
To prevent the loss of customers in the near future, we can keep them on the verge of high consumption, but we can prevent them from losing their important customers through appropriate consumption
General value customers: low consumption, high consumption frequency and recent consumption. Coupons and other forms of activities can be used to stimulate the consumption of customers at this level and increase their consumption
General development customers: the number accounts for the highest proportion, and there is consumption in the near future, but the consumption amount and consumption frequency are not high. Considering the high proportion of people, activities can be held appropriately to improve the consumption frequency and consumption amount
Generally keep customers: under the control of cost and resources, consider as appropriate
General retention of customers: under the control of cost and resources, it shall be considered as appropriate

(2) Hierarchical analysis of user status

In [30]:

pivoted_status = df.pivot_table(index='Customer_ID', columns='Month', values='Date', aggfunc='count').fillna(0)

In [31]:

def active_status(data):
    status = []
    for i in range(len(data)):     
        #If there is no consumption this month
        if data[i] == 0:
            if len(status) > 0:
                if status[i-1] == 'unreg':
                    status.append('unreg')
                else:
                    status.append('unactive')
            else:
                status.append('unreg')
        #If there is consumption this month
        else:
            if len(status) > 0:
                if status[i-1] == 'unreg':
                    status.append('new')
                elif status[i-1] == 'unactive':
                    status.append('return')
                else:
                    status.append('active')
            else:
                status.append('new')
    status = pd.Series(status, index = data.index)
    return status

In [32]:

active_status = pivoted_status.apply(active_status, axis=1)

In [33]:

active_status.replace('unreg', np.nan).apply(lambda x:x.value_counts()).fillna(0).T.apply(lambda x: x/x.sum(),axis=1).plot.area()

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e991f66d30>

As can be seen from the above figure:

New users: the proportion of new users shows an obvious downward trend, indicating that the new operation is insufficient
Active users: the proportion reached the highest in February, followed by a slow downward trend, indicating that the consumption operation is declining
Inactive users: inactive users show an obvious upward trend, and the loss of customers is more obvious
Return customers: there is a slow upward trend, indicating that the recall operation is good

4. User life cycle analysis

(1) User lifecycle distribution

In [34]:

#The data samples constituting the user life cycle research need users with consumption times > = 2 times
clv = (grouped_customer[['Sales_Amount']].sum())[grouped_customer.Transaction_ID.nunique() > 1]

In [35]:

clv['lifetime'] = (grouped_customer.Date.max() - grouped_customer.Date.min())/np.timedelta64(1,'D')

In [36]:

clv.describe()

Out[36]:

	Sales_Amount	lifetime
count	11090.000000	11090.000000
mean	121.473811	116.468260
std	202.733651	85.985488
min	2.240000	0.000000
25%	27.462500	42.000000
50%	55.635000	96.000000
75%	126.507500	190.000000
max	3985.940000	364.000000

It can be seen from the above table that the average life cycle of users who consume more than once is 116 days, and the average consumption amount in the user's life cycle is 121.47 yuan

In [37]:

clv['lifetime'].plot.hist(bins = 50)

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e991ab1400>

As can be seen from the above figure:

There are many users with a life cycle of 0-90 days, indicating that customers with a short life cycle account for a high proportion and a high turnover rate within 90 days. This part of users can be used as the focus of operation to prolong the life cycle of these users;
The life cycle is evenly distributed between 90-250, which is also the life cycle of most users, which can stimulate their consumption and increase their consumption amount in the life cycle;
Few people have a life cycle greater than 250 days, indicating that the proportion of loyal customers with a long life cycle is not high.

(2) User life cycle value distribution

In [38]:

clv['Sales_Amount'].plot.hist(bins = 50)

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e99178d208>

As can be seen from the above figure:

The value of most users in the life cycle is within 500, and most of them are within 100. There are large extreme values, raising the average value, and the data skews to the right.

(3) User life cycle and its value relationship

In [39]:

plt.scatter(x='lifetime', y='Sales_Amount', data=clv)

Out[39]:

<matplotlib.collections.PathCollection at 0x1e991810518>

As can be seen from the above figure:

There is no linear relationship between the user's life cycle and its customer value. When the life cycle is within 300 days, the value contributed by some users with longer life cycle is higher than that of users with shorter life cycle;
When the life cycle is more than 300 days, some users contribute less value. Due to insufficient data, the results are only for reference

5. Analysis of repurchase rate and repurchase rate

(1) Repurchase rate analysis

In [40]:

#Number of users whose monthly consumption times are greater than 1
customer_month_again = grouped_month_customer.nunique().query('Transaction_ID > 1').reset_index().groupby('Month').count().Customer_ID
#Monthly consumption users
customer_month = grouped_month.Customer_ID.nunique()
#Monthly repurchase rate
(customer_month_again/customer_month).plot()

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e991788390>

It can be seen from the above figure that the repurchase rate fluctuates around 25%, which means that 25% of users will spend many times every month; The repurchase rate decreased in the first three months, picked up in the follow-up and showed an overall upward trend. We should decide whether to further improve the repurchase rate or focus on the acquisition of new users in combination with our own business model. Due to the lack of data in the last month, the results are mainly real data.

(2) Repurchase rate analysis

In [41]:

#  1 means consumption in the first 90 days and repurchase in this month. 0 means consumption in the first 90 days and no repurchase in this month. nan means no consumption in the first 90 days
def buy_back(data):
    status = [np.nan,np.nan,np.nan]
    for i in range(3,len(data)):
        #Purchase this month
        if data[i] == 1:
            #Purchase in the first 90 days
            if (data[i-1] == 1 or data[i-2] ==1 or data[i-3] == 1):
                status.append(1)
            #Not purchased in the first 90 days
            else:
                status.append(np.nan)
        #Not purchased this month
        else:
            #Purchase in the first 90 days
            if (data[i-1] == 1 or data[i-2] ==1 or data[i-3] == 1):
                status.append(0)
            #Not purchased in the first 90 days
            else:
                status.append(np.nan)
    status = pd.Series(status, index = data.index)
    return status

In [42]:

back_status = pivoted_status.apply(buy_back, axis=1)
back_status.head()

Out[42]:

Month	2017-01-01	2017-02-01	2017-03-01	2017-04-01	2017-05-01	2017-06-01	2017-07-01	2017-08-01	2017-09-01	2017-10-01	2017-11-01	2017-12-01	2018-01-01
Customer_ID
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	0.0	0.0	1.0	0.0	0.0	0.0	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	0.0	NaN	NaN	NaN
5	NaN	NaN	NaN	0.0	1.0	0.0	1.0	0.0	0.0	0.0	NaN	NaN	NaN

In [43]:

(back_status.sum()/back_status.count()).plot()

Out[43]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e9952a1390>

It can be seen from the above figure that the repurchase rate within 90 days, that is, the repeat purchase rate within 90 days is less than 10%, indicating that the store is in the user acquisition mode. However, according to the previous analysis, the acquisition of new users is on the decline. At present, the store is not healthy, and the focus should be on the acquisition of new users at the current stage,

6. Commodity Association Rule Mining

(1) Analyze hot selling goods

In [44]:

#Take out the top 10 product types
hot_category = df.groupby('SKU_Category').count().Sales_Amount.sort_values(ascending=False)[:10].reset_index()
plt.barh(hot_category.SKU_Category, hot_category.Sales_Amount)

Out[44]:

<BarContainer object of 10 artists>

In [45]:

#Proportion of best selling goods
hot_category['percent'] = hot_category.Sales_Amount.apply(lambda x:x/hot_category.Sales_Amount.sum())
plt.figure(figsize=(6,6))
plt.pie(hot_category.percent,labels=hot_category.SKU_Category,autopct='%1.2f%%')
plt.show()

In [46]:

category_list = df.groupby('Transaction_ID').SKU_Category.apply(list).values.tolist()

In [47]:

from apyori import apriori

In [48]:

min_support_value = 0.02
min_confidence_value = 0.3
result = list(apriori(transactions=category_list, min_support=min_support_value, min_confidence=min_confidence_value, min_left=0))

In [49]:

result

Out[49]:

[RelationRecord(items=frozenset({'FU5', 'LPF'}), support=0.02067035651340404, ordered_statistics=[OrderedStatistic(items_base=frozenset({'FU5'}), items_add=frozenset({'LPF'}), confidence=0.4946355900850906, lift=6.8819142262602355)]),
 RelationRecord(items=frozenset({'LPF', 'IEV'}), support=0.031152407161188583, ordered_statistics=[OrderedStatistic(items_base=frozenset({'IEV'}), items_add=frozenset({'LPF'}), confidence=0.4889589905362776, lift=6.802935131397615), OrderedStatistic(items_base=frozenset({'LPF'}), items_add=frozenset({'IEV'}), confidence=0.43342654334265435, lift=6.802935131397614)]),
 RelationRecord(items=frozenset({'OXH', 'LPF'}), support=0.020067406697381034, ordered_statistics=[OrderedStatistic(items_base=frozenset({'OXH'}), items_add=frozenset({'LPF'}), confidence=0.4810971089696071, lift=6.693551990185444)])]

From the above results:

'FU5' -- > 'LPF': the support is about 2.1% and the confidence is about 49.5%. It shows that the probability of purchasing these two types of goods at the same time is about 2.1%. After purchasing FU5 products, the probability of purchasing LPF products at the same time is 49.5%
'IEV' -- > 'LPF': the support is about 3.1% and the confidence is about 48.9%. It shows that the probability of purchasing these two types of goods at the same time is about 3.1%, and the probability of purchasing IEV type products first and LPF type products at the same time is about 48.9%
'LPF' -- > 'IEV': the support is about 3.1% and the confidence is about 43.3%. It shows that the probability of purchasing these two types of goods at the same time is about 3.1%, and the probability of purchasing LPF type products first and IEV type products at the same time is about 43.3%
'OXH' -- > 'LPF': the support is about 2.0% and the confidence is about 48.1%. It shows that the probability of purchasing these two types of goods at the same time is about 2.0%, and the probability of purchasing IEV type products first and LPF type products at the same time is about 48.1%

Keywords: Python Data Analysis

Added by snizkorod on Tue, 08 Feb 2022 01:59:19 +0200

Programming VIP

Retail store order data analysis

1, Project background

2, Data source

3, Ask questions

4, Understanding data

5, Data cleaning

1. Import data

2. Select subset

3. Delete duplicate values

4. Missing value processing

5. Standardized treatment

6. Abnormal value handling

6, Analysis content

1. Monthly analysis of consumption

2. User distribution analysis

3. User layered analysis

4. User life cycle analysis

5. Analysis of repurchase rate and repurchase rate

6. Commodity Association Rule Mining

Popular Keywords