Practical cases of common statistical methods such as cumulative, year-on-year and month on month in Pandas

In the statistical table, the statistics are usually accumulated in the current year, completed in the same period of the previous year (accumulated), completed in the current period (such as the current month) and completed in the last month, and the year-on-year and month on month analysis are carried out. The following example is shown in the monthly report statistics table. This article will use Python Pandas tool for statistics.

Of which:

  • (current year) cumulative: refers to the total amount from January to the end of the current year
  • (last year) same period (cumulative): refers to the total amount from January of last year to the end month corresponding to the cumulative amount of this year
  • Year on year (growth rate) = (amount in current period - amount in the same period) / amount in the same period * 100%
  • Month on month (growth rate) = (current period amount - previous period amount) / previous period amount * 100%

Note: the current period here refers to the completion of the current month or the current month, and the number of the previous period refers to the completion of the previous month.

Sample data:

Note: for the convenience of demonstration, the data source of this case only uses the data of 2 years and 5 months a year.

1. (this year) cumulative

In statistical analysis and development, it is a common requirement to accumulate some statistical data by year and month. For data, it is to accumulate data line by line according to rules.

The cumsum() function in Pandas can accumulate demand according to a certain time dimension.

# Take the cumulative value of this year
import pandas as pd

df = pd.read_csv('data2021.csv')
cum_columns_name = ['cum_churncount','cum_newcount']
df[cum_columns_name] = df[['years','churncount','newcount']].groupby(['years']).cumsum()

Note: the grouping 'years' refers to the cumulative time dimension of the year.

The calculation results are as follows:

2. (last year) cumulative in the same period

For the cumulative value of the same period (last year), the data of the same month of the cumulative value of the previous year will be directly obtained. pandas DataFrame. The shift () function can move the data by a specified number of rows.

Continue the above column and read the data of the same period. First, move 'yearmonth' up five rows to get a new DataFrame, as shown in the figure above. Use 'yearmonth' to associate the data of two tables (left Association: the original table is on the left and the new table is on the right), so as to achieve the effect of de synchronization data.

cum_columns_dict = {'cum_churncount':'cum_same_period_churncount',
df_cum_same_period = df[['cum_churncount','cum_newcount','yearmonth']].copy()
df_cum_same_period = df_cum_same_period.rename(columns=cum_columns_dict)
#df_cum_same_period.loc[:,'yearmonth'] = df_cum_same_period['yearmonth'].shift(-12) # 12 months a year
df_cum_same_period.loc[:,'yearmonth'] = df_cum_same_period['yearmonth'].shift(-5)   # Because only five months of data are taken

df = pd.merge(left=df,right=df_cum_same_period,on='yearmonth',how='left')

3. Last month (completed)

Take the data of last month and use pandas dataframe The shift() function moves the data by a specified number of rows.

Continue the above column and read the data of the previous period. (same as the synchronization principle, omitted)

last_mnoth_columns_dict = {'churncount':'last_month_churncount',
df_last_month = df[['churncount','newcount','yearmonth']].copy()
df_last_month = df_last_month.rename(columns=last_mnoth_columns_dict)

df_last_month.loc[:,'yearmonth'] = df_last_month['yearmonth'].shift(-1)  # Move one line
df = pd.merge(left=df,right=df_last_month,on='yearmonth',how='left')

4. Year on year (growth rate)

The year-on-year calculation involves division, and the data with divisor of zero needs to be eliminated.

df.fillna(0,inplace=True) # Null values are filled with 0 
# Calculate year-on-year
df.loc[df['cum_same_period_churncount']!=0,'cum_churncount_rat'] = (df['cum_churncount']-df['cum_same_period_churncount'])/df['cum_same_period_churncount'] # Divisor cannot be zero
df.loc[df['cum_same_period_newcount']!=0,'cum_newcount_rat'] =  (df['cum_newcount']-df['cum_same_period_newcount'])/df['cum_same_period_newcount'] # Divisor cannot be zero

5. Month on month (growth rate)

# Calculate month on month ratio
df.loc[df['last_month_churncount']!=0,'churncount_rat'] = (df['churncount']-df['last_month_churncount'])/df['last_month_churncount'] # Divisor cannot be zero
df.loc[df['last_month_newcount']!=0,'newcount_rat'] =  (df['newcount']-df['last_month_newcount'])/df['last_month_newcount'] # Divisor cannot be zero

6. Summary

pandas has many statistical calculation functions and methods. The techniques summarized here include cumulative cumsum() function, mobile data shift() function, table merge Association merge() function, and modifying data through loc conditions.

Keywords: Python Data Analysis pandas

Added by jamiethehill on Fri, 18 Feb 2022 05:54:32 +0200