Long format dataframe and wide format dataframe are two forms of data frames, which appear frequently in data analysis. In the process of data processing,
It is often necessary to switch between the two. Based on pandas, this paper introduces the mutual conversion operation of long data and wide data.
environment
- python3.9
- win10 64bit
- pandas==1.2.1
Width to length
In pandas, the data from wide type to long type includes melt and wide_to_long two methods.
melt
melt method is called data fusion. It is a method owned by dataFrame and is frequently used. The parameters are explained as follows:
DataFrame.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
- id_vars:[tuple, list, ndarray], identifier variable in column, not involved in fusion.
- value_vars:[tuple, list, ndarray], variables in the column are fused. All variables are fused by default.
- var_name:[scalar], the name of the variable after fusion. The default is variable.
- value_name:[scalar], the name of the value after fusion, default value.
- col_level:[int, str], select columns for multiple column indexes.
- ignore_index:[bool], whether to reorder the index after fusion. The default is True.
import pandas as pd pd.set_option('display.notebook_repr_html',False) # Wide data w_df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6], 'C': [7,8,9]}) w_df
A B C 0 1 4 7 1 2 5 8 2 3 6 9
- When no parameters are passed in, all columns will be merged by default.
# All fusion w_df.melt()
variable value 0 A 1 1 A 2 2 A 3 3 B 4 4 B 5 5 B 6 6 C 7 7 C 8 8 C 9
- Set id_vars parameter, select some columns as identifiers and do not participate in the fusion, and all the remaining columns will be fused.
# A identification, B, C fusion w_df.melt(id_vars=['A'])
A variable value 0 1 B 4 1 2 B 5 2 3 B 6 3 1 C 7 4 2 C 8 5 3 C 9
# A. B identification, C fusion w_df.melt(id_vars=['A','B'])
A B variable value 0 1 4 C 7 1 2 5 C 8 2 3 6 C 9
- Set value_vars parameter, select some columns as fusion columns.
Note that the remaining columns are not automatically treated as identifier columns.
# Fusion only A w_df.melt(value_vars=['A'])
variable value 0 A 1 1 A 2 2 A 3
# Fusion only A,B w_df.melt(value_vars=['A','B'])
variable value 0 A 1 1 A 2 2 A 3 3 B 4 4 B 5 5 B 6
- Set var_ Name (default variable), value_ The name (default value) parameter sets the name of the fused variable and value.
# Set the variable name and value name after fusion w_df.melt(var_name='code',value_name='count')
code count 0 A 1 1 A 2 2 A 3 3 B 4 4 B 5 5 B 6 6 C 7 7 C 8 8 C 9
- Set ignore_index=False can retain the index of the original data.
w_df.melt(ignore_index=False)
variable value 0 A 1 1 A 2 2 A 3 0 B 4 1 B 5 2 B 6 0 C 7 1 C 8 2 C 9
- Set col_level parameter, you can select multiple column index data to fuse data.
# Column multiple index data mi_w_df=w_df.copy() mi_w_df.columns=[list('ABC'),list('DEF')] mi_w_df
A B C D E F 0 1 4 7 1 2 5 8 2 3 6 9
# Merge first index column mi_w_df.melt(col_level=0)
variable value 0 A 1 1 A 2 2 A 3 3 B 4 4 B 5 5 B 6 6 C 7 7 C 8 8 C 9
# Merge second index column mi_w_df.melt(col_level=1)
variable value 0 D 1 1 D 2 2 D 3 3 E 4 4 E 5 5 E 6 6 F 7 7 F 8 8 F 9
wide_to_long
wide_ to_ The long function comes with pandas and is a supplement to melt. It is more applicable in the case of special width to length conversion.
pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\d+')
- df:[pd.dataframe], wide data frame
- Stub names: [STR, list like], stub name in column name
- i: [STR, list like], index variable in column
- j:[str], rename of suffix
- sep:[str,default ''], separator between stub name and suffix
- suffix:[str,default "\ d +"], suffix
# Wide data s_df = pd.DataFrame({"A1970" : [1,33,3], "B1980" : [3,5,7], "A1980" : [13,15,17], "B1970" : [6,8,14], "x" : [1,2,3], "y" : [4,5,6]}) s_df
A1970 B1980 A1980 B1970 x y 0 1 3 13 6 1 4 1 33 5 15 8 2 5 2 3 7 17 14 3 6
In the data, the names of a1970, b1980, a1980 and b1970 have the same structure. If you need to separate them, you can use long_to_wide function.
# Width to length of a specific column pd.wide_to_long(s_df,stubnames=['A','B'],j='year',i='x')
y A B x year 1 1970 4 1 6 1980 4 13 3 2 1970 5 33 8 1980 5 15 5 3 1970 6 3 14 1980 6 17 7
- Set stubnames, and the function will match the target column in the data column according to the set characters, and then convert it into long data
# Convert only columns containing A pd.wide_to_long(s_df,stubnames=['A',],j='year',i='x')
B1970 y B1980 A x year 1 1970 6 4 3 1 2 1970 8 5 5 33 3 1970 14 6 7 3 1 1980 6 4 3 13 2 1980 8 5 5 15 3 1980 14 6 7 17
If the character set by the stubnames parameter cannot be found in the column of the original data frame, an empty data frame is returned.
# If there is no C character in the column name, an empty data frame is returned pd.wide_to_long(s_df,stubnames=['C',],j='year',i='x')
Empty DataFrame Columns: [B1970, y, A1980, B1980, A1970, C] Index: []
- Parameter i can be set to multiple columns to return multiple indexes.
# Set multiple indexes pd.wide_to_long(s_df,stubnames=['A','B'],j='year',i=['x','y'])
A B x y year 1 4 1970 1 6 1980 13 3 2 5 1970 33 8 1980 15 5 3 6 1970 3 14 1980 17 7
- The parameter sep represents the separator. The default is "", which can be set according to the actual situation.
# Wide data (- separator) sep_df = pd.DataFrame({"A-1970" : [1,33,3], "B-1980" : [3,5,7], "A-1980" : [13,15,17], "B-1970" : [6,8,14], "x" : [1,2,3], "y" : [4,5,6]}) sep_df
A-1970 B-1980 A-1980 B-1970 x y 0 1 3 13 6 1 4 1 33 5 15 8 2 5 2 3 7 17 14 3 6
If you want to convert the column name to sep = -, you need to set the separator in the column name.
# Set sep parameters pd.wide_to_long(sep_df,stubnames=['A','B'],j='year',i='x',sep='-')
y A B x year 1 1970 4 1 6 1980 4 13 3 2 1970 5 33 8 1980 5 15 5 3 1970 6 3 14 1980 6 17 7
- The parameter suffix refers to the suffix, which is "\ d +" by default. It is a regular expression and represents a matching number, which can be replaced according to the actual situation.
# Wide data suf_df = pd.DataFrame({"Aone" : [1,33,3], "Btwo" : [3,5,7], "Atwo" : [13,15,17], "Bone" : [6,8,14], "x" : [1,2,3], "y" : [4,5,6]}) suf_df
Aone Btwo Atwo Bone x y 0 1 3 13 6 1 4 1 33 5 15 8 2 5 2 3 7 17 14 3 6
# Specify suffix pd.wide_to_long(suf_df,stubnames=['A','B'],j='year',i='x',suffix='(one|two)')
y A B x year 1 one 4 1 6 two 4 13 3 2 one 5 33 8 two 5 15 5 3 one 6 3 14 two 6 17 7
Length to width
The conversion of long data to wide data can be realized through the function of pivot, which is similar to the function of pivot table in excel. It is implemented by pivot method in pandas.
DataFrame.pivot(index=None, columns=None, values=None)
- Index: [STR, object, a list of STR], index of perspective
- Columns: [STR, object, a list of STR], perspective columns
- Values: [STR, object, a list of the previous]
# Long data l_df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'], 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], 'cat':['alpha','alpha','alpha','beta','beta','beta'], 'baz': [1, 2, 3, 4, 5, 6], 'zoo': [4, 6, 8, 1, 2, 9]}) l_df
foo bar cat baz zoo 0 one A alpha 1 4 1 one B alpha 2 6 2 one C alpha 3 8 3 two A beta 4 1 4 two B beta 5 2 5 two C beta 6 9
Select foo column as the index after perspective, bar column as the column of perspective, the elements inside will be expanded into the column of the new data frame, and baz as the value of perspective and filled in the new data frame.
# Pivot data l_df.pivot(index='foo',columns='bar',values='baz')
bar A B C foo one 1 2 3 two 4 5 6
- Set index to multiple column names, and the PivotTable will have multiple row indexes.
# Multi index Perspective l_df.pivot(index=['foo','bar'],columns='cat',values='baz')
cat alpha beta foo bar one A 1.0 NaN B 2.0 NaN C 3.0 NaN two A NaN 4.0 B NaN 5.0 C NaN 6.0
- Set columns to multiple column names, and the PivotTable will have multiple column indexes.
# Multi column Perspective l_df.pivot(index='foo',columns=['bar','cat'],values='baz')
bar A B C A B C cat alpha alpha alpha beta beta beta foo one 1.0 2.0 3.0 NaN NaN NaN two NaN NaN NaN 4.0 5.0 6.0
- Set values to multiple column names.
l_df.pivot(index='foo',columns='bar',values=['baz','zoo'])
baz zoo bar A B C A B C foo one 1 2 3 4 6 8 two 4 5 6 1 2 9