python--pandas length width data conversion

Long format dataframe and wide format dataframe are two forms of data frames, which appear frequently in data analysis. In the process of data processing,
It is often necessary to switch between the two. Based on pandas, this paper introduces the mutual conversion operation of long data and wide data.

environment

  • python3.9
  • win10 64bit
  • pandas==1.2.1

Width to length

In pandas, the data from wide type to long type includes melt and wide_to_long two methods.

melt

melt method is called data fusion. It is a method owned by dataFrame and is frequently used. The parameters are explained as follows:

DataFrame.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)

  • id_vars:[tuple, list, ndarray], identifier variable in column, not involved in fusion.
  • value_vars:[tuple, list, ndarray], variables in the column are fused. All variables are fused by default.
  • var_name:[scalar], the name of the variable after fusion. The default is variable.
  • value_name:[scalar], the name of the value after fusion, default value.
  • col_level:[int, str], select columns for multiple column indexes.
  • ignore_index:[bool], whether to reorder the index after fusion. The default is True.
import pandas as pd
pd.set_option('display.notebook_repr_html',False)
# Wide data
w_df = pd.DataFrame({'A': [1,2,3],
                   'B': [4,5,6],
                   'C': [7,8,9]})
w_df
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
  • When no parameters are passed in, all columns will be merged by default.
# All fusion
w_df.melt()
  variable  value
0        A      1
1        A      2
2        A      3
3        B      4
4        B      5
5        B      6
6        C      7
7        C      8
8        C      9
  • Set id_vars parameter, select some columns as identifiers and do not participate in the fusion, and all the remaining columns will be fused.
# A identification, B, C fusion
w_df.melt(id_vars=['A'])
   A variable  value
0  1        B      4
1  2        B      5
2  3        B      6
3  1        C      7
4  2        C      8
5  3        C      9
# A. B identification, C fusion
w_df.melt(id_vars=['A','B'])
   A  B variable  value
0  1  4        C      7
1  2  5        C      8
2  3  6        C      9
  • Set value_vars parameter, select some columns as fusion columns.

Note that the remaining columns are not automatically treated as identifier columns.

# Fusion only A
w_df.melt(value_vars=['A'])
  variable  value
0        A      1
1        A      2
2        A      3
# Fusion only A,B
w_df.melt(value_vars=['A','B'])
  variable  value
0        A      1
1        A      2
2        A      3
3        B      4
4        B      5
5        B      6
  • Set var_ Name (default variable), value_ The name (default value) parameter sets the name of the fused variable and value.
# Set the variable name and value name after fusion
w_df.melt(var_name='code',value_name='count')
  code  count
0    A      1
1    A      2
2    A      3
3    B      4
4    B      5
5    B      6
6    C      7
7    C      8
8    C      9
  • Set ignore_index=False can retain the index of the original data.
w_df.melt(ignore_index=False)
  variable  value
0        A      1
1        A      2
2        A      3
0        B      4
1        B      5
2        B      6
0        C      7
1        C      8
2        C      9
  • Set col_level parameter, you can select multiple column index data to fuse data.
# Column multiple index data
mi_w_df=w_df.copy()
mi_w_df.columns=[list('ABC'),list('DEF')]
mi_w_df
   A  B  C
   D  E  F
0  1  4  7
1  2  5  8
2  3  6  9
# Merge first index column
mi_w_df.melt(col_level=0)
  variable  value
0        A      1
1        A      2
2        A      3
3        B      4
4        B      5
5        B      6
6        C      7
7        C      8
8        C      9
# Merge second index column
mi_w_df.melt(col_level=1)
  variable  value
0        D      1
1        D      2
2        D      3
3        E      4
4        E      5
5        E      6
6        F      7
7        F      8
8        F      9

wide_to_long

wide_ to_ The long function comes with pandas and is a supplement to melt. It is more applicable in the case of special width to length conversion.

pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\d+')

  • df:[pd.dataframe], wide data frame
  • Stub names: [STR, list like], stub name in column name
  • i: [STR, list like], index variable in column
  • j:[str], rename of suffix
  • sep:[str,default ''], separator between stub name and suffix
  • suffix:[str,default "\ d +"], suffix
# Wide data
s_df = pd.DataFrame({"A1970" : [1,33,3],
                   "B1980" : [3,5,7],
                   "A1980" : [13,15,17],
                   "B1970" : [6,8,14],
                   "x"     : [1,2,3],
                   "y"     : [4,5,6]})
s_df
   A1970  B1980  A1980  B1970  x  y
0      1      3     13      6  1  4
1     33      5     15      8  2  5
2      3      7     17     14  3  6

In the data, the names of a1970, b1980, a1980 and b1970 have the same structure. If you need to separate them, you can use long_to_wide function.

# Width to length of a specific column
pd.wide_to_long(s_df,stubnames=['A','B'],j='year',i='x')
        y   A   B
x year           
1 1970  4   1   6
  1980  4  13   3
2 1970  5  33   8
  1980  5  15   5
3 1970  6   3  14
  1980  6  17   7
  • Set stubnames, and the function will match the target column in the data column according to the set characters, and then convert it into long data
# Convert only columns containing A
pd.wide_to_long(s_df,stubnames=['A',],j='year',i='x')
        B1970  y  B1980   A
x year                     
1 1970      6  4      3   1
2 1970      8  5      5  33
3 1970     14  6      7   3
1 1980      6  4      3  13
2 1980      8  5      5  15
3 1980     14  6      7  17

If the character set by the stubnames parameter cannot be found in the column of the original data frame, an empty data frame is returned.

# If there is no C character in the column name, an empty data frame is returned
pd.wide_to_long(s_df,stubnames=['C',],j='year',i='x')
Empty DataFrame
Columns: [B1970, y, A1980, B1980, A1970, C]
Index: []
  • Parameter i can be set to multiple columns to return multiple indexes.
# Set multiple indexes
pd.wide_to_long(s_df,stubnames=['A','B'],j='year',i=['x','y'])
           A   B
x y year        
1 4 1970   1   6
    1980  13   3
2 5 1970  33   8
    1980  15   5
3 6 1970   3  14
    1980  17   7
  • The parameter sep represents the separator. The default is "", which can be set according to the actual situation.
# Wide data (- separator)
sep_df = pd.DataFrame({"A-1970" : [1,33,3],
                   "B-1980" : [3,5,7],
                   "A-1980" : [13,15,17],
                   "B-1970" : [6,8,14],
                   "x"     : [1,2,3],
                   "y"     : [4,5,6]})
sep_df
   A-1970  B-1980  A-1980  B-1970  x  y
0       1       3      13       6  1  4
1      33       5      15       8  2  5
2       3       7      17      14  3  6

If you want to convert the column name to sep = -, you need to set the separator in the column name.

# Set sep parameters
pd.wide_to_long(sep_df,stubnames=['A','B'],j='year',i='x',sep='-')
        y   A   B
x year           
1 1970  4   1   6
  1980  4  13   3
2 1970  5  33   8
  1980  5  15   5
3 1970  6   3  14
  1980  6  17   7
  • The parameter suffix refers to the suffix, which is "\ d +" by default. It is a regular expression and represents a matching number, which can be replaced according to the actual situation.
# Wide data
suf_df = pd.DataFrame({"Aone" : [1,33,3],
                   "Btwo" : [3,5,7],
                   "Atwo" : [13,15,17],
                   "Bone" : [6,8,14],
                   "x"     : [1,2,3],
                   "y"     : [4,5,6]})
suf_df
   Aone  Btwo  Atwo  Bone  x  y
0     1     3    13     6  1  4
1    33     5    15     8  2  5
2     3     7    17    14  3  6
# Specify suffix
pd.wide_to_long(suf_df,stubnames=['A','B'],j='year',i='x',suffix='(one|two)')
        y   A   B
x year           
1 one   4   1   6
  two   4  13   3
2 one   5  33   8
  two   5  15   5
3 one   6   3  14
  two   6  17   7

Length to width

The conversion of long data to wide data can be realized through the function of pivot, which is similar to the function of pivot table in excel. It is implemented by pivot method in pandas.

DataFrame.pivot(index=None, columns=None, values=None)

  • Index: [STR, object, a list of STR], index of perspective
  • Columns: [STR, object, a list of STR], perspective columns
  • Values: [STR, object, a list of the previous]
# Long data
l_df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'cat':['alpha','alpha','alpha','beta','beta','beta'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': [4, 6, 8, 1, 2, 9]})
l_df
   foo bar    cat  baz  zoo
0  one   A  alpha    1    4
1  one   B  alpha    2    6
2  one   C  alpha    3    8
3  two   A   beta    4    1
4  two   B   beta    5    2
5  two   C   beta    6    9

Select foo column as the index after perspective, bar column as the column of perspective, the elements inside will be expanded into the column of the new data frame, and baz as the value of perspective and filled in the new data frame.

# Pivot data
l_df.pivot(index='foo',columns='bar',values='baz')
bar  A  B  C
foo         
one  1  2  3
two  4  5  6
  • Set index to multiple column names, and the PivotTable will have multiple row indexes.
# Multi index Perspective
l_df.pivot(index=['foo','bar'],columns='cat',values='baz')
cat      alpha  beta
foo bar             
one A      1.0   NaN
    B      2.0   NaN
    C      3.0   NaN
two A      NaN   4.0
    B      NaN   5.0
    C      NaN   6.0
  • Set columns to multiple column names, and the PivotTable will have multiple column indexes.
# Multi column Perspective
l_df.pivot(index='foo',columns=['bar','cat'],values='baz')
bar     A     B     C    A    B    C
cat alpha alpha alpha beta beta beta
foo                                 
one   1.0   2.0   3.0  NaN  NaN  NaN
two   NaN   NaN   NaN  4.0  5.0  6.0
  • Set values to multiple column names.
l_df.pivot(index='foo',columns='bar',values=['baz','zoo'])

    baz       zoo      
bar   A  B  C   A  B  C
foo                    
one   1  2  3   4  6  8
two   4  5  6   1  2  9

Keywords: Python Data Analysis

Added by sunshine66 on Thu, 03 Mar 2022 20:26:40 +0200