Pandas data processing - inventory those commonly used functions

This article will sort out some high-frequency usages in practical use. Of course, there will be an article on time series processing. It should be emphasized here that it is not recommended for beginners to chew all the methods in Pandas. This is too inefficient, and many methods are basically unavailable at ordinary times, and it is easy to forget. The correct way is to understand the common methods first, then find a project to start directly, and then check the official documents if the existing methods can't handle it.

Through the way of "artificial intelligence", I selected some common methods from the official documents. There are more than 20 methods. Beginners can try to understand them first. In order to avoid too much reading, this article will introduce 10 first.

The data used for the demonstration are as follows:

In [15]: data
Out[15]:
  company  salary  age
0     NaN      43   21
1       A       8   41
2       A      28   26
3       C      42   28
4       A      33   26
5       C      20   18
6       A      48   43
7       B      25   23
8       B      39   18

.head()

Scope objects: Series and DataFrame

Main purpose: return the first N rows of DataFrame. When the amount of data is large, use. head() to quickly get a general understanding of the data.

Usage:

#The first 5 lines are returned by default, and N can be set by itself
In [16]: data.head()
Out[16]:
  company  salary  age
0     NaN      43   21
1       A       8   41
2       A      28   26
3       C      42   28
4       A      33   26

.info()

Scope objects: Series and DataFrame

Main purpose: to print some basic information of the data used, including the data types of indexes and columns and the occupied memory size.

Usage:

In [17]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
company    8 non-null object
salary     9 non-null int32
age        9 non-null int32
dtypes: int32(2), object(1)
memory usage: 224.0+ bytes

.describe()

Scope objects: Series and DataFrame

Main purpose: to generate descriptive statistical summary, including data count and percentile, which is helpful to understand the approximate data distribution

Usage:

# Descriptive statistics for numeric columns are generated by default
# Generate all columns with include = 'all'
In [18]: data.describe()
Out[18]:
          salary        age
count   9.000000   9.000000
mean   31.777778  27.111111
std    12.804079   9.143911
min     8.000000  18.000000
25%    25.000000  21.000000
50%    33.000000  26.000000
75%    42.000000  28.000000
max    48.000000  43.000000

.value_counts()

Target: Series

Main purpose: to count the number of each category in the category variable, such as the number of people in each company

Main parameters:

normalize (boolean, default False)
Return the proportion of each category
sort (boolean, default True)
Sort statistics results
ascending (boolean, default False)
Sort in ascending order

Usage:

In [19]: data['company'].value_counts()
Out[19]:
A    4
B    2
C    2
Name: company, dtype: int64
        
# Return the proportion
In [20]: data['company'].value_counts(normalize=True)
Out[20]:
A    0.50
B    0.25
C    0.25
Name: company, dtype: float64

# Ascending arrangement
In [21]: data['company'].value_counts(ascending=True)
Out[21]:
C    2
B    2
A    4
Name: company, dtype: int64

.isna()

Scope objects: Series and DataFrame

Main purpose: to judge whether the data is a missing value. If yes, it returns True and if no, it returns False

Usage:

In [22]: data.isna()
Out[22]:
   company  salary    age
0     True   False  False
1    False   False  False
2    False   False  False
3    False   False  False
4    False   False  False
5    False   False  False
6    False   False  False
7    False   False  False
8    False   False  False

.any()

Scope objects: Series and DataFrame

Main purpose: in most cases, there is a large amount of data, so it is impossible to directly check the missing value after isna(). any() and isna() can be used together to judge whether there is a missing value in a column.

Usage:

In [23]: data.isna().any()
Out[23]:
company     True
salary     False
age        False
dtype: bool

.dropna()

Scope objects: Series and DataFrame

Main purpose: delete data with missing values

Usage:

In [24]: data.dropna()
Out[24]:
  company  salary  age
1       A       8   41
2       A      28   26
3       C      42   28
4       A      33   26
5       C      20   18
6       A      48   43
7       B      25   23
8       B      39   18

.fillna()

Scope objects: Series and DataFrame

Main purpose: fill in missing data

Main parameters:

value (scalar, dict, Series, or DataFrame)
The value used to populate the missing value
method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None)
Missing values are usually filled with the values after bfill and fill with the values before ffill
inplace (boolean, default False)
Whether to act on the original object

Usage:

In [26]: data.fillna('B')
Out[26]:
  company  salary  age
0       B      43   21
1       A       8   41
2       A      28   26
3       C      42   28
4       A      33   26
5       C      20   18
6       A      48   43
7       B      25   23
8       B      39   18

# Fill with the value after the missing value (where NaN is followed by 'A')
In [25]: data.fillna(method='bfill')
Out[25]:
  company  salary  age
0       A      43   21
1       A       8   41
2       A      28   26
3       C      42   28
4       A      33   26
5       C      20   18
6       A      48   43
7       B      25   23
8       B      39   18

.sort_index()

Scope objects: Series and DataFrame

Main purpose: sort data by index

Main parameters:

ascending (boolean, default True)
Sort in ascending order
inplace (boolean, default False)
Whether to act on the original object

Usage:

# Sort by index in descending order
In [27]: data.sort_index(ascending=False)
Out[27]:
  company  salary  age
8       B      39   18
7       B      25   23
6       A      48   43
5       C      20   18
4       A      33   26
3       C      42   28
2       A      28   26
1       A       8   41
0     NaN      43   21

.sort_values()

Scope objects: Series and DataFrame

Main purpose: for DataFrame, sort by a column (controlled by the by parameter), and sort Series by data column.

Main parameters:

by (str or list of str)
Columns to be sorted need to be specified when acting on DataFrame
ascending (boolean, default False)
Sort in ascending order

In [28]: data.sort_values(by='salary')
Out[28]:
  company  salary  age
1       A       8   41
5       C      20   18
7       B      25   23
2       A      28   26
4       A      33   26
8       B      39   18
3       C      42   28
0     NaN      43   21
6       A      48   43

Here are some common methods of Pandas. The overall difficulty will be a little greater than that in the above article, but it is still relatively easy to understand. Don't say much and go straight to the subject.

The data used for the demonstration are as follows:

In [11]: data
Out[11]:
  company  gender  salary   age
0       B  female      30  40.0
1       A  female      36  31.0
2       B  female      35  28.0
3       B  female       9  18.0
4       B  female      16  43.0
5       A    male      46  22.0
6       B  female      15  28.0
7       B  female      33  40.0
8       C    male      19  32.0

.astype()

Scope objects: Series and DataFrame

Main purpose: modify the data type of the field. When the amount of data is large, it can be used to reduce the memory occupied by the data. It is mostly used for Series.

Usage:

# Change the age field to type int
In [12]: data["age"] = data["age"].astype(int)

In [13]: data
Out[13]:
  company  gender  salary  age
0       B  female      30   40
1       A  female      36   31
2       B  female      35   28
3       B  female       9   18
4       B  female      16   43
5       A    male      46   22
6       B  female      15   28
7       B  female      33   40
8       C    male      19   32

.rename()

Scope object: series, dataframe (in most cases)

Main purpose: it is mainly used to modify the column name of DataFrame

Main parameters:

columns (dict-like or function)
Specify the column name to be modified and the new column name, which is generally passed in as a dictionary
inplace (boolean, default False)
Whether to act on the original object

Usage:

# Change 'age' to employee number 'number' and apply to the original object
In [15]: data.rename(columns={'age':'number'},inplace=True)

In [16]: data
Out[16]:
  company  gender  salary  number
0       B  female      30      40
1       A  female      36      31
2       B  female      35      28
3       B  female       9      18
4       B  female      16      43
5       A    male      46      22
6       B  female      15      28
7       B  female      33      40
8       C    male      19      32

.set_index()

Scoping object: DataFrame

Main purpose: set one (more) fields in DataFrame as index

Usage:

In [19]: data.set_index('number',inplace=True)

In [20]: data
Out[20]:
       company  gender  salary
number
40           B  female      30
31           A  female      36
28           B  female      35
18           B  female       9
43           B  female      16
22           A    male      46
28           B  female      15
40           B  female      33
32           C    male      19

.reset_index()

Scope object: Series,DataFrame

Main purpose: reset the index. The default index after reset is 0~len(df)-1

Main parameters:

drop (boolean, default False)
Whether to discard the original index is shown in the following demonstration
inplace (boolean, default False)
Whether to act on the original object

Usage:

# drop = True, reset the index and discard the original index
In [22]: data.reset_index(drop=True)
Out[22]:
  company  gender  salary
0       B  female      30
1       A  female      36
2       B  female      35
3       B  female       9
4       B  female      16
5       A    male      46
6       B  female      15
7       B  female      33
8       C    male      19

# drop = False, reset index
# The original index column 'number' enters the DataFrame as a new field
In [23]: data.reset_index(drop=False,inplace=True)

In [24]: data
Out[24]:
   number company  gender  salary
0      40       B  female      30
1      31       A  female      36
2      28       B  female      35
3      18       B  female       9
4      43       B  female      16
5      22       A    male      46
6      28       B  female      15
7      40       B  female      33
8      32       C    male      19

.drop_duplicates()

Scope object: Series,DataFrame

Main purpose: to remove duplicate values, which is similar to distinct in SQL

Usage:

In [26]: data['company'].drop_duplicates()
Out[26]:
0    B
1    A
8    C
Name: company, dtype: object

.drop()

Scope object: Series,DataFrame

Main purpose: it is often used to delete some fields in DataFrame

Main parameters:

columns (single label or list-like)
Specify the fields to delete

Usage:

# Delete the 'gender' column
In [27]: data.drop(columns = ['gender'])
Out[27]:
   number company  salary
0      40       B      30
1      31       A      36
2      28       B      35
3      18       B       9
4      43       B      16
5      22       A      46
6      28       B      15
7      40       B      33
8      32       C      19

.isin()

Scope object: Series,DataFrame

Main purpose: it is often used to build Boolean indexes and filter the data of DataFrame

Usage:

# Filter out the employee records of company A and company C
In [29]: data.loc[data['company'].isin(['A','C'])]
Out[29]:
   number company  gender  salary
1      31       A  female      36
5      22       A    male      46
8      32       C    male      19

pd.cut()

Main purpose: discretize continuous variables, such as dividing people's age into various intervals

Main parameters:

x (array-like)
One dimensional data requiring discretization
bins (int, sequence of scalars, or IntervalIndex)
Set the intervals to be divided into. You can specify the number of intervals or breakpoints
labels (array or bool, optional)
Label the interval

Usage:

# Divide the salary into five ranges
In [33]: pd.cut(data.salary,bins = 5)
Out[33]:
0     (23.8, 31.2]
1     (31.2, 38.6]
2     (31.2, 38.6]
3    (8.963, 16.4]
4    (8.963, 16.4]
5     (38.6, 46.0]
6    (8.963, 16.4]
7     (31.2, 38.6]
8     (16.4, 23.8]
Name: salary, dtype: category
Categories (5, interval[float64]): [(8.963, 16.4] < (16.4, 23.8] < (23.8, 31.2] < (31.2, 38.6] <(38.6, 46.0]]
                                     
# Specify your own breakpoint
In [32]: pd.cut(data.salary,bins = [0,10,20,30,40,50])
Out[32]:
0    (20, 30]
1    (30, 40]
2    (30, 40]
3     (0, 10]
4    (10, 20]
5    (40, 50]
6    (10, 20]
7    (30, 40]
8    (10, 20]
Name: salary, dtype: category
Categories (5, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40] < (40, 50]]

# Specifies the label of the interval                                                                             
In [34]: pd.cut(data.salary,bins = [0,10,20,30,40,50],labels = ['low','lower-middle','in','so-so','high'])
Out[34]:
0     in
1    so-so
2    so-so
3     low
4    lower-middle
5     high
6    lower-middle
7    so-so
8    lower-middle
Name: salary, dtype: category
Categories (5, object): [low < lower-middle < in < so-so < high]

pd.qcut()

Main purpose: discretize continuous variables, which is different from pd.cut() which is divided by specific values and pd.qcut() which is divided by quantiles

Main parameters:

x (array-like)
One dimensional data requiring discretization
q(integer or array of quantiles)
Set the interval to be divided into. You can specify the interval format or breakpoint
labels (array or boolean, default None)
Label the interval

Usage:

# It is divided according to 0-33.33%, 33.33% - 66.67%, 66.67% - 100% percentile
In [35]: pd.qcut(data.salary,q = 3)
Out[35]:
0    (18.0, 33.667]
1    (33.667, 46.0]
2    (33.667, 46.0]
3     (8.999, 18.0]
4     (8.999, 18.0]
5    (33.667, 46.0]
6     (8.999, 18.0]
7    (18.0, 33.667]
8    (18.0, 33.667]
Name: salary, dtype: category
Categories (3, interval[float64]): [(8.999, 18.0] < (18.0, 33.667] < (33.667, 46.0]]

.where()

Scope object: Series,DataFrame

Main purpose: replacing unqualified values with specified values is equivalent to executing an if else

Main parameters:

cond (boolean Series/DataFrame, array-like, or callable)
Criteria for filtering
other(scalar, Series/DataFrame, or callable)
For the value that does not meet the cond condition (the result is False), replace it with the value of other

Usage:

# Statement parsing
# If salary < = 40, the original value remains unchanged
# If salary is greater than 40, it is set to 40
In [38]: data['salary'].where(data.salary<=40,40)
Out[38]:
0    30
1    36
2    35
3     9
4    16
5    40
6    15
7    33
8    19
Name: salary, dtype: int32

pd.concat()

Main purpose: to put together multiple Series or dataframes (horizontal or vertical)

Main parameters:

objs (a sequence or mapping of Series or DataFrame objects)
The Series or DataFrame used for splicing is generally passed in a list
axis (0/'index', 1/'columns')
Controls whether the data is spliced horizontally or vertically. The default is longitudinal splicing.
ignore_index (bool, default False)
Whether to keep the index inside the original trains or DataFrame. If True, a new index will be generated for the spliced data (0~n-1)

Usage:

# Take the first three and last three items of data as data1 and data2 respectively
In [41]: data1 = data.head(3)
In [42]: data1
Out[42]:
   number company  gender  salary
0      40       B  female      30
1      31       A  female      36
2      28       B  female      35

In [43]: data2 = data.tail(3)
In [44]: data2
Out[44]:
   number company  gender  salary
6      28       B  female      15
7      40       B  female      33
8      32       C    male      19

# Splice data
In [45]: pd.concat([data1,data2],ignore_index = False)
Out[45]:
   number company  gender  salary
0      40       B  female      30
1      31       A  female      36
2      28       B  female      35
6      28       B  female      15
7      40       B  female      33
8      32       C    male      19

# Splice data and reset index
In [46]: pd.concat([data1,data2],ignore_index=True)
Out[46]:
   number company  gender  salary
0      40       B  female      30
1      31       A  female      36
2      28       B  female      35
3      28       B  female      15
4      40       B  female      33
5      32       C    male      19

.pivot_table()

Scoping object: DataFrame

Main purpose: perform pivoting on DataFrame, which is equivalent to PivotTable in Excel

Main parameters:

values (column to aggregate, optional)
Field used for aggregation operation (target variable of PivotTable)
index (column, Grouper, array, or list of the previous)
Similar to row labels in a PivotTable report
columns (column, Grouper, array, or list of the previous)
Similar to column labels in a PivotTable report
aggfunc ( function, list of functions, dict, default numpy.mean)
What aggregation operation is performed on values

Usage:

# Data perspective of salary from two dimensions of company and gender
# Look at the average salary level under these two dimensions
In [47]: data.pivot_table(values = 'salary',index = 'company',
                          columns = 'gender',aggfunc=np.mean)
Out[47]:
gender   female  male
company
A          36.0  46.0
B          23.0   NaN
C           NaN  19.0

The functions commonly used in Pandas are sorted out here. As for map and apply, they are not introduced here. For a detailed introduction, see the articles written before. I hope these functions can be helpful to you!

Keywords: Python Back-end

Added by inkdrop on Tue, 26 Oct 2021 04:21:33 +0300

Programming VIP

Pandas data processing - inventory those commonly used functions

.info()

.describe()

.value_counts()

.isna()

.any()

.dropna()

.fillna()

.sort_index()

.sort_values()

.astype()

.rename()

.set_index()

.reset_index()

.drop_duplicates()

.drop()

.isin()

pd.cut()

pd.qcut()

.where()

pd.concat()

.pivot_table()

Popular Keywords