This article will sort out some high-frequency usages in practical use. Of course, there will be an article on time series processing. It should be emphasized here that it is not recommended for beginners to chew all the methods in Pandas. This is too inefficient, and many methods are basically unavailable at ordinary times, and it is easy to forget. The correct way is to understand the common methods first, then find a project to start directly, and then check the official documents if the existing methods can't handle it.
Through the way of "artificial intelligence", I selected some common methods from the official documents. There are more than 20 methods. Beginners can try to understand them first. In order to avoid too much reading, this article will introduce 10 first.
The data used for the demonstration are as follows:
In [15]: data Out[15]: company salary age 0 NaN 43 21 1 A 8 41 2 A 28 26 3 C 42 28 4 A 33 26 5 C 20 18 6 A 48 43 7 B 25 23 8 B 39 18
.head()
Scope objects: Series and DataFrame
Main purpose: return the first N rows of DataFrame. When the amount of data is large, use. head() to quickly get a general understanding of the data.
Usage:
#The first 5 lines are returned by default, and N can be set by itself In [16]: data.head() Out[16]: company salary age 0 NaN 43 21 1 A 8 41 2 A 28 26 3 C 42 28 4 A 33 26
.info()
Scope objects: Series and DataFrame
Main purpose: to print some basic information of the data used, including the data types of indexes and columns and the occupied memory size.
Usage:
In [17]: data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 9 entries, 0 to 8 Data columns (total 3 columns): company 8 non-null object salary 9 non-null int32 age 9 non-null int32 dtypes: int32(2), object(1) memory usage: 224.0+ bytes
.describe()
Scope objects: Series and DataFrame
Main purpose: to generate descriptive statistical summary, including data count and percentile, which is helpful to understand the approximate data distribution
Usage:
# Descriptive statistics for numeric columns are generated by default # Generate all columns with include = 'all' In [18]: data.describe() Out[18]: salary age count 9.000000 9.000000 mean 31.777778 27.111111 std 12.804079 9.143911 min 8.000000 18.000000 25% 25.000000 21.000000 50% 33.000000 26.000000 75% 42.000000 28.000000 max 48.000000 43.000000
.value_counts()
Target: Series
Main purpose: to count the number of each category in the category variable, such as the number of people in each company
Main parameters:
- normalize (boolean, default False)
Return the proportion of each category - sort (boolean, default True)
Sort statistics results - ascending (boolean, default False)
Sort in ascending order
Usage:
In [19]: data['company'].value_counts() Out[19]: A 4 B 2 C 2 Name: company, dtype: int64 # Return the proportion In [20]: data['company'].value_counts(normalize=True) Out[20]: A 0.50 B 0.25 C 0.25 Name: company, dtype: float64 # Ascending arrangement In [21]: data['company'].value_counts(ascending=True) Out[21]: C 2 B 2 A 4 Name: company, dtype: int64
.isna()
Scope objects: Series and DataFrame
Main purpose: to judge whether the data is a missing value. If yes, it returns True and if no, it returns False
Usage:
In [22]: data.isna() Out[22]: company salary age 0 True False False 1 False False False 2 False False False 3 False False False 4 False False False 5 False False False 6 False False False 7 False False False 8 False False False
.any()
Scope objects: Series and DataFrame
Main purpose: in most cases, there is a large amount of data, so it is impossible to directly check the missing value after isna(). any() and isna() can be used together to judge whether there is a missing value in a column.
Usage:
In [23]: data.isna().any() Out[23]: company True salary False age False dtype: bool
.dropna()
Scope objects: Series and DataFrame
Main purpose: delete data with missing values
Usage:
In [24]: data.dropna() Out[24]: company salary age 1 A 8 41 2 A 28 26 3 C 42 28 4 A 33 26 5 C 20 18 6 A 48 43 7 B 25 23 8 B 39 18
.fillna()
Scope objects: Series and DataFrame
Main purpose: fill in missing data
Main parameters:
- value (scalar, dict, Series, or DataFrame)
The value used to populate the missing value - method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None)
Missing values are usually filled with the values after bfill and fill with the values before ffill - inplace (boolean, default False)
Whether to act on the original object
Usage:
In [26]: data.fillna('B') Out[26]: company salary age 0 B 43 21 1 A 8 41 2 A 28 26 3 C 42 28 4 A 33 26 5 C 20 18 6 A 48 43 7 B 25 23 8 B 39 18 # Fill with the value after the missing value (where NaN is followed by 'A') In [25]: data.fillna(method='bfill') Out[25]: company salary age 0 A 43 21 1 A 8 41 2 A 28 26 3 C 42 28 4 A 33 26 5 C 20 18 6 A 48 43 7 B 25 23 8 B 39 18
.sort_index()
Scope objects: Series and DataFrame
Main purpose: sort data by index
Main parameters:
- ascending (boolean, default True)
Sort in ascending order - inplace (boolean, default False)
Whether to act on the original object
Usage:
# Sort by index in descending order In [27]: data.sort_index(ascending=False) Out[27]: company salary age 8 B 39 18 7 B 25 23 6 A 48 43 5 C 20 18 4 A 33 26 3 C 42 28 2 A 28 26 1 A 8 41 0 NaN 43 21
.sort_values()
Scope objects: Series and DataFrame
Main purpose: for DataFrame, sort by a column (controlled by the by parameter), and sort Series by data column.
Main parameters:
- by (str or list of str)
Columns to be sorted need to be specified when acting on DataFrame - ascending (boolean, default False)
Sort in ascending order
In [28]: data.sort_values(by='salary') Out[28]: company salary age 1 A 8 41 5 C 20 18 7 B 25 23 2 A 28 26 4 A 33 26 8 B 39 18 3 C 42 28 0 NaN 43 21 6 A 48 43
Here are some common methods of Pandas. The overall difficulty will be a little greater than that in the above article, but it is still relatively easy to understand. Don't say much and go straight to the subject.
The data used for the demonstration are as follows:
In [11]: data Out[11]: company gender salary age 0 B female 30 40.0 1 A female 36 31.0 2 B female 35 28.0 3 B female 9 18.0 4 B female 16 43.0 5 A male 46 22.0 6 B female 15 28.0 7 B female 33 40.0 8 C male 19 32.0
.astype()
Scope objects: Series and DataFrame
Main purpose: modify the data type of the field. When the amount of data is large, it can be used to reduce the memory occupied by the data. It is mostly used for Series.
Usage:
# Change the age field to type int In [12]: data["age"] = data["age"].astype(int) In [13]: data Out[13]: company gender salary age 0 B female 30 40 1 A female 36 31 2 B female 35 28 3 B female 9 18 4 B female 16 43 5 A male 46 22 6 B female 15 28 7 B female 33 40 8 C male 19 32
.rename()
Scope object: series, dataframe (in most cases)
Main purpose: it is mainly used to modify the column name of DataFrame
Main parameters:
- columns (dict-like or function)
Specify the column name to be modified and the new column name, which is generally passed in as a dictionary - inplace (boolean, default False)
Whether to act on the original object
Usage:
# Change 'age' to employee number 'number' and apply to the original object In [15]: data.rename(columns={'age':'number'},inplace=True) In [16]: data Out[16]: company gender salary number 0 B female 30 40 1 A female 36 31 2 B female 35 28 3 B female 9 18 4 B female 16 43 5 A male 46 22 6 B female 15 28 7 B female 33 40 8 C male 19 32
.set_index()
Scoping object: DataFrame
Main purpose: set one (more) fields in DataFrame as index
Usage:
In [19]: data.set_index('number',inplace=True) In [20]: data Out[20]: company gender salary number 40 B female 30 31 A female 36 28 B female 35 18 B female 9 43 B female 16 22 A male 46 28 B female 15 40 B female 33 32 C male 19
.reset_index()
Scope object: Series,DataFrame
Main purpose: reset the index. The default index after reset is 0~len(df)-1
Main parameters:
- drop (boolean, default False)
Whether to discard the original index is shown in the following demonstration - inplace (boolean, default False)
Whether to act on the original object
Usage:
# drop = True, reset the index and discard the original index In [22]: data.reset_index(drop=True) Out[22]: company gender salary 0 B female 30 1 A female 36 2 B female 35 3 B female 9 4 B female 16 5 A male 46 6 B female 15 7 B female 33 8 C male 19 # drop = False, reset index # The original index column 'number' enters the DataFrame as a new field In [23]: data.reset_index(drop=False,inplace=True) In [24]: data Out[24]: number company gender salary 0 40 B female 30 1 31 A female 36 2 28 B female 35 3 18 B female 9 4 43 B female 16 5 22 A male 46 6 28 B female 15 7 40 B female 33 8 32 C male 19
.drop_duplicates()
Scope object: Series,DataFrame
Main purpose: to remove duplicate values, which is similar to distinct in SQL
Usage:
In [26]: data['company'].drop_duplicates() Out[26]: 0 B 1 A 8 C Name: company, dtype: object
.drop()
Scope object: Series,DataFrame
Main purpose: it is often used to delete some fields in DataFrame
Main parameters:
- columns (single label or list-like)
Specify the fields to delete
Usage:
# Delete the 'gender' column In [27]: data.drop(columns = ['gender']) Out[27]: number company salary 0 40 B 30 1 31 A 36 2 28 B 35 3 18 B 9 4 43 B 16 5 22 A 46 6 28 B 15 7 40 B 33 8 32 C 19
.isin()
Scope object: Series,DataFrame
Main purpose: it is often used to build Boolean indexes and filter the data of DataFrame
Usage:
# Filter out the employee records of company A and company C In [29]: data.loc[data['company'].isin(['A','C'])] Out[29]: number company gender salary 1 31 A female 36 5 22 A male 46 8 32 C male 19
pd.cut()
Main purpose: discretize continuous variables, such as dividing people's age into various intervals
Main parameters:
- x (array-like)
One dimensional data requiring discretization - bins (int, sequence of scalars, or IntervalIndex)
Set the intervals to be divided into. You can specify the number of intervals or breakpoints - labels (array or bool, optional)
Label the interval
Usage:
# Divide the salary into five ranges In [33]: pd.cut(data.salary,bins = 5) Out[33]: 0 (23.8, 31.2] 1 (31.2, 38.6] 2 (31.2, 38.6] 3 (8.963, 16.4] 4 (8.963, 16.4] 5 (38.6, 46.0] 6 (8.963, 16.4] 7 (31.2, 38.6] 8 (16.4, 23.8] Name: salary, dtype: category Categories (5, interval[float64]): [(8.963, 16.4] < (16.4, 23.8] < (23.8, 31.2] < (31.2, 38.6] <(38.6, 46.0]] # Specify your own breakpoint In [32]: pd.cut(data.salary,bins = [0,10,20,30,40,50]) Out[32]: 0 (20, 30] 1 (30, 40] 2 (30, 40] 3 (0, 10] 4 (10, 20] 5 (40, 50] 6 (10, 20] 7 (30, 40] 8 (10, 20] Name: salary, dtype: category Categories (5, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40] < (40, 50]] # Specifies the label of the interval In [34]: pd.cut(data.salary,bins = [0,10,20,30,40,50],labels = ['low','lower-middle','in','so-so','high']) Out[34]: 0 in 1 so-so 2 so-so 3 low 4 lower-middle 5 high 6 lower-middle 7 so-so 8 lower-middle Name: salary, dtype: category Categories (5, object): [low < lower-middle < in < so-so < high]
pd.qcut()
Main purpose: discretize continuous variables, which is different from pd.cut() which is divided by specific values and pd.qcut() which is divided by quantiles
Main parameters:
- x (array-like)
One dimensional data requiring discretization - q(integer or array of quantiles)
Set the interval to be divided into. You can specify the interval format or breakpoint - labels (array or boolean, default None)
Label the interval
Usage:
# It is divided according to 0-33.33%, 33.33% - 66.67%, 66.67% - 100% percentile In [35]: pd.qcut(data.salary,q = 3) Out[35]: 0 (18.0, 33.667] 1 (33.667, 46.0] 2 (33.667, 46.0] 3 (8.999, 18.0] 4 (8.999, 18.0] 5 (33.667, 46.0] 6 (8.999, 18.0] 7 (18.0, 33.667] 8 (18.0, 33.667] Name: salary, dtype: category Categories (3, interval[float64]): [(8.999, 18.0] < (18.0, 33.667] < (33.667, 46.0]]
.where()
Scope object: Series,DataFrame
Main purpose: replacing unqualified values with specified values is equivalent to executing an if else
Main parameters:
- cond (boolean Series/DataFrame, array-like, or callable)
Criteria for filtering - other(scalar, Series/DataFrame, or callable)
For the value that does not meet the cond condition (the result is False), replace it with the value of other
Usage:
# Statement parsing # If salary < = 40, the original value remains unchanged # If salary is greater than 40, it is set to 40 In [38]: data['salary'].where(data.salary<=40,40) Out[38]: 0 30 1 36 2 35 3 9 4 16 5 40 6 15 7 33 8 19 Name: salary, dtype: int32
pd.concat()
Main purpose: to put together multiple Series or dataframes (horizontal or vertical)
Main parameters:
- objs (a sequence or mapping of Series or DataFrame objects)
The Series or DataFrame used for splicing is generally passed in a list - axis (0/'index', 1/'columns')
Controls whether the data is spliced horizontally or vertically. The default is longitudinal splicing. - ignore_index (bool, default False)
Whether to keep the index inside the original trains or DataFrame. If True, a new index will be generated for the spliced data (0~n-1)
Usage:
# Take the first three and last three items of data as data1 and data2 respectively In [41]: data1 = data.head(3) In [42]: data1 Out[42]: number company gender salary 0 40 B female 30 1 31 A female 36 2 28 B female 35 In [43]: data2 = data.tail(3) In [44]: data2 Out[44]: number company gender salary 6 28 B female 15 7 40 B female 33 8 32 C male 19 # Splice data In [45]: pd.concat([data1,data2],ignore_index = False) Out[45]: number company gender salary 0 40 B female 30 1 31 A female 36 2 28 B female 35 6 28 B female 15 7 40 B female 33 8 32 C male 19 # Splice data and reset index In [46]: pd.concat([data1,data2],ignore_index=True) Out[46]: number company gender salary 0 40 B female 30 1 31 A female 36 2 28 B female 35 3 28 B female 15 4 40 B female 33 5 32 C male 19
.pivot_table()
Scoping object: DataFrame
Main purpose: perform pivoting on DataFrame, which is equivalent to PivotTable in Excel
Main parameters:
- values (column to aggregate, optional)
Field used for aggregation operation (target variable of PivotTable) - index (column, Grouper, array, or list of the previous)
Similar to row labels in a PivotTable report - columns (column, Grouper, array, or list of the previous)
Similar to column labels in a PivotTable report - aggfunc ( function, list of functions, dict, default numpy.mean)
What aggregation operation is performed on values
Usage:
# Data perspective of salary from two dimensions of company and gender # Look at the average salary level under these two dimensions In [47]: data.pivot_table(values = 'salary',index = 'company', columns = 'gender',aggfunc=np.mean) Out[47]: gender female male company A 36.0 46.0 B 23.0 NaN C NaN 19.0
The functions commonly used in Pandas are sorted out here. As for map and apply, they are not introduced here. For a detailed introduction, see the articles written before. I hope these functions can be helpful to you!