Official account: Special House
Author: Peter
Editor: Peter
Hello, I'm Peter~
This will be the last article on DataFrame data filtering, focusing on the use of three pairs of functions:
- iloc and loc, the most important and frequently used pair of functions
- at and iat
- any and all
Important learning materials: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html , various examples on the official website of pandas take you to learn.
Extended reading
Pandas has a variety of data retrieval methods. There are many functions and skills that we need to master and accumulate. The previous two articles are:
Analog data
Two data are simulated in this paper:
- The index of the first copy is of character type
- The second index uses the default numeric type
import pandas as pd import numpy as np
# First analog data df0 df0 = pd.DataFrame( [[one hundred and one, one hundred and two, one hundred and forty], [114, 95, 67], [eighty-seven, 128, 117]], index=['language', 'mathematics', 'English'], columns=['Xiao Ming', 'Xiao Hong',"Xiao Sun"]) df0
# Second analog number df df = pd.DataFrame({ "name":['Xiao Ming','Xiao Wang','Zhang Fei','Guan Yu','Xiao Xiao Sun','Wang Jianguo','Liu Bei'], "sex":['male','female','female','male','female','male','female'], "age":[20,23,18,21,25,21,24], "score":[np.nan,600,550,np.nan,610,580,634], # Two pieces of data are missing "address":[ "Nanshan District, Shenzhen City, Guangdong Province", np.nan, # Missing data "Yuhua District, Changsha City, Hunan Province", "Dongcheng District ", "Baiyun District, Guangzhou City, Guangdong Province", "Jiangxia District, Wuhan City, Hubei Province", "Longhua District, Shenzhen City, Guangdong Province" ] }) df
iloc and loc
iloc filters by numeric value, while loc filters by attribute or row index name
iloc
Directly specify the value and take out the single line record
# 1. Use value df1 = df.iloc[1] # Row record of single value extraction df1 # result name Xiao Wang sex female age 23 score 600.0 address NaN Name: 1, dtype: object
Use colons to indicate all
df1 = df.iloc[1,:] # : colon indicates all df1 # result name Xiao Wang sex female age 23 score 600.0 address NaN Name: 1, dtype: object
You can also use slices to retrieve data:
df1 = df.iloc[:3] # Take out the first 3 lines of records df1
Take out discontinuous multi line records:
df2 = df.iloc[[1,2,4]] # Take out multiple lines of records df2
name | sex | age | score | address | |
---|---|---|---|---|---|
1 | Xiao Wang | female | 23 | 600.0 | NaN |
2 | Zhang Fei | female | 18 | 550.0 | Yuhua District, Changsha City, Hunan Province |
4 | Xiao Xiao Sun | female | 25 | 610.0 | Baiyun District, Guangzhou City, Guangdong Province |
# 2. Extract some column attributes of row records df3 = df.iloc[2,0:2] df3 # result name Zhang Fei sex female Name: 2, dtype: object
# Use slices in the column direction in steps of 2 df4 = df.iloc[2,0:5:2] df4 # result name Zhang Fei age 18 address Yuhua District, Changsha City, Hunan Province Name: 2, dtype: object
# The row index is 2 and the column index numbers are 1 and 3 df5 = df.iloc[2,[1,3]] df5 # result sex female score 550.0 Name: 2, dtype: object
# 3. Take out the specific value df6 = df.iloc[2,4] df6 # result 'Yuhua District, Changsha City, Hunan Province'
Using slices in both row and column directions, you can also specify the step size:
# 4. Use slices in both row and column directions df7 = df.iloc[0:4,0:6:2] df7
Compare with the original data:
!!! A very useful method: NP r_, Help us extract discontinuous column attributes
# 5. Take out the discontinuous row and column data and use NP r_ df8 = df.iloc[:, np.r_[0,2:4]] df8
name | age | score | |
---|---|---|---|
0 | Xiao Ming | 20 | NaN |
1 | Xiao Wang | 23 | 600.0 |
2 | Zhang Fei | 18 | 550.0 |
3 | Guan Yu | 21 | NaN |
4 | Xiao Xiao Sun | 25 | 610.0 |
5 | Wang Jianguo | 21 | 580.0 |
6 | Liu Bei | 24 | 634.0 |
df9 = df.iloc[np.r_[0,2:4],:] df9
name | sex | age | score | address | |
---|---|---|---|---|---|
0 | Xiao Ming | male | 20 | NaN | Nanshan District, Shenzhen City, Guangdong Province |
2 | Zhang Fei | female | 18 | 550.0 | Yuhua District, Changsha City, Hunan Province |
3 | Guan Yu | male | 21 | NaN | Dongcheng District |
loc
Use the row index name or column attribute to retrieve data directly
# 1. Take out a single column df10 = df.loc[:,"name"] df10
0 Xiao Ming 1 Xiao Wang 2 Zhang Fei 3 Guan Yu 4 Xiao Xiao Sun 5 Wang Jianguo 6 Liu Bei Name: name, dtype: object
# 2. Fetch multiple columns df11 = df.loc[:,["name","age"]] df11
# 3. Using numeric values, take out the first row with index 0 df12 = df.loc[0] df12 name Xiao Ming sex male age 20 score NaN address Nanshan District, Shenzhen City, Guangdong Province Name: 0, dtype: object
# 4. Take out the row records with indexes of 0, 1 and 3. At this time, all column fields are reserved df13 = df.loc[[0,1,3]] df13
# Use colon:, to indicate all columns. The effect is the same as above df14 = df.loc[[0,1,3],:] df14
# 5. Take out some rows and some columns df15 = df.loc[[0,1,3],["name","sex","score"]] df15
# 6,!!! Use index slice: both start and end positions are included df16 = df.loc[0:3] df16
df.loc[:] # Represents all data
# 7. When filtering columns, there must be row elements # name and score columns of all rows df17 = df.loc[:,["name","score"]] df17
name | score | |
---|---|---|
0 | Xiao Ming | NaN |
1 | Xiao Wang | 600.0 |
2 | Zhang Fei | 550.0 |
3 | Guan Yu | NaN |
4 | Xiao Xiao Sun | 610.0 |
5 | Wang Jianguo | 580.0 |
6 | Liu Bei | 634.0 |
# age of all rows and all subsequent columns df18 = df.loc[:,"age":] df18
age | score | address | |
---|---|---|---|
0 | 20 | NaN | Nanshan District, Shenzhen City, Guangdong Province |
1 | 23 | 600.0 | NaN |
2 | 18 | 550.0 | Yuhua District, Changsha City, Hunan Province |
3 | 21 | NaN | Dongcheng District |
4 | 25 | 610.0 | Baiyun District, Guangzhou City, Guangdong Province |
5 | 21 | 580.0 | Jiangxia District, Wuhan City, Hubei Province |
6 | 24 | 634.0 | Longhua District, Shenzhen City, Guangdong Province |
# 8. Partial row, age and all subsequent columns # Remember: the start and end positions are included, which is different from python slicing df19 = df.loc[1:3,"age":] df19
age | score | address | |
---|---|---|---|
1 | 23 | 600.0 | NaN |
2 | 18 | 550.0 | Yuhua District, Changsha City, Hunan Province |
3 | 21 | NaN | Dongcheng District |
# 9. Fetching for non numeric row index df20 = df0.loc["language"] df20
Xiao Ming one hundred and one Xiao Hong 102 Xiao Sun 140 Name: language, dtype: int64
# 10. Note that the two square brackets take out DataFrame data, and the single bracket is Series data df0.loc[["language"]]
Xiao Ming | Xiao Hong | Xiao Sun | |
---|---|---|---|
language | 101 | 102 | 140 |
df0.loc[["language","English"]]
Xiao Ming | Xiao Hong | Xiao Sun | |
---|---|---|---|
language | 101 | 102 | 140 |
English | 87 | 128 | 117 |
# 11. Extract some row and column data df21 = df0.loc[["language","English"],"Xiao Ming"] df21 language 101 English 87 Name: Xiao Ming, dtype: int64
df0.loc[["language","English"],["Xiao Ming","Xiao Sun"]]
Xiao Ming | Xiao Sun | |
---|---|---|
language | 101 | 140 |
English | 87 | 117 |
# 12. Directly use the row index name to retrieve data df0.loc[["language","English"]]
Xiao Ming | Xiao Hong | Xiao Sun | |
---|---|---|---|
language | 101 | 102 | 140 |
English | 87 | 128 | 117 |
Comparison between the two
df.loc[[1,2]]
name | sex | age | score | address | |
---|---|---|---|---|---|
1 | Xiao Wang | female | 23 | 600.0 | NaN |
2 | Zhang Fei | female | 18 | 550.0 | Yuhua District, Changsha City, Hunan Province |
df.iloc[[1,2]]
name | sex | age | score | address | |
---|---|---|---|---|---|
1 | Xiao Wang | female | 23 | 600.0 | NaN |
2 | Zhang Fei | female | 18 | 550.0 | Yuhua District, Changsha City, Hunan Province |
# Specify the column property name we need df.loc[[1,2],["name","score"]]
name | score | |
---|---|---|
1 | Xiao Wang | 600.0 |
2 | Zhang Fei | 550.0 |
# Take out rows 1 and 2, columns 0 and 3 df.iloc[[1,2],np.r_[0,3]]
name | score | |
---|---|---|
1 | Xiao Wang | 600.0 |
2 | Zhang Fei | 550.0 |
at and iat
at
The at function is similar to loc, but what the at function takes out is only a value
df22 = df.at[4,"sex"] df22 'female'
df.at[2,"name"] 'Zhang Fei'
df0
Xiao Ming | Xiao Hong | Xiao Sun | |
---|---|---|---|
language | 101 | 102 | 140 |
mathematics | 114 | 95 | 67 |
English | 87 | 128 | 117 |
# Specify both index and column names df23 = df0.at['language','Xiao Sun'] df23 140
# at and loc df.loc[1].at['age'] 23
df
name | sex | age | score | address | |
---|---|---|---|---|---|
0 | Xiao Ming | male | 20 | NaN | Nanshan District, Shenzhen City, Guangdong Province |
1 | Xiao Wang | female | 23 | 600.0 | NaN |
2 | Zhang Fei | female | 18 | 550.0 | Yuhua District, Changsha City, Hunan Province |
3 | Guan Yu | male | 21 | NaN | Dongcheng District |
4 | Xiao Xiao Sun | female | 25 | 610.0 | Baiyun District, Guangzhou City, Guangdong Province |
5 | Wang Jianguo | male | 21 | 580.0 | Jiangxia District, Wuhan City, Hubei Province |
6 | Liu Bei | female | 24 | 634.0 | Longhua District, Shenzhen City, Guangdong Province |
# The fourth element with the column name name df.name.at[4] 'Xiao Xiao Sun'
iat
Like iloc, only numeric indexing is supported
df24 = df.iat[2,4] df24 'Yuhua District, Changsha City, Hunan Province'
df.loc[2].iat[4] 'Yuhua District, Changsha City, Hunan Province'
df.iloc[2].iat[4] 'Yuhua District, Changsha City, Hunan Province'
any and all
- any: True if at least one is True
- All: all results are required to be True
When the incoming axis=1, the query will be performed according to the row; axis=0 means query by column
Comparison of data in Series
# Two False pass any and the result is False pd.Series([False, False]).any() # False
pd.Series([True, False]).any() # True
pd.Series([True, False]).all() # False
# any: skip null value pd.Series([np.nan]).any() # False
pd.Series([np.nan]).any(skipna=False) # True
# all: skip null value pd.Series([np.nan]).all() # True
pd.Series([np.nan]).all(skipna=False) #True
Comparison in DataFrame
df0
Xiao Ming | Xiao Hong | Xiao Sun | |
---|---|---|---|
language | 101 | 102 | 140 |
mathematics | 114 | 95 | 67 |
English | 87 | 128 | 117 |
# 1. Fetch the data to be queried df0.loc[:,["Xiao Ming","Xiao Hong"]]
Xiao Ming | Xiao Hong | |
---|---|---|
language | 101 | 102 |
mathematics | 114 | 95 |
English | 87 | 128 |
# 2. Compare df0.loc[:,["Xiao Ming","Xiao Hong"]] >= 100
Xiao Ming | Xiao Hong | |
---|---|---|
language | True | True |
mathematics | True | False |
English | False | True |
any
# 3. Filter with any function df0[(df0.loc[:,["Xiao Ming","Xiao Hong"]] >= 100).any(1)]
all
Only language can meet the needs of three people at the same time, and all of them are more than 100
# 4. Use the all function to filter: only the language meets 3 people and is greater than 100 at the same time df0[(df0.loc[:,["Xiao Ming","Xiao Hong"]] >= 100).all(1)]
summary
This paper introduces the use of three pairs of functions of pandas through simulated data. Among them, loc and iloc functions are very common and practical functions, which they often use. So far, the data filtering part of pandas has been fully introduced.
Of course, the methods introduced are only part of pandas's rich retrieval skills. There are many functions and methods that readers need to learn and accumulate by themselves. I hope the methods introduced will be helpful to you.
Starting with the next article, we will introduce various operation skills in Pandas.