Play pandas access_ lower

Official account: Special House
Author: Peter
Editor: Peter

Hello, I'm Peter~

This will be the last article on DataFrame data filtering, focusing on the use of three pairs of functions:

  • iloc and loc, the most important and frequently used pair of functions
  • at and iat
  • any and all

Important learning materials: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html , various examples on the official website of pandas take you to learn.

Extended reading

Pandas has a variety of data retrieval methods. There are many functions and skills that we need to master and accumulate. The previous two articles are:

Analog data

Two data are simulated in this paper:

  • The index of the first copy is of character type
  • The second index uses the default numeric type
import pandas as pd
import numpy as np
# First analog data df0

df0 = pd.DataFrame(
    [[one hundred and one, one hundred and two, one hundred and forty], [114, 95, 67], [eighty-seven, 128, 117]],
    index=['language', 'mathematics', 'English'],
    columns=['Xiao Ming', 'Xiao Hong',"Xiao Sun"])

df0

#  Second analog number df

df = pd.DataFrame({
    "name":['Xiao Ming','Xiao Wang','Zhang Fei','Guan Yu','Xiao Xiao Sun','Wang Jianguo','Liu Bei'],
    "sex":['male','female','female','male','female','male','female'],
    "age":[20,23,18,21,25,21,24],
    "score":[np.nan,600,550,np.nan,610,580,634],  # Two pieces of data are missing
    "address":[
        "Nanshan District, Shenzhen City, Guangdong Province",
        np.nan,  # Missing data
        "Yuhua District, Changsha City, Hunan Province",
        "Dongcheng District ",
        "Baiyun District, Guangzhou City, Guangdong Province",
        "Jiangxia District, Wuhan City, Hubei Province",
        "Longhua District, Shenzhen City, Guangdong Province"
        ]
})

df

iloc and loc

iloc filters by numeric value, while loc filters by attribute or row index name

iloc

Directly specify the value and take out the single line record

# 1. Use value

df1 = df.iloc[1]  # Row record of single value extraction
df1

# result
name          Xiao Wang
sex            female
age           23
score      600.0
address      NaN
Name: 1, dtype: object

Use colons to indicate all

df1 = df.iloc[1,:]  # : colon indicates all
df1

# result
name          Xiao Wang
sex            female
age           23
score      600.0
address      NaN
Name: 1, dtype: object

You can also use slices to retrieve data:

df1 = df.iloc[:3]  # Take out the first 3 lines of records
df1

Take out discontinuous multi line records:

df2 = df.iloc[[1,2,4]]  # Take out multiple lines of records
df2
namesexagescoreaddress
1Xiao Wangfemale23600.0NaN
2Zhang Feifemale18550.0Yuhua District, Changsha City, Hunan Province
4Xiao Xiao Sunfemale25610.0Baiyun District, Guangzhou City, Guangdong Province
# 2. Extract some column attributes of row records

df3 = df.iloc[2,0:2]
df3

# result
name    Zhang Fei
sex      female
Name: 2, dtype: object
# Use slices in the column direction in steps of 2

df4 = df.iloc[2,0:5:2]  
df4

# result
name              Zhang Fei
age               18
address    Yuhua District, Changsha City, Hunan Province
Name: 2, dtype: object
# The row index is 2 and the column index numbers are 1 and 3

df5 = df.iloc[2,[1,3]]  
df5

# result
sex          female
score    550.0
Name: 2, dtype: object
# 3. Take out the specific value

df6 = df.iloc[2,4]
df6

# result
'Yuhua District, Changsha City, Hunan Province'

Using slices in both row and column directions, you can also specify the step size:

# 4. Use slices in both row and column directions

df7 = df.iloc[0:4,0:6:2]
df7

Compare with the original data:

!!! A very useful method: NP r_, Help us extract discontinuous column attributes

# 5. Take out the discontinuous row and column data and use NP r_

df8 = df.iloc[:, np.r_[0,2:4]]
df8
nameagescore
0Xiao Ming20NaN
1Xiao Wang23600.0
2Zhang Fei18550.0
3Guan Yu21NaN
4Xiao Xiao Sun25610.0
5Wang Jianguo21580.0
6Liu Bei24634.0
df9 = df.iloc[np.r_[0,2:4],:]
df9
namesexagescoreaddress
0Xiao Mingmale20NaNNanshan District, Shenzhen City, Guangdong Province
2Zhang Feifemale18550.0Yuhua District, Changsha City, Hunan Province
3Guan Yumale21NaNDongcheng District

loc

Use the row index name or column attribute to retrieve data directly

# 1. Take out a single column

df10 = df.loc[:,"name"]  
df10
0     Xiao Ming
1     Xiao Wang
2     Zhang Fei
3     Guan Yu
4    Xiao Xiao Sun
5    Wang Jianguo
6     Liu Bei
Name: name, dtype: object
# 2. Fetch multiple columns

df11 = df.loc[:,["name","age"]]  
df11

#  3. Using numeric values, take out the first row with index 0

df12 = df.loc[0]
df12

name              Xiao Ming
sex                male
age               20
score            NaN
address    Nanshan District, Shenzhen City, Guangdong Province
Name: 0, dtype: object
# 4. Take out the row records with indexes of 0, 1 and 3. At this time, all column fields are reserved

df13 = df.loc[[0,1,3]]
df13

# Use colon:, to indicate all columns. The effect is the same as above

df14 = df.loc[[0,1,3],:]  
df14

# 5. Take out some rows and some columns

df15 = df.loc[[0,1,3],["name","sex","score"]]
df15

# 6,!!! Use index slice: both start and end positions are included

df16 = df.loc[0:3]
df16

df.loc[:]  # Represents all data

# 7. When filtering columns, there must be row elements

# name and score columns of all rows
df17 = df.loc[:,["name","score"]]
df17
namescore
0Xiao MingNaN
1Xiao Wang600.0
2Zhang Fei550.0
3Guan YuNaN
4Xiao Xiao Sun610.0
5Wang Jianguo580.0
6Liu Bei634.0
# age of all rows and all subsequent columns

df18 = df.loc[:,"age":]
df18
agescoreaddress
020NaNNanshan District, Shenzhen City, Guangdong Province
123600.0NaN
218550.0Yuhua District, Changsha City, Hunan Province
321NaNDongcheng District
425610.0Baiyun District, Guangzhou City, Guangdong Province
521580.0Jiangxia District, Wuhan City, Hubei Province
624634.0Longhua District, Shenzhen City, Guangdong Province
# 8. Partial row, age and all subsequent columns

# Remember: the start and end positions are included, which is different from python slicing

df19 = df.loc[1:3,"age":]
df19
agescoreaddress
123600.0NaN
218550.0Yuhua District, Changsha City, Hunan Province
321NaNDongcheng District
# 9. Fetching for non numeric row index

df20 = df0.loc["language"]
df20
Xiao Ming    one hundred and one
 Xiao Hong    102
 Xiao Sun    140
Name: language, dtype: int64
# 10. Note that the two square brackets take out DataFrame data, and the single bracket is Series data

df0.loc[["language"]]
Xiao MingXiao HongXiao Sun
language101102140
df0.loc[["language","English"]]
Xiao MingXiao HongXiao Sun
language101102140
English87128117
# 11. Extract some row and column data

df21 = df0.loc[["language","English"],"Xiao Ming"]  
df21

language    101
 English     87
Name: Xiao Ming, dtype: int64
df0.loc[["language","English"],["Xiao Ming","Xiao Sun"]]  
Xiao MingXiao Sun
language101140
English87117
# 12. Directly use the row index name to retrieve data

df0.loc[["language","English"]]
Xiao MingXiao HongXiao Sun
language101102140
English87128117

Comparison between the two

df.loc[[1,2]]
namesexagescoreaddress
1Xiao Wangfemale23600.0NaN
2Zhang Feifemale18550.0Yuhua District, Changsha City, Hunan Province
df.iloc[[1,2]]
namesexagescoreaddress
1Xiao Wangfemale23600.0NaN
2Zhang Feifemale18550.0Yuhua District, Changsha City, Hunan Province
# Specify the column property name we need

df.loc[[1,2],["name","score"]]  
namescore
1Xiao Wang600.0
2Zhang Fei550.0
# Take out rows 1 and 2, columns 0 and 3

df.iloc[[1,2],np.r_[0,3]]
namescore
1Xiao Wang600.0
2Zhang Fei550.0

at and iat

at

The at function is similar to loc, but what the at function takes out is only a value

df22 = df.at[4,"sex"]
df22

'female'
df.at[2,"name"]

'Zhang Fei'
df0
Xiao MingXiao HongXiao Sun
language101102140
mathematics1149567
English87128117
# Specify both index and column names

df23 = df0.at['language','Xiao Sun']
df23

140
# at and loc

df.loc[1].at['age']

23
df
namesexagescoreaddress
0Xiao Mingmale20NaNNanshan District, Shenzhen City, Guangdong Province
1Xiao Wangfemale23600.0NaN
2Zhang Feifemale18550.0Yuhua District, Changsha City, Hunan Province
3Guan Yumale21NaNDongcheng District
4Xiao Xiao Sunfemale25610.0Baiyun District, Guangzhou City, Guangdong Province
5Wang Jianguomale21580.0Jiangxia District, Wuhan City, Hubei Province
6Liu Beifemale24634.0Longhua District, Shenzhen City, Guangdong Province
# The fourth element with the column name name
df.name.at[4]   

'Xiao Xiao Sun'

iat

Like iloc, only numeric indexing is supported

df24 = df.iat[2,4]
df24

'Yuhua District, Changsha City, Hunan Province'
df.loc[2].iat[4]

'Yuhua District, Changsha City, Hunan Province'
df.iloc[2].iat[4]

'Yuhua District, Changsha City, Hunan Province'

any and all

  • any: True if at least one is True
  • All: all results are required to be True

When the incoming axis=1, the query will be performed according to the row; axis=0 means query by column

Comparison of data in Series

# Two False pass any and the result is False

pd.Series([False, False]).any()   # False
pd.Series([True, False]).any()  # True
pd.Series([True, False]).all()  # False
# any: skip null value

pd.Series([np.nan]).any()  # False
pd.Series([np.nan]).any(skipna=False)  # True
# all: skip null value

pd.Series([np.nan]).all()  # True
pd.Series([np.nan]).all(skipna=False)  #True  

Comparison in DataFrame

df0
Xiao MingXiao HongXiao Sun
language101102140
mathematics1149567
English87128117
# 1. Fetch the data to be queried
df0.loc[:,["Xiao Ming","Xiao Hong"]]
Xiao MingXiao Hong
language101102
mathematics11495
English87128
# 2. Compare
df0.loc[:,["Xiao Ming","Xiao Hong"]] >= 100
Xiao MingXiao Hong
languageTrueTrue
mathematicsTrueFalse
EnglishFalseTrue

any

# 3. Filter with any function

df0[(df0.loc[:,["Xiao Ming","Xiao Hong"]] >= 100).any(1)]

all

Only language can meet the needs of three people at the same time, and all of them are more than 100

# 4. Use the all function to filter: only the language meets 3 people and is greater than 100 at the same time

df0[(df0.loc[:,["Xiao Ming","Xiao Hong"]] >= 100).all(1)]

summary

This paper introduces the use of three pairs of functions of pandas through simulated data. Among them, loc and iloc functions are very common and practical functions, which they often use. So far, the data filtering part of pandas has been fully introduced.

Of course, the methods introduced are only part of pandas's rich retrieval skills. There are many functions and methods that readers need to learn and accumulate by themselves. I hope the methods introduced will be helpful to you.

Starting with the next article, we will introduce various operation skills in Pandas.

Keywords: Python Data Analysis pandas

Added by BLottman on Mon, 31 Jan 2022 17:41:53 +0200