[Python enhancement] pandas processing excel data

1, Installation environment

python It is generally necessary to operate the addition, deletion, reading and writing of excel table files Third party Library There are xlwt,xlrd. Xlrd is responsible for reading excel and xlwt is responsible for writing Excel files. This operation method is cumbersome, efficient and easy to understand.

Pandas is the encapsulation Library of xlwt and xlrd libraries. It has more comprehensive operation objects, such as csv,excel,dataframe and so on. On the basis of reading and writing libraries such as xlwt, one library can operate files of different formats.

pandas relies on the xlrd module for processing Excel, so we need to install this module in advance. The installation command is:

pip install xlrd

two, Reading and writing excel

Table content:

number	name	age	score	sex
1	lhh	two 4	100	male
2	wcy	28	99	female
3	lfg	56	98	male
4	zzy	52	97	female

1. Read file

#encoding=utf-8
import pandas as pd

# Read text content
data= pd.read_excel("demo.xlsx")
print(data)
#Solve the problem of incomplete data display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#Gets the value of the specified cell
one = data.iloc[3,3]
two = data.iloc[0,2]
print(one,two)

Output:

   number name  age  score sex
0       1  lhh   24    100   male
1       2  wcy   28     99   female
2       3  lfg   56     98   male
3       4  zzy   52     97   female
97 24

Get the first few pieces of data

# get data
data = data.head()
print(data)

# sheet_name represents the number of workbooks, which can be entered in a list composed of names
data = pd.read_excel("demo.xlsx",sheet_name=0)
print(data)

Output:

   number name  age  score sex
0       1  lhh   24    100   male
1       2  wcy   28     99   female
2       3  lfg   56     98   male
3       4  zzy   52     97   female
   number name  age  score sex
0       1  lhh   24    100   male
1       2  wcy   28     99   female
2       3  lfg   56     98   male
3       4  zzy   52     97   female

Get all the data in the table and return a list

# Get all the data and return a list
value = data.values
print(value)

Output:

[[1 'lhh' 24 100 'male']
 [2 'wcy' 28 99 'female']
 [3 'lfg' 56 98 'male']
 [4 'zzy' 52 97 'female']]

Output data for the specified row

df = pd.read_excel("demo.xlsx")
#Indicates the first row, excluding the header
data = df.iloc[0].values
print(data)

Output:

[1 'lhh' 24 100 'male']

2. Detailed explanation of LOC and iloc

LOC [row, column] first and then column: all rows or columns. Generally, multiple rows can use brackets, and continuous rows can use a:c, etc
iloc[index,columns] row index, column index and index start from 0. The usage is the same

3. Multi row and multi column reading

# Multiline output
df = pd.read_excel("demo.xlsx")
#If you want to read a specified number of rows, you need to nest a list to specify the number of rows in ix[] it
data = df.loc[1:2]
print(data)

Output:

   number name  age  score sex
1       2  wcy   28     99   female
2       3  lfg   56     98   male

Read the value of the first row and the second column

df = pd.read_excel("demo.xlsx")
data = df.iloc[1,2]
print(data)

Output:

Read the values of the second row, the third row, the third column and the fourth column

#Read the data of the third column and the fourth column of the first row and the second row. Nested lists are required here
df = pd.read_excel("demo.xlsx")
data = df.iloc[[1,2],[2,3]].values
print(data)

Output:

[[28 99]
 [56 98]]

Read the data in the second and third rows specifying the "score" and "age" columns

df = pd.read_excel("demo.xlsx")
data = df.loc[[1,2],["score","age"]].values
print(data)

Output:

[[99 28]
 [98 56]]

Reads data from the specified column

#Reads data from the specified column
df = pd.read_excel("demo.xlsx")
data = df.loc[:,["score","age"]].values
print(data)

Output:

[[100  24]
 [ 99  28]
 [ 98  56]
 [ 97  52]]

4. Output line number and column number

#Output row and column numbers
df=pd.read_excel('demo.xlsx')
print("Output line number list",df.index.values)
print("Output list",df.columns.values)
# The df.sample method is similar to the df.head() and df.values methods. The data selection is to randomly select three rows
print("output:",df.sample(3).values)

output

Output line number list [0 1 2 3]
Output list ['number' 'name' 'age' 'score' 'sex']
output: [[3 'lfg' 56 98 'male']
 [4 'zzy' 52 97 'female']
 [2 'wcy' 28 99 'female']]

5. Get the specified value

#Gets the specified value for the specified column
df=pd.read_excel('demo.xlsx')
print("Output value",df['score'].values)

Output:

Output value [100  99  98  97]

excel data to dictionary

#excel data to dictionary
df=pd.read_excel('demo.xlsx')
test_data=[]
for i in df.index.values: 
    #Be careful not to write loc as iloc
    row_data=df.loc[i,['number','name','score','age','sex']].to_dict()
    test_data.append(row_data)
print("output".format(test_data))

Output:

output: [{'number': 1, 'name': 'lhh', 'score': 100, 'age': 24, 'sex': 'male'}, {'number': 2, 'name': 'wcy', 'score': 99, 'age': 28, 'sex': 'female'}, {'number': 3, 'name': 'lfg', 'score': 98, 'age': 56, 'sex': 'male'}, {'number': 4, 'name': 'zzy', 'score': 97, 'age': 52, 'sex': 'female'}]

6. Data cleaning operation

Remove all rows with null values

df = pd.read_excel('demo.xlsx')
print(df)
data = df.dropna()
print(data)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name   age  score sex
0       1  lhh  24.0  100.0   male
1       2  wcy  28.0   99.0   female
2       3  lfg  56.0   98.0   male
3       4  zzy  52.0   97.0   female

Fill in null values

#Fill in null values
df = pd.read_excel('demo.xlsx')
print(df)
df["age"].fillna(0,inplace=True)
print(df)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   0.0    NaN  dwdw

Remove spaces from string

df['name'] = df['name'].map(str.strip)
print(df)

String case conversion

df['name'] = df['name'].map(str.lower)
print(df)

Change data format

df['name'].fillna(0).astype("int")

Change the name of the column

df.rename(columns={"name":"username"},inplace=True)
print(df)

Delete duplicate data

df['age'].drop_duplicates(inplace=True)
df['age'].drop_duplicates(inplace=True,kepp="last")
print(df)

List the previous data

# Modification and replacement
data = df.tail(3)
print(data)

What line of data is printed

# What line of data is printed
print(data.loc[3])

Print the data in the row and the data in the column

#Print out the columns in the eighth row [column_1]
print(data.loc[3,column_1])

# The data name of the printed line is... Data
print(data.loc[2:4,"name":"sex"])

Count the number of occurrences

# Count the number of occurrences. Name is the name of the column
data = df.name.value_counts()
print(data)

Apply a function apply() usage to each column

# Apply a function to each column
f = lambda x : x-20
df["age"] = df["age"].apply(f)
print(df)

Apply a function applymap() to each element

# Apply a function to each element
df = pd.read_excel("dw.xlsx")
f = lambda x:x+100
df = df.applymap(f)
print(df)

Traverse the data of rows and columns

# Traverse rows and columns
for i,row in df.iterrows():
    print(i,row)

map function usage

df["name"] = df["name"].map(lambda name:name+"wo")
print(df)

Select the specified column for re output

df = pd.read_excel("demo.xlsx")
df.loc[:,"name":"score"].to_excel("3 Column output.xlsx")

Add line header

df = pd.read_excel("demo.xlsx",header=None,names=["Serial number","full name","Age","fraction","Gender"])
df.to_excel("title.xlsx",index=False)
print(df)

7. Multi table consolidation

Detailed explanation of concat parameter:

Objs (required parameter): the list or dictionary of pandas objects participating in the connection
axis: indicates the axial direction of the connection. The default value is 0
join: select inner or outer (default), and other axial indexes will be merged according to the intersection (inner) or Union (outer)
join_axes: indicates the index used for other N-1 axes. Union / intersection operation is not performed
keys: the value related to the connection object, which is used to form a hierarchical index on the connection axis
verify_integrity: de duplication
ignore_index: ignore index

frames = [df1,df2,df3]
result = pd.concat(frames)
result = pd.concat(frames,keys=["x","y","z"])  #Define each table

df = pd.read_excel("demo.xlsx")
df.iloc[2:4,:].to_excel("demo01.xlsx")
data = pd.read_excel("demo01.xlsx")
frames = [df,data]
content = pd.concat(frames,keys=["one","two"],ignore_index=True,verify_integrity=True)
print(content)

Output:

   number name   age  score   sex  Unnamed: 0
0       1  lhh  24.0  100.0     male         NaN
1       2  wcy  28.0   99.0     female         NaN
2       3  lfg  56.0   98.0     male         NaN
3       4  zzy  52.0   97.0     female         NaN
4       5   dw   NaN    NaN  dwdw         NaN
5       3  lfg  56.0   98.0     male         2.0
6       4  zzy  52.0   97.0     female         3.0

Add a row of tables to connect:

# Create a new data and merge by row
df2 = pd.DataFrame({"number":10,"name":"www","age":30,"score":1000},index=[1])
print(df2)
result = pd.concat([df,df2],axis=0,ignore_index=True)
print(result)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name  age  score
1      10  www   30   1000
   number name   age   score   sex
0       1  lhh  24.0   100.0     male
1       2  wcy  28.0    99.0     female
2       3  lfg  56.0    98.0     male
3       4  zzy  52.0    97.0     female
4       5   dw   NaN     NaN  dwdw
5      10  www  30.0  1000.0   NaN

Add a column of data to connect

\#Create a new column to merge
df3 = pd.DataFrame({"index":[2,3,4,5,6]})
result = pd.concat([df,df3],axis=1)
print(result)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name   age  score   sex  index
0       1  lhh  24.0  100.0     male      2
1       2  wcy  28.0   99.0     female      3
2       3  lfg  56.0   98.0     male      4
3       4  zzy  52.0   97.0     female      5
4       5   dw   NaN    NaN  dwdw      6

Common centralized usage

#Intersect and merge df1 and df4 horizontally
result = pd.concat([df1,df4],axis=1,join="inner")   Columns are added,Rows are intersections`

#The horizontal indexes of df1 table and df4 table are carried out according to the index of df1
pd.concat([df1,df4],axis=1,join_axes=[df1.index])  Columns are added,Line to df1 Subject to,Empty is NaN

#Connect tables through the append() method
result = df.append(result,ignore_index=True)
print(result)

#Add a column s1 table and merge horizontally with df
s1 = pd.Series(["a","a","a","a","a"],name="x")
result = pd.concat([df,s1],axis=1)
print(result)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name   age  score   sex  x
0       1  lhh  24.0  100.0     male  a
1       2  wcy  28.0   99.0     female  a
2       3  lfg  56.0   98.0     male  a
3       4  zzy  52.0   97.0     female  a
4       5   dw   NaN    NaN  dwdw  a

Data filtering

#Only data aged 24 and 28 are displayed
print(df)
df = df[df["age"].isin([24,28])]
print(df)

   number name   age  score sex
0       1  lhh  24.0  100.0   male
1       2  wcy  28.0   99.0   female

#Only age data except 24 and 28 are displayed
print(df)
df = df[df["age"].isin([24,28])]
print(df)

   number name   age  score   sex
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw

# Delete unwanted columns
result = df.iloc[:, :len(df.columns) - 1]
print(result)

# Take out the contained data and save it
one = result[result["age"].isin([24])]
two = result[result["age"].isin([28])]
three = result[result["age"].isin([52])]
four = result[result["age"].isin([50])]
writer = pd.ExcelWriter("Data filtering.xlsx")
result.to_excel(writer, sheet_name="all", index=False)
one.to_excel(writer, sheet_name="one", index=False)
two.to_excel(writer, sheet_name="two", index=False)
three.to_excel(writer, sheet_name="three", index=False)
four.to_excel(writer, sheet_name="four", index=False)
writer.save()

Keywords: Python Back-end

Added by qistoph on Wed, 03 Nov 2021 21:02:09 +0200

Programming VIP

[Python enhancement] pandas processing excel data

two, Reading and writing excel

1. Read file

2. Detailed explanation of LOC and iloc

3. Multi row and multi column reading

4. Output line number and column number

5. Get the specified value

6. Data cleaning operation

7. Multi table consolidation

Detailed explanation of concat parameter:

Data filtering

Popular Keywords