python It is generally necessary to operate the addition, deletion, reading and writing of excel table files Third party Library There are xlwt,xlrd. Xlrd is responsible for reading excel and xlwt is responsible for writing Excel files. This operation method is cumbersome, efficient and easy to understand.
Pandas is the encapsulation Library of xlwt and xlrd libraries. It has more comprehensive operation objects, such as csv,excel,dataframe and so on. On the basis of reading and writing libraries such as xlwt, one library can operate files of different formats.
pandas relies on the xlrd module for processing Excel, so we need to install this module in advance. The installation command is:
pip install xlrd
two, Reading and writing excel
Table content:
number | name | age | score | sex | |
---|---|---|---|---|---|
1 | lhh | two 4 | 100 | male | |
2 | wcy | 28 | 99 | female | |
3 | lfg | 56 | 98 | male | |
4 | zzy | 52 | 97 | female |
1. Read file
#encoding=utf-8 import pandas as pd # Read text content data= pd.read_excel("demo.xlsx") print(data) #Solve the problem of incomplete data display pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) #Gets the value of the specified cell one = data.iloc[3,3] two = data.iloc[0,2] print(one,two)
Output:
number name age score sex 0 1 lhh 24 100 male 1 2 wcy 28 99 female 2 3 lfg 56 98 male 3 4 zzy 52 97 female 97 24
- Get the first few pieces of data
# get data data = data.head() print(data) # sheet_name represents the number of workbooks, which can be entered in a list composed of names data = pd.read_excel("demo.xlsx",sheet_name=0) print(data)
Output:
number name age score sex 0 1 lhh 24 100 male 1 2 wcy 28 99 female 2 3 lfg 56 98 male 3 4 zzy 52 97 female number name age score sex 0 1 lhh 24 100 male 1 2 wcy 28 99 female 2 3 lfg 56 98 male 3 4 zzy 52 97 female
- Get all the data in the table and return a list
# Get all the data and return a list value = data.values print(value)
Output:
[[1 'lhh' 24 100 'male'] [2 'wcy' 28 99 'female'] [3 'lfg' 56 98 'male'] [4 'zzy' 52 97 'female']]
- Output data for the specified row
df = pd.read_excel("demo.xlsx") #Indicates the first row, excluding the header data = df.iloc[0].values print(data)
Output:
[1 'lhh' 24 100 'male']
2. Detailed explanation of LOC and iloc
- LOC [row, column] first and then column: all rows or columns. Generally, multiple rows can use brackets, and continuous rows can use a:c, etc
- iloc[index,columns] row index, column index and index start from 0. The usage is the same
3. Multi row and multi column reading
# Multiline output df = pd.read_excel("demo.xlsx") #If you want to read a specified number of rows, you need to nest a list to specify the number of rows in ix[] it data = df.loc[1:2] print(data)
Output:
number name age score sex 1 2 wcy 28 99 female 2 3 lfg 56 98 male
- Read the value of the first row and the second column
df = pd.read_excel("demo.xlsx") data = df.iloc[1,2] print(data)
Output:
28
- Read the values of the second row, the third row, the third column and the fourth column
#Read the data of the third column and the fourth column of the first row and the second row. Nested lists are required here df = pd.read_excel("demo.xlsx") data = df.iloc[[1,2],[2,3]].values print(data)
Output:
[[28 99] [56 98]]
- Read the data in the second and third rows specifying the "score" and "age" columns
df = pd.read_excel("demo.xlsx") data = df.loc[[1,2],["score","age"]].values print(data)
Output:
[[99 28] [98 56]]
- Reads data from the specified column
#Reads data from the specified column df = pd.read_excel("demo.xlsx") data = df.loc[:,["score","age"]].values print(data)
Output:
[[100 24] [ 99 28] [ 98 56] [ 97 52]]
4. Output line number and column number
#Output row and column numbers df=pd.read_excel('demo.xlsx') print("Output line number list",df.index.values) print("Output list",df.columns.values) # The df.sample method is similar to the df.head() and df.values methods. The data selection is to randomly select three rows print("output:",df.sample(3).values)
output
Output line number list [0 1 2 3] Output list ['number' 'name' 'age' 'score' 'sex'] output: [[3 'lfg' 56 98 'male'] [4 'zzy' 52 97 'female'] [2 'wcy' 28 99 'female']]
5. Get the specified value
#Gets the specified value for the specified column df=pd.read_excel('demo.xlsx') print("Output value",df['score'].values)
Output:
Output value [100 99 98 97]
- excel data to dictionary
#excel data to dictionary df=pd.read_excel('demo.xlsx') test_data=[] for i in df.index.values: #Be careful not to write loc as iloc row_data=df.loc[i,['number','name','score','age','sex']].to_dict() test_data.append(row_data) print("output".format(test_data))
Output:
output: [{'number': 1, 'name': 'lhh', 'score': 100, 'age': 24, 'sex': 'male'}, {'number': 2, 'name': 'wcy', 'score': 99, 'age': 28, 'sex': 'female'}, {'number': 3, 'name': 'lfg', 'score': 98, 'age': 56, 'sex': 'male'}, {'number': 4, 'name': 'zzy', 'score': 97, 'age': 52, 'sex': 'female'}]
6. Data cleaning operation
- Remove all rows with null values
df = pd.read_excel('demo.xlsx') print(df) data = df.dropna() print(data)
Output:
number name age score sex 0 1 lhh 24.0 100.0 male 1 2 wcy 28.0 99.0 female 2 3 lfg 56.0 98.0 male 3 4 zzy 52.0 97.0 female 4 5 dw NaN NaN dwdw number name age score sex 0 1 lhh 24.0 100.0 male 1 2 wcy 28.0 99.0 female 2 3 lfg 56.0 98.0 male 3 4 zzy 52.0 97.0 female
- Fill in null values
#Fill in null values df = pd.read_excel('demo.xlsx') print(df) df["age"].fillna(0,inplace=True) print(df)
Output:
number name age score sex 0 1 lhh 24.0 100.0 male 1 2 wcy 28.0 99.0 female 2 3 lfg 56.0 98.0 male 3 4 zzy 52.0 97.0 female 4 5 dw NaN NaN dwdw number name age score sex 0 1 lhh 24.0 100.0 male 1 2 wcy 28.0 99.0 female 2 3 lfg 56.0 98.0 male 3 4 zzy 52.0 97.0 female 4 5 dw 0.0 NaN dwdw
- Remove spaces from string
df['name'] = df['name'].map(str.strip) print(df)
- String case conversion
df['name'] = df['name'].map(str.lower) print(df)
- Change data format
df['name'].fillna(0).astype("int")
- Change the name of the column
df.rename(columns={"name":"username"},inplace=True) print(df)
- Delete duplicate data
df['age'].drop_duplicates(inplace=True) df['age'].drop_duplicates(inplace=True,kepp="last") print(df)
- List the previous data
# Modification and replacement data = df.tail(3) print(data)
- What line of data is printed
# What line of data is printed print(data.loc[3])
- Print the data in the row and the data in the column
#Print out the columns in the eighth row [column_1] print(data.loc[3,column_1])
# The data name of the printed line is... Data print(data.loc[2:4,"name":"sex"])
- Count the number of occurrences
# Count the number of occurrences. Name is the name of the column data = df.name.value_counts() print(data)
- Apply a function apply() usage to each column
# Apply a function to each column f = lambda x : x-20 df["age"] = df["age"].apply(f) print(df)
- Apply a function applymap() to each element
# Apply a function to each element df = pd.read_excel("dw.xlsx") f = lambda x:x+100 df = df.applymap(f) print(df)
- Traverse the data of rows and columns
# Traverse rows and columns for i,row in df.iterrows(): print(i,row)
- map function usage
df["name"] = df["name"].map(lambda name:name+"wo") print(df)
- Select the specified column for re output
df = pd.read_excel("demo.xlsx") df.loc[:,"name":"score"].to_excel("3 Column output.xlsx")
- Add line header
df = pd.read_excel("demo.xlsx",header=None,names=["Serial number","full name","Age","fraction","Gender"]) df.to_excel("title.xlsx",index=False) print(df)
7. Multi table consolidation
Detailed explanation of concat parameter:
- Objs (required parameter): the list or dictionary of pandas objects participating in the connection
- axis: indicates the axial direction of the connection. The default value is 0
- join: select inner or outer (default), and other axial indexes will be merged according to the intersection (inner) or Union (outer)
- join_axes: indicates the index used for other N-1 axes. Union / intersection operation is not performed
- keys: the value related to the connection object, which is used to form a hierarchical index on the connection axis
- verify_integrity: de duplication
- ignore_index: ignore index
frames = [df1,df2,df3] result = pd.concat(frames) result = pd.concat(frames,keys=["x","y","z"]) #Define each table
df = pd.read_excel("demo.xlsx") df.iloc[2:4,:].to_excel("demo01.xlsx") data = pd.read_excel("demo01.xlsx") frames = [df,data] content = pd.concat(frames,keys=["one","two"],ignore_index=True,verify_integrity=True) print(content)
Output:
number name age score sex Unnamed: 0 0 1 lhh 24.0 100.0 male NaN 1 2 wcy 28.0 99.0 female NaN 2 3 lfg 56.0 98.0 male NaN 3 4 zzy 52.0 97.0 female NaN 4 5 dw NaN NaN dwdw NaN 5 3 lfg 56.0 98.0 male 2.0 6 4 zzy 52.0 97.0 female 3.0
Add a row of tables to connect:
# Create a new data and merge by row df2 = pd.DataFrame({"number":10,"name":"www","age":30,"score":1000},index=[1]) print(df2) result = pd.concat([df,df2],axis=0,ignore_index=True) print(result)
Output:
number name age score sex 0 1 lhh 24.0 100.0 male 1 2 wcy 28.0 99.0 female 2 3 lfg 56.0 98.0 male 3 4 zzy 52.0 97.0 female 4 5 dw NaN NaN dwdw number name age score 1 10 www 30 1000 number name age score sex 0 1 lhh 24.0 100.0 male 1 2 wcy 28.0 99.0 female 2 3 lfg 56.0 98.0 male 3 4 zzy 52.0 97.0 female 4 5 dw NaN NaN dwdw 5 10 www 30.0 1000.0 NaN
Add a column of data to connect
\#Create a new column to merge df3 = pd.DataFrame({"index":[2,3,4,5,6]}) result = pd.concat([df,df3],axis=1) print(result)
Output:
number name age score sex 0 1 lhh 24.0 100.0 male 1 2 wcy 28.0 99.0 female 2 3 lfg 56.0 98.0 male 3 4 zzy 52.0 97.0 female 4 5 dw NaN NaN dwdw number name age score sex index 0 1 lhh 24.0 100.0 male 2 1 2 wcy 28.0 99.0 female 3 2 3 lfg 56.0 98.0 male 4 3 4 zzy 52.0 97.0 female 5 4 5 dw NaN NaN dwdw 6
Common centralized usage
#Intersect and merge df1 and df4 horizontally result = pd.concat([df1,df4],axis=1,join="inner") Columns are added,Rows are intersections`
#The horizontal indexes of df1 table and df4 table are carried out according to the index of df1 pd.concat([df1,df4],axis=1,join_axes=[df1.index]) Columns are added,Line to df1 Subject to,Empty is NaN
#Connect tables through the append() method result = df.append(result,ignore_index=True) print(result)
#Add a column s1 table and merge horizontally with df s1 = pd.Series(["a","a","a","a","a"],name="x") result = pd.concat([df,s1],axis=1) print(result)
Output:
number name age score sex 0 1 lhh 24.0 100.0 male 1 2 wcy 28.0 99.0 female 2 3 lfg 56.0 98.0 male 3 4 zzy 52.0 97.0 female 4 5 dw NaN NaN dwdw number name age score sex x 0 1 lhh 24.0 100.0 male a 1 2 wcy 28.0 99.0 female a 2 3 lfg 56.0 98.0 male a 3 4 zzy 52.0 97.0 female a 4 5 dw NaN NaN dwdw a
Data filtering
#Only data aged 24 and 28 are displayed print(df) df = df[df["age"].isin([24,28])] print(df)
number name age score sex 0 1 lhh 24.0 100.0 male 1 2 wcy 28.0 99.0 female
#Only age data except 24 and 28 are displayed print(df) df = df[df["age"].isin([24,28])] print(df)
number name age score sex 2 3 lfg 56.0 98.0 male 3 4 zzy 52.0 97.0 female 4 5 dw NaN NaN dwdw
# Delete unwanted columns result = df.iloc[:, :len(df.columns) - 1] print(result)
# Take out the contained data and save it one = result[result["age"].isin([24])] two = result[result["age"].isin([28])] three = result[result["age"].isin([52])] four = result[result["age"].isin([50])] writer = pd.ExcelWriter("Data filtering.xlsx") result.to_excel(writer, sheet_name="all", index=False) one.to_excel(writer, sheet_name="one", index=False) two.to_excel(writer, sheet_name="two", index=False) three.to_excel(writer, sheet_name="three", index=False) four.to_excel(writer, sheet_name="four", index=False) writer.save()