[Python enhancement] pandas processing excel data

1, Installation environment

python It is generally necessary to operate the addition, deletion, reading and writing of excel table files Third party Library There are xlwt,xlrd. Xlrd is responsible for reading excel and xlwt is responsible for writing Excel files. This operation method is cumbersome, efficient and easy to understand.

Pandas is the encapsulation Library of xlwt and xlrd libraries. It has more comprehensive operation objects, such as csv,excel,dataframe and so on. On the basis of reading and writing libraries such as xlwt, one library can operate files of different formats.

pandas relies on the xlrd module for processing Excel, so we need to install this module in advance. The installation command is:

pip install xlrd

two, Reading and writing excel

Table content:

numbernameagescoresex
1lhhtwo 4100male
2wcy2899female
3lfg5698male
4zzy5297female

1. Read file

#encoding=utf-8
import pandas as pd

# Read text content
data= pd.read_excel("demo.xlsx")
print(data)
#Solve the problem of incomplete data display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#Gets the value of the specified cell
one = data.iloc[3,3]
two = data.iloc[0,2]
print(one,two)

Output:

   number name  age  score sex
0       1  lhh   24    100   male
1       2  wcy   28     99   female
2       3  lfg   56     98   male
3       4  zzy   52     97   female
97 24
  • Get the first few pieces of data
# get data
data = data.head()
print(data)

# sheet_name represents the number of workbooks, which can be entered in a list composed of names
data = pd.read_excel("demo.xlsx",sheet_name=0)
print(data)

Output:

   number name  age  score sex
0       1  lhh   24    100   male
1       2  wcy   28     99   female
2       3  lfg   56     98   male
3       4  zzy   52     97   female
   number name  age  score sex
0       1  lhh   24    100   male
1       2  wcy   28     99   female
2       3  lfg   56     98   male
3       4  zzy   52     97   female
  • Get all the data in the table and return a list
# Get all the data and return a list
value = data.values
print(value)

Output:

[[1 'lhh' 24 100 'male']
 [2 'wcy' 28 99 'female']
 [3 'lfg' 56 98 'male']
 [4 'zzy' 52 97 'female']]
  • Output data for the specified row
df = pd.read_excel("demo.xlsx")
#Indicates the first row, excluding the header
data = df.iloc[0].values
print(data)

Output:

[1 'lhh' 24 100 'male']

2. Detailed explanation of LOC and iloc

  • LOC [row, column] first and then column: all rows or columns. Generally, multiple rows can use brackets, and continuous rows can use a:c, etc
  • iloc[index,columns] row index, column index and index start from 0. The usage is the same

3. Multi row and multi column reading

# Multiline output
df = pd.read_excel("demo.xlsx")
#If you want to read a specified number of rows, you need to nest a list to specify the number of rows in ix[] it
data = df.loc[1:2]
print(data)

Output:

   number name  age  score sex
1       2  wcy   28     99   female
2       3  lfg   56     98   male
  • Read the value of the first row and the second column
df = pd.read_excel("demo.xlsx")
data = df.iloc[1,2]
print(data)

Output:

28
  • Read the values of the second row, the third row, the third column and the fourth column
#Read the data of the third column and the fourth column of the first row and the second row. Nested lists are required here
df = pd.read_excel("demo.xlsx")
data = df.iloc[[1,2],[2,3]].values
print(data)

Output:

[[28 99]
 [56 98]]
  • Read the data in the second and third rows specifying the "score" and "age" columns
df = pd.read_excel("demo.xlsx")
data = df.loc[[1,2],["score","age"]].values
print(data)

Output:

[[99 28]
 [98 56]]
  • Reads data from the specified column
#Reads data from the specified column
df = pd.read_excel("demo.xlsx")
data = df.loc[:,["score","age"]].values
print(data)

Output:

[[100  24]
 [ 99  28]
 [ 98  56]
 [ 97  52]]

4. Output line number and column number

#Output row and column numbers
df=pd.read_excel('demo.xlsx')
print("Output line number list",df.index.values)
print("Output list",df.columns.values)
# The df.sample method is similar to the df.head() and df.values methods. The data selection is to randomly select three rows
print("output:",df.sample(3).values)

output

Output line number list [0 1 2 3]
Output list ['number' 'name' 'age' 'score' 'sex']
output: [[3 'lfg' 56 98 'male']
 [4 'zzy' 52 97 'female']
 [2 'wcy' 28 99 'female']]

5. Get the specified value

#Gets the specified value for the specified column
df=pd.read_excel('demo.xlsx')
print("Output value",df['score'].values)

Output:

Output value [100  99  98  97]
  • excel data to dictionary
#excel data to dictionary
df=pd.read_excel('demo.xlsx')
test_data=[]
for i in df.index.values: 
    #Be careful not to write loc as iloc
    row_data=df.loc[i,['number','name','score','age','sex']].to_dict()
    test_data.append(row_data)
print("output".format(test_data))

Output:

output: [{'number': 1, 'name': 'lhh', 'score': 100, 'age': 24, 'sex': 'male'}, {'number': 2, 'name': 'wcy', 'score': 99, 'age': 28, 'sex': 'female'}, {'number': 3, 'name': 'lfg', 'score': 98, 'age': 56, 'sex': 'male'}, {'number': 4, 'name': 'zzy', 'score': 97, 'age': 52, 'sex': 'female'}]

6. Data cleaning operation

  • Remove all rows with null values
df = pd.read_excel('demo.xlsx')
print(df)
data = df.dropna()
print(data)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name   age  score sex
0       1  lhh  24.0  100.0   male
1       2  wcy  28.0   99.0   female
2       3  lfg  56.0   98.0   male
3       4  zzy  52.0   97.0   female
  • Fill in null values
#Fill in null values
df = pd.read_excel('demo.xlsx')
print(df)
df["age"].fillna(0,inplace=True)
print(df)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   0.0    NaN  dwdw
  • Remove spaces from string
df['name'] = df['name'].map(str.strip)
print(df)
  • String case conversion
df['name'] = df['name'].map(str.lower)
print(df)
  • Change data format
df['name'].fillna(0).astype("int")
  • Change the name of the column
df.rename(columns={"name":"username"},inplace=True)
print(df)
  • Delete duplicate data
df['age'].drop_duplicates(inplace=True)
df['age'].drop_duplicates(inplace=True,kepp="last")
print(df)
  • List the previous data
# Modification and replacement
data = df.tail(3)
print(data)
  • What line of data is printed
# What line of data is printed
print(data.loc[3])
  • Print the data in the row and the data in the column
#Print out the columns in the eighth row [column_1]
print(data.loc[3,column_1])
# The data name of the printed line is... Data
print(data.loc[2:4,"name":"sex"])
  • Count the number of occurrences
# Count the number of occurrences. Name is the name of the column
data = df.name.value_counts()
print(data)
  • Apply a function apply() usage to each column
# Apply a function to each column
f = lambda x : x-20
df["age"] = df["age"].apply(f)
print(df)
  • Apply a function applymap() to each element
# Apply a function to each element
df = pd.read_excel("dw.xlsx")
f = lambda x:x+100
df = df.applymap(f)
print(df)
  • Traverse the data of rows and columns
# Traverse rows and columns
for i,row in df.iterrows():
    print(i,row)
  • map function usage
df["name"] = df["name"].map(lambda name:name+"wo")
print(df)
  • Select the specified column for re output
df = pd.read_excel("demo.xlsx")
df.loc[:,"name":"score"].to_excel("3 Column output.xlsx")
  • Add line header
df = pd.read_excel("demo.xlsx",header=None,names=["Serial number","full name","Age","fraction","Gender"])
df.to_excel("title.xlsx",index=False)
print(df)

7. Multi table consolidation

Detailed explanation of concat parameter:

  • Objs (required parameter): the list or dictionary of pandas objects participating in the connection
  • axis: indicates the axial direction of the connection. The default value is 0
  • join: select inner or outer (default), and other axial indexes will be merged according to the intersection (inner) or Union (outer)
  • join_axes: indicates the index used for other N-1 axes. Union / intersection operation is not performed
  • keys: the value related to the connection object, which is used to form a hierarchical index on the connection axis
  • verify_integrity: de duplication
  • ignore_index: ignore index
frames = [df1,df2,df3]
result = pd.concat(frames)
result = pd.concat(frames,keys=["x","y","z"])  #Define each table
df = pd.read_excel("demo.xlsx")
df.iloc[2:4,:].to_excel("demo01.xlsx")
data = pd.read_excel("demo01.xlsx")
frames = [df,data]
content = pd.concat(frames,keys=["one","two"],ignore_index=True,verify_integrity=True)
print(content)

Output:

   number name   age  score   sex  Unnamed: 0
0       1  lhh  24.0  100.0     male         NaN
1       2  wcy  28.0   99.0     female         NaN
2       3  lfg  56.0   98.0     male         NaN
3       4  zzy  52.0   97.0     female         NaN
4       5   dw   NaN    NaN  dwdw         NaN
5       3  lfg  56.0   98.0     male         2.0
6       4  zzy  52.0   97.0     female         3.0

Add a row of tables to connect:

# Create a new data and merge by row
df2 = pd.DataFrame({"number":10,"name":"www","age":30,"score":1000},index=[1])
print(df2)
result = pd.concat([df,df2],axis=0,ignore_index=True)
print(result)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name  age  score
1      10  www   30   1000
   number name   age   score   sex
0       1  lhh  24.0   100.0     male
1       2  wcy  28.0    99.0     female
2       3  lfg  56.0    98.0     male
3       4  zzy  52.0    97.0     female
4       5   dw   NaN     NaN  dwdw
5      10  www  30.0  1000.0   NaN

Add a column of data to connect

\#Create a new column to merge
df3 = pd.DataFrame({"index":[2,3,4,5,6]})
result = pd.concat([df,df3],axis=1)
print(result)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name   age  score   sex  index
0       1  lhh  24.0  100.0     male      2
1       2  wcy  28.0   99.0     female      3
2       3  lfg  56.0   98.0     male      4
3       4  zzy  52.0   97.0     female      5
4       5   dw   NaN    NaN  dwdw      6

Common centralized usage

#Intersect and merge df1 and df4 horizontally
result = pd.concat([df1,df4],axis=1,join="inner")   Columns are added,Rows are intersections`
#The horizontal indexes of df1 table and df4 table are carried out according to the index of df1
pd.concat([df1,df4],axis=1,join_axes=[df1.index])  Columns are added,Line to df1 Subject to,Empty is NaN
#Connect tables through the append() method
result = df.append(result,ignore_index=True)
print(result)
#Add a column s1 table and merge horizontally with df
s1 = pd.Series(["a","a","a","a","a"],name="x")
result = pd.concat([df,s1],axis=1)
print(result)

Output:

   number name   age  score   sex
0       1  lhh  24.0  100.0     male
1       2  wcy  28.0   99.0     female
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
   number name   age  score   sex  x
0       1  lhh  24.0  100.0     male  a
1       2  wcy  28.0   99.0     female  a
2       3  lfg  56.0   98.0     male  a
3       4  zzy  52.0   97.0     female  a
4       5   dw   NaN    NaN  dwdw  a

Data filtering

#Only data aged 24 and 28 are displayed
print(df)
df = df[df["age"].isin([24,28])]
print(df)
   number name   age  score sex
0       1  lhh  24.0  100.0   male
1       2  wcy  28.0   99.0   female
#Only age data except 24 and 28 are displayed
print(df)
df = df[df["age"].isin([24,28])]
print(df)
   number name   age  score   sex
2       3  lfg  56.0   98.0     male
3       4  zzy  52.0   97.0     female
4       5   dw   NaN    NaN  dwdw
# Delete unwanted columns
result = df.iloc[:, :len(df.columns) - 1]
print(result)
# Take out the contained data and save it
one = result[result["age"].isin([24])]
two = result[result["age"].isin([28])]
three = result[result["age"].isin([52])]
four = result[result["age"].isin([50])]
writer = pd.ExcelWriter("Data filtering.xlsx")
result.to_excel(writer, sheet_name="all", index=False)
one.to_excel(writer, sheet_name="one", index=False)
two.to_excel(writer, sheet_name="two", index=False)
three.to_excel(writer, sheet_name="three", index=False)
four.to_excel(writer, sheet_name="four", index=False)
writer.save()

Keywords: Python Back-end

Added by qistoph on Wed, 03 Nov 2021 21:02:09 +0200