Python crawler - data saving

Python crawler (IV) -- data saving

1. Read and write Excel

To read and write excel, we need to install another module openpyxl and directly use the installation method of pip.

pip install openpyxl

When writing data, we only need to import the Workbook object to create the Workbook.

from openpyxl import Workbook
#Create Workbook
wb=Workbook()

Similarly, if we need to write data to a table, we need to create a table first. If no indication is specified, it will default to sheet1, sheet2... And when we operate on a table, we can get the table by table name

#Create a table
sheet = wb.activesheet.title = 'Table 1'
#Create a new table
sheet2 = wb.create_sheet( 'Table 2')
#Get table by table name
sheet1 = wb['Table 1']

Now we know how to get the worksheet. If you need to operate on the worksheet, you can directly specify a cell and complex value it:

Write value
```python
sheet['A1']=20

The same goes for the worksheet The method of cell () is more convenient to operate. You can directly specify row and column and assign values to them

sheet.cell(row=4,column=5).value=30
sheet.cell(row=2,column=9,value=87)

If the data to be written is one by one, we can directly build the data to be written into a list or insert it directly in the form of tuples and dictionaries. For example, the data we obtained through data parsing just prints it out. Now we can directly build it into a list or tuple and insert it into the table to be operated.

#By default, it is added at the bottom of the data in the table, that is, append row by row, starting from the first row
ws.append([title, info, score, follow])

The writing part has been introduced, and then the table reading operation. For example, give you an excel table with several data. If you need to operate on these data, you need to obtain the data first. When using the openpyxl module, you must first open the corresponding table. We have just learned that a specific cell can be operated through the cell, so when we use it here, we can use the loop to traverse a column or a row in the table and store it in the variables in python.

# Open file:
from openpyx1.import load.workbookT education
#Get sheet by table name:
table = wb['sheet2']
#Get the number of rows and columns:
rows = tab1e.max_row
cols = table.max_column
print(rows,co1s)
#Get cell value:
# Get the table content from the first row and the first column from 1. Be careful not to lose it value
Data = table.cell(row=1,column=1).value
print(Data)
#Get all table names
sheet_names = wb.sheetnames
print(sheet_names[0])
ws = wb[(wb.sheetnames[0])]# index 0 is the first table#Active table name
print(wb.active.title)

Finally, you need to save the table after the table operation.

wb.save('test.xlsx')

2. Write CSV format

csv file format is a general format for importing and exporting spreadsheets and databases. Recently, when I call RPC to process server data, I often need to archive the data, so I use this convenient format.

python has a package for reading and writing csv files. You can directly import csv. Using this python package, you can easily operate csv files. Some simple usages are as follows.

import csv
list = [[1,2,3,4],[1,2,3,4],[5,6,7,8],[5,6,7,8]]
with open( ' example1.csv', 'w' , newline=' ') as csvfile:
spamwriter = csv.writer(csvfile,delimiter=',')
for li in list:
spamwriter.writerow([1,2,3,4])

When the csv module reads a file, it needs to read it in csv mode:

import csv
with open( 'example.csv', encoding='utf-8') as f:
		csv_reader = csv.reader(f)
		for row in csv_reader:
		print(row)

Because csv files are separated by commas by default, when other symbols are used as separators, you can specify the separator to read.

with open( ' example.csv', encoding='utf-8') as f:
#The delimiter can be specified by specifying the parameter of delimiter
reader = csv.reader(f, delimiter=', ')
for row in reader:
print(row)

The csv format also supports the read-write dictionary format. Let's see how it is used directly:

#write in
import csv
with open( ' names.csv' , 'w') as csvfi1e:
	fie1dnames = ['first_name ', 'last_name ' ]
	writer = csv.Dictwriter(csvfile,fieldnames=fie1dnames)
	writer.writeheader()
	writer.writerow({ 'first_name ' : 'Baked', 'last_name ' : 'Beans '})
  writer.writerow({ 'first_name ' : 'Love1y'3
	writer.writerow({'first_name ' : 'wonderfu1','last_name ' : 'spam'})
#read
import csv
with open( ' names.csv', newline='') as csvfile:
	reader = csv.DictReader(csvfile)
	for row in reader:
		print(row['first_name '],row[ ' 1ast_name'])

3. Read / write json format

If the response body obtained from the server in the requests module is json format data, you can directly respond json () saves the data in json format. json format is very similar to the dictionary format in python, but json is mainly a data format for saving data. When we obtain its content, we can directly obtain it according to the format of dictionary content. When it is directly saved in json format, we can obtain its specific content through dictionary index.

If a string is returned from the server, the string needs to be serialized into json format, and the process of changing json data into dictionary type is called deserialization.

#Read json data
with open( ' data.json', 'r ', encoding='utf-8 ' ) as f :
#Deserialization, changing json data to dictionary format
data = json. load(f)
#print data
print(data)
print(data['name']

For deserialization, you only need to use json Just use the jumps() function. If there is Chinese garbled code when saving json data, that is, compile the Chinese into the format of letters and numbers, similar to:

# filename : data.json
{ "name": "\u9743\u756f","shares": 100,"price" : 542.233

You only need to add parameters when serializing json data:

json_str = json.dumps(data,ensure_ascii=False)

4. Summary

There are many types of data. The specific format depends on our specific purpose. If we can save a large amount of data in xlsx format or csv format, these two formats are very convenient for data processing. If there are other requirements, we can also save the data in the database, and it is also very convenient for python to connect to the database. The basic crawler is summarized here. Some crawler methods such as js decryption, scratch and selenium may be updated in the future.

Keywords: Python crawler Python crawler

Added by rockroka on Sun, 19 Dec 2021 06:23:55 +0200