Contents of this meeting:
Python collects the data content of the website and saves the detailed information in PDF
Development environment used this time:
- Python 3.8
- Pycham 2021.2 Professional Edition
- The wkhtmltopdf installation package is required to save the PDF
You can click to receive the above environment
Module usage:
Module to be installed
- requests data request module
Installation method: pip install requests - Parsel data analysis module pip install parsel
- pdfkit PDF module pip install pdfkit
Built in module (not allowed to be installed)
- re regular expression built-in module
- Json string to Json data built-in module
- csv save module built-in module
- time module built-in module
How to install modules
- win + R enter cmd, click OK, enter the installation command pip install module name (pip install requests) and press enter
- Click terminal in pychart to enter the installation command
Case ideas of this lesson (the most basic idea and process of crawler):
I Data source analysis
- Determine what data content we want? music
- Carry out packet capture analysis through developer tools to analyze the data source > > > where the music playback address comes from
II Code implementation steps crawler Trilogy: send request > > > get data > > > analyze data > > > save data
- Send a request. For what url to send, carry the header mask
website
Send request get request - Get the data and get the response data returned by the server
- Analyze the data and extract the position related information data we want
- Save data, save text / database / table... csv table data
- Multi page data collection
Code display
The corresponding installation package / installation tutorial / activation code / use tutorial / learning materials / tool plug-ins can be obtained for free
First import the module
import requests import parsel # Data analysis module pip install parsel import pdfkit # pip install pdfkit # Import regular expression module import re # Built in module # Import json import json # Built in module # Import format output module import pprint # Built in module # Import csv module import csv # Built in module # Import time module import time
1. Send request
def get_job_content(title, html_url): # url = 'Copy the details page URL yourself~' # Recruitment details page html_str = """ <!doctype html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document</title> </head> <body> {article} </body> </html> """ headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36', } response = requests.get(url=html_url, headers=headers, proxies=[]) response.encoding = 'gbk'
2. Obtain data
# print(response.text)
3. Parse the data. The css selector extracts the data content according to the tag attributes
selectors = parsel.Selector(response.text) # Convert the obtained html string data into a selector object content = selectors.css('body > div.tCompanyPage > div.tCompany_center.clearfix > div.tCompany_main').get() print(content) html_data = html_str.format(article=content) # '1.html' company name + position name html_path = 'html\\' + title + '.html' pdf_path = 'pdf\\' + title + '.pdf' with open(html_path, mode='w', encoding='utf-8') as f: f.write(html_data) config = pdfkit.configuration(wkhtmltopdf=r'C:\01-Software-installation\wkhtmltopdf\bin\wkhtmltopdf.exe') pdfkit.from_file(html_path, pdf_path, configuration=config) # Mode mode save mode / read mode a additional write will not overwrite w write will overwrite f = open('recruit_1.csv', mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter(f, fieldnames=[ 'title', 'Company name', 'salary', 'city', 'education', 'experience', 'Company type', 'Company attributes', 'company size ', 'fringe benefits', 'Release date', 'Detail page', ]) csv_writer.writeheader() # Write header for page in range(1, 11):
1. Send request f '{page}' string format method ()
print(f'===============================Collecting page{page}Data content of the page===============================') time.sleep(2) url = f'Details page URL, copy it yourself~' # headers dictionary data type key value pair form # Quick batch replacement, select the content to be replaced, ctrl + R input regular syntax headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36', } # Send a request for the url address through the get method in the request module, carry the headers request header, and finally receive the returned data content with the response custom variable response = requests.get(url=url, headers=headers)
2. Get the data and get the response data returned by the server
# print(response.text)
3. Analyze data
# From response Text to find the window__ SEARCH_ RESULT__ = (.*?)</ Script > from window__ SEARCH_ RESULT__ = Start < / script > end here in the middle html_data = re.findall('window.__SEARCH_RESULT__ = (.*?)</script>', response.text)[0] # Where does findall() find data # print(html_data) # type() to view the data type # print(type(html_data)) # If it is a dictionary, it will be very convenient for taking values. The string is converted to dictionary data json_data = json.loads(html_data) # Convert to dictionary data type # Dictionary values are obtained through key value pairs, and the content of [value] on the right of the colon is extracted through the content of [key] on the left of the colon # pprint.pprint(json_data['engine_jds']) formats the output so that the dictionary data has an expanded output effect. print() prints on one line # LIS = [1,2,3,4,5,6,7,9] for I in LIS: (for loop traversal) extract the elements in the list one by one for index in json_data['engine_jds']: dit = { 'title': index['job_name'], 'Company name': index['company_name'], 'salary': index['providesalary_text'], 'city': index['workarea_text'], 'education': index['attribute_text'][2], 'experience': index['attribute_text'][1], 'Company type': index['companytype_text'], 'Company attributes': index['companyind_text'], 'company size ': index['companysize_text'], 'fringe benefits': index['jobwelf'], 'Release date': index['updatedate'], 'Detail page': index['job_href'], } title = index['job_name'] + index['company_name'] title = re.sub(r'[/\:?*"<>|]', '', title) get_job_content(title, index['job_href']) csv_writer.writerow(dit) print(dit)
Some small knowledge points
Whether XPath or CSS regular expression returns empty data list
- Incorrect grammar
- When the server returns data (whether it is anti crawled or not)
- Find the right data source
XPath help (matching element panel)
The crawler looks at the data returned by the server
python application field
- Crawler program
- Data analysis > > > data analysis powerbi
- Website Development > > > develop a website
- Game Development > > > pyGame
- Game assist > > > simulation click image recognition simulation Click
- AI > > > at present, algorithms call API interfaces written by others
- Image processing > > > according to the photo location, the mobile phone takes photos, opens the location, and then sends it to others. You can locate it through this photo
- Automated script
- Automated test / operation and maintenance
- GUI desktop application development software tk pyqt