Python collects the data content of the website and saves the detailed information in PDF

Contents of this meeting:

Development environment used this time:

Python 3.8
Pycham 2021.2 Professional Edition
The wkhtmltopdf installation package is required to save the PDF
You can click to receive the above environment

Module usage:

Module to be installed

requests data request module
Installation method: pip install requests
Parsel data analysis module pip install parsel
pdfkit PDF module pip install pdfkit

Built in module (not allowed to be installed)

re regular expression built-in module
Json string to Json data built-in module
csv save module built-in module
time module built-in module

How to install modules

win + R enter cmd, click OK, enter the installation command pip install module name (pip install requests) and press enter
Click terminal in pychart to enter the installation command

Case ideas of this lesson (the most basic idea and process of crawler):

I Data source analysis

Determine what data content we want? music
Carry out packet capture analysis through developer tools to analyze the data source > > > where the music playback address comes from

II Code implementation steps crawler Trilogy: send request > > > get data > > > analyze data > > > save data

Send a request. For what url to send, carry the header mask
website
Send request get request
Get the data and get the response data returned by the server
Analyze the data and extract the position related information data we want
Save data, save text / database / table... csv table data
Multi page data collection

Code display

The corresponding installation package / installation tutorial / activation code / use tutorial / learning materials / tool plug-ins can be obtained for free

First import the module

import requests
import parsel  # Data analysis module pip install parsel
import pdfkit  # pip install pdfkit
# Import regular expression module
import re  # Built in module
# Import json
import json  # Built in module
# Import format output module
import pprint  # Built in module
# Import csv module
import csv  # Built in module
# Import time module
import time

1. Send request

def get_job_content(title, html_url):
    # url = 'Copy the details page URL yourself~'  # Recruitment details page
    html_str = """
    <!doctype html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>Document</title>
    </head>
    <body>
    {article}
    </body>
    </html>
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
    }
    response = requests.get(url=html_url, headers=headers, proxies=[])
    response.encoding = 'gbk'

2. Obtain data

    # print(response.text)

3. Parse the data. The css selector extracts the data content according to the tag attributes

    selectors = parsel.Selector(response.text)  # Convert the obtained html string data into a selector object
    content = selectors.css('body > div.tCompanyPage > div.tCompany_center.clearfix > div.tCompany_main').get()
    print(content)
    html_data = html_str.format(article=content)
    # '1.html' company name + position name
    html_path = 'html\\' + title + '.html'
    pdf_path = 'pdf\\' + title + '.pdf'
    with open(html_path, mode='w', encoding='utf-8') as f:
        f.write(html_data)

    config = pdfkit.configuration(wkhtmltopdf=r'C:\01-Software-installation\wkhtmltopdf\bin\wkhtmltopdf.exe')
    pdfkit.from_file(html_path, pdf_path, configuration=config)



# Mode mode save mode / read mode a additional write will not overwrite w write will overwrite
f = open('recruit_1.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=[
    'title',
    'Company name',
    'salary',
    'city',
    'education',
    'experience',
    'Company type',
    'Company attributes',
    'company size ',
    'fringe benefits',
    'Release date',
    'Detail page',
])
csv_writer.writeheader()  # Write header

for page in range(1, 11):

1. Send request f '{page}' string format method ()

    print(f'===============================Collecting page{page}Data content of the page===============================')
    time.sleep(2)
    url = f'Details page URL, copy it yourself~'
    # headers dictionary data type key value pair form
    # Quick batch replacement, select the content to be replaced, ctrl + R input regular syntax
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
    }

    # Send a request for the url address through the get method in the request module, carry the headers request header, and finally receive the returned data content with the response custom variable
    response = requests.get(url=url, headers=headers)

2. Get the data and get the response data returned by the server

    # print(response.text)

3. Analyze data

    # From response Text to find the window__ SEARCH_ RESULT__ =  (.*?)</ Script > from window__ SEARCH_ RESULT__ = Start < / script > end here in the middle
    html_data = re.findall('window.__SEARCH_RESULT__ = (.*?)</script>', response.text)[0]  # Where does findall() find data
    # print(html_data)
    # type() to view the data type
    # print(type(html_data))
    # If it is a dictionary, it will be very convenient for taking values. The string is converted to dictionary data
    json_data = json.loads(html_data)  # Convert to dictionary data type
    # Dictionary values are obtained through key value pairs, and the content of [value] on the right of the colon is extracted through the content of [key] on the left of the colon
    # pprint.pprint(json_data['engine_jds']) formats the output so that the dictionary data has an expanded output effect. print() prints on one line
    # LIS = [1,2,3,4,5,6,7,9] for I in LIS: (for loop traversal) extract the elements in the list one by one
    for index in json_data['engine_jds']:
        dit = {
            'title': index['job_name'],
            'Company name': index['company_name'],
            'salary': index['providesalary_text'],
            'city': index['workarea_text'],
            'education': index['attribute_text'][2],
            'experience': index['attribute_text'][1],
            'Company type': index['companytype_text'],
            'Company attributes': index['companyind_text'],
            'company size ': index['companysize_text'],
            'fringe benefits': index['jobwelf'],
            'Release date': index['updatedate'],
            'Detail page': index['job_href'],
        }
        title = index['job_name'] + index['company_name']
        title = re.sub(r'[/\:?*"<>|]', '', title)
        get_job_content(title, index['job_href'])
        csv_writer.writerow(dit)
        print(dit)

Some small knowledge points

Whether XPath or CSS regular expression returns empty data list

Incorrect grammar
When the server returns data (whether it is anti crawled or not)
Find the right data source

XPath help (matching element panel)

The crawler looks at the data returned by the server

python application field

Crawler program
Data analysis > > > data analysis powerbi
Website Development > > > develop a website
Game Development > > > pyGame
Game assist > > > simulation click image recognition simulation Click
AI > > > at present, algorithms call API interfaces written by others
Image processing > > > according to the photo location, the mobile phone takes photos, opens the location, and then sends it to others. You can locate it through this photo
Automated script
Automated test / operation and maintenance
GUI desktop application development software tk pyqt

Keywords: Python Programming Pycharm crawler Data Analysis

Added by alfonsomr on Sat, 05 Mar 2022 11:08:50 +0200

Programming VIP

Python collects the data content of the website and saves the detailed information in PDF

Contents of this meeting:

Development environment used this time:

Module usage:

Module to be installed

Built in module (not allowed to be installed)

How to install modules

Case ideas of this lesson (the most basic idea and process of crawler):

I Data source analysis

II Code implementation steps crawler Trilogy: send request > > > get data > > > analyze data > > > save data

Code display

The corresponding installation package / installation tutorial / activation code / use tutorial / learning materials / tool plug-ins can be obtained for free

First import the module

1. Send request

2. Obtain data

3. Parse the data. The css selector extracts the data content according to the tag attributes

1. Send request f '{page}' string format method ()

2. Get the data and get the response data returned by the server

3. Analyze data

Some small knowledge points

python application field

Popular Keywords