Python collects the data content of the website and saves the detailed information in PDF

Contents of this meeting:

Python collects the data content of the website and saves the detailed information in PDF

Development environment used this time:

Module usage:

Module to be installed

  • requests data request module
    Installation method: pip install requests
  • Parsel data analysis module pip install parsel
  • pdfkit PDF module pip install pdfkit

Built in module (not allowed to be installed)

  • re regular expression built-in module
  • Json string to Json data built-in module
  • csv save module built-in module
  • time module built-in module

How to install modules

  1. win + R enter cmd, click OK, enter the installation command pip install module name (pip install requests) and press enter
  2. Click terminal in pychart to enter the installation command

Case ideas of this lesson (the most basic idea and process of crawler):

I Data source analysis

  1. Determine what data content we want? music
  2. Carry out packet capture analysis through developer tools to analyze the data source > > > where the music playback address comes from

II Code implementation steps crawler Trilogy: send request > > > get data > > > analyze data > > > save data

  1. Send a request. For what url to send, carry the header mask
    website
    Send request get request
  2. Get the data and get the response data returned by the server
  3. Analyze the data and extract the position related information data we want
  4. Save data, save text / database / table... csv table data
  5. Multi page data collection

Code display

The corresponding installation package / installation tutorial / activation code / use tutorial / learning materials / tool plug-ins can be obtained for free

First import the module

import requests
import parsel  # Data analysis module pip install parsel
import pdfkit  # pip install pdfkit
# Import regular expression module
import re  # Built in module
# Import json
import json  # Built in module
# Import format output module
import pprint  # Built in module
# Import csv module
import csv  # Built in module
# Import time module
import time

1. Send request

def get_job_content(title, html_url):
    # url = 'Copy the details page URL yourself~'  # Recruitment details page
    html_str = """
    <!doctype html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>Document</title>
    </head>
    <body>
    {article}
    </body>
    </html>
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
    }
    response = requests.get(url=html_url, headers=headers, proxies=[])
    response.encoding = 'gbk'

2. Obtain data

    # print(response.text)

3. Parse the data. The css selector extracts the data content according to the tag attributes

    selectors = parsel.Selector(response.text)  # Convert the obtained html string data into a selector object
    content = selectors.css('body > div.tCompanyPage > div.tCompany_center.clearfix > div.tCompany_main').get()
    print(content)
    html_data = html_str.format(article=content)
    # '1.html' company name + position name
    html_path = 'html\\' + title + '.html'
    pdf_path = 'pdf\\' + title + '.pdf'
    with open(html_path, mode='w', encoding='utf-8') as f:
        f.write(html_data)

    config = pdfkit.configuration(wkhtmltopdf=r'C:\01-Software-installation\wkhtmltopdf\bin\wkhtmltopdf.exe')
    pdfkit.from_file(html_path, pdf_path, configuration=config)



# Mode mode save mode / read mode a additional write will not overwrite w write will overwrite
f = open('recruit_1.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=[
    'title',
    'Company name',
    'salary',
    'city',
    'education',
    'experience',
    'Company type',
    'Company attributes',
    'company size ',
    'fringe benefits',
    'Release date',
    'Detail page',
])
csv_writer.writeheader()  # Write header

for page in range(1, 11):

1. Send request f '{page}' string format method ()

    print(f'===============================Collecting page{page}Data content of the page===============================')
    time.sleep(2)
    url = f'Details page URL, copy it yourself~'
    # headers dictionary data type key value pair form
    # Quick batch replacement, select the content to be replaced, ctrl + R input regular syntax
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
    }

    # Send a request for the url address through the get method in the request module, carry the headers request header, and finally receive the returned data content with the response custom variable
    response = requests.get(url=url, headers=headers)

2. Get the data and get the response data returned by the server

    # print(response.text)

3. Analyze data

    # From response Text to find the window__ SEARCH_ RESULT__ =  (.*?)</ Script > from window__ SEARCH_ RESULT__ = Start < / script > end here in the middle
    html_data = re.findall('window.__SEARCH_RESULT__ = (.*?)</script>', response.text)[0]  # Where does findall() find data
    # print(html_data)
    # type() to view the data type
    # print(type(html_data))
    # If it is a dictionary, it will be very convenient for taking values. The string is converted to dictionary data
    json_data = json.loads(html_data)  # Convert to dictionary data type
    # Dictionary values are obtained through key value pairs, and the content of [value] on the right of the colon is extracted through the content of [key] on the left of the colon
    # pprint.pprint(json_data['engine_jds']) formats the output so that the dictionary data has an expanded output effect. print() prints on one line
    # LIS = [1,2,3,4,5,6,7,9] for I in LIS: (for loop traversal) extract the elements in the list one by one
    for index in json_data['engine_jds']:
        dit = {
            'title': index['job_name'],
            'Company name': index['company_name'],
            'salary': index['providesalary_text'],
            'city': index['workarea_text'],
            'education': index['attribute_text'][2],
            'experience': index['attribute_text'][1],
            'Company type': index['companytype_text'],
            'Company attributes': index['companyind_text'],
            'company size ': index['companysize_text'],
            'fringe benefits': index['jobwelf'],
            'Release date': index['updatedate'],
            'Detail page': index['job_href'],
        }
        title = index['job_name'] + index['company_name']
        title = re.sub(r'[/\:?*"<>|]', '', title)
        get_job_content(title, index['job_href'])
        csv_writer.writerow(dit)
        print(dit)

Some small knowledge points

Whether XPath or CSS regular expression returns empty data list

  1. Incorrect grammar
  2. When the server returns data (whether it is anti crawled or not)
  3. Find the right data source

XPath help (matching element panel)

The crawler looks at the data returned by the server

python application field

  1. Crawler program
  2. Data analysis > > > data analysis powerbi
  3. Website Development > > > develop a website
  4. Game Development > > > pyGame
  5. Game assist > > > simulation click image recognition simulation Click
  6. AI > > > at present, algorithms call API interfaces written by others
  7. Image processing > > > according to the photo location, the mobile phone takes photos, opens the location, and then sends it to others. You can locate it through this photo
  8. Automated script
  9. Automated test / operation and maintenance
  10. GUI desktop application development software tk pyqt

Keywords: Python Programming Pycharm crawler Data Analysis

Added by alfonsomr on Sat, 05 Mar 2022 11:08:50 +0200