Reptile small case: suitable for Python zero foundation, reptile data collection interested students!

Preface

The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

When I was a child, I always had 100000 questions about why. Today, I take you to climb a question and answer website. In this lesson, I use regular expression to extract data of text class. Regular expression is a general method of data extraction.

Suitable for:

Python zero foundation, interested in reptile data collection students!

Environment introduction:

python 3.6
pycharm
requests
re
json

General thinking of reptiles

1. Determine the url path to crawl, and the headers parameter

2. Send request -- requests simulate browser to send request and get response data

3. Parsing data -- re module: provides all regular expression functions

4. Save data -- save data in json format

1. Determine the url path to crawl, and the headers parameter

    base_url = 'https://www.guokr.com/ask/highlight/?page={}'.format(str(page))
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}

2. Send request -- requests simulate browser to send request and get response data

   response = requests.get(base_url, headers=headers)
    data = response.text
    # print(data)

3, Parsing data -- re module: provides all regular expression functions

Compiling regular expression precompiled code objects is faster than using strings directly, because the interpreter recommends compiling strings into code objects before executing the code in the form of strings

pattern = re.compile('<h2><a target="_blank" href="(.*?)">(.*?)</a></h2>', re.S)
    pattern_list = pattern.findall(data)  # -->list
    # print(pattern_list)
    # json [{[]}]{}
    # structure json data format
    for i in pattern_list:
        data_dict = {}
        data_dict['title'] = i[1]
        data_dict['href'] = i[0]
        data_list.append(data_dict)
    # convert to json format
    # json.dumps Default for Chinese in serialization ascii Code.To output real Chinese, you need to specify ensure_ascii=False: 
    json_data_list = json.dumps(data_list, ensure_ascii=False)
    # print(json_data_list)
    
with open("guoke02.json", 'w', encoding='utf-8') as f:
    f.write(json_data_list)

4. Save file in json format

20 pieces of data per page, 100 pages in total, 2000 pieces of data~

If you want to learn Python or are learning python, there are many Python tutorials, but are they up to date? Maybe you have learned something that someone else probably learned two years ago. Share a wave of the latest Python tutorials in 2020 in this editor. Access to the way, private letter small "information", you can get free Oh!

The complete code is as follows:

# requests
# re
# json
# General thinking of reptiles
# 1,Determine the climbing url route, headers parameter
# 2,Send request -- requests Simulate browser to send request and get response data
# 3,Parse data -- re Module: provides all regular expression functions
# 4,Save data -- Preservation json Data in format
import requests  # pip install requests
import re
import json
data_list = []
for page in range(1, 101):
    print("====Climbing to the top{}Industry data====\n".format(page))
    # 1,Determined to crawl url route, headers parameter
    base_url = 'https://www.guokr.com/ask/highlight/?page={}'.format(str(page))
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
    # 2,Send request -- requests Simulate browser to send request and get response data
    response = requests.get(base_url, headers=headers)
    data = response.text
    # print(data)
    # 3,Parse data -- re Module: provides all regular expression functions
    # <h2><a target="_blank" href="https://Www.guokr.com/question/669761/ "> Indians call men's genitals Linga and women's genitals Yuni. Yoga is the combination of Linga and Yuni. Is this true or not</a></h2>
    # 3,1 Compiling regular expression precompiled code objects is faster than using strings directly, because the interpreter recommends that you compile strings into code objects before executing the code in the form of strings
    pattern = re.compile('<h2><a target="_blank" href="(.*?)">(.*?)</a></h2>', re.S)
    pattern_list = pattern.findall(data)  # -->list
    # print(pattern_list)
    # json [{[]}]{}
    # structure json data format
    for i in pattern_list:
        data_dict = {}
        data_dict['title'] = i[1]
        data_dict['href'] = i[0]
        data_list.append(data_dict)
    # convert to json format
    # json.dumps Default for Chinese in serialization ascii Code.To output real Chinese, you need to specify ensure_ascii=False: 
    json_data_list = json.dumps(data_list, ensure_ascii=False)
    # print(json_data_list)
    # Preservation json Format file
    with open("guoke02.json", 'w', encoding='utf-8') as f:
        f.write(json_data_list)

Keywords: Python JSON Windows ascii

Added by javamint on Wed, 06 May 2020 10:44:25 +0300

Programming VIP

Reptile small case: suitable for Python zero foundation, reptile data collection interested students!

Popular Keywords