Crawlers - GET and POST requests

urllib.parse.urlencode() and urllib.parse.unquote()

  • Encoding uses the urlencode() function of urllib.parse to help us convert key:value pairs into strings like "key=value". Decoding uses urllib's unquote() function.
# Test results in Python 3.5 console
>>> import urllib
>>> word = {"wd":"Reptiles"}
# By using the urllib.parse.urlencode() method, dictionary key-value pairs are converted by URL encoding to be accepted by the West server.
>>> urllib.parse.urlencode(word)
'wd=%E7%88%AC%E8%99%AB'
# The URL encoded string is converted back to the original string by the urllib.parse.unquote() method.
>>> urllib.parse.unquote(word)
'wd=Reptiles'

Typically, HTTP requests submit data, which needs to be encoded into a URL encoding format and either as part of the URL or passed to the Request object as a parameter.

GET method

GET requests are typically used to get data from the server. For example, we use Baidu to search for crawlers: https://www.baidu.com/s?wd=crawler(https://www.baidu.com/s?wd=%E7%88%AC%E8%99%AB)

We can see that in the Request section, http://www.baidu.com/s? Followed by a long string containing the keyword "crawler" that we want to query, so we can try sending the request using the default GET method.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

# Import Library
import urllib.request
import urllib

url = "http://www.baidu.com/s?"
word = {"wd":"Reptiles"}
# Convert to url encoding format
word = urllib.parse.urlencode(word)
# Stitching into a complete url
full_url = url + word
# User-Agent for chrome, included in header
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}
# url, along with headers, constructs a Request request that will come with the User-Agent of the chrome browser
request = urllib.request.Request(full_url, headers = header)
# Send this request to the server
response = urllib.request.urlopen(request)

html = response.read()
fo = open("baidu.html", "wb")
fo.write(html)
fo.close()

Bulk Crawl Paste Bar Page Data

First we create a python file: tiebaSpider.py. What we want to do is enter the address of a Baidu post bar, for example: Baidu post bar LOL bar

First page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=0

Page 2: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=50

Page 3: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=100

......

Crawl the contents of the above pages

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

"""
Function: Crawl paste bar page data in batch
 Target address: Baidu post bar LOL bar
 Analysis:
    Page 1: https://tieba.baidu.com/f?Kw=lol&ie=utf-8&pn=0
    Page 2: http://tieba.baidu.com/f?Kw=lol&ie=utf-8&pn=50
    Page 3: http://tieba.baidu.com/f?Kw=lol&ie=utf-8&pn=100
    ......
Law:
    The difference between each page URL in the post bar is the last pn value, and the rest are the same.Its pn = (page - 1) * 50
    url = "https://tieba.baidu.com/f?kw=lol&ie=utf-8&pn="
    pn = (page - 1) * 50
    full_url = url + str(pn)
"""

#Import Library
import urllib
import urllib.request

#Get the server response file based on the url address
def loadPage(url):
    """
    Function: Get server response file based on url address
    : param url: url address
    : return: Get server response file information
    """
    # User-Agent header for chrome
    header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}
    # url, along with headers, constructs a Request request that will accompany the User-Agent of the chrome browser
    request = urllib.request.Request(url, headers = header)
    #Send this request to the server
    reponse = urllib.request.urlopen(request)
    #Get everything in the response file
    html = reponse.read()

    return html

#Storage Files
def writeFile(html, file_name):
    """
    Function: Save server response file to local disk file
    : param html: Server response file content
    : param file_name: Local disk file name
    :return: None
    """
    with open(file_name, "wb") as f:
        f.write(html)


#Post Bar Crawler Function
def tiebaSpider(url, begin_page, end_page):
    """
    Function: Processing url pages from begin_page to end_page
    : param url: url address
    : param begin_page: Start page to crawl
    : param end_page: End page to crawl
    :return:
    """
    for page in range(begin_page, end_page + 1):
        pn = (page - 1) * 50
        full_url = url + str(pn)
        file_name = "page" + str(page) + "page.html"
        print("crawling" + file_name)
        #Get html file information for full_url
        html = loadPage(full_url)
        print("Storing" + file_name)
        #Stores html file information corresponding to full_url
        writeFile(html, file_name)

#Primary Function
if __name__ == '__main__':
    url = "https://tieba.baidu.com/f?"
    #Enter the posts to crawl
    kw = input("Please enter a post that needs to be crawled:")
    #Enter the start and end pages to crawl
    begin_page = int(input("Please enter the start page:")
    end_page = int(input("Please enter the end page:")
    key = urllib.parse.urlencode({"kw":kw})
    Example url of #combination: https://tieba.baidu.com/f?Kw=lol&ie=utf-8&pn=
    url = url + key + "&ie=utf-8&pn="
    #Call the paste-bar crawler function to crawl data
    tiebaSpider(url, begin_page, end_page)

POST mode

The Request request object has a data parameter, which is used in the POST. The data we want to transfer is the parameter data, which is a dictionary with matching key-value pairs.

The following example simulates a POST request using a Dow dictionary translation site.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'
 
"""
POST Way: Take the Taoist dictionary translation website as an example
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null"
"""

# Import Library
import urllib.request
import urllib
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null"
# User-Agent for chrome, included in header
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}

word = input("Enter the terms you need to translate:")

from_data = {
    "i":word,
    "from":"AUTO",
    "to":"AUTO",
    "smartresult":"dict",
    "doctype":"json",
    "version":"2.1",
    "keyfrom":"fanyi.wed"
}
data = urllib.parse.urlencode(from_data)
data = data.encode(encoding="utf-8")  # str to bytes

request = urllib.request.Request(url, data = data, headers = header)

response = urllib.request.urlopen(request)

html = response.read().decode(encoding = "utf-8").strip()

print(html)

Get the content loaded by AJAX

Some web page content is loaded using AJAX, AJAX generally returns JSON. JSON data can be returned directly by POST or GET of AJAX address

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'
 
"""
//Get data loaded by AJAX
//Some web page content is loaded using AJAX, as long as you remember that AJAX generally returns JSON, and you post or get the AJAX address directly, you return JSON data.
//Take bean flakes for example:
url = "https://movie.douban.com/j/chart/top_list?type=11&interval_id=100:90&action&start=0&limit=100"
"""

# Import Library
import urllib
import urllib.request

url = "https://movie.douban.com/j/chart/top_list?"
# User-Agent for chrome, included in header
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}
from_data = {
    'type':'11',
    'interval_id':'100:90',
    'action':'',
    'start':'0',
    'limit':'100'
}
data = urllib.parse.urlencode(from_data)
data = data.encode(encoding="utf-8")  # str to bytes

request = urllib.request.Request(url, data = data, headers = header)

response = urllib.request.urlopen(request)

html = response.read().decode(encoding = "utf-8")

print(html)

Keywords: Python IE encoding JSON Windows

Added by phpform08 on Wed, 12 Jun 2019 19:53:44 +0300