urllib.parse.urlencode() and urllib.parse.unquote()
- Encoding uses the urlencode() function of urllib.parse to help us convert key:value pairs into strings like "key=value". Decoding uses urllib's unquote() function.
# Test results in Python 3.5 console >>> import urllib >>> word = {"wd":"Reptiles"} # By using the urllib.parse.urlencode() method, dictionary key-value pairs are converted by URL encoding to be accepted by the West server. >>> urllib.parse.urlencode(word) 'wd=%E7%88%AC%E8%99%AB' # The URL encoded string is converted back to the original string by the urllib.parse.unquote() method. >>> urllib.parse.unquote(word) 'wd=Reptiles'
Typically, HTTP requests submit data, which needs to be encoded into a URL encoding format and either as part of the URL or passed to the Request object as a parameter.
GET method
GET requests are typically used to get data from the server. For example, we use Baidu to search for crawlers: https://www.baidu.com/s?wd=crawler(https://www.baidu.com/s?wd=%E7%88%AC%E8%99%AB)
We can see that in the Request section, http://www.baidu.com/s? Followed by a long string containing the keyword "crawler" that we want to query, so we can try sending the request using the default GET method.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' # Import Library import urllib.request import urllib url = "http://www.baidu.com/s?" word = {"wd":"Reptiles"} # Convert to url encoding format word = urllib.parse.urlencode(word) # Stitching into a complete url full_url = url + word # User-Agent for chrome, included in header header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'} # url, along with headers, constructs a Request request that will come with the User-Agent of the chrome browser request = urllib.request.Request(full_url, headers = header) # Send this request to the server response = urllib.request.urlopen(request) html = response.read() fo = open("baidu.html", "wb") fo.write(html) fo.close()
Bulk Crawl Paste Bar Page Data
First we create a python file: tiebaSpider.py. What we want to do is enter the address of a Baidu post bar, for example: Baidu post bar LOL bar
First page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=0
Page 2: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=50
Page 3: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=100
......
Crawl the contents of the above pages
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' """ Function: Crawl paste bar page data in batch Target address: Baidu post bar LOL bar Analysis: Page 1: https://tieba.baidu.com/f?Kw=lol&ie=utf-8&pn=0 Page 2: http://tieba.baidu.com/f?Kw=lol&ie=utf-8&pn=50 Page 3: http://tieba.baidu.com/f?Kw=lol&ie=utf-8&pn=100 ...... Law: The difference between each page URL in the post bar is the last pn value, and the rest are the same.Its pn = (page - 1) * 50 url = "https://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=" pn = (page - 1) * 50 full_url = url + str(pn) """ #Import Library import urllib import urllib.request #Get the server response file based on the url address def loadPage(url): """ Function: Get server response file based on url address : param url: url address : return: Get server response file information """ # User-Agent header for chrome header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'} # url, along with headers, constructs a Request request that will accompany the User-Agent of the chrome browser request = urllib.request.Request(url, headers = header) #Send this request to the server reponse = urllib.request.urlopen(request) #Get everything in the response file html = reponse.read() return html #Storage Files def writeFile(html, file_name): """ Function: Save server response file to local disk file : param html: Server response file content : param file_name: Local disk file name :return: None """ with open(file_name, "wb") as f: f.write(html) #Post Bar Crawler Function def tiebaSpider(url, begin_page, end_page): """ Function: Processing url pages from begin_page to end_page : param url: url address : param begin_page: Start page to crawl : param end_page: End page to crawl :return: """ for page in range(begin_page, end_page + 1): pn = (page - 1) * 50 full_url = url + str(pn) file_name = "page" + str(page) + "page.html" print("crawling" + file_name) #Get html file information for full_url html = loadPage(full_url) print("Storing" + file_name) #Stores html file information corresponding to full_url writeFile(html, file_name) #Primary Function if __name__ == '__main__': url = "https://tieba.baidu.com/f?" #Enter the posts to crawl kw = input("Please enter a post that needs to be crawled:") #Enter the start and end pages to crawl begin_page = int(input("Please enter the start page:") end_page = int(input("Please enter the end page:") key = urllib.parse.urlencode({"kw":kw}) Example url of #combination: https://tieba.baidu.com/f?Kw=lol&ie=utf-8&pn= url = url + key + "&ie=utf-8&pn=" #Call the paste-bar crawler function to crawl data tiebaSpider(url, begin_page, end_page)
POST mode
The Request request object has a data parameter, which is used in the POST. The data we want to transfer is the parameter data, which is a dictionary with matching key-value pairs.
The following example simulates a POST request using a Dow dictionary translation site.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' """ POST Way: Take the Taoist dictionary translation website as an example url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null" """ # Import Library import urllib.request import urllib url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null" # User-Agent for chrome, included in header header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'} word = input("Enter the terms you need to translate:") from_data = { "i":word, "from":"AUTO", "to":"AUTO", "smartresult":"dict", "doctype":"json", "version":"2.1", "keyfrom":"fanyi.wed" } data = urllib.parse.urlencode(from_data) data = data.encode(encoding="utf-8") # str to bytes request = urllib.request.Request(url, data = data, headers = header) response = urllib.request.urlopen(request) html = response.read().decode(encoding = "utf-8").strip() print(html)
Get the content loaded by AJAX
Some web page content is loaded using AJAX, AJAX generally returns JSON. JSON data can be returned directly by POST or GET of AJAX address
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' """ //Get data loaded by AJAX //Some web page content is loaded using AJAX, as long as you remember that AJAX generally returns JSON, and you post or get the AJAX address directly, you return JSON data. //Take bean flakes for example: url = "https://movie.douban.com/j/chart/top_list?type=11&interval_id=100:90&action&start=0&limit=100" """ # Import Library import urllib import urllib.request url = "https://movie.douban.com/j/chart/top_list?" # User-Agent for chrome, included in header header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'} from_data = { 'type':'11', 'interval_id':'100:90', 'action':'', 'start':'0', 'limit':'100' } data = urllib.parse.urlencode(from_data) data = data.encode(encoding="utf-8") # str to bytes request = urllib.request.Request(url, data = data, headers = header) response = urllib.request.urlopen(request) html = response.read().decode(encoding = "utf-8") print(html)