requests High Order & BS4
Article directory
requests High Order & BS4
Yesterday's review:
requests:
get(url, headers, params, proxies)
post(url, headers, data, proxies)
xpath:
/
//
nodename
nodename[@attribute="..."]
text()
@attribute
1. High-order usage of requests
1.requests Upload File Operation 2. Session maintenance: Session object (emphasis) 3. Set timeout: timeout. If no response is returned within 5 seconds of the request, an exception is thrown. 4.Prepare Request: Build a request object that can be put into a queue to achieve crawl queue scheduling
1.requests Upload File Operation files = {'file': open('filename', 'rb')} res = requests.post(url=url, files=files) 2.Conversation maintenance: Session object from requests import Session session = Session() res = session.get(url=url, headers=headers) 3.Setting timeout time: timeout, No response was returned within 5 seconds of request, Throw an exception res = requests.get(url=url, headers=headers, timeout=5) 4.Prepare Request: structure request object, Crawl queue scheduling can be implemented by putting it in the queue from requests import Request, Session url = '....' data = { 'wd': 'spiderman' } headers = { 'User-Agent': '...' } # 1. Truthful session object session = Session() # 2. Construct the request object and pass in the necessary parameters req = Request('POST', url, data=data, headers=headers) req = Request('GET', url, params=params, headers=headers) # 3. Application of prepared_request method to transform request object into Prepared Request object prepared = session.prepare_request(req) # 4. Send requests using session send method res = session.send(prepared)
2.BeautifulSoup library usage
# Beautiful Soup Library Introduction: Beautiful Soup is also a parsing library BS parsing data relies on parsers. The parsers supported by BS are html. parser, lxml, xml, HTML 5lib, etc. Among them, lxml parser has fast parsing speed and strong fault tolerance. Most parsers used in BS at this stage are lxml
# Beautiful Soup usage steps: from bs4 import BeautifulSoup soup = BeautifulSoup(res.text, 'lxml') tag = soup.select("CSS selector expression") # Returns a list
# CSS selector: 1.Locate tags according to node names and hierarchical relationships: tag chooser & Hierarchical selector soup.select('title') soup.select('div > ul > li') # Single-level selector soup.select('div li') # Multilevel selector 2.According to the node class Attribute Location Label: class selector soup.select('.panel') 3.according to id Attribute Location Label: id selector soup.select('#item') 4.Nested selection: ul_list = soup.select('ul') for ul in ul_list: print(ul.select('li')) # Gets the text or properties of the node: tag_obj.string: Getting direct Subtext-->If there are nodes in the node parallel to the direct text, What this method gets is None tag_obj.get_text(): Get all the text of the descendant node tag_obj['attribute']: Get node attributes
# Exercise examples: html = ''' <div class="panel"> <div class="panel-heading"> <h4>BeautifulSoup Practice</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">First li Label</li> <li class="element">The second li Label</li> <li class="element">Third li Label</li> </ul> <ul class="list list-small"> <li class="element">one</li> <li class="element">two</li> </ul> <li class="element">Testing multilevel selector</li> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') # 1. Locate the node according to its name and get its text h4 = soup.select('h4') # tag chooser print(h4[0].get_text()) # 2. Locate nodes according to class attributes panel = soup.select('.panel-heading') print(panel) # 3. Locate nodes according to id attributes ul = soup.select('#list-1') print(ul) # 4. Nested selection ul_list = soup.select('ul') for ul in ul_list: li = ul.select('li') print(li) # 5. Single-level selector and multi-level selector li_list_single = soup.select(".panel-body > ul > li") li_list_multi = soup.select(".panel-body li")
# Homework: Climb the whole three-country romance sales and write it into the txt file:'http://www.shicimingju.com/book/sanguoyanyi.html' import requests from bs4 import BeautifulSoup url = 'http://www.shicimingju.com/book/sanguoyanyi.html' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' } res = requests.get(url=url, headers=headers) soup = BeautifulSoup(res.text, 'lxml') a_list = soup.select(".book-mulu ul li a") for item in a_list: name = item.string href = item["href"] # print(href) full_url = 'http://www.shicimingju.com' + href detail_page = requests.get(url=full_url, headers=headers).text soup_detail = BeautifulSoup(detail_page, 'lxml') div = soup_detail.select(".chapter_content")[0] print(type(div.get_text())) with open('%s.txt' % name, 'w', encoding="utf-8") as f: f.write(div.get_text()) # Write silently: //Conversation maintenance:Session object from requests import Session session = Session() res = session.get(url=url, headers=headers) # Beautiful Soup usage steps: from bs4 import BeautifulSoup soup = BeautifulSoup(res.text, 'lxml') tag = soup.select("CSS selector expression") # Returns a list