requests High Order & BS4

requests High Order & BS4

Article directory


requests High Order & BS4

Yesterday's review:

requests:

get(url, headers, params, proxies)

post(url, headers, data, proxies)

xpath:

/

//

nodename

nodename[@attribute="..."]

text()

@attribute

1. High-order usage of requests

1.requests Upload File Operation
 2. Session maintenance: Session object (emphasis)
3. Set timeout: timeout. If no response is returned within 5 seconds of the request, an exception is thrown.
4.Prepare Request: Build a request object that can be put into a queue to achieve crawl queue scheduling
1.requests Upload File Operation
        files = {'file': open('filename', 'rb')}
        res = requests.post(url=url, files=files)

2.Conversation maintenance: Session object
        from requests import Session
        session = Session()
        res = session.get(url=url, headers=headers)

3.Setting timeout time: timeout, No response was returned within 5 seconds of request, Throw an exception
        res = requests.get(url=url, headers=headers, timeout=5)

4.Prepare Request: structure request object, Crawl queue scheduling can be implemented by putting it in the queue
        from requests import Request, Session
        url = '....'
        data = {
            'wd': 'spiderman'
        }
        headers = {
            'User-Agent': '...'
        }
        # 1. Truthful session object
        session = Session()
        # 2. Construct the request object and pass in the necessary parameters
        req = Request('POST', url, data=data, headers=headers)
        req = Request('GET', url, params=params, headers=headers)
        # 3. Application of prepared_request method to transform request object into Prepared Request object
        prepared = session.prepare_request(req)
        # 4. Send requests using session send method
        res = session.send(prepared)

2.BeautifulSoup library usage

# Beautiful Soup Library Introduction:
Beautiful Soup is also a parsing library
 BS parsing data relies on parsers. The parsers supported by BS are html. parser, lxml, xml, HTML 5lib, etc. Among them, lxml parser has fast parsing speed and strong fault tolerance.
Most parsers used in BS at this stage are lxml
# Beautiful Soup usage steps:
from bs4 import BeautifulSoup

soup = BeautifulSoup(res.text, 'lxml')
tag = soup.select("CSS selector expression")   # Returns a list
# CSS selector:
1.Locate tags according to node names and hierarchical relationships: tag chooser  &  Hierarchical selector
soup.select('title')
soup.select('div > ul > li')   # Single-level selector
soup.select('div li')  # Multilevel selector

2.According to the node class Attribute Location Label: class selector
soup.select('.panel')

3.according to id Attribute Location Label: id selector
soup.select('#item')

4.Nested selection:
ul_list = soup.select('ul')
for ul in ul_list:
  print(ul.select('li'))

# Gets the text or properties of the node:
tag_obj.string: Getting direct Subtext-->If there are nodes in the node parallel to the direct text, What this method gets is None
tag_obj.get_text(): Get all the text of the descendant node
tag_obj['attribute']: Get node attributes
# Exercise examples:
html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>BeautifulSoup Practice</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">First li Label</li>
                <li class="element">The second li Label</li>
                <li class="element">Third li Label</li>
            </ul>
            <ul class="list list-small">
                <li class="element">one</li>
                <li class="element">two</li>
            </ul>
            <li class="element">Testing multilevel selector</li>
        </div>
    </div>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
# 1. Locate the node according to its name and get its text
h4 = soup.select('h4')   # tag chooser
print(h4[0].get_text())

# 2. Locate nodes according to class attributes
panel = soup.select('.panel-heading')
print(panel)

# 3. Locate nodes according to id attributes
ul = soup.select('#list-1')
print(ul)

# 4. Nested selection
ul_list = soup.select('ul')
for ul in ul_list:
    li = ul.select('li')
    print(li)
    
# 5. Single-level selector and multi-level selector
li_list_single = soup.select(".panel-body > ul > li")
li_list_multi = soup.select(".panel-body li")


# Homework: Climb the whole three-country romance sales and write it into the txt file:'http://www.shicimingju.com/book/sanguoyanyi.html'

import requests
from bs4 import BeautifulSoup

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
res = requests.get(url=url, headers=headers)
soup = BeautifulSoup(res.text, 'lxml')
a_list = soup.select(".book-mulu ul li a")
for item in a_list:
    name = item.string
    href = item["href"]
    # print(href)
    full_url = 'http://www.shicimingju.com' + href
    detail_page = requests.get(url=full_url, headers=headers).text
    soup_detail = BeautifulSoup(detail_page, 'lxml')
    div = soup_detail.select(".chapter_content")[0]
    print(type(div.get_text()))
    with open('%s.txt' % name, 'w', encoding="utf-8") as f:
        f.write(div.get_text())
        
        
# Write silently:
//Conversation maintenance:Session object
	from requests import Session
	session = Session()
	res = session.get(url=url, headers=headers)
    
# Beautiful Soup usage steps:
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'lxml')
tag = soup.select("CSS selector expression")   # Returns a list

Keywords: Session Attribute xml Windows

Added by mikewhy on Sat, 05 Oct 2019 09:49:45 +0300