Requests html is a relatively new crawler library. The author and requests are the same author.
I. installation dependency
pip install requests-html
When installing, we can see that lxml,reuqests,bs4 are installed... Our common parsing and crawling libraries are installed in him separately.
II. Request initiation
from requests_html import HTMLSession session = HTMLSession() #The usage is exactly the same as that of the requests.session instantiated object, and it automatically saves the return information. #Compared with reuqests, it has more response.html attributes.
Note: the default is to send headless browser, and if it calls the browser kernel with render
1. Solve headless browser (for anti climbing, it doesn't matter if no anti climbing is done)
Modify source code
Left click ctrl to enter HTMLSession
We can see that he inherited BaseSession.
-
Left click ctrl to enter BaseSession
Original source code
class BaseSession(requests.Session): def __init__(self, mock_browser : bool = True, verify : bool = True, browser_args : list = ['--no-sandbox']): super().__init__() if mock_browser: self.headers['User-Agent'] = user_agent() self.hooks['response'].append(self.response_hook) self.verify = verify self.__browser_args = browser_args self.__headless = headless #It's useless to omit, not delete. @property async def browser(self): if not hasattr(self, "_browser"): self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args) return self._browser
Modified source code
class BaseSession(requests.Session): """ A consumable session, for cookie persistence and connection pooling, amongst other things. """ def __init__(self, mock_browser : bool = True, verify : bool = True, browser_args : list = ['--no-sandbox'],headless=True): super().__init__() # Mock a web browser's user agent. if mock_browser: self.headers['User-Agent'] = user_agent() self.hooks['response'].append(self.response_hook) self.verify = verify self.__browser_args = browser_args self.__headless = headless #It's useless to omit, not delete. @property async def browser(self): if not hasattr(self, "_browser"): self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=self.__headless, args=self.__browser_args) return self._browser
In fact, I made a process to transfer a headless in.
Reset for session
from requests_html import HTMLSession session = HTMLSession( browser_args=['--no-sand', '--user-agent='xxxxx' ] ) #So you can directly define what kind of browser he is to send requests.
2. Solve the browser kernel (for anti climbing, it doesn't matter if no anti climbing is done)
#js injection with modules from requests_html import HTMLSession session =HTMLSession(.....) response = session.get('https://www.baidu.com') script=''' ()=>{ Object.defineProperties(navigator,{ webdriver:{ get: () => undefined } })}''' print(response.html.render(script=script))
III. response.html related attributes
The response object here is
from requests_html import HTMLSession session =HTMLSession() response = session.get('https://www.baidu.com') #For your understanding, this is the response.
1.absolute_links
All paths will be converted to absolute paths and returned.
2.links
Return path as is
3.base_url
The path in the. Base tag. If there is no base tag, it is the current url.
4.html
Return string string with label inside
5.text
Return string string does not contain a tag crawling what novel news and so on super easy to use!
6.encoding
Decoding format. Note that this is the encoding of response.html. If you only set response.encoding, it has no effect on this encoding.
7.raw_html
Equivalent to binary returned by r.content
8.pq
Returns the PyQuery object. The individual doesn't use this library very much and doesn't write any conclusions.
IV. response.html related methods
The following response object is abbreviated to r.
1.find
Using css selector to find objects
Get all
Syntax: r.html.find('css selector ')
Return value: [element object 1,... ]
Get only the first
Syntax `: r.html.find('css selector ', first = True)
Return value: element object
2.xpath
Using xpath selector to find objects
Get all
Syntax: r.html.xpath('xpath selector ')
Return value: [Element object 1,... ]
Get only the first
Syntax `: r.html.xpath('xpath selector ', first = True)
Return value: Element object
3. Search (only get the first one)
Similar to regular matching, which is to change the (. *?) in a regular into {}.
Syntax: r.html.search('template ')
Template 1: ('xx{}xxx {} ')
Get: get the first: r.html.search('template ') [0] others and so on
Template 2: ('xxx{name}yyy{pwd} ')
Get: get the first: r.html.search('template ') ['name'] others and so on
4. Search all
Same as search
Return value: [result object, result object,]
5. Render (I will write a separate summary later, which is a little more)
In fact, it encapsulates pyppeter. If you don't know pyppeter, you can think of Selenium as a simulated browser access.
V. Element object methods and properties
- absolute_links: absolute url
- links: relative url
- Text: display text only
- html: tags also display
- attrs: attribute
- find('css selector ')
- xpath('xapth path ')
- . search('template ')
- . search all ('template ')