Summary of crawler's latest library requests HTML Library

Requests html is a relatively new crawler library. The author and requests are the same author.

I. installation dependency

pip install requests-html

When installing, we can see that lxml,reuqests,bs4 are installed... Our common parsing and crawling libraries are installed in him separately.

II. Request initiation

from requests_html import HTMLSession
session = HTMLSession()

#The usage is exactly the same as that of the requests.session instantiated object, and it automatically saves the return information.
#Compared with reuqests, it has more response.html attributes.

Note: the default is to send headless browser, and if it calls the browser kernel with render

1. Solve headless browser (for anti climbing, it doesn't matter if no anti climbing is done)

Modify source code

  • Left click ctrl to enter HTMLSession

  • We can see that he inherited BaseSession.

  • Left click ctrl to enter BaseSession

    Original source code

    class BaseSession(requests.Session):
        def __init__(self, mock_browser : bool = True, verify : bool = True,
                     browser_args : list = ['--no-sandbox']):
            super().__init__()
            if mock_browser:
            self.headers['User-Agent'] = user_agent()
    
            self.hooks['response'].append(self.response_hook)
            self.verify = verify
    
            self.__browser_args = browser_args
            self.__headless = headless
    
          #It's useless to omit, not delete.
        @property
        async def browser(self):
            if not hasattr(self, "_browser"):
                self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
    
            return self._browser

    Modified source code

    class BaseSession(requests.Session):
        """ A consumable session, for cookie persistence and connection pooling,
        amongst other things.
        """
    
        def __init__(self, mock_browser : bool = True, verify : bool = True,
                     browser_args : list = ['--no-sandbox'],headless=True):
            super().__init__()
    
            # Mock a web browser's user agent.
            if mock_browser:
                self.headers['User-Agent'] = user_agent()
    
            self.hooks['response'].append(self.response_hook)
            self.verify = verify
    
            self.__browser_args = browser_args
            self.__headless = headless
              #It's useless to omit, not delete.
        @property
        async def browser(self):
            if not hasattr(self, "_browser"):
                self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=self.__headless, args=self.__browser_args)
    
            return self._browser

    In fact, I made a process to transfer a headless in.

Reset for session

from requests_html import HTMLSession
session = HTMLSession(
browser_args=['--no-sand',
              '--user-agent='xxxxx'
             ]
)
#So you can directly define what kind of browser he is to send requests.

2. Solve the browser kernel (for anti climbing, it doesn't matter if no anti climbing is done)

#js injection with modules
from requests_html  import HTMLSession

session  =HTMLSession(.....)
response = session.get('https://www.baidu.com')
script='''
()=>{
Object.defineProperties(navigator,{
        webdriver:{
        get: () => undefined
        }
    })}'''
print(response.html.render(script=script))

III. response.html related attributes

The response object here is

from requests_html  import HTMLSession
session  =HTMLSession()
response = session.get('https://www.baidu.com')
#For your understanding, this is the response.

1.absolute_links

All paths will be converted to absolute paths and returned.

2.links

Return path as is

3.base_url

The path in the. Base tag. If there is no base tag, it is the current url.

4.html

Return string string with label inside

5.text

Return string string does not contain a tag crawling what novel news and so on super easy to use!

6.encoding

Decoding format. Note that this is the encoding of response.html. If you only set response.encoding, it has no effect on this encoding.

7.raw_html

Equivalent to binary returned by r.content

8.pq

Returns the PyQuery object. The individual doesn't use this library very much and doesn't write any conclusions.

IV. response.html related methods

The following response object is abbreviated to r.

1.find

Using css selector to find objects

Get all

Syntax: r.html.find('css selector ')

Return value: [element object 1,... ]

Get only the first

Syntax `: r.html.find('css selector ', first = True)

Return value: element object

2.xpath

Using xpath selector to find objects

Get all

Syntax: r.html.xpath('xpath selector ')

Return value: [Element object 1,... ]

Get only the first

Syntax `: r.html.xpath('xpath selector ', first = True)

Return value: Element object

3. Search (only get the first one)

Similar to regular matching, which is to change the (. *?) in a regular into {}.

Syntax: r.html.search('template ')

Template 1: ('xx{}xxx {} ')

Get: get the first: r.html.search('template ') [0] others and so on

Template 2: ('xxx{name}yyy{pwd} ')

Get: get the first: r.html.search('template ') ['name'] others and so on

4. Search all

Same as search

Return value: [result object, result object,]

5. Render (I will write a separate summary later, which is a little more)

In fact, it encapsulates pyppeter. If you don't know pyppeter, you can think of Selenium as a simulated browser access.

V. Element object methods and properties

  • absolute_links: absolute url
  • links: relative url
  • Text: display text only
  • html: tags also display
  • attrs: attribute
  • find('css selector ')
  • xpath('xapth path ')
  • . search('template ')
  • . search all ('template ')

Keywords: Python Session encoding pip Selenium

Added by kwilder on Thu, 17 Oct 2019 12:58:13 +0300