Summary of crawler's latest library requests HTML Library

Requests html is a relatively new crawler library. The author and requests are the same author.

I. installation dependency

pip install requests-html

When installing, we can see that lxml,reuqests,bs4 are installed... Our common parsing and crawling libraries are installed in him separately.

II. Request initiation

from requests_html import HTMLSession
session = HTMLSession()

#The usage is exactly the same as that of the requests.session instantiated object, and it automatically saves the return information.
#Compared with reuqests, it has more response.html attributes.

Note: the default is to send headless browser, and if it calls the browser kernel with render

1. Solve headless browser (for anti climbing, it doesn't matter if no anti climbing is done)

Modify source code

Left click ctrl to enter HTMLSession
We can see that he inherited BaseSession.

Left click ctrl to enter BaseSession

Original source code

class BaseSession(requests.Session):
    def __init__(self, mock_browser : bool = True, verify : bool = True,
                 browser_args : list = ['--no-sandbox']):
        super().__init__()
        if mock_browser:
        self.headers['User-Agent'] = user_agent()

        self.hooks['response'].append(self.response_hook)
        self.verify = verify

        self.__browser_args = browser_args
        self.__headless = headless

      #It's useless to omit, not delete.
    @property
    async def browser(self):
        if not hasattr(self, "_browser"):
            self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)

        return self._browser

Modified source code

class BaseSession(requests.Session):
    """ A consumable session, for cookie persistence and connection pooling,
    amongst other things.
    """

    def __init__(self, mock_browser : bool = True, verify : bool = True,
                 browser_args : list = ['--no-sandbox'],headless=True):
        super().__init__()

        # Mock a web browser's user agent.
        if mock_browser:
            self.headers['User-Agent'] = user_agent()

        self.hooks['response'].append(self.response_hook)
        self.verify = verify

        self.__browser_args = browser_args
        self.__headless = headless
          #It's useless to omit, not delete.
    @property
    async def browser(self):
        if not hasattr(self, "_browser"):
            self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=self.__headless, args=self.__browser_args)

        return self._browser

In fact, I made a process to transfer a headless in.

Reset for session

from requests_html import HTMLSession
session = HTMLSession(
browser_args=['--no-sand',
              '--user-agent='xxxxx'
             ]
)
#So you can directly define what kind of browser he is to send requests.

2. Solve the browser kernel (for anti climbing, it doesn't matter if no anti climbing is done)

#js injection with modules
from requests_html  import HTMLSession

session  =HTMLSession(.....)
response = session.get('https://www.baidu.com')
script='''
()=>{
Object.defineProperties(navigator,{
        webdriver:{
        get: () => undefined
        }
    })}'''
print(response.html.render(script=script))

III. response.html related attributes

The response object here is

from requests_html  import HTMLSession
session  =HTMLSession()
response = session.get('https://www.baidu.com')
#For your understanding, this is the response.

1.absolute_links

All paths will be converted to absolute paths and returned.

2.links

Return path as is

3.base_url

The path in the. Base tag. If there is no base tag, it is the current url.

4.html

Return string string with label inside

5.text

Return string string does not contain a tag crawling what novel news and so on super easy to use!

6.encoding

Decoding format. Note that this is the encoding of response.html. If you only set response.encoding, it has no effect on this encoding.

7.raw_html

Equivalent to binary returned by r.content

8.pq

Returns the PyQuery object. The individual doesn't use this library very much and doesn't write any conclusions.

IV. response.html related methods

The following response object is abbreviated to r.

1.find

Using css selector to find objects

Get all

Syntax: r.html.find('css selector ')

Return value: [element object 1,... ]

Get only the first

Syntax `: r.html.find('css selector ', first = True)

Return value: element object

2.xpath

Using xpath selector to find objects

Get all

Syntax: r.html.xpath('xpath selector ')

Return value: [Element object 1,... ]

Get only the first

Syntax `: r.html.xpath('xpath selector ', first = True)

Return value: Element object

3. Search (only get the first one)

Similar to regular matching, which is to change the (. *?) in a regular into {}.

Syntax: r.html.search('template ')

Template 1: ('xx{}xxx {} ')

Get: get the first: r.html.search('template ') [0] others and so on

Template 2: ('xxx{name}yyy{pwd} ')

Get: get the first: r.html.search('template ') ['name'] others and so on

4. Search all

Same as search

Return value: [result object, result object,]

5. Render (I will write a separate summary later, which is a little more)

In fact, it encapsulates pyppeter. If you don't know pyppeter, you can think of Selenium as a simulated browser access.

V. Element object methods and properties

absolute_links: absolute url
links: relative url
Text: display text only
html: tags also display
attrs: attribute
find('css selector ')
xpath('xapth path ')
. search('template ')
. search all ('template ')

Keywords: Python Session encoding pip Selenium

Added by kwilder on Thu, 17 Oct 2019 12:58:13 +0300

Programming VIP