Summary of the latest crawler library requests-html Library

Summary of the latest crawler library requests-html Library
requests-html is a newer repository and the author and requests are the same author

1. Installation dependency

pip install requests-html

We can see when he installs that he has lxml,reuqests,bs4 installed... The libraries we use to parse and crawl are all installed in him

2. Initiate Request

from requests_html import HTMLSession
session = HTMLSession()
#The return information is automatically saved, just like the object instantiated by requests.session
#He has more information about response.html than reuqests
 Points of Attention:Send default is headless browser,And if he uses render Call Browser Kernel

1. Resolve headless browsers (for backcrawling, it doesn't matter if you don't)

Modify source code

ctrl left-click into HTMLSession

We can see that he inherits BaseSession

ctrl left-click into BaseSession

Original Source

class BaseSession(requests.Session):
    def __init__(self, mock_browser : bool = True, verify : bool = True,
                 browser_args : list = ['--no-sandbox']):
        super().__init__()
        if mock_browser:
        self.headers['User-Agent'] = user_agent()

        self.hooks['response'].append(self.response_hook)
        self.verify = verify

        self.__browser_args = browser_args
        self.__headless = headless

      #A useless omission in the middle is not a deletion
    @property
    async def browser(self):
        if not hasattr(self, "_browser"):
            self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)

        return self._browser

Modified Source

class BaseSession(requests.Session):
    """ A consumable session, for cookie persistence and connection pooling,
    amongst other things.
    """

    def __init__(self, mock_browser : bool = True, verify : bool = True,
                 browser_args : list = ['--no-sandbox'],headless=False):       #If you set it to True He just has no head,And you run it again render Browser does not pop up
        super().__init__()

        # Mock a web browser's user agent.
        if mock_browser:
            self.headers['User-Agent'] = user_agent()

        self.hooks['response'].append(self.response_hook)
        self.verify = verify

        self.__browser_args = browser_args
        self.__headless = headless
          #A useless omission in the middle is not a deletion
    @property
    async def browser(self):
        if not hasattr(self, "_browser"):
            self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=self.__headless, args=self.__browser_args)

        return self._browser

Actually, I did a handling to pass a headless in

For session reset

from requests_html import HTMLSession
session = HTMLSession(
browser_args=['--no-sand',
              '--user-agent='xxxxx'
             ]
)
#So you can directly define what browser he is sending requests to

2. Resolve the browser kernel (for anti-crawling, if not)

#Using modules js injection
from requests_html  import HTMLSession

session  =HTMLSession(.....)
response = session.get('https://www.baidu.com')
script='''
()=>{
Object.defineProperties(navigator,{
        webdriver:{
        get: () => undefined
        }
    })}'''
print(response.html.render(script=script))

3. response.html Related Properties

The response object here is

from requests_html  import HTMLSession
session  =HTMLSession()
response = session.get('https://www.baidu.com')
#This is for your understanding response

1.absolute_links

All paths are returned as absolute paths

2.links

Return path as it is

3.base_url

The path in the.Base tag, if there is no base tag, is the current url

4.html

Returns that the string contains labels

5.text

Return string string contains no super-useful tags to crawl what novel news!

6.encoding

Decoding format, note here is response.html encoding. If you only set response.encoding, it has no effect on this encoding

7.raw_html

Equivalent to r.content returning binary

8.pq

Returns a PyQuery object, which is not used by individuals to draw all unwritten conclusions

IV. response.html Related Methods

The following response object is abbreviated as r

1.find

Finding Objects with css Selector

Get All

Syntax: r.html.find('css selector')

Return value: [element Object 1,...]is a list

Get only the first

Syntax `: r.html.find('css selector', first = True)

Return value: element object

2.xpath

Finding objects with xpath selector

Get All

Syntax: r.html.xpath('xpath selector')

Return value: [Element object 1,...]is a list

Get only the first

Syntax `: r.html.xpath('xpath selector', first = True)

Return value: Element object

3.search (get only the first)

Similar to regular matching, you change (. *?) inside a regular to {}

Syntax: r.html.search('template')

Template one: ('xx{}xxx{}')

Get: Get the first: r.html.search('template') [0] Others and so on

Template 2: ('xxx{name}yyy{pwd}')

Get: Get the first: r.html.search('template') ['name'] Others and so on

4.search_ All (get all)

Same usage as search

Return value: [result object, result object,]

5.render (I'll write a single summary later that's a bit too much)

He's actually encapsulating pyppeteer. If you don't know pyppeteer, think of Selenium as a simulated browser access

5. Element Object Methods and Properties

Copy.absolute_links:absolute url
 .links: relative url
 .text: Show text only
 .html: Labels will also be displayed
 .attrs:Attribute
 .find('css selector')
.xpath('xapth path')
.search('template')
.search_all('template')

Keywords: crawler

Added by tinker on Sat, 27 Nov 2021 19:42:57 +0200