Summary of the latest crawler library requests-html Library
requests-html is a newer repository and the author and requests are the same author
1. Installation dependency
pip install requests-html
We can see when he installs that he has lxml,reuqests,bs4 installed... The libraries we use to parse and crawl are all installed in him
2. Initiate Request
from requests_html import HTMLSession session = HTMLSession()
#The return information is automatically saved, just like the object instantiated by requests.session #He has more information about response.html than reuqests Points of Attention:Send default is headless browser,And if he uses render Call Browser Kernel
1. Resolve headless browsers (for backcrawling, it doesn't matter if you don't)
Modify source code
ctrl left-click into HTMLSession
We can see that he inherits BaseSession
ctrl left-click into BaseSession
Original Source
class BaseSession(requests.Session): def __init__(self, mock_browser : bool = True, verify : bool = True, browser_args : list = ['--no-sandbox']): super().__init__() if mock_browser: self.headers['User-Agent'] = user_agent() self.hooks['response'].append(self.response_hook) self.verify = verify self.__browser_args = browser_args self.__headless = headless #A useless omission in the middle is not a deletion @property async def browser(self): if not hasattr(self, "_browser"): self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args) return self._browser
Modified Source
class BaseSession(requests.Session): """ A consumable session, for cookie persistence and connection pooling, amongst other things. """ def __init__(self, mock_browser : bool = True, verify : bool = True, browser_args : list = ['--no-sandbox'],headless=False): #If you set it to True He just has no head,And you run it again render Browser does not pop up super().__init__() # Mock a web browser's user agent. if mock_browser: self.headers['User-Agent'] = user_agent() self.hooks['response'].append(self.response_hook) self.verify = verify self.__browser_args = browser_args self.__headless = headless #A useless omission in the middle is not a deletion @property async def browser(self): if not hasattr(self, "_browser"): self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=self.__headless, args=self.__browser_args) return self._browser
Actually, I did a handling to pass a headless in
For session reset
from requests_html import HTMLSession session = HTMLSession( browser_args=['--no-sand', '--user-agent='xxxxx' ] )
#So you can directly define what browser he is sending requests to
2. Resolve the browser kernel (for anti-crawling, if not)
#Using modules js injection from requests_html import HTMLSession session =HTMLSession(.....) response = session.get('https://www.baidu.com') script=''' ()=>{ Object.defineProperties(navigator,{ webdriver:{ get: () => undefined } })}''' print(response.html.render(script=script))
3. response.html Related Properties
The response object here is
from requests_html import HTMLSession session =HTMLSession() response = session.get('https://www.baidu.com') #This is for your understanding response
1.absolute_links
All paths are returned as absolute paths
2.links
Return path as it is
3.base_url
The path in the.Base tag, if there is no base tag, is the current url
4.html
Returns that the string contains labels
5.text
Return string string contains no super-useful tags to crawl what novel news!
6.encoding
Decoding format, note here is response.html encoding. If you only set response.encoding, it has no effect on this encoding
7.raw_html
Equivalent to r.content returning binary
8.pq
Returns a PyQuery object, which is not used by individuals to draw all unwritten conclusions
IV. response.html Related Methods
The following response object is abbreviated as r
1.find
Finding Objects with css Selector
Get All
Syntax: r.html.find('css selector')
Return value: [element Object 1,...]is a list
Get only the first
Syntax `: r.html.find('css selector', first = True)
Return value: element object
2.xpath
Finding objects with xpath selector
Get All
Syntax: r.html.xpath('xpath selector')
Return value: [Element object 1,...]is a list
Get only the first
Syntax `: r.html.xpath('xpath selector', first = True)
Return value: Element object
3.search (get only the first)
Similar to regular matching, you change (. *?) inside a regular to {}
Syntax: r.html.search('template')
Template one: ('xx{}xxx{}')
Get: Get the first: r.html.search('template') [0] Others and so on
Template 2: ('xxx{name}yyy{pwd}')
Get: Get the first: r.html.search('template') ['name'] Others and so on
4.search_ All (get all)
Same usage as search
Return value: [result object, result object,]
5.render (I'll write a single summary later that's a bit too much)
He's actually encapsulating pyppeteer. If you don't know pyppeteer, think of Selenium as a simulated browser access
5. Element Object Methods and Properties
Copy.absolute_links:absolute url .links: relative url .text: Show text only .html: Labels will also be displayed .attrs:Attribute .find('css selector') .xpath('xapth path') .search('template') .search_all('template')