How to integrate selenium crawling web pages in scratch

1. Background

We usually use three crawler libraries when crawling web pages: requests, scratch and selenium. Requests are generally used for small crawlers, while scratch is used to build large crawler projects, while selenium is mainly used to deal with responsible pages (complex js rendered pages, requests are very difficult to construct, or the construction method often changes).

When we are faced with large crawler projects, we will certainly choose the scratch framework for development, but it is troublesome to parse complex JS rendered pages. Although it is convenient to use selenium browser rendering to capture such a page, in this way, we do not need to care about what kind of requests occur in the background of the page, nor do we need to analyze the rendering process of the whole page. We only need to care about the final result of the page, which can be seen and crawled, but the efficiency of selenium is too low.

Therefore, if selenium can be integrated into the script and let selenium be responsible for crawling complex pages, such crawlers will be invincible and can crawl any website.

environment

python 3.6.1

System: win7

IDE: pycharm

Installed chrome browser

Configure the chromedriver (set the environment variable)

selenium 3.7.0

scrapy 1.4.0

3. Principle analysis

3.1. Analyze the process of request

First, let's take a look at the latest architecture diagram of graph:
Some processes:

First: the crawler engine generates requests, sends them to the scheduler scheduling module, enters the waiting queue and waits for scheduling.

Second: the scheduler module starts to schedule these requests, get out of the queue and send them to the crawler engine.

Third: the crawler engine sends these requests to the download middleware (multiple, such as adding header s, agents, customization, etc.) for processing.

Fourth: after processing, send it to the Downloader module for download.

From this processing process, the breakthrough is to download the middleware and directly process the request with selenium.

3.2. Source code analysis of requests and response intermediate processing pieces

Relevant code location:
　　
Source code analysis:

　# File: e: \ miniconda \ lib \ site packages \ scratch \ core \ downloader \ middleware py
　　"""
　　Downloader Middleware manager
　　See documentation in docs/topics/downloader-middleware.rst
　　"""
　　import six
　　from twisted.internet import defer
　　from scrapy.http import Request, Response
　　from scrapy.middleware import MiddlewareManager
　　from scrapy.utils.defer import mustbe_deferred
　　from scrapy.utils.conf import build_component_list
　　class DownloaderMiddlewareManager(MiddlewareManager):
　　    component_name = 'downloader middleware'
　　    @classmethod
　　    def _get_mwlist_from_settings(cls, settings):
　　        # From settings Py or custom_ Get the customized Middleware in setting
　　        '''
　　        'DOWNLOADER_MIDDLEWARES': {
　　            'mySpider.middlewares.ProxiesMiddleware': 400,
　　            # SeleniumMiddleware
　　            'mySpider.middlewares.SeleniumMiddleware': 543,
　　            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
　　        },
　　        '''
　　        return build_component_list(
　　            settings.getwithbase('DOWNLOADER_MIDDLEWARES'))
　　    # Add the processing functions of all custom Middleware middleware to the corresponding methods list
　　    def _add_middleware(self, mw):
　　        if hasattr(mw, 'process_request'):
　　            self.methods['process_request'].append(mw.process_request)
　　        if hasattr(mw, 'process_response'):
　　            self.methods['process_response'].insert(0, mw.process_response)
　　        if hasattr(mw, 'process_exception'):
　　            self.methods['process_exception'].insert(0, mw.process_exception)
　　    # The whole download process
　　    def download(self, download_func, request, spider):
　　        @defer.inlineCallbacks
　　        def process_request(request):
　　            # The request is processed through the process of each custom Middleware_ The request method is added to the list
　　            for method in self.methods['process_request']:
　　                response = yield method(request=request, spider=spider)
　　                assert response is None or isinstance(response, (Response, Request)), \
　　                        'Middleware %s.process_request must return None, Response or Request, got %s' % \
　　                        (six.get_method_self(method).__class__.__name__, response.__class__.__name__)
　　                # This is the key
　　                # If in the process of a Middleware_ After the request is processed, a response object is generated
　　                # Then it will return the response directly, jump out of the loop and no longer process other processes_ request
　　                # Previously, our header and proxy middleware only added a user agent and a proxy without any return value
　　                # It should also be noted that the return must be a Response object
　　                # The HtmlResponse constructed later is the subclass object of the Response
　　                if response:
　　                    defer.returnValue(response)
　　            # If all processes above_ If no Response object is returned in the request
　　            # Finally, the processed Request will be sent to download_func to download and return a Response object
　　            # Then it passes through the process of each Middleware in turn_ The response method is processed as follows
　　            defer.returnValue((yield download_func(request=request,spider=spider)))
　　        @defer.inlineCallbacks
　　        def process_response(response):
　　            assert response is not None, 'Received None in process_response'
　　            if isinstance(response, Request):
　　                defer.returnValue(response)
　　            for method in self.methods['process_response']:
　　                response = yield method(request=request, response=response,
　　                                        spider=spider)
　　                assert isinstance(response, (Response, Request)), \
　　                    'Middleware %s.process_response must return Response or Request, got %s' % \
　　                    (six.get_method_self(method).__class__.__name__, type(response))
　　                if isinstance(response, Request):
　　                    defer.returnValue(response)
　　            defer.returnValue(response)
　　        @defer.inlineCallbacks
　　        def process_exception(_failure):
　　            exception = _failure.value
　　            for method in self.methods['process_exception']:
　　                response = yield method(request=request, exception=exception,
　　                                        spider=spider)
　　                assert response is None or isinstance(response, (Response, Request)), \
　　                    'Middleware %s.process_exception must return None, Response or Request, got %s' % \
　　                    (six.get_method_self(method).__class__.__name__, type(response))
　　                if response:
　　                    defer.returnValue(response)
　　            defer.returnValue(_failure)
　　        deferred = mustbe_deferred(process_request, request)
　　        deferred.addErrback(process_exception)
　　        deferred.addCallback(process_response)
　　        return deferred

code

In settings Py, configure selenium parameters:

#File settings In PY
# --------- selenium parameter configuration-------------
　　SELENIUM_ Timeout = timeout of 25 # selenium browser, in seconds
　　LOAD_IMAGE = True # do you want to download the picture
　　WINDOW_HEIGHT = 900 # browser window size
　　WINDOW_WIDTH = 900

In the spider, when generating a request, mark which requests need to be downloaded through selenium:

　# File myspider In PY
　　class mySpider(CrawlSpider):
　　    name = "mySpiderAmazon"
　　    allowed_domains = ['amazon.com']
　　    custom_settings = {
　　        'LOG_LEVEL':'INFO',
　　        'DOWNLOAD_DELAY': 0,
　　        'COOKIES_ENABLED': False,  # enabled by default
　　        'DOWNLOADER_MIDDLEWARES': {
　　            # Agent middleware
　　            'mySpider.middlewares.ProxiesMiddleware': 400,
　　            # Selenium Middleware
　　            'mySpider.middlewares.SeleniumMiddleware': 543,
　　            # Turn off the default user agent middleware of the sweep
　　            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
　　        },
　　#..................... Gorgeous dividing line
　　# When generating a request, put the flag of whether to use selenium download into meta
　　yield Request(
　　    url = "https://www.amazon.com/",
　　    meta = {'usedSelenium': True, 'dont_redirect': True},
　　    callback = self.parseIndexPage,
　　    errback = self.error
　　)
　　Downloading Middleware middlewares.py In, use selenium Grab page (core part)
　　# -*- coding: utf-8 -*-
　　from selenium import webdriver
　　from selenium.common.exceptions import TimeoutException
　　from selenium.webdriver.common.by import By
　　from selenium.webdriver.support.ui import WebDriverWait
　　from selenium.webdriver.support import expected_conditions as EC
　　from selenium.webdriver.common.keys import Keys
　　from scrapy.http import HtmlResponse
　　from logging import getLogger
　　import time
　　class SeleniumMiddleware():
　　    # It is often necessary to obtain the properties of settings in pipeline or middleware, which can be obtained through scratch crawler. Crawler. Settings property
　　    @classmethod
　　    def from_crawler(cls, crawler):
　　        # From settings Py, extract selenium setting parameters and initialize classes
　　        return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'),
　　                   isLoadImage=crawler.settings.get('LOAD_IMAGE'),
　　                   windowHeight=crawler.settings.get('WINDOW_HEIGHT'),
　　                   windowWidth=crawler.settings.get('WINDOW_WIDTH')
　　                   )
　　    def __init__(self, timeout=30, isLoadImage=True, windowHeight=None, windowWidth=None):
　　        self.logger = getLogger(__name__)
　　        self.timeout = timeout
　　        self.isLoadImage = isLoadImage
　　        # Define a browser belonging to this class to prevent a new chrome browser from opening every time a page is requested
　　        # In this way, all requests processed by this class can use only this browser
　　        self.browser = webdriver.Chrome()
　　        if windowHeight and windowWidth:
　　            self.browser.set_window_size(900, 900)
　　        self.browser.set_page_load_timeout(self.timeout)        # Page load timeout
　　        self.wait = WebDriverWait(self.browser, 25)             # Specifies the element load timeout
　　        def process_request(self, request, spider):
　　        '''
　　        use chrome Grab page
　　        :param request: Request Request object
　　        :param spider: Spider object
　　        :return: HtmlResponse response
　　        '''
　　        # self.logger.debug('chrome is getting page')
　　        print(f"chrome is getting page")
　　        # It depends on the tags in meta to determine whether selenium is needed to crawl
　　        usedSelenium = request.meta.get('usedSelenium', False)
　　        if usedSelenium:
　　            try:
　　                self.browser.get(request.url)
　　                # Does the search box appear
　　                input = self.wait.until(
　　                    EC.presence_of_element_located((By.XPATH, "//div[@class='nav-search-field ']/input"))
　　                )
　　                time.sleep(2)
　　                input.clear()
　　                input.send_keys("iphone 7s")
　　                # Press enter to search
　　                input.send_keys(Keys.RETURN)
　　                # See if the search results appear
　　                searchRes = self.wait.until(
　　                    EC.presence_of_element_located((By.XPATH, "//div[@id='resultsCol']"))
　　                )
　　            except Exception as e:
　　                # self.logger.debug(f'chrome getting page error, Exception = {e}')
　　                print(f"chrome getting page error, Exception = {e}")
　　                return HtmlResponse(url=request.url, status=500, request=request)
　　            else:
　　                time.sleep(3)
　　                return HtmlResponse(url=request.url,
　　                                    body=self.browser.page_source,
　　                                    request=request,
　　                                    # Best according to the specific coding of the web page
　　                                    encoding='utf-8',
　　                                    status=200)

results of enforcement

　　6. Existing problems

6.1. Spider closed and chrome did not exit.

2018-04-04 09:26:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
　　{'downloader/response_bytes': 2092766,
　　 'downloader/response_count': 2,
　　 'downloader/response_status_count/200': 2,
　　 'finish_reason': 'finished',
　　 'finish_time': datetime.datetime(2018, 4, 4, 1, 26, 16, 763602),
　　 'log_count/INFO': 7,
　　 'request_depth_max': 1,
　　 'response_received_count': 2,
　　 'scheduler/dequeued': 2,
　　 'scheduler/dequeued/memory': 2,
　　 'scheduler/enqueued': 2,
　　 'scheduler/enqueued/memory': 2,
　　 'start_time': datetime.datetime(2018, 4, 4, 1, 25, 48, 301602)}
　　2018-04-04 09:26:18 [scrapy.core.engine] INFO: Spider closed (finished)

Above, we put the browser object into the Middleware and can only do process_request and process_response does not describe how to call the close method of the scratch in the Middleware.

Solution: use the semaphore method when the spider is received_ When the closed signal, call browser quit()

6.2. When a project starts multiple spider s at the same time, it will share the selenium in Middleware, which is not conducive to concurrency.

Because only some or even a small number of pages use chrome in the way of scratch + selenium. Since there are so many restrictions on putting chrome into Middleware, why not put chrome into spider. The advantage of this is that each spider has its own chrome, so when multiple spiders are started, there will be multiple chrome. Not all spiders share one chrome, which is good for our concurrency.

Solution: put the initialization of chrome into the spider, and each spider has its own chrome

Improved code

In settings Py, configure selenium parameters:

# File settings In PY
　　# -----------selenium parameter configuration-------------
　　SELENIUM_TIMEOUT = 25           # Timeout of selenium browser, in seconds
　　LOAD_IMAGE = True               # Download picture
　　WINDOW_HEIGHT = 900             # Browser window size
　　WINDOW_WIDTH = 900

In the spider, when generating a request, mark which requests need to be downloaded through selenium:

　# File myspider In PY
　　# selenium related Library
　　from selenium import webdriver
　　from selenium.webdriver.support.ui import WebDriverWait
　　# Sweep signal correlation Library
　　from scrapy.utils.project import get_project_settings
　　# The following method will be abandoned, so it is not necessary
　　# from scrapy.xlib.pydispatch import dispatcher
　　from scrapy import signals
　　# The latest scheme adopted by scratch
　　from pydispatch import dispatcher
　　class mySpider(CrawlSpider):
　　    name = "mySpiderAmazon"
　　    allowed_domains = ['amazon.com']
　　    custom_settings = {
　　        'LOG_LEVEL':'INFO',
　　        'DOWNLOAD_DELAY': 0,
　　        'COOKIES_ENABLED': False,  # enabled by default
　　        'DOWNLOADER_MIDDLEWARES': {
　　            # Agent middleware
　　            'mySpider.middlewares.ProxiesMiddleware': 400,
　　            # Selenium Middleware
　　            'mySpider.middlewares.SeleniumMiddleware': 543,
　　            # Turn off the default user agent middleware of the sweep
　　            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
　　        },
　　    # Put the chrome initialization into the spider and become an element in the spider
　　    def __init__(self, timeout=30, isLoadImage=True, windowHeight=None, windowWidth=None):
　　        # From settings PY
　　        self.mySetting = get_project_settings()
　　        self.timeout = self.mySetting['SELENIUM_TIMEOUT']
　　        self.isLoadImage = self.mySetting['LOAD_IMAGE']
　　        self.windowHeight = self.mySetting['WINDOW_HEIGHT']
　　        self.windowWidth = self.mySetting['windowWidth']
　　        # Initializing chrome objects
　　        self.browser = webdriver.Chrome()
　　        if self.windowHeight and self.windowWidth:
　　            self.browser.set_window_size(900, 900)
　　        self.browser.set_page_load_timeout(self.timeout)        # Page load timeout
　　        self.wait = WebDriverWait(self.browser, 25)             # Specifies the element load timeout
　　        super(mySpider, self).__init__()
　　        # Set the semaphore when the spider is received_ When the closed signal, call the mySpiderCloseHandle method to close chrome
　　        dispatcher.connect(receiver = self.mySpiderCloseHandle,
　　                           signal = signals.spider_closed`Insert the code slice here`
　　                           )
　　    # Semaphore handler: close chrome browser
　　    def mySpiderCloseHandle(self, spider):
　　        print(f"mySpiderCloseHandle: enter ")
　　        self.browser.quit()
　　#..................... Gorgeous dividing line
　　# When generating a request, put the flag of whether to use selenium download into meta
　　yield Request(
　　    url = "https://www.amazon.com/",
　　    meta = {'usedSelenium': True, 'dont_redirect': True},
　　    callback = self.parseIndexPage,
　　    errback = self.error
　　)

Download middleware middleware Py, use selenium to grab the page

　# -*- coding: utf-8 -*-
　　from selenium import webdriver
　　from selenium.common.exceptions import TimeoutException
　　from selenium.webdriver.common.by import By
　　from selenium.webdriver.support.ui import WebDriverWait
　　from selenium.webdriver.support import expected_conditions as EC
　　from selenium.webdriver.common.keys import Keys
　　from scrapy.http import HtmlResponse
　　from logging import getLogger
　　import time
　　class SeleniumMiddleware():
　　    # Middleware will pass in a spider, which is our spider object, from which we can get__ init__ chrome related elements at
　　    def process_request(self, request, spider):
　　        '''
　　        use chrome Grab page
　　        :param request: Request Request object
　　        :param spider: Spider object
　　        :return: HtmlResponse response
　　        '''
　　        print(f"chrome is getting page")
　　        # It depends on the tags in meta to determine whether selenium is needed to crawl
　　        usedSelenium = request.meta.get('usedSelenium', False)
　　        if usedSelenium:
　　            try:
　　                spider.browser.get(request.url)
　　                # Does the search box appear
　　                input = spider.wait.until(
　　                    EC.presence_of_element_located((By.XPATH, "//div[@class='nav-search-field ']/input"))
　　                )
　　                time.sleep(2)
　　                input.clear()
　　                input.send_keys("iphone 7s")
　　                # Press enter to search
　　                input.send_keys(Keys.RETURN)
　　                # See if the search results appear
　　                searchRes = spider.wait.until(
　　                    EC.presence_of_element_located((By.XPATH, "//div[@id='resultsCol']"))
　　                )
　　            except Exception as e:
　　                print(f"chrome getting page error, Exception = {e}")
　　                return HtmlResponse(url=request.url, status=500, request=request)
　　            else:
　　                time.sleep(3)
　　                # Page crawling succeeds, and a successful Response object is constructed (HtmlResponse is its subclass)
　　                return HtmlResponse(url=request.url,
　　                                    body=spider.browser.page_source,
　　                                    request=request,
　　                                    # Best according to the specific coding of the web page
　　                                    encoding='utf-8',
　　                                    status=200)

Running result (after the spider ends, execute mySpiderCloseHandle and close the chrome browser):

['categorySelectorAmazon1.pipelines.MongoPipeline']
　　2018-04-04 11:56:21 [scrapy.core.engine] INFO: Spider opened
　　2018-04-04 11:56:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
　　chrome is getting page
　　parseProductDetail url = https://www.amazon.com/, status = 200, meta = {'usedSelenium': True, 'dont_redirect': True, 'download_timeout': 25.0, 'proxy': 'http://H37XPSB6V57VU96D:CAB31DAEB9313CE5@proxy.abuyun.com:9020', 'depth': 0}
　　chrome is getting page
　　2018-04-04 11:56:54 [scrapy.core.engine] INFO: Closing spider (finished)
　　mySpiderCloseHandle: enter 
　　2018-04-04 11:56:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
　　{'downloader/response_bytes': 1938619,
　　 'downloader/response_count': 2,
　　 'downloader/response_status_count/200': 2,
　　 'finish_reason': 'finished',
　　 'finish_time': datetime.datetime(2018, 4, 4, 3, 56, 54, 301602),
　　 'log_count/INFO': 7,
　　 'request_depth_max': 1,
　　 'response_received_count': 2,
　　 'scheduler/dequeued': 2,
　　 'scheduler/dequeued/memory': 2,
　　 'scheduler/enqueued': 2,
　　 'scheduler/enqueued/memory': 2,
　　 'start_time': datetime.datetime(2018, 4, 4, 3, 56, 21, 642602)}
　　2018-04-04 11:56:59 [scrapy.core.engine] INFO: Spider closed (finished)

Finally: [may give you some help]

These materials should be the most comprehensive and complete preparation warehouse for [software testing] friends. This warehouse also accompanies tens of thousands of test engineers through the most difficult journey. I hope it can also help you!
Pay attention to my WeChat official account [software test dao] free access.

My learning and communication group: 644956177. There are technical cows in the group to communicate and share~

If my blog is helpful to you and you like my blog content, please click "like", "comment" and "collect" for three times!

Keywords: Selenium Programmer software testing Testing

Added by Pi_Mastuh on Sun, 16 Jan 2022 06:16:34 +0200

Programming VIP

How to integrate selenium crawling web pages in scratch

Finally: [may give you some help]

Popular Keywords