How to integrate selenium crawling web pages in scratch

1. Background

We usually use three crawler libraries when crawling web pages: requests, scratch and selenium. Requests are generally used for small crawlers, while scratch is used to build large crawler projects, while selenium is mainly used to deal with responsible pages (complex js rendered pages, requests are very difficult to construct, or the construction method often changes).

When we are faced with large crawler projects, we will certainly choose the scratch framework for development, but it is troublesome to parse complex JS rendered pages. Although it is convenient to use selenium browser rendering to capture such a page, in this way, we do not need to care about what kind of requests occur in the background of the page, nor do we need to analyze the rendering process of the whole page. We only need to care about the final result of the page, which can be seen and crawled, but the efficiency of selenium is too low.

Therefore, if selenium can be integrated into the script and let selenium be responsible for crawling complex pages, such crawlers will be invincible and can crawl any website.

  1. environment

python 3.6.1

System: win7

IDE: pycharm

Installed chrome browser

Configure the chromedriver (set the environment variable)

selenium 3.7.0

scrapy 1.4.0

3. Principle analysis

3.1. Analyze the process of request

First, let's take a look at the latest architecture diagram of graph:
Some processes:

First: the crawler engine generates requests, sends them to the scheduler scheduling module, enters the waiting queue and waits for scheduling.

Second: the scheduler module starts to schedule these requests, get out of the queue and send them to the crawler engine.

Third: the crawler engine sends these requests to the download middleware (multiple, such as adding header s, agents, customization, etc.) for processing.

Fourth: after processing, send it to the Downloader module for download.

From this processing process, the breakthrough is to download the middleware and directly process the request with selenium.

3.2. Source code analysis of requests and response intermediate processing pieces

Relevant code location:
  
Source code analysis:

 # File: e: \ miniconda \ lib \ site packages \ scratch \ core \ downloader \ middleware py
  """
  Downloader Middleware manager
  See documentation in docs/topics/downloader-middleware.rst
  """
  import six
  from twisted.internet import defer
  from scrapy.http import Request, Response
  from scrapy.middleware import MiddlewareManager
  from scrapy.utils.defer import mustbe_deferred
  from scrapy.utils.conf import build_component_list
  class DownloaderMiddlewareManager(MiddlewareManager):
      component_name = 'downloader middleware'
      @classmethod
      def _get_mwlist_from_settings(cls, settings):
          # From settings Py or custom_ Get the customized Middleware in setting
          '''
          'DOWNLOADER_MIDDLEWARES': {
              'mySpider.middlewares.ProxiesMiddleware': 400,
              # SeleniumMiddleware
              'mySpider.middlewares.SeleniumMiddleware': 543,
              'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
          },
          '''
          return build_component_list(
              settings.getwithbase('DOWNLOADER_MIDDLEWARES'))
      # Add the processing functions of all custom Middleware middleware to the corresponding methods list
      def _add_middleware(self, mw):
          if hasattr(mw, 'process_request'):
              self.methods['process_request'].append(mw.process_request)
          if hasattr(mw, 'process_response'):
              self.methods['process_response'].insert(0, mw.process_response)
          if hasattr(mw, 'process_exception'):
              self.methods['process_exception'].insert(0, mw.process_exception)
      # The whole download process
      def download(self, download_func, request, spider):
          @defer.inlineCallbacks
          def process_request(request):
              # The request is processed through the process of each custom Middleware_ The request method is added to the list
              for method in self.methods['process_request']:
                  response = yield method(request=request, spider=spider)
                  assert response is None or isinstance(response, (Response, Request)), \
                          'Middleware %s.process_request must return None, Response or Request, got %s' % \
                          (six.get_method_self(method).__class__.__name__, response.__class__.__name__)
                  # This is the key
                  # If in the process of a Middleware_ After the request is processed, a response object is generated
                  # Then it will return the response directly, jump out of the loop and no longer process other processes_ request
                  # Previously, our header and proxy middleware only added a user agent and a proxy without any return value
                  # It should also be noted that the return must be a Response object
                  # The HtmlResponse constructed later is the subclass object of the Response
                  if response:
                      defer.returnValue(response)
              # If all processes above_ If no Response object is returned in the request
              # Finally, the processed Request will be sent to download_func to download and return a Response object
              # Then it passes through the process of each Middleware in turn_ The response method is processed as follows
              defer.returnValue((yield download_func(request=request,spider=spider)))
          @defer.inlineCallbacks
          def process_response(response):
              assert response is not None, 'Received None in process_response'
              if isinstance(response, Request):
                  defer.returnValue(response)
              for method in self.methods['process_response']:
                  response = yield method(request=request, response=response,
                                          spider=spider)
                  assert isinstance(response, (Response, Request)), \
                      'Middleware %s.process_response must return Response or Request, got %s' % \
                      (six.get_method_self(method).__class__.__name__, type(response))
                  if isinstance(response, Request):
                      defer.returnValue(response)
              defer.returnValue(response)
          @defer.inlineCallbacks
          def process_exception(_failure):
              exception = _failure.value
              for method in self.methods['process_exception']:
                  response = yield method(request=request, exception=exception,
                                          spider=spider)
                  assert response is None or isinstance(response, (Response, Request)), \
                      'Middleware %s.process_exception must return None, Response or Request, got %s' % \
                      (six.get_method_self(method).__class__.__name__, type(response))
                  if response:
                      defer.returnValue(response)
              defer.returnValue(_failure)
          deferred = mustbe_deferred(process_request, request)
          deferred.addErrback(process_exception)
          deferred.addCallback(process_response)
          return deferred
  1. code

In settings Py, configure selenium parameters:

#File settings In PY
# --------- selenium parameter configuration-------------
  SELENIUM_ Timeout = timeout of 25 # selenium browser, in seconds
  LOAD_IMAGE = True # do you want to download the picture
  WINDOW_HEIGHT = 900 # browser window size
  WINDOW_WIDTH = 900

In the spider, when generating a request, mark which requests need to be downloaded through selenium:

 # File myspider In PY
  class mySpider(CrawlSpider):
      name = "mySpiderAmazon"
      allowed_domains = ['amazon.com']
      custom_settings = {
          'LOG_LEVEL':'INFO',
          'DOWNLOAD_DELAY': 0,
          'COOKIES_ENABLED': False,  # enabled by default
          'DOWNLOADER_MIDDLEWARES': {
              # Agent middleware
              'mySpider.middlewares.ProxiesMiddleware': 400,
              # Selenium Middleware
              'mySpider.middlewares.SeleniumMiddleware': 543,
              # Turn off the default user agent middleware of the sweep
              'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
          },
  #..................... Gorgeous dividing line
  # When generating a request, put the flag of whether to use selenium download into meta
  yield Request(
      url = "https://www.amazon.com/",
      meta = {'usedSelenium': True, 'dont_redirect': True},
      callback = self.parseIndexPage,
      errback = self.error
  )
  Downloading Middleware middlewares.py In, use selenium Grab page (core part)
  # -*- coding: utf-8 -*-
  from selenium import webdriver
  from selenium.common.exceptions import TimeoutException
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.support import expected_conditions as EC
  from selenium.webdriver.common.keys import Keys
  from scrapy.http import HtmlResponse
  from logging import getLogger
  import time
  class SeleniumMiddleware():
      # It is often necessary to obtain the properties of settings in pipeline or middleware, which can be obtained through scratch crawler. Crawler. Settings property
      @classmethod
      def from_crawler(cls, crawler):
          # From settings Py, extract selenium setting parameters and initialize classes
          return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'),
                     isLoadImage=crawler.settings.get('LOAD_IMAGE'),
                     windowHeight=crawler.settings.get('WINDOW_HEIGHT'),
                     windowWidth=crawler.settings.get('WINDOW_WIDTH')
                     )
      def __init__(self, timeout=30, isLoadImage=True, windowHeight=None, windowWidth=None):
          self.logger = getLogger(__name__)
          self.timeout = timeout
          self.isLoadImage = isLoadImage
          # Define a browser belonging to this class to prevent a new chrome browser from opening every time a page is requested
          # In this way, all requests processed by this class can use only this browser
          self.browser = webdriver.Chrome()
          if windowHeight and windowWidth:
              self.browser.set_window_size(900, 900)
          self.browser.set_page_load_timeout(self.timeout)        # Page load timeout
          self.wait = WebDriverWait(self.browser, 25)             # Specifies the element load timeout
          def process_request(self, request, spider):
          '''
          use chrome Grab page
          :param request: Request Request object
          :param spider: Spider object
          :return: HtmlResponse response
          '''
          # self.logger.debug('chrome is getting page')
          print(f"chrome is getting page")
          # It depends on the tags in meta to determine whether selenium is needed to crawl
          usedSelenium = request.meta.get('usedSelenium', False)
          if usedSelenium:
              try:
                  self.browser.get(request.url)
                  # Does the search box appear
                  input = self.wait.until(
                      EC.presence_of_element_located((By.XPATH, "//div[@class='nav-search-field ']/input"))
                  )
                  time.sleep(2)
                  input.clear()
                  input.send_keys("iphone 7s")
                  # Press enter to search
                  input.send_keys(Keys.RETURN)
                  # See if the search results appear
                  searchRes = self.wait.until(
                      EC.presence_of_element_located((By.XPATH, "//div[@id='resultsCol']"))
                  )
              except Exception as e:
                  # self.logger.debug(f'chrome getting page error, Exception = {e}')
                  print(f"chrome getting page error, Exception = {e}")
                  return HtmlResponse(url=request.url, status=500, request=request)
              else:
                  time.sleep(3)
                  return HtmlResponse(url=request.url,
                                      body=self.browser.page_source,
                                      request=request,
                                      # Best according to the specific coding of the web page
                                      encoding='utf-8',
                                      status=200)
  1. results of enforcement

      6. Existing problems

6.1. Spider closed and chrome did not exit.

2018-04-04 09:26:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
  {'downloader/response_bytes': 2092766,
   'downloader/response_count': 2,
   'downloader/response_status_count/200': 2,
   'finish_reason': 'finished',
   'finish_time': datetime.datetime(2018, 4, 4, 1, 26, 16, 763602),
   'log_count/INFO': 7,
   'request_depth_max': 1,
   'response_received_count': 2,
   'scheduler/dequeued': 2,
   'scheduler/dequeued/memory': 2,
   'scheduler/enqueued': 2,
   'scheduler/enqueued/memory': 2,
   'start_time': datetime.datetime(2018, 4, 4, 1, 25, 48, 301602)}
  2018-04-04 09:26:18 [scrapy.core.engine] INFO: Spider closed (finished)

Above, we put the browser object into the Middleware and can only do process_request and process_response does not describe how to call the close method of the scratch in the Middleware.

Solution: use the semaphore method when the spider is received_ When the closed signal, call browser quit()

6.2. When a project starts multiple spider s at the same time, it will share the selenium in Middleware, which is not conducive to concurrency.

Because only some or even a small number of pages use chrome in the way of scratch + selenium. Since there are so many restrictions on putting chrome into Middleware, why not put chrome into spider. The advantage of this is that each spider has its own chrome, so when multiple spiders are started, there will be multiple chrome. Not all spiders share one chrome, which is good for our concurrency.

Solution: put the initialization of chrome into the spider, and each spider has its own chrome

  1. Improved code

In settings Py, configure selenium parameters:

# File settings In PY
  # -----------selenium parameter configuration-------------
  SELENIUM_TIMEOUT = 25           # Timeout of selenium browser, in seconds
  LOAD_IMAGE = True               # Download picture
  WINDOW_HEIGHT = 900             # Browser window size
  WINDOW_WIDTH = 900

In the spider, when generating a request, mark which requests need to be downloaded through selenium:

 # File myspider In PY
  # selenium related Library
  from selenium import webdriver
  from selenium.webdriver.support.ui import WebDriverWait
  # Sweep signal correlation Library
  from scrapy.utils.project import get_project_settings
  # The following method will be abandoned, so it is not necessary
  # from scrapy.xlib.pydispatch import dispatcher
  from scrapy import signals
  # The latest scheme adopted by scratch
  from pydispatch import dispatcher
  class mySpider(CrawlSpider):
      name = "mySpiderAmazon"
      allowed_domains = ['amazon.com']
      custom_settings = {
          'LOG_LEVEL':'INFO',
          'DOWNLOAD_DELAY': 0,
          'COOKIES_ENABLED': False,  # enabled by default
          'DOWNLOADER_MIDDLEWARES': {
              # Agent middleware
              'mySpider.middlewares.ProxiesMiddleware': 400,
              # Selenium Middleware
              'mySpider.middlewares.SeleniumMiddleware': 543,
              # Turn off the default user agent middleware of the sweep
              'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
          },
      # Put the chrome initialization into the spider and become an element in the spider
      def __init__(self, timeout=30, isLoadImage=True, windowHeight=None, windowWidth=None):
          # From settings PY
          self.mySetting = get_project_settings()
          self.timeout = self.mySetting['SELENIUM_TIMEOUT']
          self.isLoadImage = self.mySetting['LOAD_IMAGE']
          self.windowHeight = self.mySetting['WINDOW_HEIGHT']
          self.windowWidth = self.mySetting['windowWidth']
          # Initializing chrome objects
          self.browser = webdriver.Chrome()
          if self.windowHeight and self.windowWidth:
              self.browser.set_window_size(900, 900)
          self.browser.set_page_load_timeout(self.timeout)        # Page load timeout
          self.wait = WebDriverWait(self.browser, 25)             # Specifies the element load timeout
          super(mySpider, self).__init__()
          # Set the semaphore when the spider is received_ When the closed signal, call the mySpiderCloseHandle method to close chrome
          dispatcher.connect(receiver = self.mySpiderCloseHandle,
                             signal = signals.spider_closed`Insert the code slice here`
                             )
      # Semaphore handler: close chrome browser
      def mySpiderCloseHandle(self, spider):
          print(f"mySpiderCloseHandle: enter ")
          self.browser.quit()
  #..................... Gorgeous dividing line
  # When generating a request, put the flag of whether to use selenium download into meta
  yield Request(
      url = "https://www.amazon.com/",
      meta = {'usedSelenium': True, 'dont_redirect': True},
      callback = self.parseIndexPage,
      errback = self.error
  )

Download middleware middleware Py, use selenium to grab the page

 # -*- coding: utf-8 -*-
  from selenium import webdriver
  from selenium.common.exceptions import TimeoutException
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.support import expected_conditions as EC
  from selenium.webdriver.common.keys import Keys
  from scrapy.http import HtmlResponse
  from logging import getLogger
  import time
  class SeleniumMiddleware():
      # Middleware will pass in a spider, which is our spider object, from which we can get__ init__ chrome related elements at
      def process_request(self, request, spider):
          '''
          use chrome Grab page
          :param request: Request Request object
          :param spider: Spider object
          :return: HtmlResponse response
          '''
          print(f"chrome is getting page")
          # It depends on the tags in meta to determine whether selenium is needed to crawl
          usedSelenium = request.meta.get('usedSelenium', False)
          if usedSelenium:
              try:
                  spider.browser.get(request.url)
                  # Does the search box appear
                  input = spider.wait.until(
                      EC.presence_of_element_located((By.XPATH, "//div[@class='nav-search-field ']/input"))
                  )
                  time.sleep(2)
                  input.clear()
                  input.send_keys("iphone 7s")
                  # Press enter to search
                  input.send_keys(Keys.RETURN)
                  # See if the search results appear
                  searchRes = spider.wait.until(
                      EC.presence_of_element_located((By.XPATH, "//div[@id='resultsCol']"))
                  )
              except Exception as e:
                  print(f"chrome getting page error, Exception = {e}")
                  return HtmlResponse(url=request.url, status=500, request=request)
              else:
                  time.sleep(3)
                  # Page crawling succeeds, and a successful Response object is constructed (HtmlResponse is its subclass)
                  return HtmlResponse(url=request.url,
                                      body=spider.browser.page_source,
                                      request=request,
                                      # Best according to the specific coding of the web page
                                      encoding='utf-8',
                                      status=200)

Running result (after the spider ends, execute mySpiderCloseHandle and close the chrome browser):

['categorySelectorAmazon1.pipelines.MongoPipeline']
  2018-04-04 11:56:21 [scrapy.core.engine] INFO: Spider opened
  2018-04-04 11:56:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  chrome is getting page
  parseProductDetail url = https://www.amazon.com/, status = 200, meta = {'usedSelenium': True, 'dont_redirect': True, 'download_timeout': 25.0, 'proxy': 'http://H37XPSB6V57VU96D:CAB31DAEB9313CE5@proxy.abuyun.com:9020', 'depth': 0}
  chrome is getting page
  2018-04-04 11:56:54 [scrapy.core.engine] INFO: Closing spider (finished)
  mySpiderCloseHandle: enter 
  2018-04-04 11:56:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
  {'downloader/response_bytes': 1938619,
   'downloader/response_count': 2,
   'downloader/response_status_count/200': 2,
   'finish_reason': 'finished',
   'finish_time': datetime.datetime(2018, 4, 4, 3, 56, 54, 301602),
   'log_count/INFO': 7,
   'request_depth_max': 1,
   'response_received_count': 2,
   'scheduler/dequeued': 2,
   'scheduler/dequeued/memory': 2,
   'scheduler/enqueued': 2,
   'scheduler/enqueued/memory': 2,
   'start_time': datetime.datetime(2018, 4, 4, 3, 56, 21, 642602)}
  2018-04-04 11:56:59 [scrapy.core.engine] INFO: Spider closed (finished)

Finally: [may give you some help]

These materials should be the most comprehensive and complete preparation warehouse for [software testing] friends. This warehouse also accompanies tens of thousands of test engineers through the most difficult journey. I hope it can also help you!
Pay attention to my WeChat official account [software test dao] free access.

My learning and communication group: 644956177. There are technical cows in the group to communicate and share~

If my blog is helpful to you and you like my blog content, please click "like", "comment" and "collect" for three times!

Keywords: Selenium Programmer software testing Testing

Added by Pi_Mastuh on Sun, 16 Jan 2022 06:16:34 +0200