Python 3 Web Crawler Actual Warfare-11, Installation of Crawler Framework: ScrapySplash, ScrapyRedis

Installation of ScrapySplash

ScrapySplash is a tool in Scrapy that supports JavaScript rendering. This section describes how it is installed.
The installation of ScrapySplash is divided into two parts. One is the installation of Splash service. The installation mode is through Docker. After installation, a Splash service will be started. We can load JavaScript pages through its interface. The other is the installation of the Python Library of ScrapySplash, after which Splash services can be used in ScrapySplash.

1. Related links

2. Install Splash

ScrapySplash uses Splash's HTTP API for page rendering, so we need to install Splash to provide rendering services. The installation is through Docker installation. Make sure that Docker is installed correctly before that.
The installation commands are as follows:

docker run -p 8050:8050 scrapinghub/splash

Similar output will occur after installation:

2017-07-03 08:53:28+0000 [-] Log opened.
2017-07-03 08:53:28.447291 [-] Splash version: 3.0
2017-07-03 08:53:28.452698 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2
2017-07-03 08:53:28.453120 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609]
2017-07-03 08:53:28.453676 [-] Open files limit: 1048576
2017-07-03 08:53:28.454258 [-] Can't bump open files limit
2017-07-03 08:53:28.571306 [-] Xvfb is started: ['Xvfb', ':1599197258', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
2017-07-03 08:53:29.041973 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2017-07-03 08:53:29.315445 [-] verbosity=1
2017-07-03 08:53:29.315629 [-] slots=50
2017-07-03 08:53:29.315712 [-] argument_cache_max_entries=500
2017-07-03 08:53:29.316564 [-] Web UI: enabled, Lua: enabled (sandbox: enabled)
2017-07-03 08:53:29.317614 [-] Site starting on 8050
2017-07-03 08:53:29.317801 [-] Starting factory <twisted.web.server.Site object at 0x7ffaa4a98cf8>
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

This proves that Splash is already running on port 8050.
Then we open: http://localhost Splash's home page can be seen at 8050, as shown in Figure 1-81:

Figure 1-81 Running Page
Of course, Splash can also be installed directly on the remote server. We can run Splash in a daemon mode on the server. The commands are as follows:

docker run -d -p 8050:8050 scrapinghub/splash

Here is an additional - d parameter, which represents running the Docker container as a guardian, so that the Splash service will not be terminated after the remote server connection is interrupted.

3. Installation of ScrapySplash

After successfully installing Splash, let's install its Python library again. The installation commands are as follows:

pip3 install scrapy-splash

After the command is run, the library will be installed successfully. We will introduce its detailed usage later.

Installation of ScrapyRedis

Scrapy Redis is an extension module of Scrapy Distributed. With it, we can easily build Scrapy Distributed Crawler. This section describes how Scrapy Redis is installed.

1. Related links

2. Pip Installation

Pip installation is recommended with the following commands:

pip3 install scrapy-redis

3. Test Installation

After the installation is complete, you can test it at the Python command line.

$ python3
>>> import scrapy_redis
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

If no error is reported, the library is proven to be installed.

Keywords: Python Docker github Redis

Added by sac0o01 on Sat, 27 Jul 2019 12:04:39 +0300