scrapy simple distributed crawler

Links to the original text:

Although scrapy can do a lot of things, it is hard to achieve large-scale distributed applications. A capable person changes the queue scheduling of scrapy, separates the starting address from the start_urls and reads from redis, so that multiple clients can read the same redis at the same time, thus realizing distributed crawler. Even on the same computer, crawlers can run in multiple processes, which is very effective in the process of large-scale crawling.

Get ready:

1. windows (from scrapy)

2. linux (main: scrapy redis mongo)



Configuration steps of scrapy under linux:

1. Install Python 3.6

Yum install openssl-devel-y solves the problem that pip3 cannot be used (pip is configured with locations that require TLS/SSL, but the SSL module in Python is not available)

Download the python package, Python-3.6.1.tar.xz, after decompression

      ./configure --prefix=/python3


      make install  

Add environmental variables:


      export PATH

pip3 is also installed by default (yum GCC is required before installation)

2. Installing Twisted

Download Twisted-17.9.0.tar.bz2, decompressed CD Twisted-17.9.0, Python 3 install

3. Installation of scrapy

    pip3 install scrapy

    pip3 install scrapy-redis
 4. Installing redis

See Blog redis Installation and Simple Use
 Error: You need tcl 8.5 or newer in order to run the Redis test

      2,tar -xvf tcl8.6.1-src.tar.gz
      3,cd tcl8.6.1/unix ; make; make install
    cp /root/redis-3.2.11/redis.conf /etc/
Start: / root/redis-3.2.11/src/redis-server/etc/redis.conf&
  5,pip3 install redis
 6. Installation of mongodb

Start:  mongod -- bind_ip

  7,pip3 install pymongo

The deployment steps of scrapy on windows:

Nobody answered the question? Editor created a Python learning and communication QQ group: 857662006 
Look for like-minded friends, help each other, there are good video learning tutorials and PDF e-books in the group!
1. Install wheel
        pip install wheel
    2. Installing lxml
    3. Install pyopenssl
    4. Install Twisted
    5. Install pywin32
    6. Installing scrapy
        pip install scrapy

Deployment code:

I take the movie crawling of American TV Paradise as a simple example, and talk about the distributed implementation. The code linux and windows have one copy each. The configuration is the same. Both can run crawling at the same time.

List only the areas that need to be modified:


Set up the mongodb, redis for fingerprint and queue

//Nobody answered the question? Editor created a Python learning and communication QQ group: 857662006 
//Look for like-minded friends, help each other, there are good video learning tutorials and PDF e-books in the group!
ROBOTSTXT_OBEY = False  # Prohibit robot s
CONCURRENT_REQUESTS = 1  # Maximum concurrency for scrapy debugging queue, default 16
   'meiju.pipelines.MongoPipeline': 300,
MONGO_URI = ''  # mongodb connection information
SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Scheduling using scrapy_redis
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  # Removing duplication in redis libraries (url)
# REDIS_URL = 'redis://root:kongzhagen@localhost:6379'# If redis has a password, use this configuration
REDIS_HOST = ''  #redisdb connection information
SCHEDULER_PERSIST = True  # Unclear fingerprints


Code stored in MongoDB

//Nobody answered the question? Editor created a Python learning and communication QQ group: 857662006 
//Look for like-minded friends, help each other, there are good video learning tutorials and PDF e-books in the group!
import pymongo

class MeijuPipeline(object):
    def process_item(self, item, spider):
        return item

class MongoPipeline(object):

    collection_name = 'movies'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    def from_crawler(cls, crawler):
        return cls(
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):

    def process_item(self, item, spider):
        return item


data structure

import scrapy

class MeijuItem(scrapy.Item):
    movieName = scrapy.Field()
    status = scrapy.Field()
    english = scrapy.Field()
    alias = scrapy.Field()
    tv = scrapy.Field()
    year = scrapy.Field()
    type = scrapy.Field()

Crawler script

//Nobody answered the question? Editor created a Python learning and communication QQ group: 857662006 
//Look for like-minded friends, help each other, there are good video learning tutorials and PDF e-books in the group!
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request

class MjSpider(scrapy.Spider):
    name = 'mj'
    allowed_domains = ['']
    # start_urls = ['']
    def start_requests(self):
        yield Request(url='', callback=self.parse)

    def parse(self, response):
        from meiju.items import MeijuItem
        movies = response.xpath('//div[@class="cn_box2"]')
        for movie in movies:
            item = MeijuItem()
            item['movieName'] = movie.xpath('./ul[@class="list_20"]/li[1]/a/text()').extract_first()
            item['status'] = movie.xpath('./ul[@class="list_20"]/li[2]/span/font/text()').extract_first()
            item['english'] = movie.xpath('./ul[@class="list_20"]/li[3]/font[2]/text()').extract_first()
            item['alias'] = movie.xpath('./ul[@class="list_20"]/li[4]/font[2]/text()').extract_first()
            item['tv'] = movie.xpath('./ul[@class="list_20"]/li[5]/font[2]/text()').extract_first()
            item['year'] = movie.xpath('./ul[@class="list_20"]/li[6]/font[2]/text()').extract_first()
            item['type'] = movie.xpath('./ul[@class="list_20"]/li[7]/font[2]/text()').extract_first()
            yield item
        for i in response.xpath('//div[@class="cn_box2"]/ul[@class="list_20"]/li[1]/a/@href').extract():
            yield Request(url='' + i)
        # Next ='+ response.xpath ("//a [contains (,'next page')/@href")[1].extract()
        # print(next)
        # yield Request(url=next, callback=self.parse)

Take a look at redis:

Look at the data in mongodb:

Keywords: Redis Python MongoDB Windows

Added by Arc on Wed, 14 Aug 2019 11:01:18 +0300