Crawler Chapter-Introduction Level-Scrapy Crawler Framework

I. Detailed Framework


Scrapy is a popular python event-driven network framework written by Twisted, which uses non-blocking asynchronous processing.

[1] The role of internal components

** Scrapy Engine (scrapy engine): ** is used to control the data processing flow of the whole system and trigger transaction processing.
** Scheduler (Scheduler): ** is used to receive requests from the engine, press them into the queue, and return them when the engine requests again. It's like a priority queue of URLs, where it decides what to crawl next, and where duplicate URLs are removed.
Downloader: ** Used to download web content and return it to Spiders [Scrapy Downloader is built on Twisted, an efficient asynchronous model]
** Spiders: (Reptiles): ** Reptiles mainly work to crawl information they need from specific web pages, namely, so-called entities, and also extract URL s from them so that Scrapy can continue to crawl the next page.
** Pipeline (Project Pipeline): ** Handles entities crawled from web pages by crawlers. Its main function is to persist entities, verify their validity and remove unwanted information. When the page is parsed by the crawler, it is sent to the project pipeline and processed in several specific orders.
** Downloader Middlewares (Downloader Middleware): ** The framework between scrapy engine and downloader mainly deals with requests and responses between scrapy engine and downloader. Setting proxy ip and user proxy can be set here.

** Spider Middlewares (crawler middleware): ** The framework between scrapy engine and crawler, the main task is to deal with crawler response input and request output.

** Scheduler Middlewares (Scheduler Middleware): ** The framework between scrapy engine and scheduler is mainly used to process requests and responses sent from scrapy engine to scheduler.

[2] Scrapy Running Process

  1. The engine fetches a URL from the scheduler for subsequent fetches
  2. The engine encapsulates the URL as a Request and passes it to the downloader
  3. The Downloader Downloads the resources and encapsulates them as a Response.
  4. Reptilian Resolution Response
  5. What is parsed out is the entity, which is handed over to the Pipeline for further processing.
  6. The parsed link (URL) is handed over to the scheduler to wait for the next fetch.

II. Project Cases

Film and TV Information Collection and Analysis Based on Scrappy Framework

Project introduction

In order to make full use of the large data resources on the Internet and make the users easily use the film and television information, crawler technology based on Skapy framework is adopted.
hair
Search engine to retrieve movie information. Douban website for video information crawling, in order to facilitate users to accurately access the latest film information.

Project code

Take "Douban Movie" as the crawling target, crawl the movie and television information in the website. Mainly includes the website ranking "Top250" and comedy, action movies
Film title, film rating, film director, film release time and film reviews.

[1] Creating Engineering

scrapy startproject Douban

[2] Create a crawler program

cd Douban/
scrapy genspider douban 'douban.com'

The crawler framework automatically creates directories and files

[3] Determine the crawling target "item.py"

class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()#Film Name
    score = scrapy.Field()#Film Scoring
    quote = scrapy.Field()#Movie Review
    director =scrapy.Field()#Film Director
    release_date = scrapy.Field() #Show time
    comment_num = scrapy.Field()#Number of comments
    image_url = scrapy.Field()#url address of movie pictures
    detail_url = scrapy.Field() #Film Details Page Information
    image_path =scrapy.Field() #Downloaded Cover Local Storage Location

[4] Change the setting configuration file.
1. Adding Useragent and Prohibiting Crawler Protocol

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Douban (+http://www.yourdomain.com)'
#Setting up Random User Agent
from fake_useragent import UserAgent
ua =UserAgent()
USER_AGENT = ua.random
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
    # 'Douban.pipelines.DoubanPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1,
    'scrapy.pipelines.files.FilesPipeline': 2,
    'DouBan.pipelines.MyImagesPipeline': 2,
    'DouBan.pipelines.DoubanPipeline': 300,
    'DouBan.pipelines.JsonWriterPipeline': 200,  # The smaller the number, the earlier the execution.
    'DouBan.pipelines.AddScoreNum': 100,  # Processing the crawled data and saving it after processing.
    'DouBan.pipelines.MysqlPipeline': 200,  # Processing the crawled data and saving it after processing.
}
FILES_STORE ='/tmp/files'
IMAGES_STORE = '/tmp/images'
IMAGES_EXPIRES = 30
# Picture thumbnails
IMAGES_THUMBS = {
    'small': (250, 250),
    'big': (270, 270),
}
# Image filter, minimum height and width
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

2. Setting up pipeline,
When Item is collected in Spider, it will be passed to Item Pipeline, which is processed by Item Pipeline components in the defined order.
Each Item Pipeline is a Python class that implements simple methods, such as deciding that the Item is discarded and stored. The following are some typical applications of item pipeline:
1. Verify crawled data (check that item contains certain fields, such as name field)
2. Check (and discard)
3. Store crawl results in files or databases

Here I process the crawled data (plus 1), then store it in mysql database, get the picture and download it.

import json

import pymysql
import scrapy
from scrapy.exceptions import DropItem


class DoubanPipeline(object):
    def process_item(self, item, spider):
        return item

class AddScoreNum(object):
    """Add one to the original score"""
    def process_item(self,item,spider):
        if item['score']:
            score = float(item['score'])
            item['score'] = str(score + 1)
            return item
        else:
            raise Exception("Not crawling to score")

        class JsonWriterPipeline(object):
            """Open the file object before the crawler, close the file object after the crawler"""

            def open_spider(self, spider):
                self.file = open('douban.json', 'w')

            def process_item(self, item, spider):
                # dict(item): Convert item objects into Dictionaries
                # json.dumps: serialize dictionaries into json strings;
                # indent=4: Storage is indented to 4;
                # ensure_ascii=False: Solving the Chinese scrambling problem
                line = json.dumps(dict(item), indent=4, ensure_ascii=False)
                self.file.write(line)
                return item

            def close_spider(self, spider):
                self.file.close()

        class MysqlPipeline(object):
            """To write MySQL Storage plug-in"""

            def open_spider(self, spider):
                # Connect to the database
                self.connect = pymysql.connect(
                    host='127.0.0.1',  # Database Address
                    port=3306,  # Database Port
                    db='DoubanProject',  # Database name
                    user='root',  # Database username
                    passwd='westos',  # Database password
                    charset='utf8',  # Coding method
                    use_unicode=True,
                    autocommit=True
                )
                # Added, deleted, and modified through cursor -- cursor
                self.cursor = self.connect.cursor()
                self.cursor.execute("create table if not exists douBanTop("
                                    "title varchar(50) unique, "
                                    "score float , "
                                    "quote varchar(100), "
                                    "director varchar(100), "
                                    "comment_num int, "
                                    "release_date varchar(10));")

            def process_item(self, item, spider):
                insert_sqli = "insert into douBanTop(title, score, quote,director) values ('%s', '%s', '%s', '%s')" % (
                    item['title'], item['score'], item['quote'], item['director'],)
                print(insert_sqli)
                try:
                    self.cursor.execute(insert_sqli)
                    # Submit sql statement
                    self.connect.commit()
                except Exception as e:
                    self.connect.rollback()
                return item  # Return must be achieved

            def close_spider(self, spider):
                self.connect.commit()
                self.cursor.close()
                self.connect.close()

        class MyImagesPipeline(ImagesPipeline):

            def get_media_requests(self, item, info):  # A single item object;
            
                #Automatically retrieve requests and download images
              
                print("item: ", item)
                yield scrapy.Request(item['image_url'])


            def item_completed(self, results, item, info):
                """
                :param results:
                    [(True,  {'url': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p1454261925.jpg',
                        'path': 'full/e9cc62a6d6a0165314b832b1f31a74ca2487547a.jpg',
                        'checksum': '5d77f59d4d634b795780b2138c1bf572'})]
                :param item:
                :param info:
                :return:
                """
                # for result in results:
                #     print("result: ", result)
                image_paths = [x['path'] for isok, x in results if isok]
                # print("image_paths: ", image_paths[0])
                if not image_paths:
                    raise DropItem("Item contains no images")

                item['image_path'] = image_paths[0]
                return item

[5] compiling spider "douban.py"

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request

from Douban.items import DoubanItem
class DoubanSpider(scrapy.Spider):
    #Reptilian name, not repeatable
    name = 'douban'
    allowed_domains = ['douban.com']
    start_urls = ['http://douban.com/',
                  'https://movie.douban.com/top250']

    url ='https://movie.douban.com/top250'
    def parse(self, response):
        item =DoubanItem()
        #<ol class="grid_view">
        movies = response.xpath('//ol[@class="grid_view"]/li')
        for movie in movies:
            #Film Name <span class="title">Shawshank's Redemption</span>
            item['title'] = movie.xpath('.//span[@class="title"]/text()').extract()[0]
            #Film Scoring
            item['score'] = movie.xpath('.//span[@class="rating_num"]/text()').extract()[0]
            #Movie Review
            quote = movie.xpath('.//span[@class="inq"]/text()').extract()
            item['quote'] = quote[0] if quote else ''
            #Film Director
            info = movie.xpath('.//div[@class="bd"]/p/text()').extract()

            director = info[0].split('To star')[0].strip()

            item['director'] = director
            #url address of movie pictures
            item['image_url'] = movie.xpath('.//div[@class="pic"]/a/img/@src').extract()[0]
            # Film Details Page Information
            item['detail_url'] = movie.xpath('.//div[@class="hd"]//a/@href').extract()[0]

            yield item

        #<a href="? Start=25& filter="> back page & gt;"</a>
        nextLink = response.xpath('.//Span [@class= "next"]/ link /@href'. extract ()# returns a list
        if nextLink:
            nextLink = nextLink[0]
            print('Next Link: ', nextLink)
            yield Request(self.url + nextLink, callback=self.parse)



Keywords: Database JSON Python MySQL

Added by chomedey on Tue, 27 Aug 2019 17:05:34 +0300