Scrapy Crawler Framework

1. Scrapy Crawler Framework

1. Scrapy is not a function library, but a crawler framework

A crawler framework is a collection of software structures and functional components that implement the crawler function

The crawler framework is a semi-finished product that helps users achieve professional web crawling

2.'5+2'structure

(1) Engine (no user modification required)

Controlling data flow between all modules

Trigger events based on conditions

(2) Downloader (no user modification required)

Download Web pages on request

Scheduler (no user modification required)

Schedule all crawl requests

(3)Downloader Middleware

Purpose: To implement user-configurable control between Engine, Scheduler, and Downloader

Functions: modify, discard, add requests or responses

Users can write configuration code

(4)Spider

Resolve the response returned by Downloader (Response)

Generate a scrapy item

Generate additional crawl requests

Require user to write configuration code

(5)Item Pipelines

Pipeline crawl items generated by Spider

It consists of a set of sequence operations, similar to pipelining, each of which is an Item Pipelines type

Possible operations include cleaning up, validating, and retrieving HTML data in crawled items, and storing the data in a database

(6)Spider Middleware

Purpose: Reprocessing requests and crawled items

Functions: modify, discard, add requests or crawl items

Users can write configuration code

2. Comparison of Request Library and Scrapy Crawlers

1. Identity

Both can make page requests and crawls, two important technical routes for Python Crawlers

Both are usable, documented and easy to get started

Neither handles JS, submits forms, copes with validation codes (extensible)

2. Differences

Request Scrapy
Page-level Crawlers Site-level Crawlers
Functional Library frame
Insufficient consideration of concurrency and poor performance Good concurrency and high performance
Focus on page downloads Focus on crawl structure
Customization flexibility General customization flexibility, deep customization difficulties
Easy to start with It's a little difficult to get started

3. Scrapy command line

Scrapy is a professional crawler framework designed to run continuously, providing an operational Scrapy command line

1. Command Line Format

1 scrapy <command> [options] [args]

2. Common Scrapy Commands

command Explain format
startproject Create a new project scrapy startproject<name>[dir]
genspider Create a crawler scrapy genspider[options]<name><domain>
settings Get crawler configuration information scrapy settings[options]
crawl Run a crawler scrapy crawl<spider>
list List all crawlers in the project scrapy list
shell Launch URL Invoke Command Line scrapy shell[url]

4. Examples

1. Create a scrapy crawler project

Command Line Statements

1 scrapy startproject python123demo #Create a new " python123demo"Crawler Project for Project Name

The project directory is D:\Codes\Python>

1  D:\Codes\Python>scrapy startproject python123demo
2 New Scrapy project 'python123demo', using template directory 'd:\codes\python\venv\lib\site-packages\scrapy\templates\project', created in:
3     D:\Codes\Python\python123demo
4 
5 You can start your first spider with:
6     cd python123demo
7     scrapy genspider example example.com

python123demo/ Outer directory

scrapy.cfg * Deploy configuration files for Scrapy Crawlers

python123demo/ * User-defined Python code for the Scrapy framework

Initialization script

Items.py * Items code template (inherited class)

middlewares.py) Middlewares code template (inherited class)

Pipelines Code Template (Inherited Class)

settings.py) Scrapy Crawler Configuration File

Spiders/ Spiders Code Template Directory (Inherited Classes)

_init_.py) Initial file, no modification required

_pycache_/ Cache directory, no modification required

2. Create a Scrapy crawl in the project

Command Line Statements

1 1 cd python123demo #open python123demo Folder
2 2 scrapy genspider demo python123.io #Create a crawler

Open the file directory before crawling

1  D:\Codes\Python>cd python123demo
2 
3 (venv) D:\Codes\Python\python123demo>scrapy genspider demo python123.io
4 Created spider 'demo' using template 'basic' in module:
5   python123demo.spiders.demo

A demo.py file is generated in the spiders folder

Content is

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 
 5 class DemoSpider(scrapy.Spider):
 6     name = 'demo'
 7     allowed_domains = ['python123.io']
 8     start_urls = ['http://python123.io/']
 9 
10     def parse(self, response):
11         pass

parse() is used to process responses, parse content to form a dictionary, and discover new URL crawl requests

3. Configured spider crawls

Modify demo.py content

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 
 5 class DemoSpider(scrapy.Spider):
 6     name = 'demo'
 7     #allowed_domains = ['python123.io']
 8     start_urls = ['http://python123.io/ws/demo.html']
 9 
10     def parse(self, response):
11         fname = response.url.split('/')[-1]
12         with open(fname,'wb') as f:
13             f.write(response.body) #Save the returned content as a file
14         self.log('Saved file %s.'% fname)

4. Run crawlers to get web pages

command line

1 scrapy crawl demo

Running crawls

 1  D:\Codes\Python\python123demo>scrapy crawl demo
 2 2020-03-19 11:25:40 [scrapy.utils.log] INFO: Scrapy 2.0.0 started (bot: python123demo)
 3 2020-03-19 11:25:40 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d
 4   10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
 5 2020-03-19 11:25:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
 6 2020-03-19 11:25:41 [scrapy.crawler] INFO: Overridden settings:
 7 {'BOT_NAME': 'python123demo',
 8  'NEWSPIDER_MODULE': 'python123demo.spiders',
 9  'ROBOTSTXT_OBEY': True,
10  'SPIDER_MODULES': ['python123demo.spiders']}
11 2020-03-19 11:25:41 [scrapy.extensions.telnet] INFO: Telnet Password: dbe958957137573b
12 2020-03-19 11:25:41 [scrapy.middleware] INFO: Enabled extensions:
13 ['scrapy.extensions.corestats.CoreStats',
14  'scrapy.extensions.telnet.TelnetConsole',
15  'scrapy.extensions.logstats.LogStats']
16 2020-03-19 11:25:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
17 ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
18  'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
19  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
20  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
21  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
22  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
23  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
24  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
25  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
26  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
27  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
28  'scrapy.downloadermiddlewares.stats.DownloaderStats']
29 2020-03-19 11:25:42 [scrapy.middleware] INFO: Enabled spider middlewares:
30 ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
31  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
32  'scrapy.spidermiddlewares.referer.RefererMiddleware',
33  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
34  'scrapy.spidermiddlewares.depth.DepthMiddleware']
35 2020-03-19 11:25:42 [scrapy.middleware] INFO: Enabled item pipelines:
36 []
37 2020-03-19 11:25:42 [scrapy.core.engine] INFO: Spider opened
38 2020-03-19 11:25:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
39 2020-03-19 11:25:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
40 2020-03-19 11:25:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://python123.io/robots.txt> from <GET http://python123.io/robots.txt>
41 2020-03-19 11:25:42 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://python123.io/robots.txt> (referer: None)
42 2020-03-19 11:25:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://python123.io/ws/demo.html> from <GET http://python123.io/ws/demo.html>
43 2020-03-19 11:25:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://python123.io/ws/demo.html> (referer: None)
44 2020-03-19 11:25:42 [demo] DEBUG: Saved file demo.html.
45 2020-03-19 11:25:42 [scrapy.core.engine] INFO: Closing spider (finished)
46 2020-03-19 11:25:42 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
47 {'downloader/request_bytes': 892,
48  'downloader/request_count': 4,
49  'downloader/request_method_count/GET': 4,
50  'downloader/response_bytes': 1901,
51  'downloader/response_count': 4,
52  'downloader/response_status_count/200': 1,
53  'downloader/response_status_count/301': 2,
54  'downloader/response_status_count/404': 1,
55  'elapsed_time_seconds': 0.644698,
56  'finish_reason': 'finished',
57  'finish_time': datetime.datetime(2020, 3, 19, 3, 25, 42, 983695),
58  'log_count/DEBUG': 5,
59  'log_count/INFO': 10,
60  'response_received_count': 2,
61  'robotstxt/request_count': 1,
62  'robotstxt/response_count': 1,
63  'robotstxt/response_status_count/404': 1,
64  'scheduler/dequeued': 2,
65  'scheduler/dequeued/memory': 2,
66  'scheduler/enqueued': 2,
67  'scheduler/enqueued/memory': 2,
68  'start_time': datetime.datetime(2020, 3, 19, 3, 25, 42, 338997)}
69 2020-03-19 11:25:42 [scrapy.core.engine] INFO: Spider closed (finished)

Captured pages are stored in demo.html

5. Code understanding

1. demo.py full version code

 1 import scrapy
 2 
 3 class DemoSpider(scrapy.Spider):
 4     name = 'demo'
 5 
 6     def start_requests(self):
 7         urls = [
 8                     'http://python123.io/ws/demo.html'
 9                ]
10         for url in urls:
11             yield scrapy.Request(url = url,callback = self.parse)
12 
13     def parse(self, response):
14         fname = response.url.split('/')[-1]
15         with open(fname,'wb') as f:
16             f.write(response.body)
17         self.log('Saved file %s.'% fname)

The start_requests(self) function is a generator that can provide better resource utilization when there are too many URLs

2. yield keyword

Yield<--->generator

Generator is a function that constantly produces values

A function containing a yield statement is a generator

The generator produces one value at a time (yield statement), the function is frozen, and a value is generated after waking up

Generator Writing

1 def gen(n): #Definition gen()function
2          for i in range(n):
3              yield i**2
4          
5 for i in gen(5):
6          print(i," ",end = "")       
7 0  1  4  9  16  

Generator saves storage space, responds faster, and is more flexible to use than listing everything at once

longhand

1  def square(n): #Definition square()function
2           ls = [i**2 for i in range(n)]
3           return ls
4  
5  for i in range(5):
6           print(i," ",end = "")  
7  0  1  2  3  4  

Normal Writing puts all results in a list, which takes up a lot of space and time, and is not good for the program to run.

6. Summary of Scrapy Crawler Framework

1. Steps for using Scrapy Crawlers

(1) Create a project and a Spider template

(2) Writing Spider

(3) Writing Item Pipeline

(4) Configuration Policy

2. Data types of Scrapy Crawlers

(1) Request class

class scrapy.http.Request()

The Request object represents an HTTP request

Generated by Spider and executed by Downloader

Property or method Explain
.url Request URL address
.method  Corresponding request method,'GET''POST', etc.
.headers Dictionary Type Style Request Header
.body Request Content Body, String Type
.meta User-added extended information used to transfer information between crapy internal modules
.copy()    Copy the request

(2) Response class

class scrapy.http.Response()

Response object represents an HTTP response

Generated by Downloader and processed by Spider

Property or method Explain
.url  URL Address for Response
.status HTTP status code, default is 200
.headers Header information for Response
.body      Content information corresponding to Response, string type
.flags A set of tags
.request Generate Request object corresponding to Response type
.copy()    Copy the response

(3) Item class

class scrapy.item.Item()

Item objects represent information content extracted from an HTML page

Generated by Spider and processed by Item Pipeline

Item type dictionary type, can operate according to dictionary type

3. Methods of extracting information by Scrapy Crawlers

Scrapy crawler supports multiple HTML information extraction methods

Beautiful Soup

lxml

re

XPath Selector

CSS Selector

Basic use of CSS Selector

1 <HTML>.css('a::attr(href)').extract()

CSS Selector is maintained and standardized by the W3C organization

Source: Beijing University of Technology, Songtian, MOOC

Keywords: Python shell Database OpenSSL

Added by cpjok on Thu, 19 Mar 2020 06:55:57 +0200