A simple understanding of scrapy

Scrapy

scrapy framework, asynchronous crawler framework.

Synchronization and Asynchronization

Synchronization: The next method relies on the previous method. If the previous method is not finished, the next method will not be executed.

Asynchronous: The next method does not depend on the previous one. The previous one is not finished, and the next one will still be executed.


Composition: Pipeline, Scheduler, Downloader, Spiders.And the Scrapy Engine engine (queue)

Middleware: Spider Middlewares, Downloader Middlewares

Pipeline is mainly used to process IO storage and write to locally acquired Sydney

Scheduler Scheduler, sends all URLs to Downloader, and redoes the url to consolidate the queue

Downloader handles request requests and returns responses to Spiders

Spiders are our crawler files

The Scrapy Engine engine controls the entire operation

spider middlewares are generally not used, and crawlers process requests when they are handed over to a dispatcher.

Downloader Middlewares Join Request Header, Join Proxy ip

Use scrapy for the first time

Install scrapy

pip install scrapy

Create Project

#Create Project
 > scrapy start project project name
 > CD project name
 #Create a crawler file
 > scrapy genspider crawler name "host address"
#Run the crawler file
 > scrapy crawl crawl name

Common Configurations

Settings Common configurations in: 
	USER_AGENT = "" # User-Agent 
	ROBOTSTXT_OBEY = True|False # Compliance with Robot Protocol 
	DEFAULT_REQUEST_HEADERS = {} # Default Headers 
	CONCURRENT_REQUESTS = 16 # Maximum number of requests processed by Downloader 
	DOWNLOAD_DELAY = 3 # Download Delay 
	SPIDER_MIDDLEWARES # Spider Middleware 
	DOWNLOADER_MIDDLEWARES # Downloader Middleware 
	ITEM_PIPELINES # Pipeline File

create a file

>> scrapy genspider s1 "blog.csdn.net"

Created crawler file

import scrapy


class S1Spider(scrapy.Spider):
    # Reptilian name
    name = 's1'
    # If the host of the url address does not belong to allowed_domains, the request is filtered out
    allowed_domains = ['blog.csdn.net']
    # url address accessed at project startup
    start_urls = ['http://blog.csdn.net/']
	
    # Access start_urls and get the method called after the response
    def parse(self, response):  # reponse is the response object
        pass

  	

The following method executes equivalent to start_urls, but the qualified request header can be skipped

def start_requests(self):
    yield scrapy.Request(				 # Send a Request object to the dispatcher
    	url = 'http://edu.csdn.net',      # Request Address, Default GET Method
        callback = self.parse2  		  # Functions called when a response is received 
    	)
 
def parse2(self, respones):				 # Functions called when a response is received
    print(response.body)				# Get data of byte type

When crawling web pages, the default robots are usually set to False

Join Request Header

# Add in this middleware

class Scrapy01DownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def process_request(self, request, spider):
        request.headers = Headers(
            {
                'User-Agent': user_agent.get_user_agent_pc()
            }
        )

        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

Open this middleware

# settings.py uncomment this middleware

DOWNLOADER_MIDDLEWARES = {
   'scrapy01.middlewares.Scrapy01DownloaderMiddleware': 543, # Numbers represent priority, the smaller the higher 
}

Join Agent Ip

def process_request(self, request, spider):
        request.headers = Headers(
            {
                'User-Agent': user_agent.get_user_agent_pc()
            }
        )
        request.meta['proxy'] = 'http://IP:PORT' + ur.urlopen("agent ip Interface").read().decode('utf-8').strip()

For our convenience, we can add a start.py

from scrapy import cmdline

cmdline.execute('scrapy crawl s1'.split())

Save the file and modify it as follows

  def parse2(self, response):
        # print("--" * 30)
        # print(response.body.decode('utf-8'))
        # print("--" * 30)
        data = response.body.decode('utf-8')
        item = {}
        item['data'] = data
        yield item

In setting s

# Open Pipeline File, in settings.py
ITEM_PIPELINES = {
   'scrapy01.pipelines.Scrapy01Pipeline': 300, # The smaller the number, the higher the priority, 
   'scrapy01.pipelines.Scrapy01Pipeline': 300, # If one more is added, the file is handed over to the next pipeline file for processing
}

Log log

LOG_FILE = "Log file address"
LOG_LEVEL = "log level"

# Log level:
	CRITICAL  Serious error (critical)
	ERROR     General error(regular errors)
    WARNING   Warning message(warning messages)
    INFO      general information(informational messages)
    DEBUG    Modal Information (debugging messages)

pymysql

import pymysql
mysql_conn = pymysql.Connection(
		host='localhost',  # Host Address
    	port=3306,         # Port number
    	user='root',       # log on user
    	password="",       # Logon Password
    	database='Connection Database Name',
    	charset='utf8',    # utf-8 codes
	)
# Create Cursor Object
cs = mysql_conn.cursor()
# Define the SQL statement to execute
cs.execute('SQL')
mysql_conn.commit()

Redis's database structure

16 databases in Redis

Select [index] switch database

type Express
String (numbers are special strings) String
Hash (Dictionary) Hash
List (ordered, equivalent to list in python) List
unordered set Set
Ordered Set Zset

Key operations:

operation command
lookup keys [ pattern ]
delete del [ key ]
Check for existence exists [ key ]
View Key Type type [ key ]

scrapy-redis principle

Will Scheduler and

# Start Scrapy-Redis Remove Filter to cancel Scrapy's Remove function 
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" 
# Enable Scheduler for Scrapy-Redis and Cancel Scheduler for Scrapy 
SCHEDULER = "scrapy_redis.scheduler.Scheduler" 
# Scrapy-Redis Breakpoint Continuous Crawling 
SCHEDULER_PERSIST = True 
# Configure connections to Redis databases 
REDIS_URL = 'redis://127.0.0.1:6379'

Keywords: Redis Database SQL pip

Added by CaptainStarbuck on Mon, 02 Sep 2019 04:51:31 +0300