Crawler Framework Scrapy Firstly-Stock Data Crawling

brief introduction

Objective: To obtain the names and trading information of all stocks on the Shanghai Stock Exchange and Shenzhen Stock Exchange.
Output: Save to file.
Technical Route: Scrapy Crawler Framework
Language: Python 3.5
Since the principles of stock information crawling have been described in the previous blog post, there are no more descriptions here. For more information, you can refer to the blog: Link Description In this article, we will focus on how this project is implemented within the Scrappy framework.

Principle analysis

The Scrapy framework is shown below:

We mainly do two steps:
(1) First, you need to write a crawler spider in the framework for link crawling and page parsing;
(2) Write pipelines to process parsed stock data and store them in files.

Coding

Steps:
(1) Create a project-generated Spider template
Open the cmd command line, navigate to the path where the project is placed, type: scrapy start project BaiduStocks, and a new project named BaiduStocks will be created in the directory.Re-enter: cd BaiduStocks enters the catalog, then enter: scrapy genspider stocks baidu.com to generate a crawl.Then we can see a stocks.py file in the spiders/directory, as shown in the following figure:

(2) Write Spider: configure the stocks.py file, modify the processing of return pages, and modify the processing of new URL crawl requests
Open the stocks.py file with the following code:

# -*- coding: utf-8 -*-
import scrapy


class StocksSpider(scrapy.Spider):
    name = 'stocks'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baidu.com/']

    def parse(self, response):
        pass

Modify the above code as follows:

# -*- coding: utf-8 -*-
import scrapy
import re
 
 
class StocksSpider(scrapy.Spider):
    name = "stocks"
    start_urls = ['http://quote.eastmoney.com/stocklist.html']
 
    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            try:
                stock = re.findall(r"[s][hz]\d{6}", href)[0]
                url = 'https://gupiao.baidu.com/stock/' + stock + '.html'
                yield scrapy.Request(url, callback=self.parse_stock)
            except:
                continue
 
    def parse_stock(self, response):
        infoDict = {}
        stockInfo = response.css('.stock-bets')
        name = stockInfo.css('.bets-name').extract()[0]
        keyList = stockInfo.css('dt').extract()
        valueList = stockInfo.css('dd').extract()
        for i in range(len(keyList)):
            key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]
            try:
                val = re.findall(r'\d+\.?.*</dd>', valueList[i])[0][0:-5]
            except:
                val = '--'
            infoDict[key]=val
 
        infoDict.update(
            {'Stock Name': re.findall('\s.*\(',name)[0].split()[0] + \
             re.findall('\>.*\<', name)[0][1:-1]})
        yield infoDict

(3) Configure the pipelines.py file to define the processing class for the Scraped Item
Open the pipelinse.py file as shown below:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class BaidustocksPipeline(object):
    def process_item(self, item, spider):
        return item

Modify the above code as follows:

# -*- coding: utf-8 -*-
 
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 
 
class BaidustocksPipeline(object):
    def process_item(self, item, spider):
        return item

#There are three methods in each pipelines class
class BaidustocksInfoPipeline(object):
    #When a crawler is called, the corresponding pipelines initiate the method
    def open_spider(self, spider):
        self.f = open('BaiduStockInfo.txt', 'w')
    #A method for pipelines where a crawl closes or ends
    def close_spider(self, spider):
        self.f.close()
    #The method used to process each Item item is also the most principal function in pipelines
    def process_item(self, item, spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

(4) Modifying settings.py is when the framework finds the class we wrote in pipelinse.py
Add in settings.py:

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'BaiduStocks.pipelines.BaidustocksInfoPipeline': 300,
}

At this point, the program is complete.

(4) Execution Procedure
On the command line, type: scrapy crawl stocks

Keywords: PHP Python

Added by jitesh on Fri, 07 Jun 2019 19:30:05 +0300