Scratch learning - climb to the fund list of Tiantian Fund Network

describe

Crawling http://fundact.eastmoney.com/banner/pg.html#ln Data from the website,
Requirements: crawl the fund code, fund name, unit net worth, date, daily growth rate, recent week, recent January, recent March, recent June, recent year, recent two years, recent three years, this year, since its establishment and handling fee of all funds (27 pages). Put the crawled data into the mariaDB database.

Environment description

python 3.6.3
scrapy 1.4.0

Step record

Create a scene project

Enter the place where you intend to put the code (F: \ mypychar_ws), create the project funds, and execute the command:

scrapy startproject funds

After the project is created, you will find that some files are generated. Here is a description of the relevant files

Scrapy. The configuration information of CFG project mainly provides a basic configuration information for the scripy command line tool. (the configuration information related to the real crawler is in the settings.py file)
items.py set the data storage template for structured data, such as Django's Model
Pipeline data processing behavior, such as general structured data persistence
settings.py configuration files, such as the number of recursive layers, concurrency, delayed download, etc
spiders crawler directory, such as creating files and writing crawler rules

Next, you can use pycharm to open the project for development

Set up the script project to run under pycharm

step 1: create a py file in the funds project (anywhere in the project)

from scrapy import cmdline

cmdline.execute("scrapy crawl fundsList".split())

step 2: configure run -- > Edit configurations

Operation mode: directly run the py file.

Analyze how to obtain data

Since all structured data of the list can be obtained through ajax request, I decided to obtain the url of the requested data through Google browser analysis:

Looking at the data returned by the interface, we find that it is not a json directly, but looks like this: we just need to take out the data item

OK, then I can directly request the data of this interface.

Write code

step 1: set item

class FundsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    code = scrapy.Field()   # Fund code
    name = scrapy.Field()   # Fund name
    unitNetWorth = scrapy.Field()   # Average NAV
    day = scrapy.Field()    # date
    dayOfGrowth = scrapy.Field()  # Daily growth rate
    recent1Week = scrapy.Field()    # Last week
    recent1Month = scrapy.Field()   # Last month
    recent3Month = scrapy.Field()   # Last March
    recent6Month = scrapy.Field()   # Last June
    recent1Year = scrapy.Field()    # Last year
    recent2Year = scrapy.Field()    # Last two years
    recent3Year = scrapy.Field()    # Last three years
    fromThisYear = scrapy.Field()   # since this year on
    fromBuild = scrapy.Field()  # Since its establishment
    serviceCharge = scrapy.Field()  # Service Charge
    upEnoughAmount = scrapy.Field()     # Minimum amount

    pass

step 2: write spider

import scrapy
import json

from scrapy.http import Request
from funds.items import FundsItem

class FundsSpider(scrapy.Spider):
    name = 'fundsList'   # Unique, used to distinguish Spider. This name is used when running the crawler
    allowed_domains = ['fund.eastmoney.com']  # Allowed domains

    # Initial url. Crawling from start_ After URLs starts automatically, the response returned by the server will be automatically passed to the parse(self, response) method.
    # Note: this url can directly obtain the relevant data of all funds
    # start_url = ['http://fundact.eastmoney.com/banner/pg.html#ln']

    # custome_setting can be used to customize the settings of each spider, while setting All in py are global attributes. When there are multiple spiders in your sweep project, this is custom_setting is very useful
    # custome_setting = {
    #
    # }

    # The initial Request in the spider is by calling start_requests(). start_requests() read start_urls in URLs and generate a Request with parse as the callback function.
    # Override start_requests will not start_urls generate Requests
    def start_requests(self):
        url = 'https://fundapi.eastmoney.com/fundtradenew.aspx?ft=pg&sc=1n&st=desc&pi=1&pn=3000&cp=&ct=&cd=&ms=&fr=&plevel=&fst=&ftype=&fr1=&fl=0&isab='
        requests = []
        request = scrapy.Request(url,callback=self.parse_funds_list)
        requests.append(request)
        return requests

    def parse_funds_list(self,response):
        datas = response.body.decode('UTF-8')

        # Remove json Department
        datas = datas[datas.find('{'):datas.find('}')+1] # Start with the first {and get}

        # Add double quotation marks to each field name of json
        datas = datas.replace('datas', '\"datas\"')
        datas = datas.replace('allRecords', '\"allRecords\"')
        datas = datas.replace('pageIndex', '\"pageIndex\"')
        datas = datas.replace('pageNum', '\"pageNum\"')
        datas = datas.replace('allPages', '\"allPages\"')

        jsonBody = json.loads(datas)
        jsonDatas = jsonBody['datas']

        fundsItems = []
        for data in jsonDatas:
            fundsItem = FundsItem()
            fundsArray = data.split('|')
            fundsItem['code'] = fundsArray[0]
            fundsItem['name'] = fundsArray[1]
            fundsItem['day'] = fundsArray[3]
            fundsItem['unitNetWorth'] = fundsArray[4]
            fundsItem['dayOfGrowth'] = fundsArray[5]
            fundsItem['recent1Week'] = fundsArray[6]
            fundsItem['recent1Month'] = fundsArray[7]
            fundsItem['recent3Month'] = fundsArray[8]
            fundsItem['recent6Month'] = fundsArray[9]
            fundsItem['recent1Year'] = fundsArray[10]
            fundsItem['recent2Year'] = fundsArray[11]
            fundsItem['recent3Year'] = fundsArray[12]
            fundsItem['fromThisYear'] = fundsArray[13]
            fundsItem['fromBuild'] = fundsArray[14]
            fundsItem['serviceCharge'] = fundsArray[18]
            fundsItem['upEnoughAmount'] = fundsArray[24]
            fundsItems.append(fundsItem)
        return fundsItems

step 3: configure settings py

custome_setting can be used to customize the settings of each spider, while setting All in py are global attributes. When there are multiple spiders in your sweep project, this is custom_setting is very useful.
However, there is only one crawler in my current project, so I use setting for the time being Py set the spider.

Default is set_ REQUEST_ Headers (since this crawler is a request interface, this item can also be configured without configuration)

DEFAULT_REQUEST_HEADERS = {
  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
    'Accept':'*/*',
    'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'zh-CN,zh;q=0.9',
    'Connection':'keep-alive',
    'Cookie':'st_pvi=72856792768813; UM_distinctid=1604442b00777b-07f0a512f81594-5e183017-100200-1604442b008b52; qgqp_b_id=f10107e9d27d5fe2099a361a52fcb296; st_si=08923516920112; ASP.NET_SessionId=s3mypeza3w34uq2zsnxl5azj',
    'Host':'fundapi.eastmoney.com',
    'Referer':'http://fundact.eastmoney.com/banner/pg.html',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

Set ITEM_PIPELINES

ITEM_PIPELINES = {
   'funds.pipelines.FundsPipeline': 300,
}

pipelines.py, write the data into my local database

import pymysql.cursors

class FundsPipeline(object):
    def process_item(self, item, spider):
        # Connect database
        connection = pymysql.connect(host='localhost',
                                     user='root',
                                     password='123',
                                     db='test',
                                     charset='utf8mb4',
                                     cursorclass=pymysql.cursors.DictCursor)

        sql = "INSERT INTO funds(code,name,unitNetWorth,day,dayOfGrowth,recent1Week,recent1Month,recent3Month,recent6Month,recent1Year,recent2Year,recent3Year,fromThisYear,fromBuild,serviceCharge,upEnoughAmount)\
                                      VALUES('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')" % (
            item['code'], item['name'], item['unitNetWorth'], item['day'], item['dayOfGrowth'], item['recent1Week'], \
            item['recent1Month'], item['recent3Month'], item['recent6Month'], item['recent1Year'], item['recent2Year'],
            item['recent3Year'], item['fromThisYear'], item['fromBuild'], item['serviceCharge'], item['upEnoughAmount'])

        with connection.cursor() as cursor:
            cursor.execute(sql) # Execute sql

        connection.commit()  # Submit to database for execution
        connection.close()

        return item

error handling

ModuleNotFoundError: No module named 'pymysql'

pip install PyMySQL

1366, "Incorrect string value: '\xE6\x99\xAF\xE9\xA1\xBA...' for column 'name' at row 1"

Chinese is garbled

Check the database code, which is latin1.
terms of settlement:

alter table funds convert to character set utf8

Keywords: Python

Added by farzal on Tue, 08 Feb 2022 09:24:15 +0200

Programming VIP