Use the Scrapy framework to crawl and store it in the database

1. Premise environment

Install pypharm, and install the Scrapy frame in pypharm

2. Create a Scrapy project

Find Terminal in the lower left corner of pychart, as shown below:

Click Terminal to enter the command line window, as shown below:

In this window, execute the following command to create a Scrapy project

 scrapy startproject entry name
   Example: scrapy startproject menu

Enter the project directory, find the spiders folder, and create a crawler under its directory with the following command.

scrapy genspider Crawler name domain name
   Example: scrapy genspider getmenu https://www.meishij.net/

At this time, our project catalogue is like this

In order to run the crawler easily, I created a start Py file to start the crawler.
start.py

from scrapy import cmdline
cmdline.execute('scrapy crawl getmenu'.split())  #Getmenu should be replaced by the name of the crawler I created. The crawler name I created above is getmenu

3. Modify basic configuration

3.1 configure simulation request

In the configuration file settings Py containing user_ In the line of agent, change its value to the user agent of the browser=

USER_AGENT ='Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'

Access method of user agent: open the browser, enter a web page casually, press F12 or right-click - check element, the following window appears, and click "network", the window is as follows:
Then refresh the page and a pile of requests will be displayed, as shown in the following figure:
Click a request randomly, and the detailed request information will be displayed. At this time, you can find the user agent in the detailed request information

3.2 configure crawler interval

In the configuration file settings Download found in PY_ Delay, and set the time interval you want, in seconds (s)

DOWNLOAD_DELAY = 0.1          #Crawl a piece of data every 0.1s

Of course, settings There are many parameters in py that can be set. If you can't use it at present, you won't configure it first.

4. Write the code of the crawler

When crawling web page data, we often need to simply analyze the structure of the web page. Only in this way can we write an efficient crawler program.

4.1 determine the target web address of the crawler

My target website is: https://www.meishij.net/fenlei/chaofan/ (fried rice division of Meijie, as shown in the figure below)

4.2 determine the data items to be crawled

The data items I want to crawl include: recipe name, recipe picture, practice steps, main materials, auxiliary materials and other recipe details. In the above target website, only the recipe picture and name are displayed, which is not detailed enough to meet my requirements. Therefore, you may need to write secondary or multiple url requests to enter the details page of the recipe.
After determining the data items to be crawled, in order to facilitate the storage, modification and other operations of the crawled data, write items Py file (of course, if you just want to observe what data you crawl, you can output it every time you crawl a piece of data in the crawler). items.py is as follows:

import scrapy
class MenuItem(scrapy.Item):
    name = scrapy.Field()  #Recipe name
    img = scrapy.Field()   #Menu picture
    step = scrapy.Field()  #Practice steps
    material1 = scrapy.Field()  #Main material
    material2 = scrapy.Field()  #accessories
    energy = scrapy.Field()  #quantity of heat
    sugar = scrapy.Field()  #sugar content
    fat = scrapy.Field()   #Fat
    needtime = scrapy.Field()  #Time required
    uptime = scrapy.Field() # Upload time (this one does not need to be obtained by crawler. Insert the system time when generating each crawler record)
    level = scrapy.Field() #Difficulty level

4.3 writing Crawlers

Above, we have planned what data to crawl from what website, and the next step is to write a crawler.
First of all, we need to ensure that we can crawl the entire page from the target URL. The following code verifies whether the following can be crawled https://www.meishij.net/fenlei/chaofan/ Page.

from datetime import datetime
import scrapy
from ..items import MenuItem
class GetmenuSpider(scrapy.Spider):
    name = 'getmenu'
    allowed_domains = ['https://www.meishij.net']
    start_urls = ['https://www.meishij.net/fenlei/chaofan / '] # crawl the target web address
 	def parse(self, response):
 		print(response.text)  #Print crawled pages

Then run start Py, execute the crawler program, and the results are as follows:
From the print results in the figure, we can see that the content of this website can be crawled successfully.
Next, according to items Py file you want to crawl the data items to see if the data in this website can meet your requirements.
Through observation, in this website( https://www.meishij.net/fenlei/chaofan/ ), only menu names and menu pictures can be crawled, while data items such as practice steps, main materials, calories and sugar content are not stored on this page at all. So we randomly click in a recipe and check its details page below( https://www.meishij.net/zuofa/huangjindanchaofan_22.html As an example), as follows:
I was surprised to find that all the data items I want to crawl are in this page (manual dog head) [/ doge], so I just need to go through the target website, that is, the website where the crawler starts( https://www.meishij.net/fenlei/chaofan/ )Jump to the corresponding recipe details website, and then crawl.
So, how to jump to the corresponding recipe details website in the target website? We need to properly analyze the source code of the target website
.
On the target page, press F12 or right-click - check element to open the following window:
Right click on a recipe - check elements
Click as follows:
As can be seen from the above figure, under the div of class = "imgw", there is an a tag of class = "list_s2_item_img". Its href attribute is the link of recipe details. Therefore, if you want to enter the details of each recipe, you should enter it through this link.
At this time, the crawler code is as follows:

import scrapy
class GetmenuSpider(scrapy.Spider):
    name = 'getmenu'
    allowed_domains = ['https://www.meishij.net']
    start_urls = ['https://www.meishij.net/fenlei/chaofan/'] 
    def parse(self, response):
        urls = response.xpath('//div[@class="imgw"]/a[@class="list_s2_item_img"]/@href').extract() # menu details link
        #Each time you crawl to a link, it will be submitted to the next function to crawl the recipe details.
        for url in urls:
            yield scrapy.Request(url, callback=self.parse2, dont_filter=True)  # Submit url to next level
    def parse2(self, response):
       print(response.text)

In the above code, the parse function is used to start crawling from it. It only crawls the link corresponding to each recipe on the web page, and each link is submitted to the parse2 function. The parse2 function crawls the details of a single recipe.

Through the above method, we can use the target website( https://www.meishij.net/fenlei/chaofan/ )To enter the details page of all recipes (21 items) on the previous page, the parse2 function will be written below to crawl the details page of the recipe.

4.3.1 name, difficulty, time required, main ingredients and auxiliary ingredients of climbing recipe

Click a menu details page, move the mouse to the menu name, and then right-click - check the elements, as shown in the following figure
As can be seen from the above figure, the recipe name is stored in the h1 tag under the div of class="recipe_title", and its class="recipe_title" is contained in the div of class="recipe_header_info". Move the mouse to the div of class="recipe_header_info", and it is found that the div stores the following information

It can be seen from the figure that the menu name, difficulty, required time, main ingredients and auxiliary ingredients are all in this div, and these are the data we need to crawl. At this time, the crawler code is as follows:

import scrapy
from ..items import MenuItem
class GetmenuSpider(scrapy.Spider):
    name = 'getmenu'
    allowed_domains = ['https://www.meishij.net']
    start_urls = ['https://www.meishij.net/fenlei/chaofan/']
    def parse(self, response):
        urls = response.xpath('//div[@class="imgw"]/a[@class="list_s2_item_img"]/@href').extract()
        for url in urls:
            yield scrapy.Request(url, callback=self.parse2, dont_filter=True)  # Submit url to next level
    def parse2(self, response):
        menuitem = MenuItem()
        #Recipe name
        menuitem['name'] = response.xpath('//div[@class="recipe_header_c"]/div[2]/h1/text()').extract()
        #Main material + integration
        mat1 = ""
        material1list = response.xpath(
            '//div[@class="recipe_ingredientsw"]/div[1]/div[2]/strong/a/text()').extract()
        for i in range(0, len(material1list)):
            mat1 = mat1 + material1list[i]
        menuitem['material1'] = mat1
        #Batching + integration
        mat2 = ""
        material2list = response.xpath(
            '//div[@class="recipe_ingredientsw"]/div[2]/div[2]/strong/a/text()').extract()
        for i in range(0, len(material2list)):
            mat2 = mat2 + material2list[i]
        menuitem['material2'] = mat2
        #Difficulty level
        menuitem['level'] = response.xpath('//div[@class="info2"]/div[4]/strong/text()').extract()
        #Time required
        menuitem['needtime'] = response.xpath('//div[@class="info2"]/div[3]/strong/text()').extract()
        print(menuitem)

Run start Py starts the crawler, and the results are shown in the figure below
As can be seen from the figure, we have obtained the menu name, difficulty, required time, main ingredients and auxiliary ingredients. However, except for the main ingredients and auxiliary ingredients, the values of other data items are a list, which is very inconvenient for us to operate the data (for example, if we want to save the crawler data into the MySQL database, the list type data cannot be saved), Therefore, we convert the data of list type into string type. At this time, the crawler code is as follows:

import scrapy
from ..items import MenuItem
class GetmenuSpider(scrapy.Spider):
    name = 'getmenu'
    allowed_domains = ['https://www.meishij.net']
    start_urls = ['https://www.meishij.net/fenlei/chaofan/']
    def parse(self, response):
        urls = response.xpath('//div[@class="imgw"]/a[@class="list_s2_item_img"]/@href').extract()
        for url in urls:
            yield scrapy.Request(url, callback=self.parse2, dont_filter=True)  # Submit url to next level
    def parse2(self, response):
        menuitem = MenuItem()
        #Recipe name
        menuitem['name'] = response.xpath('//div[@class="recipe_header_c"]/div[2]/h1/text()').extract()
        #Main material + integration
        mat1 = ""
        material1list = response.xpath(
            '//div[@class="recipe_ingredientsw"]/div[1]/div[2]/strong/a/text()').extract()
        for i in range(0, len(material1list)):
            mat1 = mat1 + material1list[i]
        menuitem['material1'] = mat1
        #Batching + integration
        mat2 = ""
        material2list = response.xpath(
            '//div[@class="recipe_ingredientsw"]/div[2]/div[2]/strong/a/text()').extract()
        for i in range(0, len(material2list)):
            mat2 = mat2 + material2list[i]
        menuitem['material2'] = mat2
        #Difficulty level
        menuitem['level'] = response.xpath('//div[@class="info2"]/div[4]/strong/text()').extract()
        #Time required
        menuitem['needtime'] = response.xpath('//div[@class="info2"]/div[3]/strong/text()').extract()
        #Format -- list to string
        menuitem['name'] = ''.join(menuitem['name'])
        menuitem['level'] = ''.join(menuitem['level'])
        menuitem['needtime'] = ''.join(menuitem['needtime'])
        print(menuitem)

Run start Py starts the crawler, and the results are shown in the figure below
So far, we have successfully crawled the menu name, including difficulty, required time, main ingredients and auxiliary ingredients. The data to be crawled include:

4.3.2 climb the menu picture link

Similarly to the above method, it is not difficult to find the src attribute of the img tag under the first div under the div of the picture link class="recipe_header_c".
Therefore, the following code should be added to parse2 of the crawler code

#Menu picture
        menuitem['img'] = response.xpath('//div[@class="recipe_header_c"]/div[1]/img/@src').extract()

4.3.3 practice steps of crawling recipes

Similarly to the above methods, it is not difficult to find that there are many div's of class="recipe_step". After verification, they are the div's for storing steps, and each div stores one step. At this time, we need to get all the steps, that is, get all the div's for storing steps, and then take out the content of the p tag of the div of class="step_content", which is the content of each step
Therefore, in order to crawl the recipe steps, the following code should be added to parse2 of the crawler Code:

#Step + integration
        steplist = response.xpath('//div[@class="recipe_step"]/div[@class="step_content"]/p/text()').extract() # here are the steps of a recipe
        step = ""
        for i in range(0, len(steplist)):   #Traverse the step list and synthesize a string as the recipe step
            step = step + steplist[i]
        menuitem['step'] = step

4.3.4 crawling heat, sugar content and fat content

Similarly to the above method, it is not difficult to find that calories, sugar content and fat content are included in the div of class="jztbox", as shown in the figure below
After observation, the content of heat under the third div under the div of class="jztbox", the content of sugar under the second div under the div of class="jztbox", the content of fat under the first div under the fourth div of class="jztbox", so the following code should be added to parse2 of crawler Code:

menuitem['energy'] = response.xpath('//div[@class="jztbox"]/div[3]/div[1]/text()').extract()
        menuitem['sugar'] = response.xpath('//div[@class="jztbox"]/div[2]/div[1]/text()').extract()
        menuitem['fat'] = response.xpath('div[@class="jztbox"]/div[4]/div[1]/text()').extract()

At this time, the crawler code is as follows:

import scrapy
from ..items import MenuItem
class GetmenuSpider(scrapy.Spider):
    name = 'getmenu'
    allowed_domains = ['https://www.meishij.net']
    start_urls = ['https://www.meishij.net/fenlei/chaofan/']
    def parse(self, response):
        urls = response.xpath('//div[@class="imgw"]/a[@class="list_s2_item_img"]/@href').extract()
        for url in urls:
            yield scrapy.Request(url, callback=self.parse2, dont_filter=True)  # Submit url to next level
    def parse2(self, response):
        menuitem = MenuItem()
        #Recipe name
        menuitem['name'] = response.xpath('//div[@class="recipe_header_c"]/div[2]/h1/text()').extract()
        #Main material + integration
        mat1 = ""
        material1list = response.xpath(
            '//div[@class="recipe_ingredientsw"]/div[1]/div[2]/strong/a/text()').extract()
        for i in range(0, len(material1list)):
            mat1 = mat1 + material1list[i]
        menuitem['material1'] = mat1
        #Batching + integration
        mat2 = ""
        material2list = response.xpath(
            '//div[@class="recipe_ingredientsw"]/div[2]/div[2]/strong/a/text()').extract()
        for i in range(0, len(material2list)):
            mat2 = mat2 + material2list[i]
        menuitem['material2'] = mat2
        #Difficulty level
        menuitem['level'] = response.xpath('//div[@class="info2"]/div[4]/strong/text()').extract()
        #Time required
        menuitem['needtime'] = response.xpath('//div[@class="info2"]/div[3]/strong/text()').extract()
        # Menu picture
        menuitem['img'] = response.xpath('//div[@class="recipe_header_c"]/div[1]/img/@src').extract()
        # Step + integration
        steplist = response.xpath('//div[@class="recipe_step"]/div[@class="step_content"]/p/text()').extract()
        step = ""
        for i in range(0, len(steplist)):
            step = step + steplist[i]
        menuitem['step'] = step
        #Heat, sugar and fat content
        menuitem['energy'] = response.xpath('//div[@class="jztbox"]/div[3]/div[1]/text()').extract()
        menuitem['sugar'] = response.xpath('//div[@class="jztbox"]/div[2]/div[1]/text()').extract()
        menuitem['fat'] = response.xpath('div[@class="jztbox"]/div[4]/div[1]/text()').extract()
        #Format -- list to string
        menuitem['name'] = ''.join(menuitem['name'])
        menuitem['level'] = ''.join(menuitem['level'])
        menuitem['needtime'] = ''.join(menuitem['needtime'])
        menuitem['img'] = ''.join(menuitem['img'])
        menuitem['energy'] = ''.join(menuitem['energy'])
        menuitem['sugar'] = ''.join(menuitem['sugar'])
        menuitem['fat'] = ''.join(menuitem['fat'])
        print(menuitem)

5. Store crawler data to database (MySQL)

5.1 install MySQL DB plug-in

MySQL DB is used to connect to the database in python and transfer this file( mysqlclient-2.0.1-cp37-cp37m-win_amd64.whl Extract code: 6666) and copy it to the workspace of pycham (my workplace is customized), then enter the folder where the file is located in the Terminal command window and execute install mysqlclient-2.0.1-cp37-cp37m-win_amd64.whl, wait for the installation to complete.
To verify whether the installation is successful, you only need to py file

import MySQLdb

No error can be reported

5.2 preliminary preparation of database

Create a new database with the name of menu
Create a user with username user2 and password 123
Create a table for cookbook2
The structure of cookbook2 is as follows (pay attention to the encoding format and set the one that can display Chinese, otherwise there may be garbled code or insertion failure)

5.3 database connection configuration

In settings Py file, add the following code:

mysql_host='127.0.0.1'        	#address
mysql_user='user2'              #Authorized user name
mysql_passwd='123'              #Authorized access password
mysql_db='menu'                 #Database name
mysql_tb='cookbook2'             #Table name
mysql_port=3306             #Port, default 3306

In settings Py file found item_ Pipeline, and add the following code:

'menu.pipelines.MenusqlPipeline': 2,

My ITEM_PIPELINES are as follows:

ITEM_PIPELINES = {
			#The smaller the number, the higher the priority
			#menu.pipelines.Class, class and pipelines Py, such as menu pipelines. MenusqlPipeline
    'menu.pipelines.MenuPipeline': 300,   
    'menu.pipelines.MenusqlPipeline': 2,        #Submit to database
}

In pipelines Py add the following code:

import MySQLdb
from .settings import mysql_host,mysql_db,mysql_user,mysql_passwd,mysql_port
class MenusqlPipeline(object):
    def __init__(self):
        host=mysql_host
        user=mysql_user
        passwd=mysql_passwd
        port=mysql_port
        db=mysql_db
        self.connection=MySQLdb.connect(host,user,passwd,db,port,charset='utf8')
        self.cursor=self.connection.cursor()
    def process_item(self, item, spider):
        sql = "insert into cookbook2(name,step,sugar,energy,fat,material1,needtime,img,level,material2) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
         #Note that the order in which params is added should be consistent with the order in which sql statements are inserted, otherwise dislocation or even insertion failure will occur
        params=list()
        params.append(item['name'])
        params.append(item['step'])
        params.append(item['sugar'])
        params.append(item['energy'])
        params.append(item['fat'])
        params.append(item['material1'])
        params.append(item['needtime'])
        params.append(item['material2'])
        params.append(item['img'])
        params.append(item['level'])
        self.cursor.execute(sql,tuple(params))
        self.connection.commit()
        return item
    def __del__(self):
        self.cursor.close()
        self.connection.close()

5.4 start crawler

Add return menuitem to the last line of the crawler, as follows:

import scrapy
from ..items import MenuItem
class GetmenuSpider(scrapy.Spider):
    name = 'getmenu'
    allowed_domains = ['https://www.meishij.net']
    start_urls = ['https://www.meishij.net/fenlei/chaofan/']
    def parse(self, response):
        urls = response.xpath('//div[@class="imgw"]/a[@class="list_s2_item_img"]/@href').extract()
        for url in urls:
            yield scrapy.Request(url, callback=self.parse2, dont_filter=True)  # Submit url to next level
    def parse2(self, response):
        menuitem = MenuItem()
        #Recipe name
        menuitem['name'] = response.xpath('//div[@class="recipe_header_c"]/div[2]/h1/text()').extract()
        #Main material + integration
        mat1 = ""
        material1list = response.xpath(
            '//div[@class="recipe_ingredientsw"]/div[1]/div[2]/strong/a/text()').extract()
        for i in range(0, len(material1list)):
            mat1 = mat1 + material1list[i]
        menuitem['material1'] = mat1
        #Batching + integration
        mat2 = ""
        material2list = response.xpath(
            '//div[@class="recipe_ingredientsw"]/div[2]/div[2]/strong/a/text()').extract()
        for i in range(0, len(material2list)):
            mat2 = mat2 + material2list[i]
        menuitem['material2'] = mat2
        #Difficulty level
        menuitem['level'] = response.xpath('//div[@class="info2"]/div[4]/strong/text()').extract()
        #Time required
        menuitem['needtime'] = response.xpath('//div[@class="info2"]/div[3]/strong/text()').extract()
        # Menu picture
        menuitem['img'] = response.xpath('//div[@class="recipe_header_c"]/div[1]/img/@src').extract()
        # Step + integration
        steplist = response.xpath('//div[@class="recipe_step"]/div[@class="step_content"]/p/text()').extract()
        step = ""
        for i in range(0, len(steplist)):
            step = step + steplist[i]
        menuitem['step'] = step
        #Heat, sugar and fat content
        menuitem['energy'] = response.xpath('//div[@class="jztbox"]/div[3]/div[1]/text()').extract()
        menuitem['sugar'] = response.xpath('//div[@class="jztbox"]/div[2]/div[1]/text()').extract()
        menuitem['fat'] = response.xpath('div[@class="jztbox"]/div[4]/div[1]/text()').extract()
        #Format -- list to string
        menuitem['name'] = ''.join(menuitem['name'])
        menuitem['level'] = ''.join(menuitem['level'])
        menuitem['needtime'] = ''.join(menuitem['needtime'])
        menuitem['img'] = ''.join(menuitem['img'])
        menuitem['energy'] = ''.join(menuitem['energy'])
        menuitem['sugar'] = ''.join(menuitem['sugar'])
        menuitem['fat'] = ''.join(menuitem['fat'])
        # print(menuitem)
        return menuitem

By running start Py start the crawler to check whether the database is successfully inserted:Insert succeeded!

Keywords: Pycharm SQL

Added by next on Tue, 18 Jan 2022 10:39:50 +0200

Programming VIP