Half a mile and half a mile -- a sketch framework

demand

Crawling Pictures in Webmaster material

ImagesPipeline for image data crawling

Crawl pictures

xpath parses the attribute value of the picture src. Make a separate request for the picture address to obtain the binary type data of the picture.

ImagesPipeline

You only need to parse the src attribute in img and submit it to the pipeline, and the pipeline will request the src of the picture, send the binary of the obtained picture and store it persistently.

Use process

  1. Data analysis (get the address of the picture)
  2. Submit the item storing the address to the specified pipeline class
  3. There is a self customized pipeline class based on ImagesPipeLine in the pipeline file
    - function method get_media_request send request
    - function method fifle_path custom picture name
    - function method item_completed passes the item to the next pipeline class to be executed
  4. In the configuration file
    - specify the location where pictures are stored: IMAGES_STORE = "./photo"
    - specify the open pipe: self customized pipe class

matters needing attention

When we parse the picture address according to xpath, we find that the address we crawled to is empty.

The main documents are as follows:

import scrapy

class PhotoSpider(scrapy.Spider):
    name = 'photo'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['https://sc.chinaz.com/tupian/']

    def parse(self, response):
        src_list = response.xpath('//div[@id="container"]/div/div/a/img/@src').extract()
        print(src_list)
        for src in src_list:
            src = "https:" + src
            print(src)


Why?

When we open the web page and check the code, we will find that the attributes of these picture addresses are divided into two types: SRC and src2.



So what's the difference between the two?

When we open the check code drop-down page, we will find that if the image enters the visualization area, the attribute will change from src 2 to src.

Therefore, when crawling data for xpath parsing, our attribute selection should be src2.

Master file

import scrapy
from photopro.items import PhotoproItem
import re

class PhotoSpider(scrapy.Spider):
    name = 'photo'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['https://sc.chinaz.com/tupian/']

    def parse(self, response):
        src_list = response.xpath('//div[@id="container"]/div/div/a/img/@src2').extract()
        for src in src_list:
            src = "https:" + src
            print(src)


At the same time, by clicking on the details page to view the picture, we found that the crawled address is only a thumbnail, and there is a difference between the original image and the crawled address_ s.

Crawl to the address https://scpic3.chinaz.net/Files/pic/pic9/202107/bpic23822_s.jpg
HD address https://scpic3.chinaz.net/Files/pic/pic9/202107/bpic23822.jpg

🆗, Knowing these two considerations, we can write code.

code implementation

Self customized pipeline class: pipelines py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


# class PhotoproPipeline:
#     def process_item(self, item, spider):
#         return item

import scrapy
from scrapy.pipelines.images import ImagesPipeline

class ImgPipeline(ImagesPipeline):
    # Request the image in item
    def get_media_requests(self, item, info):

        yield scrapy.Request(url = item["src"])

    # Name of custom picture
    def file_path(self, request, response = None, info = None):
        url = request.url
        file_name = url.split("/")[-1]
        return file_name

    def item_completed(self, result, item, info):
        return item # The return value is passed to the next pipeline class to be executed

items.py code

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class PhotoproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    src = scrapy.Field()
    pass

Master file code

import scrapy
from photopro.items import PhotoproItem
import re

class PhotoSpider(scrapy.Spider):
    name = 'photo'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['https://sc.chinaz.com/tupian/']

    def parse(self, response):
        src_list = response.xpath('//div[@id="container"]/div/div/a/img/@src2').extract()
        for src in src_list:
            src = "https:" + src
            src = re.sub("_s", "", src)
            item = PhotoproItem()
            item["src"] = src
            # print(src)
            yield item

configuration file


photopro.pipelines. Photopro. In imgpipeline pipelines. What you fill in the back must be the pipe class customized in the front.

Operation results


Keywords: Python Python crawler

Added by johanlundin88 on Fri, 31 Dec 2021 00:39:07 +0200