[data acquisition and fusion technology] the third major operation

Operation ①

Requirements: specify a website and crawl all the pictures in the website, such as China Meteorological Network( http://www.weather.com.cn ). Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 3 digits of the student number)
Output information:
Output the downloaded Url information on the console, store the downloaded image in the images subfolder, and give a screenshot.

1) Completion process:

General idea: after initiating the visit to the specified web page, crawl all the picture URLs and download them. If the number is insufficient, crawl all the links in the web page to visit them in turn, and then download the pictures in the new web page. Repeat the above process until enough pictures are downloaded.

1. Check the required information in the html document of the web page through the browser

Need picture url and new link
- Picture url (src attribute under img tag):
- New link (href attribute under any label):

2. Crawl the information according to the results of the previous step

href attribute crawling
Image url crawling

3. Image download (single thread and multi thread)

Get the picture information through request.get (URL) and write it locally

def img_download(pic_url, count):  # Download the picture according to the picture url

    if len(pic_url) > 4 and pic_url[-4] == '.':  # Set the download file suffix according to the url suffix
        end_with = pic_url[-4:]
    else:
        end_with = ''
    path = './EX3_1_images/' + 'img' + str(count) + end_with
    try:  # Download pictures
        resp = requests.get(pic_url)
        resp = resp.content
        f = open(path, 'wb')
        f.write(resp)
        f.close()
        print('Download img', str(count), ' successfully!')
        # Show successful download
    except Exception as err:
        print('Fail to download img', str(count),
              ' with error as ', err, '!')
        # Show download failed

Single threaded download
- Execute the above image download function for the crawled image url in turn

Multithreaded Download

Add the download tasks to the new threads one by one (when the parameter t is True, the download process uses multithreading, and when False, the download process uses single thread)

if t:
        each_page_threads = []
        for img in imgs:
            T = threading.Thread(target=img_download, args=(img, cc))
            cc += 1
            T.setDaemon(False)
            T.start()
            each_page_threads.append(T)
        for each_thread in each_page_threads:
            each_thread.join()

4. Result display

2) Experience:

Most of the content is the same as the previous job, so it is relatively simple to implement. You can continue to modify the picture download and page Jump to prevent downloading duplicate pictures

Operation ②

Requirements: use the sketch framework to reproduce the operation ①
Output information: the same as operation ①

1) Completion process

General idea: the implementation steps are the same as that of job ①. Just transfer the code of job ① into the sketch framework. Complete the image url crawling and page Jump in the spider, encapsulate the image url into the item and submit it to pipeline for information output and image download

1.spider

import scrapy
from scrapy.selector import Selector as selector
from ..items import Ex32Item


class Ex32Spider(scrapy.Spider):
    name = 'ex3_2'
    allowed_domains = ['www.weather.com.cn']
    start_urls = ['http://www.weather.com.cn/']
    count = 0  # Calculate the number of generated item s, i.e. crawled image URLs

    def parse(self, response):
        try:
            data = response.body.decode()
        except:
            return
        s = selector(text=data)
        pics = s.xpath('//img/@src').extract() # crawl image url
        for pic in pics:  # Package item s one by one and pass them to pipeline
            if self.count >= 115:  # Stop when the quantity is enough
                break
            if len(pic) < 5:  # Exclude incorrect picture URLs
                continue
            item = Ex32Item()
            item['url'] = pic
            item['count'] = self.count
            self.count += 1
            yield item
        links = s.xpath('//@href').extract() # crawl Web links
        for link in links:
            if self.count >= 115:  # Exit when there are enough pictures
                break
            try:
                url = response.urljoin(link)
                yield scrapy.Request(url=url, callback=self.parse)  
                # Call this function again for the new link and repeat the above operation in the new web page until enough pictures are crawled
            except Exception:
                continue  # Try the next link when there are not enough pictures and the link fails to access

2.pipelines

import requests
from itemadapter import ItemAdapter


def img_download(pic_url, count):  # Download the picture according to the picture url
    if len(pic_url) > 4 and pic_url[-4] == '.':  # Set the download file suffix according to the url suffix
        end_with = pic_url[-4:]
    else:
        end_with = ''
    path = 'D:\\031904115\\Data Collecting EX\\EX3\\EX3_2_images/' \
           + 'img' + str(count) + end_with
    try:  # Download pictures
        resp = requests.get(pic_url)
        resp = resp.content
        f = open(path, 'wb')
        f.write(resp)
        f.close()
        print('Download img', str(count), ' successfully!')
        # Show successful download
    except Exception as err:
        print('Fail to download img', str(count),
              ' with error as ', err, '!')
        # Show download failed


class Ex32Pipeline:
    def process_item(self, item, spider):
        print(str(item['count']), ':', item['url'])  # Picture information display
        img_download(item['url'], item['count'])  # Picture download
        return item

3. Result display

2) Experience:

Assignment ② is mainly to convert the crawler into the skeleton of the sketch. After the process, I became more familiar with the skeleton and learned more about the operation methods in the framework

Operation ③

Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the imgs path
Candidate sites: https://movie.douban.com/top250
Output information:

Serial number	Movie title	director	performer	brief introduction	Film rating	Film cover
1	The Shawshank Redemption	Frank delabond	Tim Robbins	Want to set people free	9.7	./imgs/xsk.jpg
2....

1) Completion process:

General idea: realize page turning processing in spider, use xpath to obtain information, and package it item by item in item. Download pictures in pipeline and save data in database

1. Open the target web page to view the label information

It is not difficult to find that the information of a movie is all in a single li tag. After simple verification, it is found that all the required information is in it

2. Try information crawling

HTML document acquisition
Movie information crawling
Since the director and the actor are in the same text, further separation is required
Through practice, it is found that the above crawling methods are not accurate enough in terms of actors and profiles

The reason for the above problems is that some films in the original web page have no introduction or actor information, which leads to the mismatch between the follow-up films and the introduction
- No profile:
- No actors:
resolvent
- Replace the missing actor information with 'None'
- Replace the missing profile with 'None' to prevent subsequent information pairing dislocation

3. Page turning operation

Through observation and attempt, it can be seen that modifying the start value in the url can correspond to the page number page_ Num = int (start / 25) + 1, the multiple of start 25 is the best

4. Code actual installation

spider

import scrapy
from scrapy.selector import Selector as selector
from ..items import Ex33Item


class Ex33Spider(scrapy.Spider):
    name = 'ex3_3'
    allowed_domains = ['movie.douban.com/top250']
    start_urls = ['https://movie.douban.com/top250/']
    url = start_urls[0][:-1]
    for i in range(1, 10):
        start_urls.append(url + '?start=' + str(i * 25))  # Generate and load all page URLs

    def parse(self, response):
        try:
            data = response.body.decode()
        except:
            return
        s = selector(text=data)
        movies = s.xpath('//ol/li ') # crawl the li tag of the corresponding movie
        ranks = movies.xpath('.//EM / text()). Extract() # crawling ranking
        director_and_actor = movies.xpath('.//Div [@ class = "BD"] / P / text() [position() = 1]). Extract() # crawls director and actor
        quotes = []
        scores = movies.xpath('.//Div [@ class = "BD"] / / span [@ class = "rating_num"] / text()). Extract() # crawling score
        titles = movies.xpath('.//img/@alt').extract() # crawls the Chinese name of the movie
        imgs = movies.xpath('.//img/@src').extract() # crawl movie picture url
        for movie in movies:  # Item by item crawling film introduction
            a = movie.xpath('.//div[@class="bd"]/p[@class="quote"]/span/text()').extract_first()
            quote = a if a else 'None'
            quotes.append(quote)
        directors = []
        actors = []
        for m in director_and_actor:  # Split the film director and actor information, and only keep the first director and actor from left to right
            m = m.replace('\n', '')
            m = m.split(':')
            if len(m) > 1:
                director = m[1].split()[0]
                try:
                    actor = m[2].split()[0]
                except IndexError:
                    actor = 'None'
                directors.append(director)
                actors.append(actor)
        for i in range(len(movies)):  # Data encapsulated into item
            item = Ex33Item()
            item['rank'] = ranks[i]
            item['title'] = titles[i]
            item['director'] = directors[i]
            item['actor'] = actors[i]
            item['quote'] = quotes[i]
            item['score'] = scores[i]
            item['img'] = imgs[i]
            yield item

pipelines (picture download and database are described in the previous or previous assignments)

import requests
import sqlite3
from itemadapter import ItemAdapter


class MovieDB:
    def openDB(self):
        self.con=sqlite3.connect("movies.db")
        self.cursor=self.con.cursor()
        try:  # Build table
            self.cursor.execute("create table movies (Serial number varchar(16),Movie title varchar(16),director varchar(64),performer varchar(32),brief introduction varchar(100),Film rating varchar(16),Film cover varchar(100),constraint pk_weather primary key (Serial number))")
        except Exception as err:  # If the table already exists, delete the original data and add again
            self.cursor.execute("delete from movies")

    def closeDB(self):
        self.con.commit()
        self.con.close()

    def insert(self, item):
        try:  # Data insertion
            self.cursor.execute("insert into movies (Serial number,Movie title,director,performer,brief introduction,Film rating,Film cover) values (?,?,?,?,?,?,?)",
                                (item['rank'], item['title'], item['director'], item['actor'], item['quote'], item['score'], item['img']))
            print('film:', item['title'], ' Saved successfully!')
        except Exception as err:
            print(err)


def img_download(pic_url, title):  # Download the picture according to the picture url
    if len(pic_url) > 4 and pic_url[-4] == '.':  # Set the download file suffix according to the url suffix
        end_with = pic_url[-4:]
    else:
        end_with = ''
    path = 'D:\\031904115\\Data Collecting EX\\EX3\\EX3_3_images/' \
           + 'img' + title + end_with
    try:  # Download pictures
        resp = requests.get(pic_url)
        resp = resp.content
        f = open(path, 'wb')
        f.write(resp)
        f.close()
        print('Download img', title, ' successfully!')
        # Show successful download
    except Exception as err:
        print('Fail to download img', title,
              ' with error as:', err)
        # Show download failed


class Ex33Pipeline:

    def __init__(self):  # Set database object
        self.db = MovieDB()

    def open_spider(self, spider):  # Open the database when the spider starts running
        self.db.openDB()

    def process_item(self, item, spider):  # Data saving and picture downloading
        self.db.insert(item)
        img_download(item['img'], item['title'])
        return item

    def close_spider(self, spider):  # Close the database when the spider finishes running
        self.db.closeDB()

5. Result display

Console:
Database content:
Save results:

2) Experience:

Homework ③ basically integrates all the crawler knowledge learned before, and it still has a little sense of achievement. It's really a big headache to constantly find mistakes in the process. Fortunately, it has been corrected as much as possible. Of course, there are still many deficiencies. I hope I can be more proficient in completing such tasks in the future.

code

Code cloud address: https://gitee.com/Poootato/crawl_project

Added by vent on Tue, 02 Nov 2021 18:51:45 +0200

Programming VIP

[data acquisition and fusion technology] the third major operation

Operation ①

1) Completion process:

1. Check the required information in the html document of the web page through the browser

2. Crawl the information according to the results of the previous step

3. Image download (single thread and multi thread)

4. Result display

2) Experience:

Operation ②

1.spider

2.pipelines

3. Result display

2) Experience:

Operation ③

1) Completion process:

1. Open the target web page to view the label information

2. Try information crawling

3. Page turning operation

4. Code actual installation

5. Result display

2) Experience:

code

Popular Keywords