The needle doesn't poke! This lightweight crawler framework is on fire

1. Preface

As we all know, Python's most popular crawler framework is Scrapy, which is mainly used to crawl website structural data

Today, we recommend a simpler, lightweight and powerful crawler framework: feapder

Project address:

https://github.com/Boris-code/feapder

2. Introduction and installation

Similar to Scrapy, feapder supports lightweight crawler, distributed crawler, batch crawler, crawler alarm mechanism and other functions

The three built-in crawlers are as follows:

  • AirSpider lightweight crawler is suitable for crawlers with simple scenes and less data
  • Spider distributed crawler, based on Redis, is suitable for massive data, and supports breakpoint continuous climbing, automatic data warehousing and other functions
  • BatchSpider distributed batch crawler is mainly used for crawlers that need periodic collection

Before the actual combat, we install the corresponding dependency Library in the virtual environment

#Install dependent Libraries
pip3 install feapder

3. Practice

We use the simplest airslider to crawl some simple data

Target website: aHR0cHM6Ly90b3BodWIudG9kYXkvIA==

The detailed implementation steps are as follows (5 steps)

3-1 creating a crawler project

First, we use the "feapder create -p" command to create a crawler project

#Create a crawler project
feapder create -p tophub_demo

3-2 creating a crawler AirSpider

From the command line, go to the spiders folder directory and use the "federader create - s" command to create a crawler

cd spiders

#Create a lightweight crawler
feapder create -s tophub_spider 1

among

  • 1 is the default, which means to create a lightweight crawler AirSpider
  • 2 stands for creating a distributed crawler Spider
  • 3 stands for creating a distributed batch crawler BatchSpider

3-3 configure database, create data table and create mapping Item

Take Mysql as an example. First, we create a data table in the database

#Create a data table
create table topic
(
    id         int auto_increment
        primary key,
    title      varchar(100)  null comment 'Article title',
    auth       varchar(20)   null comment 'author',
    like_count     int default 0 null comment 'Like counting',
    collection int default 0 null comment 'Number of collections',
    comment    int default 0 null comment 'Number of comments'
);

Under the root directory of settings, open the project Py file, configure database connection information

# settings.py

MYSQL_IP = "localhost"
MYSQL_PORT = 3306
MYSQL_DB = "xag"
MYSQL_USER_NAME = "root"
MYSQL_USER_PASS = "root"

Finally, create a mapping Item (optional)

Enter the items folder and use the "feapder create -i" command to create a file to map to the database

PS: since AirSpider does not support automatic data warehousing, this step is not necessary

3-4 compiling crawler and data analysis

The first step is to make "MysqlDB" initialize the database

from feapder.db.mysqldb import MysqlDB

class TophubSpider(feapder.AirSpider):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.db = MysqlDB()

Step 2: start_ In the requests method, specify the main link address to crawl, and use the keyword "download_middle" to configure the random UA

import feapder
from fake_useragent import UserAgent

def start_requests(self):
    yield feapder.Request("https://tophub.today/", download_midware=self.download_midware)

def download_midware(self, request):
    #Random UA
    #Dependency: pip3 install fake_useragent
    ua = UserAgent().random
    request.headers = {'User-Agent': ua}
    return request

The third step is to crawl the title and link address of the home page

Use the built-in method xpath of feapder to parse the data

def parse(self, request, response):
    # print(response.text)
    card_elements = response.xpath('//div[@class="cc-cd"]')

    #Filter out the corresponding card elements [what's worth buying]
    buy_good_element = [card_element for card_element in card_elements if
                        card_element.xpath('.//div[@class="cc-cd-is"]//span/text()').extract_first() = = 'what's worth buying'] [0]

    #Get internal article title and address
    a_elements = buy_good_element.xpath('.//div[@class="cc-cd-cb nano"]//a')

    for a_element in a_elements:
        #Titles and links
        title = a_element.xpath('.//span[@class="t"]/text()').extract_first()
        href = a_element.xpath('.//@href').extract_first()

        #Issue a new task again with the title of the article
        yield feapder.Request(href, download_midware=self.download_midware, callback=self.parser_detail_page,
                              title=title)

Step 4: crawl the details page data

In the previous step, issue a new task, specify the callback function through the keyword "callback", and finally in the parser_detail_page to analyze the data of the detail page

def parser_detail_page(self, request, response):
    """
    Analyze article detail data
    :param request:
    :param response:
    :return:
    """
    title = request.title

    url = request.url

    #Analyze the article details page to get the number of likes, collections, comments and the name of the author
    author = response.xpath('//a[@class="author-title"]/text()').extract_first().strip()

    print("Author:", author, 'Article title:', title, "Address:", url)

    desc_elements = response.xpath('//span[@class="xilie"]/span')

    print("desc number:", len(desc_elements))

    #Like
    like_count = int(re.findall('\d+', desc_elements[1].xpath('./text()').extract_first())[0])
    #Collect
    collection_count = int(re.findall('\d+', desc_elements[2].xpath('./text()').extract_first())[0])
    #Commentary
    comment_count = int(re.findall('\d+', desc_elements[3].xpath('./text()').extract_first())[0])

    print("give the thumbs-up:", like_count, "Collection:", collection_count, "comment:", comment_count)

3-5 data warehousing

Use the database object instantiated above to execute SQL and insert the data into the database

#Insert database
sql = "INSERT INTO topic(title,auth,like_count,collection,comment) values('%s','%s','%s','%d','%d')" % (
title, author, like_count, collection_count, comment_count)

#Execute
self.db.execute(sql)

4. Finally

This article talks about AirSpider, the simplest crawler in feapder, through a simple example

The use of advanced functions of feapder will be described in detail later through a series of examples.

If you think the article is good, please praise, collect and forward it, because this will be the strongest driving force for me to continue to output more high-quality articles!


Here I would like to recommend my own Python learning Q group: 705933274. All of them are learning python. If you want to learn or are learning python, you are welcome to join. Everyone is a software development party and shares dry goods from time to time (only related to Python software development), including a copy of the latest Python advanced materials and zero basic teaching in 2021 compiled by myself, Welcome to advanced and interested partners of Python!

 

Keywords: Python Java Database MySQL crawler

Added by nickiehow on Fri, 04 Mar 2022 00:23:05 +0200