1. Preface
As we all know, Python's most popular crawler framework is Scrapy, which is mainly used to crawl website structural data
Today, we recommend a simpler, lightweight and powerful crawler framework: feapder
Project address:
https://github.com/Boris-code/feapder
2. Introduction and installation
Similar to Scrapy, feapder supports lightweight crawler, distributed crawler, batch crawler, crawler alarm mechanism and other functions
The three built-in crawlers are as follows:
- AirSpider lightweight crawler is suitable for crawlers with simple scenes and less data
- Spider distributed crawler, based on Redis, is suitable for massive data, and supports breakpoint continuous climbing, automatic data warehousing and other functions
- BatchSpider distributed batch crawler is mainly used for crawlers that need periodic collection
Before the actual combat, we install the corresponding dependency Library in the virtual environment
#Install dependent Libraries pip3 install feapder
3. Practice
We use the simplest airslider to crawl some simple data
Target website: aHR0cHM6Ly90b3BodWIudG9kYXkvIA==
The detailed implementation steps are as follows (5 steps)
3-1 creating a crawler project
First, we use the "feapder create -p" command to create a crawler project
#Create a crawler project feapder create -p tophub_demo
3-2 creating a crawler AirSpider
From the command line, go to the spiders folder directory and use the "federader create - s" command to create a crawler
cd spiders #Create a lightweight crawler feapder create -s tophub_spider 1
among
- 1 is the default, which means to create a lightweight crawler AirSpider
- 2 stands for creating a distributed crawler Spider
- 3 stands for creating a distributed batch crawler BatchSpider
3-3 configure database, create data table and create mapping Item
Take Mysql as an example. First, we create a data table in the database
#Create a data table create table topic ( id int auto_increment primary key, title varchar(100) null comment 'Article title', auth varchar(20) null comment 'author', like_count int default 0 null comment 'Like counting', collection int default 0 null comment 'Number of collections', comment int default 0 null comment 'Number of comments' );
Under the root directory of settings, open the project Py file, configure database connection information
# settings.py MYSQL_IP = "localhost" MYSQL_PORT = 3306 MYSQL_DB = "xag" MYSQL_USER_NAME = "root" MYSQL_USER_PASS = "root"
Finally, create a mapping Item (optional)
Enter the items folder and use the "feapder create -i" command to create a file to map to the database
PS: since AirSpider does not support automatic data warehousing, this step is not necessary
3-4 compiling crawler and data analysis
The first step is to make "MysqlDB" initialize the database
from feapder.db.mysqldb import MysqlDB class TophubSpider(feapder.AirSpider): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.db = MysqlDB()
Step 2: start_ In the requests method, specify the main link address to crawl, and use the keyword "download_middle" to configure the random UA
import feapder from fake_useragent import UserAgent def start_requests(self): yield feapder.Request("https://tophub.today/", download_midware=self.download_midware) def download_midware(self, request): #Random UA #Dependency: pip3 install fake_useragent ua = UserAgent().random request.headers = {'User-Agent': ua} return request
The third step is to crawl the title and link address of the home page
Use the built-in method xpath of feapder to parse the data
def parse(self, request, response): # print(response.text) card_elements = response.xpath('//div[@class="cc-cd"]') #Filter out the corresponding card elements [what's worth buying] buy_good_element = [card_element for card_element in card_elements if card_element.xpath('.//div[@class="cc-cd-is"]//span/text()').extract_first() = = 'what's worth buying'] [0] #Get internal article title and address a_elements = buy_good_element.xpath('.//div[@class="cc-cd-cb nano"]//a') for a_element in a_elements: #Titles and links title = a_element.xpath('.//span[@class="t"]/text()').extract_first() href = a_element.xpath('.//@href').extract_first() #Issue a new task again with the title of the article yield feapder.Request(href, download_midware=self.download_midware, callback=self.parser_detail_page, title=title)
Step 4: crawl the details page data
In the previous step, issue a new task, specify the callback function through the keyword "callback", and finally in the parser_detail_page to analyze the data of the detail page
def parser_detail_page(self, request, response): """ Analyze article detail data :param request: :param response: :return: """ title = request.title url = request.url #Analyze the article details page to get the number of likes, collections, comments and the name of the author author = response.xpath('//a[@class="author-title"]/text()').extract_first().strip() print("Author:", author, 'Article title:', title, "Address:", url) desc_elements = response.xpath('//span[@class="xilie"]/span') print("desc number:", len(desc_elements)) #Like like_count = int(re.findall('\d+', desc_elements[1].xpath('./text()').extract_first())[0]) #Collect collection_count = int(re.findall('\d+', desc_elements[2].xpath('./text()').extract_first())[0]) #Commentary comment_count = int(re.findall('\d+', desc_elements[3].xpath('./text()').extract_first())[0]) print("give the thumbs-up:", like_count, "Collection:", collection_count, "comment:", comment_count)
3-5 data warehousing
Use the database object instantiated above to execute SQL and insert the data into the database
#Insert database sql = "INSERT INTO topic(title,auth,like_count,collection,comment) values('%s','%s','%s','%d','%d')" % ( title, author, like_count, collection_count, comment_count) #Execute self.db.execute(sql)
4. Finally
This article talks about AirSpider, the simplest crawler in feapder, through a simple example
The use of advanced functions of feapder will be described in detail later through a series of examples.
If you think the article is good, please praise, collect and forward it, because this will be the strongest driving force for me to continue to output more high-quality articles!
Here I would like to recommend my own Python learning Q group: 705933274. All of them are learning python. If you want to learn or are learning python, you are welcome to join. Everyone is a software development party and shares dry goods from time to time (only related to Python software development), including a copy of the latest Python advanced materials and zero basic teaching in 2021 compiled by myself, Welcome to advanced and interested partners of Python!