This chapter is a practical project. Don't introduce too much. Direct start project description:
data:image/s3,"s3://crabby-images/dc3d6/dc3d6b1a0eeb9596be611a7efbdbfd843e1ac6fe" alt=""
After entering the official website
data:image/s3,"s3://crabby-images/693ae/693aec5fe0a99cfbbe44f8ef5afecc173ba84c3c" alt=""
You can see the address
data:image/s3,"s3://crabby-images/f566d/f566db4ac4a7914d5c7c3fdf099c21e570a25d1e" alt=""
Since the address we need is
data:image/s3,"s3://crabby-images/af01f/af01fd54ee99cf62388770253bcb0be6f7af843c" alt=""
To create a Scrapy project:
data:image/s3,"s3://crabby-images/828b9/828b978086ffc6945221a96eb86b27b2fe48753f" alt=""
In Tecent_ Find the spiders folder under the recruit folder, Open the cmd window here and enter the command: scratch genspider catch_ positon tencent.com Create a crawler file named "catch_positon"
data:image/s3,"s3://crabby-images/3d404/3d404933a541851fe4681926d908d45b4e3e953c" alt=""
Clear climb target We open the "tencent_recruit" project just created in pycharm. Locate the items.py file
data:image/s3,"s3://crabby-images/ffa97/ffa9754f68cbb6d84ed2d4e45201faa998ff64cb" alt=""
According to the target web page, we determine the crawling target as
- "Position name"
- "Position details connection"
- "Position type"
- "Number of recruits"
- "Place of work"
- Publish on.
data:image/s3,"s3://crabby-images/a62d8/a62d888a8e14c2e5530b166b4e2787b9d19c8b2d" alt=""
Accordingly, we write“ items.py ”Determine the climb target.
data:image/s3,"s3://crabby-images/517a7/517a7fb6323122c70535cb09b12b074e74e108d0" alt=""
import scrapy class TencentRecruitItem(scrapy.Item): # define the fields for your item here like: position_name = scrapy.Field() #Job title position_detail_link = scrapy.Field() #Job details link position_type = scrapy.Field() #Position type recruit_num = scrapy.Field() #Number of recruits work_location = scrapy.Field() #Duty station publish_time = scrapy.Field() #Release time
Writing crawler files
Double click the "catch_position. Py" we created to write the crawler file.
data:image/s3,"s3://crabby-images/2be84/2be8474eb330178f8c9b8a74944112107f210c29" alt=""
First, change the "start_urls" field value to our target web address.
data:image/s3,"s3://crabby-images/e0194/e01945110d2572dbb4c2b6336e1bbd8cabe4f683" alt=""
"In" settings.py "Add" # "before the" ROBOTSTXT_OBEY "protocol in line 22 (line 22 in pycharm, and the number of lines may be different in different editors).
data:image/s3,"s3://crabby-images/a8c90/a8c90ce7ed8c61ad1cbcfc0157db09776e474845" alt=""
Remove the "#" comment before "USER_AGENT" in line 19 (line 19 in pycharm, and the number of lines may be different in different editors), and change its value to the value seen in F12 in the browser.
data:image/s3,"s3://crabby-images/f7054/f7054e52ecb1f989c6b3a2a75278258fcb28cdb8" alt=""
data:image/s3,"s3://crabby-images/ae7fe/ae7fe3341e27ad8a2dc3c97ba5f1b0d233de12db" alt=""
Then write our crawler file catch_positon.py
data:image/s3,"s3://crabby-images/cac01/cac013f863a02a5588f526baaff682d9c6b0f765" alt=""
Change the content of parse to:
data:image/s3,"s3://crabby-images/bd7bc/bd7bc35dc3e86225ce676025b8a4623c5fa2f22e" alt=""
def parse(self, response): node_list = response.xpath('//tr[@class="even"]|//tr[@class="odd"]') #Extracting data using xpath for node in node_list: print(node.xpath('./td/a/text()'))
Enter: scratch crawl catch_position on the cmd command line to run the crawler for testing.
data:image/s3,"s3://crabby-images/1acb0/1acb0e9e40001da88bb6cde814f18887783e26dd" alt=""
It can be seen that there is only one data in each row of the extracted data list, so we use "extract_first()" to take the first element.
data:image/s3,"s3://crabby-images/c835f/c835f35eafff6c79c80b209c4024c4c03e96fbeb" alt=""
Note: "extract()[0]" and "extract_first()" can get the first element. Once there is no data, "extract()[0]" will report an error. The small mark range overflow will terminate the program, while "extract_first()" will directly return "null" to indicate a null value and will not interrupt the program. Therefore, we often use "extract_first()" when getting the first element.
We import the items class
data:image/s3,"s3://crabby-images/9eb23/9eb23d893837f8e68fa19a4b209e3d1a351c761b" alt=""
Instantiate the item class and assign the corresponding data to the corresponding item.
data:image/s3,"s3://crabby-images/9c588/9c5881e61c904503796483e6af9993162ecc0e3f" alt=""
def parse(self, response): node_list = response.xpath('//tr[@class="even"]|//tr[@class="odd"]') #Extracting data using xpath for node in node_list: item = TencentRecruitItem() item['position_name'] = node.xpath('./td[1]/a/text()').extract_first() item['position_detail_link']= node.xpath('./td[1]/a/@href').extract_first() item['position_type']= node.xpath('./td[2]/text()').extract_first() item['recruit_num']= node.xpath('./td[3]/text()').extract_first() item['work_location']= node.xpath('./td[4]/text()').extract_first() item['publish_time']= node.xpath('./td[5]/text()').extract_first() yield item
We have successfully extracted the data on the first page of "Tencent Recruitment". Next, let's analyze the web page and crawl all the recruitment information. Press F12, click Select element and select "next page", you can see the corresponding web page code automatically located by the browser for us.
data:image/s3,"s3://crabby-images/d404e/d404e03954df87b493ae8ebd085f7a7221294bd5" alt=""
We click the corresponding a tag link in the code and find that we directly come to the second page. According to this law, we can get the idea of crawling all recruitment information
data:image/s3,"s3://crabby-images/d04c7/d04c70a3f0abafa7da6f6ad719ad3d2a87e95199" alt=""
data:image/s3,"s3://crabby-images/5d21a/5d21a6b5d3d63106b262acccbd0b86de61b6fb85" alt=""
Write pipeline files and store data Double click“ pipelines.py ", enter the pipeline file and write it.
data:image/s3,"s3://crabby-images/aafd7/aafd72f290e254135bfb5870d86376e7508c44a4" alt=""
import json class TencentRecruitPipeline(object): def open_spider(self,spider): self.file = open('tencent_recruit.json','w',encoding='utf-8') def process_item(self, item, spider): data = json.dumps(dict(item),ensure_ascii=False)+'\n' self.file.write(data) return item
Define the close_spider function to close the file at the end of the crawler.
data:image/s3,"s3://crabby-images/a74c6/a74c60f2414da1748e80341966a3a92ba4eac804" alt=""
Finally, to“ settings.py "Register the pipeline in. Find line 69 (line 69 in pycharm, and the number of lines may be different in different editors). Remove the" # "comment in the corresponding part of" ITEM_PIPELINES ".
data:image/s3,"s3://crabby-images/cde62/cde62a0366c30560d50906e8274c7124fcd2f0a3" alt=""
So far, run the crawler file. You can successfully obtain Tencent recruitment information.
data:image/s3,"s3://crabby-images/95f42/95f424b3b77154059917640a913567ea30222c01" alt=""