I've described the emoticon store on GitHub before, but I don't feel like there are enough emoticon stores to meet the needs of frequent fights, so I decided to mine the emoticon resources from the omnipresent Internet.
The goal of this crawl is something with an average annual salary of millions. There are questions and answers about emoticon packages. Almost every answer gives a large number of emoticon packs. Netizens from all over the world are actually making friends with emoticon packs. Then I will collect all your emails~
First let's expand all the answers and see them all.Open the developer's tool, find the page on which to answer, and copy all the content from Requests Headers into our own Headers.Use requests to crawl the page.
offset in params changes its value based on page changes, adding 5 to each page turned.The other parameters do not change.
Next, by looking at the source code of the web page, we can clearly see that a link to each picture is placed after the data-actualsrc.
This link can be obtained from the source code of the web page using regular expressions.
pic_urls=re.findall(r'data-actualsrc="(.*?.(gif|jpg|png))',content)
Based on the url links for these pictures, we can download all the pictures as long as we construct the requests request again.An intuitive idea is to first visit the source web page, save all the picture URLs in a list, and then iterate through the list, accessing and downloading the content one by one.
This consumes a lot of memory and downloads pictures slowly, so I've taken another strategy, a simple Requests + Redis distributed crawler.
Get Picture URL
As mentioned earlier, crawl to get url links, except that we store them in Redis and add all URLs to the same collection.
def get_urls(self,offset,urls): params={ 'include': 'data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled,is_recognized,paid_info,paid_info_content;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics', 'limit': 5, 'offset': offset, 'platform': 'desktop', 'sort_by': 'default' } r=requests.get(self.url,headers=self.headers,params=params) data=r.json()['data'] for i in data: content=i['content'] pic_urls=re.findall(r'data-actualsrc="(.*?.(gif|jpg|png))',content) for j in range(len(pic_urls)): self.r.sadd("urls",pic_urls[j][0])
Picture Download
Create a new py file to download pictures. Because the link is much faster than the picture download, a multithreaded request processing is used in the picture download ring.
def download(self): if "urls" in self.r.keys(): while True: try: url=self.r.spop("urls") r=requests.get(url,headers=self.headers) with open(img_path+os.path.sep+'{}{}'.format(self.count,url[-4:]),'wb') as f: f.write(r.content) print("Successfully downloaded{}An emoticon bag!".format(self.count)) self.count+=1 except: if "urls" not in self.r.keys(): print("The emoticon pack has been downloaded completely") break else: print("{}Request sending failed!".format(url)) continue else: self.download(self)
Both programs run simultaneously, taking URLs to save in Redis and taking URLs to download pictures.It greatly speeds up the download and reduces the use of memory.
Result Display
A total of 5W + emoticon packs were crawled, including sassy gif motions.Ha-ha!Bold question: Who is fighting in Dougu?
I've put all my emoticon packs on one disk and get them back in the background immediately ~