Preface
I wrote the interface function to download a single article before. Combined with this article, I can download all personal articles
code implementation
- No technical content is simple xpath processing, but it's interesting that a csdn employee wrote his blog address into the source code, which is a hidden div, but I've filtered it out in the code.
- The response time is optimized. After all, it needs page crawling. Without multithreading, the response time will inevitably increase with the increase of page.
- A list is passed to each thread. Multiple threads share a list, but I don't lock the list when I access it. I don't think the data in the list will be lost without deleting the elements in the list, but the order of adding is changed (if this idea is not right, please comment and explain it to me). Finally, according to the time of updating the article Reorder each article (what you get is that a list contains multiple dictionaries, and you need to sort the dictionaries according to the key values specified in the dictionary). You use the optertor library
code implementation
import requests from lxml import etree import re import threading import operator def get_page(url): response = requests.get(url) all_page = int(re.findall('var listTotal = (.*?) ;',response.text)[0])//20 +1 return all_page def parse_article(url,article_list): response = requests.get(url).text x = etree.HTML(response) x= x.xpath('//div[(@class="article-item-box csdn-tracking-statistics")][not(@style="display: none;")]') # article_list = [] for item in x: title = item.xpath('h4/a/text()')[1].strip() url = item.xpath('h4/a/@href')[0] pubdata = item.xpath('div[@class="info-box d-flex align-content-center"]/p/span[@class="date"]/text()')[0] pageviews = item.xpath('div[@class="info-box d-flex align-content-center"]/p[3]/span/span/text()')[0] comments = item.xpath('div[@class="info-box d-flex align-content-center"]/p[5]/span/span/text()')[0] article = dict( title = title, url = url, pubdata = pubdata, pageviews = pageviews, comments = comments ) article_list.append(article) # print(article_list) def main(url): main_url = url all_page = get_page(url) thread_list = [] data = [] for page in range(1,all_page+1): url = main_url + '/article/list/' + str(page) t = threading.Thread(target=parse_article,args=(url,data)) t.start() thread_list.append(t) for t in thread_list: t.join() data.sort(key=operator.itemgetter('pubdata')) print(data,len(data)) if __name__ == '__main__': url = 'https://blog.csdn.net/chouzhou9701' main(url)