python3 crawls CSDN personal all article list page

Preface

I wrote the interface function to download a single article before. Combined with this article, I can download all personal articles

code implementation

  1. No technical content is simple xpath processing, but it's interesting that a csdn employee wrote his blog address into the source code, which is a hidden div, but I've filtered it out in the code.
  2. The response time is optimized. After all, it needs page crawling. Without multithreading, the response time will inevitably increase with the increase of page.
  3. A list is passed to each thread. Multiple threads share a list, but I don't lock the list when I access it. I don't think the data in the list will be lost without deleting the elements in the list, but the order of adding is changed (if this idea is not right, please comment and explain it to me). Finally, according to the time of updating the article Reorder each article (what you get is that a list contains multiple dictionaries, and you need to sort the dictionaries according to the key values specified in the dictionary). You use the optertor library

code implementation

import requests
from lxml import etree
import re
import threading
import operator


def get_page(url):
    response = requests.get(url)
    all_page = int(re.findall('var listTotal = (.*?) ;',response.text)[0])//20 +1
    return all_page


def parse_article(url,article_list):
    response = requests.get(url).text
    x = etree.HTML(response)
    x= x.xpath('//div[(@class="article-item-box csdn-tracking-statistics")][not(@style="display: none;")]')
    # article_list = []

    for item in x:
        title = item.xpath('h4/a/text()')[1].strip()
        url = item.xpath('h4/a/@href')[0]
        pubdata = item.xpath('div[@class="info-box d-flex align-content-center"]/p/span[@class="date"]/text()')[0]
        pageviews = item.xpath('div[@class="info-box d-flex align-content-center"]/p[3]/span/span/text()')[0]
        comments = item.xpath('div[@class="info-box d-flex align-content-center"]/p[5]/span/span/text()')[0]
        article = dict(
        title = title,
        url = url,
        pubdata = pubdata,
        pageviews = pageviews,
        comments = comments
            )
        article_list.append(article)
    # print(article_list)

def main(url):
    main_url = url
    all_page = get_page(url)
    thread_list = []
    data = []
    for page in range(1,all_page+1):
        url = main_url + '/article/list/' + str(page)
        t = threading.Thread(target=parse_article,args=(url,data))
        t.start()
        thread_list.append(t)

    for t in thread_list:
        t.join()

    data.sort(key=operator.itemgetter('pubdata'))
    print(data,len(data))

if __name__ == '__main__':
    url = 'https://blog.csdn.net/chouzhou9701'
    main(url)

Added by sonnieboy on Mon, 18 Nov 2019 19:23:01 +0200