Python crawler - CSDN blog posts are saved as HTML/PDF documents

preface

After writing the CSDN blog, Have you ever worried that if one day the blog "disappears" (you know)... Your efforts will not be in vain? And do you want to save your favorite articles locally and avoid "disappearing" or charging for them one day? This article will learn how to use Python script to automatically save CSDN blog articles to the local in HTML and PDF formats.

Single article preservation

Let's take a look at how to save and format a specified single article locally.

Scripting

1. The script needs to import the following modules:

import pdfkit
import requests
import parsel

2. At the same time, the PC needs to install wkhtmltopdf tool to convert html documents into pdf, Official website download address , after installation:
3. Open the browser and right-click to check and analyze the html elements related to the article content (excluding irrelevant content). Through simple analysis, we can find that the content we need is under the article tag:
4. Because the web page elements behind the extracted act tag can not form a complete html document, it needs to be formatted and assembled. The following is a standard html structure:

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Document</title>
</head>
<body>
    Related content
</body>
</html>

After analyzing the preceding contents, let's look at the complete script directly:

import pdfkit
import requests
import parsel

url = 'https://bwshen.blog.csdn.net/article/details/119778471'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
html = response.text
selector = parsel.Selector(html)
# Extract the title of the article
title = selector.css('.title-article::text').get()
# Extract the content labeled article
article = selector.css('article').get()
# Define a standard html structure for assembling the content of the article tag extracted above
src_html = '''
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Document</title>

</head>
<body>
    {content}
</body>
</html>
'''
# Assemble the article content into a standard html format document and save it locally
with open(title + '.html', mode='w+', encoding='utf-8') as f:
    f.write(src_html.format(content=article))
    print('%s.html Saved successfully' % title)
# Call wkhtmltopdf tool to convert html document to pdf format document
config = pdfkit.configuration(wkhtmltopdf=r'D:\SoftWare\htmltopdf\wkhtmltopdf\bin\wkhtmltopdf.exe')
pdfkit.from_file(title + '.html', title + '.pdf', configuration=config)
print(title + '.pdf', 'Saved successfully')

Effect demonstration

The code runs as follows:
1. The generated HTML document is as follows:
2. The generated PDF file is as follows (with navigation label):

Batch save

How to save and convert all the articles of a blogger automatically? You can't deal with it one by one... Let's improve the script to realize the batch automatic saving and conversion of all blog posts of a blogger.

Scripting

Visit my blog home page and find that the tag below contains all the article links of the current page:
Take a look at the complete code:

import pdfkit
import requests
import parsel
import time


src_html = '''
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Document</title>

</head>
<body>
    {content}
</body>
</html>
'''
headers = {
    'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}

def download_one_page(page_url):
    response = requests.get(url=page_url, headers=headers)
    html = response.text
    selector = parsel.Selector(html)
    title = selector.css('.title-article::text').get()
    # Extract the content labeled article
    article = selector.css('article').get()
    # Save HTML document
    with open(title+'.html', mode='w+', encoding='utf-8') as f:
        f.write(src_html.format(content=article))
    print('%s.html Saved successfully' % title)
    # Convert HTML to PDF
    config = pdfkit.configuration(wkhtmltopdf=r'D:\SoftWare\htmltopdf\wkhtmltopdf\bin\wkhtmltopdf.exe')
    pdfkit.from_file(title+'.html', title+'.pdf', configuration=config)
    print('%s.pdf Saved successfully' % title)


def down_all_url(index_url):
    index_response = requests.get(url=index_url,headers=headers)
    index_selector = parsel.Selector(index_response.text)
    urls = index_selector.css('.article-list h4 a::attr(href)').getall()
    for url in urls:
        download_one_page(url)
        time.sleep(2.5)

if __name__ == '__main__':
    down_all_url('https://bwshen.blog.csdn.net/')

Effect demonstration

The code runs as follows:

summary

In fact, if you use Firefox, you can use ready-made plug-ins: PDF Saver For CSDN Blog Save a CSDN blog post as a local PDF file, but the generated PDF does not have a navigation tag and cannot be saved in batch, so you'd better use Python script.

Keywords: python3

Added by springo on Tue, 21 Dec 2021 22:10:00 +0200