Take you to finish the novel sword in Python in 5 minutes (complete code attached)

Train of thought:

step 1: request Sword come. Novel site, get two things

  • Novel name -- create a folder with novel name under the same directory of Python program
  • Link address for each chapter of the novel

step 1 corresponds to the get uurl() function in the code, which also includes the multithreaded code used to speed up the crawler. After many tests, it takes about 5 minutes to complete chapter 645 of the novel.
(total time of the last test: 288.94552659988403s, specific time depends on the network condition)

step 2: traverse (for loop) all links from step 1, and get two things as well

  • Novel chapter name - under the folder created in step 1, create a. txt file with the novel chapter name
  • Chapter content of the novel - saved in chapters for easy viewing

step 2 corresponds to the download() function in the code, which requires two parameters: one is the url of the novel chapter link obtained from step 1; the other is the name of the folder created by step 1.

Tools: the third-party libraries used by Pycharm are: requests, pyquery, threading, os, time (not required, used to record the total running time of the program)

# -*- coding: UTF-8 -*-
import requests
from pyquery import PyQuery as pq
import threading
import time
import os


# Simple anti creep processing: modify the content of request header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
           'Cookie': 'Hm_lvt_acb6f2bc2dcbdaa4ceeed2289f1d0283=1562677085,1562680745,1562916025; Hm_lpvt_acb6f2bc2dcbdaa4ceeed2289f1d0283=1562922723',
           'Host': 'www.jianlaixiaoshuo.com',
           'Referer': 'http://www.jianlaixiaoshuo.com/',
           }


def get_url():
    response = requests.get("http://www.jianlaixiaoshuo.com/", headers=headers)
    response.encoding = response.apparent_encoding  # Prevent random code in response body
    html = pq(response.text)
    name = html('body > div:nth-child(2) > div > div > div > h1 > a').text()  # Get novel name
    if os.path.exists(name) == False:              # In the same directory of Python program, create folder with novel name
        os.mkdir(name)
    links = html('body > div:nth-child(2) > div > dl > dd > a')  # Get the link address of each chapter of the novel
    # Construct multithreading, speed up crawler
    threads = []   # Multithreaded list
    for link in links.items():
        url = 'http://Www.jianlaixiaoshuo.com '+ link.attr.href ා splice the complete link
        t = threading.Thread(target=download, args=(url, name))  # Set one thread per link
        threads.append(t)                                        # Add the created threads to the multithreading list in turn
    for i in threads:
        i.start()  # Startup thread
    for i in threads:
        i.join()   # Blocks threads in the current context until the thread that called this method terminates


def download(url, name):
    #  Get the title and content of each chapter
    response = requests.get(url, headers=headers)
    response.encoding = response.apparent_encoding  # Prevent random code in response body
    html = pq(response.text)
    title = html('#BookCon > h1').text()  # Chapter name
    content = html('#BookText').text()    # Chapter content
    file_name = name + '\\' + title + '.txt'  # Create a. txt file with the chapter name, and save the novel content in chapters.
    print('Saving novel file:' + file_name)  # Prompt crawling progress
    with open(file_name, "w", encoding='utf-8') as f:
        f.write(content)


def main():    # Main function
    start_time = time.time()  # Time the program started running
    get_url()
    print("Total time:%s" % (time.time()-start_time))


if __name__ == '__main__':
   main()

Keywords: encoding Python network Pycharm

Added by loquaci on Mon, 28 Oct 2019 22:50:49 +0200