Train of thought:
step 1: request Sword come. Novel site, get two things
- Novel name -- create a folder with novel name under the same directory of Python program
- Link address for each chapter of the novel
step 1 corresponds to the get uurl() function in the code, which also includes the multithreaded code used to speed up the crawler. After many tests, it takes about 5 minutes to complete chapter 645 of the novel.
(total time of the last test: 288.94552659988403s, specific time depends on the network condition)
step 2: traverse (for loop) all links from step 1, and get two things as well
- Novel chapter name - under the folder created in step 1, create a. txt file with the novel chapter name
- Chapter content of the novel - saved in chapters for easy viewing
step 2 corresponds to the download() function in the code, which requires two parameters: one is the url of the novel chapter link obtained from step 1; the other is the name of the folder created by step 1.
Tools: the third-party libraries used by Pycharm are: requests, pyquery, threading, os, time (not required, used to record the total running time of the program)
# -*- coding: UTF-8 -*- import requests from pyquery import PyQuery as pq import threading import time import os # Simple anti creep processing: modify the content of request header headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Cookie': 'Hm_lvt_acb6f2bc2dcbdaa4ceeed2289f1d0283=1562677085,1562680745,1562916025; Hm_lpvt_acb6f2bc2dcbdaa4ceeed2289f1d0283=1562922723', 'Host': 'www.jianlaixiaoshuo.com', 'Referer': 'http://www.jianlaixiaoshuo.com/', } def get_url(): response = requests.get("http://www.jianlaixiaoshuo.com/", headers=headers) response.encoding = response.apparent_encoding # Prevent random code in response body html = pq(response.text) name = html('body > div:nth-child(2) > div > div > div > h1 > a').text() # Get novel name if os.path.exists(name) == False: # In the same directory of Python program, create folder with novel name os.mkdir(name) links = html('body > div:nth-child(2) > div > dl > dd > a') # Get the link address of each chapter of the novel # Construct multithreading, speed up crawler threads = [] # Multithreaded list for link in links.items(): url = 'http://Www.jianlaixiaoshuo.com '+ link.attr.href ා splice the complete link t = threading.Thread(target=download, args=(url, name)) # Set one thread per link threads.append(t) # Add the created threads to the multithreading list in turn for i in threads: i.start() # Startup thread for i in threads: i.join() # Blocks threads in the current context until the thread that called this method terminates def download(url, name): # Get the title and content of each chapter response = requests.get(url, headers=headers) response.encoding = response.apparent_encoding # Prevent random code in response body html = pq(response.text) title = html('#BookCon > h1').text() # Chapter name content = html('#BookText').text() # Chapter content file_name = name + '\\' + title + '.txt' # Create a. txt file with the chapter name, and save the novel content in chapters. print('Saving novel file:' + file_name) # Prompt crawling progress with open(file_name, "w", encoding='utf-8') as f: f.write(content) def main(): # Main function start_time = time.time() # Time the program started running get_url() print("Total time:%s" % (time.time()-start_time)) if __name__ == '__main__': main()