The book continues: Learn Python crawler record post_ Qingshen blog - CSDN blog
In the last article, we successfully wrote a simple reptile, but I wanted to climb the whole novel, so we continued to read it
This paper is divided into two parts:
1. Train of thought analysis
2. Single thread crawling whole novel
3. Multithreaded crawling novel
1, Train of thought analysis
import requests as req from bs4 import BeautifulSoup as bs html='https://www.bqkan8.com/1_1496/450365.html '# write URL txt=req.get(url=html)#Crawl the website txt=bs(txt.text,'html.parser') txt=txt.find_all('div',id='content') txt=(str(txt).replace('<br/><br/>','')) print(txt.replace(' ','\n\n'))
This is the code we wrote in the last article. Next, I want to crawl the whole novel. First of all, we need to get its URL to crawl the content, that is, we first need to get all the URLs, store these URLs in a list, and call out the URL when necessary
2, Single thread crawling novel
OK, now that you have the idea, the next step is to write the code:
First, we write a crawler to get the novel directory, and filter the obtained content to remove unnecessary things. The code is as follows:
import requests as req from bs4 import BeautifulSoup as ds url1='https://www.bqkan8.com/1_1496' text1=req.get(url=url1) bf=ds(text1.content,'html.parser') text2=bf.find_all('div',class_="listmain") print(text2)
Then we get:
Continue to analyze the content obtained. First, we crawl through the whole novel. The first 12 chapters are the latest chapters, which must be in a chaotic order. Therefore, do not use the first 12 chapters. Secondly, extract the chapter name and URL separately.
For convenience, I use a list to store all URLs and chapter names, and use string and get () method to get the URL
Without much to say, let's look at the code and understand:
import requests as req from bs4 import BeautifulSoup as ds jsq=0 url1='https://www.bqkan8.com/1_1496' encoding='utf-8' text1=req.get(url=url1) bf=ds(text1.content,'html.parser') text2=bf.find_all('div',class_="listmain") text2=text2[0] a=text2.find_all('a') list1=[] url1='https://www.bqkan8.com' for i in a: if jsq>11: wz=url1+i.get('href') list1.append(wz) else: jsq=jsq+1 print(list1)
The reason for adding a for loop is that we need to remove the latest 12 chapters, so that the website we get is from the first chapter of the novel. The output result is as follows:
Now that the website is completed, we can start writing the program to crawl the text of the novel. We only need to modify the program we wrote above slightly:
import requests as req from bs4 import BeautifulSoup as ds jsq=0 url1='https://www.bqkan8.com/1_1496' encoding='utf-8' text1=req.get(url=url1) bf=ds(text1.content,'html.parser') text2=bf.find_all('div',class_="listmain") text2=text2[0] a=text2.find_all('a') r=open('E:/Summer reptile learning/novel.txt','a',encoding='utf-8') list1=[];zjjs=0 url1='https://www.bqkan8.com' for i in a: if jsq>11: wz=url1+i.get('href') list1.append(wz) zjjs+=1 else: jsq=jsq+1 jsq=1 for i in list1: nr=req.get(url=i) text3=ds(nr.content,'html.parser') text3=text3.find_all('div',id='content') text3=text3[0].text.replace(" ",'\n\n') r.write(text3) jd=jsq/zjjs*100;jd=round(jd,2) print(f'Download progress{jd}%') jsq+=1 r.close() print('Download complete!\n Program exit')
The for loop is selected here to obtain each web address circularly, and save the file locally through open. Then you run it and find that
Mom, it takes more than half an hour to download a 13MB novel, which is extremely inefficient!
So let's enter this article, which is also the most important node for me to learn about crawlers: Thread+queue, the combination of threads and queues
3, Multithreaded crawling novel
I won't say much about Python's "pseudo multithreading". There are many summaries on the Internet. First, let's learn about the threading function of the tool we need to use this time:
import threading #Reference thread function import time def work(worker): print(f"worker{worker}I am working") #Define a program that can run time.sleep(5) thread1=threading.Thread(target=work,args=(1,)) #target is followed by the program executed by the thread, args is a parameter, which is expressed in the form of tuples, so a comma should be added after it thread2=threading.Thread(target=work,args=(2,)) thread1.start() thread2.start()
Note that in this program, I set the system to sleep for five seconds, but when running the program, we found that the two threads run almost without pause.
Next is the queue function queue (). I use the priority queue.PriorityQueue () in its three operation modes. In short, I add an identifier to each task added to the queue. The smaller the identifier, the earlier the task comes out.
Let's look at the code. The code is long but not complex. We need to look at the comments and understand them slowly:
import requests as req from bs4 import BeautifulSoup as bf import threading import queue address=input('Please enter the storage address:(as E:/xxx/xxx.txt): ') url1=input('Please enter the directory URL of the novel(Find in this novel website: https://www.bqkan8.com /), note: must be a complete address!!! \n please enter: ') def nove(url1): nove_directory_url=req.get(url=url1) nove_directory_url=bf(nove_directory_url.content,'html.parser') nove_directory_url=nove_directory_url.find_all('div',class_='listmain') nove_directory_url=nove_directory_url[0] nove_directory_url=list(nove_directory_url.find_all('a')) return nove_directory_url #The website of novel catalogue and the acquisition of novel chapter names def nove_directory_url_get(nove_directory_url1): nove_directory_url_list=[];jsq=0 for i in nove_directory_url1: if jsq>11: nove_directory='https://www.bqkan8.com'+i.get('href') nove_directory_url_list.append(nove_directory) else: jsq=jsq+1 return nove_directory_url_list #Get the URL of each chapter of the novel def nove_directory_get(nove_directory): nove_directory_list=[];jsq=0 for i in nove_directory : if jsq>11: nove_directory1=i.string nove_directory_list.append(nove_directory1) else: jsq=jsq+1 return nove_directory_list #Acquisition of novel chapter names def nove_text(url2): nove_text=req.get(url=url2) nove_text=bf(nove_text.content,'html.parser') nove_text=nove_text.find_all('div',id='content') nove_text=nove_text[0].text.replace(' ','\n\n') return nove_text #Acquisition of novel text nove_url_and_directory=nove(url1)#Get a list of novel chapter names and web addresses nove_directory_url_list=nove_directory_url_get(nove_url_and_directory)#Get a list of URL s for each chapter of the novel nove_directory_list=nove_directory_get(nove_url_and_directory)#Get a list of chapter names for each chapter of the novel def nove_text_(nove_text_url,nove_title): text1=(nove_title+'\n\n') text2=nove_text_url global text3 text3={} nove_text1='\n\n'+text1+nove_text(text2[1]) text3[text2[0]]=nove_text1 #Temporarily store in the form of a dictionary, assign the identifier in the queue to the dictionary key value, and assign the body content to the dictionary value value long=len(nove_directory_list)#Get the number of chapters nove_directory_url_queue=queue.PriorityQueue()#Create a URL queue nove_directory_list_queue=queue.PriorityQueue()#Create a chapter name queue z=0 for i in nove_directory_list: nove_directory_list_queue.put([z,i]) z+=1 #Adding a task to the chapter name queue means adding the chapter name into the queue x=0 for i in nove_directory_url_list: nove_directory_url_queue.put([x,i]) x+=1#Add a task to the URL queue and attach an identifier print('Download queue loading completed') long2=10 while nove_directory_url_queue is not True: #The cyclic condition is that the URL queue is empty thread1=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(),nove_directory_list_queue.get()[1])) thread2=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1])) thread3=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1])) thread4=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1])) thread5=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1])) thread6=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1])) thread7=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1])) thread8=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1])) thread9=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1])) thread10=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1])) thread1.start() thread2.start() thread3.start() thread4.start() thread5.start() thread6.start() thread7.start() thread8.start() thread9.start() thread10.start() thread1.join() thread2.join() thread3.join() thread4.join() thread5.join() thread6.join() thread7.join() thread8.join() thread9.join() thread10.join()#Blocking threads can be understood as stopping threads t=open(f'{address}','a',encoding='utf-8') for i in sorted(text3):#Sort the key values of the dictionary to get the sorted contents txt=text3[i] t.write(txt)#write file f=long2/long f=round(f,4) print('Download progress:',f*100,'%') long2=long2+10 t.close() print('Download complete!')
Here we mainly use the sorting method of the dictionary, so that we can get the sorted novel content, so that the written content is not disorderly.
This program is not perfect. For example, the last thread takes the task
But the energy is limited. After all, the college entrance examination is coming:)
Finally, I wish everyone can read the last, and there will be no power failure when the written program is not saved 2333