Learn Python crawler records

The book continues: Learn Python crawler record post_ Qingshen blog - CSDN blog

In the last article, we successfully wrote a simple reptile, but I wanted to climb the whole novel, so we continued to read it

This paper is divided into two parts:

1. Train of thought analysis

2. Single thread crawling whole novel

3. Multithreaded crawling novel

1, Train of thought analysis

import requests as req
from bs4 import BeautifulSoup as bs
html='https://www.bqkan8.com/1_1496/450365.html '# write URL
txt=req.get(url=html)#Crawl the website 
txt=bs(txt.text,'html.parser') 
txt=txt.find_all('div',id='content')
txt=(str(txt).replace('<br/><br/>',''))
print(txt.replace('        ','\n\n'))

This is the code we wrote in the last article. Next, I want to crawl the whole novel. First of all, we need to get its URL to crawl the content, that is, we first need to get all the URLs, store these URLs in a list, and call out the URL when necessary

2, Single thread crawling novel

OK, now that you have the idea, the next step is to write the code:

First, we write a crawler to get the novel directory, and filter the obtained content to remove unnecessary things. The code is as follows:

import requests as req
from bs4 import BeautifulSoup as ds
url1='https://www.bqkan8.com/1_1496'
text1=req.get(url=url1)
bf=ds(text1.content,'html.parser')
text2=bf.find_all('div',class_="listmain")
print(text2)

Then we get:

Continue to analyze the content obtained. First, we crawl through the whole novel. The first 12 chapters are the latest chapters, which must be in a chaotic order. Therefore, do not use the first 12 chapters. Secondly, extract the chapter name and URL separately.

For convenience, I use a list to store all URLs and chapter names, and use string and get () method to get the URL

Without much to say, let's look at the code and understand:

import requests as req
from bs4 import BeautifulSoup as ds
jsq=0
url1='https://www.bqkan8.com/1_1496'
encoding='utf-8'
text1=req.get(url=url1)
bf=ds(text1.content,'html.parser')
text2=bf.find_all('div',class_="listmain")
text2=text2[0]
a=text2.find_all('a')
list1=[]
url1='https://www.bqkan8.com'
for i in a:
    if jsq>11:
        wz=url1+i.get('href')
        list1.append(wz)
    else:
        jsq=jsq+1
print(list1)

The reason for adding a for loop is that we need to remove the latest 12 chapters, so that the website we get is from the first chapter of the novel. The output result is as follows:

Now that the website is completed, we can start writing the program to crawl the text of the novel. We only need to modify the program we wrote above slightly:

import requests as req
from bs4 import BeautifulSoup as ds
jsq=0
url1='https://www.bqkan8.com/1_1496'
encoding='utf-8'
text1=req.get(url=url1)
bf=ds(text1.content,'html.parser')
text2=bf.find_all('div',class_="listmain")
text2=text2[0]
a=text2.find_all('a')
r=open('E:/Summer reptile learning/novel.txt','a',encoding='utf-8')
list1=[];zjjs=0
url1='https://www.bqkan8.com'
for i in a:
    if jsq>11:
        wz=url1+i.get('href')
        list1.append(wz)
        zjjs+=1
    else:
        jsq=jsq+1
jsq=1
for i in list1:
    nr=req.get(url=i)
    text3=ds(nr.content,'html.parser')
    text3=text3.find_all('div',id='content')
    text3=text3[0].text.replace("        ",'\n\n')
    r.write(text3)
    jd=jsq/zjjs*100;jd=round(jd,2)
    print(f'Download progress{jd}%')
    jsq+=1
r.close()
print('Download complete!\n Program exit')

The for loop is selected here to obtain each web address circularly, and save the file locally through open. Then you run it and find that

Mom, it takes more than half an hour to download a 13MB novel, which is extremely inefficient!

So let's enter this article, which is also the most important node for me to learn about crawlers: Thread+queue, the combination of threads and queues

3, Multithreaded crawling novel

I won't say much about Python's "pseudo multithreading". There are many summaries on the Internet. First, let's learn about the threading function of the tool we need to use this time:

import threading #Reference thread function
import time
def work(worker):
    print(f"worker{worker}I am working") #Define a program that can run
    time.sleep(5)
thread1=threading.Thread(target=work,args=(1,)) #target is followed by the program executed by the thread, args is a parameter, which is expressed in the form of tuples, so a comma should be added after it
thread2=threading.Thread(target=work,args=(2,))
thread1.start()
thread2.start()

Note that in this program, I set the system to sleep for five seconds, but when running the program, we found that the two threads run almost without pause.

Next is the queue function queue (). I use the priority queue.PriorityQueue () in its three operation modes. In short, I add an identifier to each task added to the queue. The smaller the identifier, the earlier the task comes out.

Let's look at the code. The code is long but not complex. We need to look at the comments and understand them slowly:

import requests as req
from bs4 import BeautifulSoup as bf
import threading
import queue
address=input('Please enter the storage address:(as E:/xxx/xxx.txt): ')
url1=input('Please enter the directory URL of the novel(Find in this novel website: https://www.bqkan8.com /), note: must be a complete address!!! \n please enter: ')
def nove(url1):
    nove_directory_url=req.get(url=url1)
    nove_directory_url=bf(nove_directory_url.content,'html.parser')
    nove_directory_url=nove_directory_url.find_all('div',class_='listmain')
    nove_directory_url=nove_directory_url[0]
    nove_directory_url=list(nove_directory_url.find_all('a'))
    return nove_directory_url       #The website of novel catalogue and the acquisition of novel chapter names

def nove_directory_url_get(nove_directory_url1):
    nove_directory_url_list=[];jsq=0
    for i in nove_directory_url1:
        if jsq>11:
            nove_directory='https://www.bqkan8.com'+i.get('href')
            nove_directory_url_list.append(nove_directory)
        else:
            jsq=jsq+1
    return nove_directory_url_list     #Get the URL of each chapter of the novel

def nove_directory_get(nove_directory):
    nove_directory_list=[];jsq=0
    for i in nove_directory :
        if jsq>11:
            nove_directory1=i.string
            nove_directory_list.append(nove_directory1)
        else:
            jsq=jsq+1
    return nove_directory_list  #Acquisition of novel chapter names

def nove_text(url2):
    nove_text=req.get(url=url2)
    nove_text=bf(nove_text.content,'html.parser')
    nove_text=nove_text.find_all('div',id='content')
    nove_text=nove_text[0].text.replace('        ','\n\n')
    return nove_text       #Acquisition of novel text

nove_url_and_directory=nove(url1)#Get a list of novel chapter names and web addresses
nove_directory_url_list=nove_directory_url_get(nove_url_and_directory)#Get a list of URL s for each chapter of the novel
nove_directory_list=nove_directory_get(nove_url_and_directory)#Get a list of chapter names for each chapter of the novel


def nove_text_(nove_text_url,nove_title):
    text1=(nove_title+'\n\n')
    text2=nove_text_url
    global text3
    text3={}
    nove_text1='\n\n'+text1+nove_text(text2[1])
    text3[text2[0]]=nove_text1 #Temporarily store in the form of a dictionary, assign the identifier in the queue to the dictionary key value, and assign the body content to the dictionary value value

long=len(nove_directory_list)#Get the number of chapters
nove_directory_url_queue=queue.PriorityQueue()#Create a URL queue
nove_directory_list_queue=queue.PriorityQueue()#Create a chapter name queue
z=0
for i in nove_directory_list:
    nove_directory_list_queue.put([z,i])
    z+=1 #Adding a task to the chapter name queue means adding the chapter name into the queue

x=0
for i in nove_directory_url_list:
    nove_directory_url_queue.put([x,i])
    x+=1#Add a task to the URL queue and attach an identifier
print('Download queue loading completed')

long2=10
while nove_directory_url_queue is not True: #The cyclic condition is that the URL queue is empty
    thread1=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(),nove_directory_list_queue.get()[1]))
    thread2=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread3=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread4=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread5=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread6=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread7=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread8=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread9=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread10=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread1.start()
    thread2.start()
    thread3.start()
    thread4.start()
    thread5.start()
    thread6.start()
    thread7.start()
    thread8.start()
    thread9.start()
    thread10.start()
    thread1.join()
    thread2.join()
    thread3.join()
    thread4.join()
    thread5.join()
    thread6.join()
    thread7.join()
    thread8.join()
    thread9.join()
    thread10.join()#Blocking threads can be understood as stopping threads
    t=open(f'{address}','a',encoding='utf-8')
    for i in sorted(text3):#Sort the key values of the dictionary to get the sorted contents
        txt=text3[i]
        t.write(txt)#write file
    f=long2/long
    f=round(f,4)
    print('Download progress:',f*100,'%')
    long2=long2+10
    t.close()
print('Download complete!')

Here we mainly use the sorting method of the dictionary, so that we can get the sorted novel content, so that the written content is not disorderly.

This program is not perfect. For example, the last thread takes the task

But the energy is limited. After all, the college entrance examination is coming:)

Finally, I wish everyone can read the last, and there will be no power failure when the written program is not saved 2333

Keywords: Python crawler

Added by Cenron on Fri, 24 Dec 2021 22:15:01 +0200