Python crawler super detailed explanation (zero basic introduction, which can be understood by the elderly)

Before we explain our crawler, we will first outline the simple concept of crawler (after all, it is a zero foundation tutorial)

Reptile

Web crawler (also known as web spider, web robot) is a program that simulates the browser to send web requests and receive request responses. It is a program that automatically grabs Internet information according to certain rules.
In principle, as long as the browser (client) can do anything, the crawler can do it.

Why do we use reptiles

In the era of Internet big data, we are given the convenience of life and the explosive emergence of massive data in the network.
In the past, through books, newspapers, television and radio, we may have information. The amount of this information is limited, and after certain screening, the information is relatively effective, but the disadvantage is that the information surface is too narrow. Asymmetric information transmission, so that our vision is limited and we can't understand more information and knowledge.
In the era of Internet big data, we suddenly have free access to information. We have obtained a large amount of information, but most of them are invalid junk information.
For example, Sina Weibo generates hundreds of millions of status updates a day, while Baidu search engine randomly searches one - 100000000 information on weight loss.
In such a large amount of information fragments, how can we obtain useful information for ourselves?
The answer is screening!
The relevant contents are collected through a certain technology, and the information we really need can be obtained after analysis and selection.
This work of information collection, analysis and integration can be applied to a wide range of fields. Whether it is life services, travel, financial investment, product market demand of various manufacturing industries, etc., we can use this technology to obtain more accurate and effective information.
Although web crawler technology has a strange name, and its first reaction is that kind of soft crawling creature, it is a sharp weapon that can move forward in the virtual world.

Reptile preparation

We usually talk about Python crawlers. In fact, there may be a misunderstanding here. Crawlers are not unique to python. There are many languages that can be used as crawlers, such as PHP,JAVA,C#,C++,Python. Python is chosen as a crawler because Python is relatively simple and has complete functions.
First, we need to download python. I downloaded the latest official version 3.8.3
Secondly, we need an environment to run Python. I use pychram

It can also be downloaded from the official website,
We also need some libraries to support the running of crawlers (some Python libraries may come with them)

It's almost these libraries. Conscience, I've written notes later

(in the process of running the crawler, you don't need only the above libraries. It depends on the specific writing method of your crawler. Anyway, if you need a library, we can install it directly in setting.)

Explanation of reptile project

What I do is crawl the crawler code of Douban scoring film Top250
This is the website we want to climb: https://movie.douban.com/top250

I've finished crawling here. Let's see the renderings. I save the crawled content in xls

Our crawling content is: film details link, picture link, Film Chinese name, film foreign name, score, evaluation number, overview and relevant information.

code analysis

First release the code, and then I analyze it step by step according to the code

Let me explain and analyze the code from bottom to bottom

# -*- codeing = utf-8 -*-
from bs4 import BeautifulSoup  # Web page analysis and data acquisition
import re  # Regular expressions for text matching`
import urllib.request, urllib.error  # Make URL and get web page data
import xlwt  # excel operation
#import sqlite3  # Perform SQLite database operations
 
 
findLink = re.compile(r'<a href="(.*?)">')  # Create regular expression objects, auction rules, rules for movie details links
findImgSrc = re.compile(r'<img.*src="(.*?)"', re.S)
findTitle = re.compile(r'<span class="title">(.*)</span>')
findRating = re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')
findJudge = re.compile(r'<span>(\d*)Human evaluation</span>')
findInq = re.compile(r'<span class="inq">(.*)</span>')
findBd = re.compile(r'<p class="">(.*?)</p>', re.S)
 
 
 
 
 
 
 
 
def main():
    baseurl = "https://movie.douban.com/top250?start = "# page link to crawl
    # 1. Crawl web pages
    datalist = getData(baseurl)
    savepath = "Douban film Top250.xls"    #Create a new XLS in the current directory and store it
    # dbpath = "movie.db"              #Create a new database in the current directory and store it
    # 3. Save data
    saveData(datalist,savepath)      #Only one of the two storage methods can be selected
    # saveData2DB(datalist,dbpath)
 
 
 
 
 
 
# Crawl web pages
def getData(baseurl):
    datalist = []  #Used to store crawled web page information
    for i in range(0, 10):  # Call the function to obtain page information, 10 times
        url = baseurl + str(i * 25)
        html = askURL(url)  # Save the obtained web page source code
        # 2. Analyze the data one by one
        soup = BeautifulSoup(html, "html.parser")
        for item in soup.find_all('div', class_="item"):  # Find a string that meets the requirements
            data = []  # Save all information about a movie
            item = str(item)
            link = re.findall(findLink, item)[0]  # Find by regular expression
            data.append(link)
            imgSrc = re.findall(findImgSrc, item)[0]
            data.append(imgSrc)
            titles = re.findall(findTitle, item)
            if (len(titles) == 2):
                ctitle = titles[0]
                data.append(ctitle)
                otitle = titles[1].replace("/", "")  #Eliminate escape characters
                data.append(otitle)
            else:
                data.append(titles[0])
                data.append(' ')
            rating = re.findall(findRating, item)[0]
            data.append(rating)
            judgeNum = re.findall(findJudge, item)[0]
            data.append(judgeNum)
            inq = re.findall(findInq, item)
            if len(inq) != 0:
                inq = inq[0].replace(". ", "")
                data.append(inq)
            else:
                data.append(" ")
            bd = re.findall(findBd, item)[0]
            bd = re.sub('<br(\s+)?/>(\s+)?', "", bd)
            bd = re.sub('/', "", bd)
            data.append(bd.strip())
            datalist.append(data)
 
 
    return datalist
 
 
 
 
# Get the web page content with a specified URL
def askURL(url):
    head = {  # Simulate the browser header information and send a message to the Douban server
        "User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 80.0.3987.122  Safari / 537.36"
    }
    # User agent means telling Douban server what kind of machine and browser we are (essentially telling the browser what level of file content we can receive)
 
 
    request = urllib.request.Request(url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html
 
 
 
 
# Save data to table
def saveData(datalist,savepath):
    print("save.......")
    book = xlwt.Workbook(encoding="utf-8",style_compression=0) #Create workbook object
    sheet = book.add_sheet('Douban film Top250', cell_overwrite_ok=True) #Create worksheet
    col = ("Movie details link","pictures linking","Chinese name of the film","Film foreign name","score","Number of evaluations","survey","Relevant information")
    for i in range(0,8):
        sheet.write(0,i,col[i])  #Listing
    for i in range(0,250):
        # print("The first%d strip" %(i+1))       #Output statements for testing
        data = datalist[i]
        for j in range(0,8):
            sheet.write(i+1,j,data[j])  #data
    book.save(savepath) #preservation
 
 
# def saveData2DB(datalist,dbpath):
#     init_db(dbpath)
#     conn = sqlite3.connect(dbpath)
#     cur = conn.cursor()
#     for data in datalist:
#             for index in range(len(data)):
#                 if index == 4 or index == 5:
#                     continue
#                 data[index] = '"'+data[index]+'"'
#             sql = '''
#                     insert into movie250(
#                     info_link,pic_link,cname,ename,score,rated,instroduction,info)
#                     values (%s)'''%",".join(data)
#             # print(sql)     #Output query statements for testing
#             cur.execute(sql)
#             conn.commit()
#     cur.close
#     conn.close()
 
 
 
 
# def init_db(dbpath):
#     sql = '''
#         create table movie250(
#         id integer  primary  key autoincrement,
#         info_link text,
#         pic_link text,
#         cname varchar,
#         ename varchar ,
#         score numeric,
#         rated numeric,
#         instroduction text,
#         info text
#         )
#
#
#     '''  #Create data table
#     conn = sqlite3.connect(dbpath)
#     cursor = conn.cursor()
#     cursor.execute(sql)
#     conn.commit()
#     conn.close()
 
 
# Save data to database
 
 
 
 
 
 
 
 
 
 
 
 
 
 
if __name__ == "__main__":  # When the program is executed
    # Call function
     main()
    # init_db("movietest.db")
     print("Climb over!")

--Coding = utf-8 -- the first one is to set the code to utf-8 and write it at the beginning to prevent random code.
Then below Import is to import some libraries and make preparations. (I didn't use sqlite3, so I annotated it).
The following regular expressions starting with find are used to filter information.
(regular expressions can be used in the re library or not. It is not necessary.)
The general process is divided into three steps:

1. Crawl web pages
2. Analyze the data one by one
3. Save web page

First analyze process 1. Crawl the web page. The baseurl is the web address of the web page we want to crawl. Go down and call getData (baseurl),
Let's look at the getData method

  for i in range(0, 10):  # Call the function to obtain page information, 10 times
        url = baseurl + str(i * 25)

You may not understand this paragraph. In fact, it is like this:
Because the movie score is top 250, each page only displays 25, so we need to visit the page 10 times, 25 * 10 = 250.

baseurl = "https://movie.douban.com/top250?start="

As long as we add a number after the baseurl, we will jump to the corresponding page, such as when i=1

https://movie.douban.com/top250?start=25

I put a hyperlink, you can click to see which page you will jump to. After all, practice makes true knowledge.

Then the askURL is called to request the web page. This method is the main method of requesting the web page,
Afraid of the trouble of turning the page, I'll copy the code again to give you an intuitive feeling

def askURL(url):
    head = {  # Simulate the browser header information and send a message to the Douban server
        "User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 80.0.3987.122  Safari / 537.36"
    }
    # User agent means telling Douban server what kind of machine and browser we are (essentially telling the browser what level of file content we can receive)
 
 
    request = urllib.request.Request(url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html

This askURL is used to send requests to the web page. What old fellow asked, why do I write a head here?

This is because if we don't write, we will be recognized as crawlers and display errors and error codes when visiting some websites

418

This is a stem. You can baidu,

418 I'm a teapot

The HTTP 418 I'm a teapot client error response code indicates that
the server refuses to brew coffee because it is a teapot. This error
is a reference to Hyper Text Coffee Pot Control Protocol which was an
April Fools' joke in 1998.

I am a teapot

So we need to "pretend" that we are a browser so that we won't be recognized,
Disguise an identity.

Come on, let's keep going,

  html = response.read().decode("utf-8")

This section is what we read from the web page and set the code to utf-8 to prevent random code.
After the successful visit, we come to the second process:

2. Analyze the data one by one

Here we use the beautiful soup library to analyze data. This library is almost a necessary library for crawlers, no matter what you write.

Now let's start to find the data that meets our requirements, using the beautiful soup method and the re library
Regular expressions to match,

findLink = re.compile(r'<a href="(.*?)">')  # Create regular expression objects, auction rules, rules for movie details links
findImgSrc = re.compile(r'<img.*src="(.*?)"', re.S)
findTitle = re.compile(r'<span class="title">(.*)</span>')
findRating = re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')
findJudge = re.compile(r'<span>(\d*)Human evaluation</span>')
findInq = re.compile(r'<span class="inq">(.*)</span>')
findBd = re.compile(r'<p class="">(.*?)</p>', re.S)

Match the data that meets our requirements, and then save it   dataList  ，  therefore   dataList   There is the data we need in the.

Last process:

3. Save data

    # 3. Save data
    saveData(datalist,savepath)      #Only one of the two storage methods can be selected
    # saveData2DB(datalist,dbpath)

You can choose to save the data to the xls table, which needs to be supported by the xlwt library
You can also choose to save the data to sqlite database, which needs to be supported by sqlite3 library

Here I choose to save to the xls table, which is why I annotate a lot of code. The annotated part is the code saved to the sqlite database. Choose one of the two

The main method of saving to xls is saveData (the following saveData2DB method is to save to sqlite database):

def saveData(datalist,savepath):
    print("save.......")
    book = xlwt.Workbook(encoding="utf-8",style_compression=0) #Create workbook object
    sheet = book.add_sheet('Douban film Top250', cell_overwrite_ok=True) #Create worksheet
    col = ("Movie details link","pictures linking","Chinese name of the film","Film foreign name","score","Number of evaluations","survey","Relevant information")
    for i in range(0,8):
        sheet.write(0,i,col[i])  #Listing
    for i in range(0,250):
        # print("The first%d strip" %(i+1))       #Output statements for testing
        data = datalist[i]
        for j in range(0,8):
            sheet.write(i+1,j,data[j])  #data
    book.save(savepath) #preservation

Create worksheets and columns (they will be created in the current directory),

   sheet = book.add_sheet('Douban film Top250', cell_overwrite_ok=True) #Create worksheet
    col = ("Movie details link","pictures linking","Chinese name of the film","Film foreign name","score","Number of evaluations","survey","Relevant information")

Then save the data in the dataList one by one.

Finally, after successful operation, such a file will be generated on the left