python crawler - write the simplest web crawler

Knowledge is like rags, remember to "sew a seam" before you can show up magnificently.

Recently, there has been a strong interest in python crawlers. Here we share our learning path and welcome your suggestions. We communicate with each other and make progress together.

1. Development tools

The tool I use is sublime text3, which is so short and pithy that I'm fascinated by the word, which men probably don't like. I recommend you to use it. Of course, if your computer is well configured, pycharm may be more suitable for you.
Sublime text 3 builds python development environment recommendation to check this blog:
[sublime builds python development environment] [http://www.cnblogs.com/codefish/p/4806849.html]

2. Introduction to Reptiles

As the name implies, a crawler is like a worm, crawling on the Internet. In this way, we can get what we want.
Since we want to climb on the Internet, we need to know the URL, the code name "Unified Resource Locator" and the nickname "Link". Its structure consists of three parts:
(1) Protocol: such as the HTTP protocol that we often use in Web sites.
(2) Domain name or IP address: domain name, such as www.baidu.com, IP address, which is the corresponding IP after domain name resolution.
(3) Path: directory or file, etc.

3.urllib develops the simplest reptile

(1) Introduction to urllib

Module Introduce
urllib.error Exception classes raised by urllib.request.
urllib.parse Parse URLs into or assemble them from components.
urllib.request Extensible library for opening URLs.
urllib.response Response classes used by urllib.
urllib.robotparser Load a robots.txt file and answer questions about fetchability of other URLs.

(2) Developing the simplest reptiles

Baidu's home page is simple and generous, which is very suitable for our crawlers.
The crawler code is as follows:

from urllib import request

def visit_baidu():
    URL = "http://www.baidu.com"
    # open the URL
    req = request.urlopen(URL)
    # read the URL 
    html = req.read()
    # decode the URL to utf-8
    html = html.decode("utf_8")
    print(html)

if __name__ == '__main__':
    visit_baidu()
The results are as follows:

We can compare the results of our operation by right-clicking in the blank of Baidu homepage to see the review elements.
Of course, request can also generate a request object, which can be opened using the urlopen method.
The code is as follows:

from urllib import request

def vists_baidu():
    # create a request obkect
    req = request.Request('http://www.baidu.com')
    # open the request object
    response = request.urlopen(req)
    # read the response 
    html = response.read()
    html = html.decode('utf-8')
    print(html)

if __name__ == '__main__':
    vists_baidu()
The result is the same as before.

(3) Error handling

Error handling is handled by urllib module, mainly URLError and HTTP Error errors, where HTTP Error errors are subclasses of URLError errors, that is, HTTRPError can also be captured by URLError.
HTTP Error can be captured by its code attribute.
The code for handling HTTP Error is as follows:
from urllib import request
from urllib import error

def Err():
    url = "https://segmentfault.com/zzz"
    req = request.Request(url)

    try:
        response = request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except error.HTTPError as e:
        print(e.code)
if __name__ == '__main__':
    Err()

The result of operation is as follows:

404 is the error code printed out, you can Baidu for this detailed information.

URLError can be captured by its reason attribute.
The code for chuliHTTPError is as follows:
from urllib import request
from urllib import error

def Err():
    url = "https://segmentf.com/"
    req = request.Request(url)

    try:
        response = request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except error.URLError as e:
        print(e.reason)
if __name__ == '__main__':
    Err()
The result of operation is as follows:

Since you're dealing with errors, it's better to write both errors into your code, after all, the more detailed and clear they are. Note that HTTPError is a subclass of URLError, so you must put HTTPError in front of URLError, otherwise URLError will be output, such as 404 output as Not Found.
The code is as follows:
from urllib import request
from urllib import error

# The first method, URLErroe and HTTPError
def Err():
    url = "https://segmentfault.com/zzz"
    req = request.Request(url)

    try:
        response = request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except error.HTTPError as e:
        print(e.code)
    except error.URLError as e:
        print(e.reason)
You can change the url to see various forms of error output.


It's not easy for a new comer to come here. If you feel that you have lost something, please don't be stingy with your appreciation.

Keywords: Python sublime Attribute Pycharm

Added by twostars on Thu, 06 Jun 2019 22:50:41 +0300