Section 14.9 Python uses urllib. request + Beautiful Soup to obtain basic information about url access

After reading the content of the url document with urllib.request and parsing it with BeautifulSoup, the basic information of the html document can be output through some basic BeautifulSoup objects. In Bowen Section 14.6 uses Python urllib.request to simulate the implementation code of browser accessing web pages > For example, read and parse the code as follows:

>>> from bs4 import BeautifulSoup
>>> import urllib.request
>>> def getURLinf(url): 
    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'}
    req = urllib.request.Request(url=url,headers=header)
    resp = urllib.request.urlopen(req,timeout=5)
    html = resp.read().decode()
  
    soup = BeautifulSoup(html,'lxml')
    return (soup,req,resp) 
>>>  soup,req ,resp  = getURLinf(r'https://blog.csdn.net/LaoYuanPython/article/details/100629947')

Basic information available includes:
1. Document Title

>>> soup.title
 Section 14.6 uses Python urllib.request to simulate the browser's implementation code for accessing web pages - Python - CSDN blog </title>

2. Documents are xml documents

>>> soup.is_xml
False

3. url address of document

>>> req.full_url
'https://blog.csdn.net/LaoYuanPython/article/details/100629947'
>>> resp.geturl()
'https://blog.csdn.net/LaoYuanPython/article/details/100629947'
>>> resp.url
'https://blog.csdn.net/LaoYuanPython/article/details/100629947'
>>>

4. The host where the document is located

>>> req.host
'blog.csdn.net'

5. Request Header Information

>>> req.header_items()
[('Host', 'blog.csdn.net'), ('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36')]
>>>

6. Response State Code

>>> resp.getcode()
200
>>>

7. Response to http header information

>>> resp.headers.items()
[('Date', 'Sun, 08 Sep 2019 15:07:12 GMT'), ('Content-Type', 'text/html; charset=UTF-8'), ('Transfer-Encoding', 'chunked'), ('Connection', 'close'), ('Set-Cookie', 'acw_tc=2760828215679552322374611eb7315abdcfe4ee6f7af5d157db5621c4267d;path=/;HttpOnly;Max-Age=2678401'), ('Server', 'openresty'), ('Vary', 'Accept-Encoding'), ('Set-Cookie', 'uuid_tt_dd=10_19729129290-1567955232238-614052; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Set-Cookie', 'dc_session_id=10_1567955232238.557324; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Vary', 'Accept-Encoding'), ('Strict-Transport-Security', 'max-age=86400')]
>>>

This section describes the basic information of url access that can be easily accessed by using urllib.request to read the content of the url document and Beautiful Soup parsing, through which some summary information can be provided for this visit.

Old ape Python, learn Python from old ape!
Blog address: https://blog.csdn.net/LaoYuan Python
Old Ape Python Blog Articles Directory: https://blog.csdn.net/LaoYuan Python/article/details/98245036
Please support, praise, comment and pay more attention! Thank you!

Keywords: Python encoding Windows xml

Added by tonbah on Tue, 08 Oct 2019 00:53:59 +0300

Programming VIP

Section 14.9 Python uses urllib. request + Beautiful Soup to obtain basic information about url access

Popular Keywords