Crawler Learning (crawling weather information in a crawler battle)

1. General Framework List + Crawl Web Pages:

#Data Visualization
from pyecharts import Bar
#Used for url connection login and other functions
import requests
#Parse data
from bs4 import BeautifulSoup

#Used to access crawled data
data = []

def parse_data(url):
    headers = {
        'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
    }
    rest = requests.get(url=url, headers=headers)#Crawl Web pages using the requests.get method
    # Ordinary people may use rest.text, but display garbled code
    text = rest.content.decode('utf-8')#Use utf-8 decoding to prevent the display of garbled code, which cannot be parsed next
    soup = BeautifulSoup(text, 'html5lib')#BeautifulSoup method needs to specify parsing text and how


def main():
    url = "http://www.weather.com.cn/textFC/hb.shtml"
    parse_data(url)

if __name__ == '__main__':
    main()

The parse_data function is primarily used to crawl and parse data

headers can be found in web pages

Error-prone point: When a web page is retrieved using requests.get, it is generally possible to use the text method for data acquisition, but the data is scrambled after the attempt because the requests.get method gets the default ISO-8859-1 decoding when text decodes.

* So use the content method and specify decode('utf-8') to decode

Data parsing I use either BS4 or lxml libraries, but I don't find it convenient to use bs4. Using html5lib is more fault-tolerant and open for HTML data parsing

 

2. Crawl page parsing:

# Crawl Data
    cons = soup.find('div', attrs={'class':'conMidtab'})
    tables = cons.find_all('table')
    for table in tables:
        trs = table.find_all('tr')[2:]
        for index,tr in enumerate(trs):
            if index == 0:
                tds = tr.find_all('td')[1]
                qiwen = tr.find_all('td')[4]
            else:
                tds = tr.find_all('td')[0]
                qiwen = tr.find_all('td')[3]
            city = list(tds.stripped_strings)[0]
            wendu = list(qiwen.stripped_strings)[0]
            data.append({'City':city, 'Maximum Temperature':wendu})

The general way to use bs4 libraries is to find or find_all (see previous blog for details)

The find method compares using data that can be found for the specified content, using attrs={} to customize the condition, and in the code I use attrs={'class':'conMidtab'} or class_='conMidtab'

View the source code of the web page for information

 

 

Locate the table of required information through'class':'conMidtab'

 

 

Re-analysis: Because there are many conMidtabs, the test analysis knows that multiple conMidtabs correspond to the weather conditions of today, tomorrow, the day after...

We are analyzing today's situation, so take the first conMidtab and use soup.find("div",class_="conMidtab") to get the contents of the first conMidtab

 

 

As you know, multiple classes under conMidtab="conMidtab2" represent weather information for different provinces

But research has shown that all weather information is stored in tables, so you can get all tables - cons.find_all('table')

 

 

Also for each table: the third tr starts with the corresponding city information, so for each table get trs = table.find_all("tr")[2:]

 

 

Error-prone point: It is also found that for the first city of each province, it is hidden in the second td of the tr, while other cities in the province are in the first td of the tr, so an if and else judgment is used

The enumerate method produces an index subscript, so when traversing trs, you know that index==0 is the first line

 

 

 

Next Analysis: City Name: For the first city in each province, it is hidden in the second td of the tr, while other cities in the province are hidden in the first td of the tr

Maximum temperature: For the first city in each province, it is hidden in the fifth td of tr, while other cities in the province are hidden in the fourth td of tr

So use

       if index == 0:
                tds = tr.find_all('td')[1]
                qiwen = tr.find_all('td')[4]
            else:
                tds = tr.find_all('td')[0]
                qiwen = tr.find_all('td')[3]
Finally, use stripped_strings to get strings and add them to the data list

3. Data acquisition for all cities:
def main():
    urls = [
        "http://www.weather.com.cn/textFC/hb.shtml",
        "http://www.weather.com.cn/textFC/db.shtml",
        "http://www.weather.com.cn/textFC/hd.shtml",
        "http://www.weather.com.cn/textFC/hz.shtml",
        "http://www.weather.com.cn/textFC/hn.shtml",
        "http://www.weather.com.cn/textFC/xb.shtml",
        "http://www.weather.com.cn/textFC/xn.shtml",
        "http://www.weather.com.cn/textFC/gat.shtml"
    ]
    for url in urls:
        parse_data(url)  

 

 

The main method has been modified to obtain national data

 

4. Sort the data to find the top 10 cities with the highest temperature in the country:

# Sort to find the top ten cities with the highest temperatures
# Sort by temperature
data.sort(key=lambda x:int(x['Maximum Temperature']))
#Top 10 Hottest Cities
data_2 = data[-10:]

When sorting, note that you can only sort by converting it to int type, otherwise you can sort by string.

 

5. Data visualization:

citys = list(map(lambda x:x['City'], data_2))#Transverse coordinates
wendu = list(map(lambda x:x['Maximum Temperature'], data_2))#Ordinate coordinates
charts = Bar('Top 10 Top Temperature Cities in China')
charts.add('', citys, wendu)
charts.render('Weather Network.html')  

Using the Bar module:

The Bar method can primarily name the icon

The add method mainly adds (name of graph color, horizontal coordinate name, vertical coordinate name)

render is primarily stored locally

Results display:

 

 

Full code:

#Data Visualization
from pyecharts import Bar
#Used for url connection login and other functions
import requests
#Parse data
from bs4 import BeautifulSoup

#Used to access crawled data
data = []


def parse_data(url):
    headers = {
        'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
    }
    rest = requests.get(url=url, headers=headers)#Crawl Web pages using the requests.get method
    # Ordinary people may use rest.text, but display garbled code
    text = rest.content.decode('utf-8')#Use utf-8 decoding to prevent the display of garbled code, which cannot be parsed next
    soup = BeautifulSoup(text, 'html5lib')#BeautifulSoup method needs to specify parsing text and how

    # Crawl Data
    cons = soup.find('div', attrs={'class':'conMidtab'})
    tables = cons.find_all('table')
    for table in tables:
        trs = table.find_all('tr')[2:]
        for index,tr in enumerate(trs):
            if index == 0:
                tds = tr.find_all('td')[1]
                qiwen = tr.find_all('td')[4]
            else:
                tds = tr.find_all('td')[0]
                qiwen = tr.find_all('td')[3]
            city = list(tds.stripped_strings)[0]
            wendu = list(qiwen.stripped_strings)[0]
            data.append({'City':city, 'Maximum Temperature':wendu})

def main():
    urls = [
        "http://www.weather.com.cn/textFC/hb.shtml",
        "http://www.weather.com.cn/textFC/db.shtml",
        "http://www.weather.com.cn/textFC/hd.shtml",
        "http://www.weather.com.cn/textFC/hz.shtml",
        "http://www.weather.com.cn/textFC/hn.shtml",
        "http://www.weather.com.cn/textFC/xb.shtml",
        "http://www.weather.com.cn/textFC/xn.shtml",
        "http://www.weather.com.cn/textFC/gat.shtml"
    ]
    for url in urls:
        parse_data(url)

    # Sort to find the top ten cities with the highest temperatures
    # Sort by temperature
    data.sort(key=lambda x:int(x['Maximum Temperature']))
    #Top 10 Hottest Cities
    data_2 = data[-10:]

    # Data Visualization
    citys = list(map(lambda x:x['City'], data_2))#Transverse coordinates
    wendu = list(map(lambda x:x['Maximum Temperature'], data_2))#Ordinate coordinates
    charts = Bar('Top 10 Top Temperature Cities in China')
    charts.add('', citys, wendu)
    charts.render('Weather Network.html')

if __name__ == '__main__':
    main()

Keywords: Python Lambda REST Windows network

Added by Pastulio on Tue, 28 Jan 2020 09:44:05 +0200