python helps the iron pillar next door find the missing underwear

cause

It's July 10th. It's time for college students to go home after the holiday. Our major went back later because of practical training. I went back later because of special circumstances. It happened that Tiezhu didn't go back either. Early in the morning, Tiezhu came to our dormitory with his head in his mouth...

Tiezhu: all gone? (tentative inquiry implies the person in the dormitory)
Feixue: No, (sticking his head out of the curtain on the bed)
Fertilizer science: wc!Iron pillar, what's the matter with your aunt's pain
 Tiezhu: my best underwear (underpants) is gone. Have you seen it
 Feixue: who looks at your stuff? I haven't seen it! (he then leaves)
. . . 

Then at more than 3 p.m., suddenly:

I really sweated for him when I was worried.... Wait, iron pillar's underpants. Is there any internal relationship between them in case of rainstorm?

Do you think underwear is the retribution for his bad stealing?
Then I launched an investigation

Survey (using crawler + data analysis and visualization)

thinking

Just do it

Reptiles have been crawling for nearly ten years. The weather in Shijiazhuang in recent days

Let's start with the library we're going to use

import requests  # Request and data acquisition for URL
import time  # Pause for time
import random  # For generating random numbers
import pandas as pd  # For data export
import re  # Use for regular expressions

Next, let's build a camouflage request header, which is one of the common means of reptiles

headers = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'keep-alive',
    'Cookie': 'widget_dz_id=54511; widget_dz_cityValues=,; timeerror=1; defaultCityID=54511; defaultCityName=%u5317%u4EAC; Hm_lvt_a3f2879f6b3620a363bec646b7a8bcdd=1516245199; Hm_lpvt_a3f2879f6b3620a363bec646b7a8bcdd=1516245199; addFavorite=clicked',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3236.0 Safari/537.36'
}

The acquisition method is also very simple. In this link Click to enter the page
Then - right click - if you are Google browser, click Check to appear

The purpose of forging the request header is to make the website think that it is a normal user requesting data, not a crawler, and does not refuse your request.

Then we go to get the file link for storing historical weather data. After analysis, the link before 2016 is: ' http://tianqi.2345.com/t/wea_history/js/58362_%s%s.js '% (year, month) in this form, there is a little difference in the later links. Fertilizer science is directly put together

urls = []
for year in range(2012, 2022):  # From 2011 to 2017
    for month in range(7, 8):  # Traverse from January to December
        # The links used from 2011 to 2016 are slightly different from those in 2017 and need to be processed by branches
        if year <= 2016:
            urls.append('http://tianqi.2345.com/t/wea_history/js/58362_%s%s.js' % (year, month))
        else:
            # In the months of 2017, there is a number 0 in front of January to September, which needs branch processing
            if month < 10:
                urls.append('http://tianqi.2345.com/t/wea_history/js/%s0%s/53698_%s0%s.js' % (year, month, year, month))
            else:
                urls.append('http://tianqi.2345.com/t/wea_history/js/%s%s/53698_%s%s.js' % (year, month, year, month))

Sure enough, the data is in json format:

Then we use regular matching to get the value we want (if not regular, you can learn from the website recommended by this link: link)

for url in urls:
    print(url)
    seconds = random.randint(1, 6)  # For each cycle, an integer between 3 and 6 is randomly generated
    response = requests.get(url, headers=headers).text  # Send a url linked request and return the response data
    ymd = re.findall("ymd:'(.*?)',", response)  # Get date data using regular expressions
    high = re.findall("bWendu:'(.*?)℃',", response)  # Get the maximum temperature data by regular expression
    low = re.findall("yWendu:'(.*?)℃',", response)  # Get the minimum temperature data by regular expression
    tianqi = re.findall("tianqi:'(.*?)',", response)  # Get weather data using regular expressions
    fengxiang = re.findall("fengxiang:'(.*?)',", response)  # Get wind direction data by regular expression
    fengli = re.findall(",fengli:'(.*?)'", response)  # Get wind data using regular expressions
    aqi = re.findall("aqi:'(.*?)',", response)  # Obtaining air quality index data by regular expression
    aqiInfo = re.findall("aqiInfo:'(.*?)',", response)  # Get air quality description data with regular expression
    aqiLevel = re.findall(",aqiLevel:'(.*?)'", response)  # Regular expression to obtain air quality level data

A simple non greedy algorithm finds the data I want in each url, and then we simply split the data to get the data of the whole July. We only need the data in recent days, so we have the following operations

# Since there is no data related to air quality from 2011 to 2015, it needs to be processed separately
    if len(aqi) == 0:
        aqi = None
        aqiInfo = None
        aqiLevel = None
        df=pd.DataFrame(
            {'ymd': ymd, 'high': high, 'low': low, 'tianqi': tianqi, 'fengxiang': fengxiang, 'fengli': fengli,
             'aqi': aqi, 'aqiInfo': aqiInfo, 'aqiLevel': aqiLevel})
        df.drop(index=[i for i in range(0,8)],inplace=True)#Delete matching redundant data
        df.drop(index=[i for i in range(14,31)],inplace=True)
        print(df)

        info.append(df)
    else:
        df = pd.DataFrame(
            {'ymd': ymd, 'high': high, 'low': low, 'tianqi': tianqi, 'fengxiang': fengxiang, 'fengli': fengli,
             'aqi': aqi, 'aqiInfo': aqiInfo, 'aqiLevel': aqiLevel})
        df.drop([i for i in range(0,8)],inplace=True)
        try:
            df.drop([i for i in range(14,31)],inplace=True)
        except:
            print('2021 year\n\n')
        print(df)
        info.append(df)

Introduce PD DataFrame. Drop usage

pd.DataFrame is well known. You can mine common data frames and other functions to increase your learning ability. I often use it to draw pictures or convert dictionary type data into row and column type data. See the details below

Usage: DataFrame.drop(labels=None,axis=0, index=None, columns=None, inplace=False)
#Parameter description


labels Is the name of the row and column to be deleted, given in the list
axis The default value is 0, which means that the row is deleted, so it is deleted columns To specify when axis=1;
index Directly specify the row to delete
columns Directly specify the column to delete
inplace=False,By default, the deletion operation does not change the original data, but returns a new data after the deletion operation dataframe;
inplace=True,The original data will be deleted directly and cannot be returned after deletion.

Finally, merge and save the data

# Merge all the stored weather data to generate a data table
weather = pd.concat(info)
# Data export
weather.to_csv('weather.xlsx', index=False,encoding='gbk') 

With the data, let's start decrypting the truth of this matter now

Data analysis and visualization

You can refer to the following methods: Dozens of cool visual data analysis websites you deserve
The analysis is performed on the data in Excel format obtained above, so we mainly use

import xlrd

Let's draw a pie chart first

Simple data processing

data=xlrd.open_workbook('weather.xlsx')#Read file data
table=data.sheet_by_name('weather')#sheet table named weather
tianshu=table.nrows-1#Because the first line is the header, it needs to be minus one
xiayv=0

#Pie chart
for i in range(1,tianshu):
    a = table.row(i)[3].value
    print(a)
    if a!='Sunny' and a!='cloudy' and a!='Sunny~cloudy' and a!='cloudy~Sunny':
        xiayv=xiayv+1

print(tianshu)
print(xiayv)

Input the data into the link platform above and draw it as shown in the figure below, which is the article "dozens of cool visual data analysis websites you deserve to have"

It can be seen intuitively that the proportion of rain is very high, and we will continue to analyze it.

Scatter diagram

This format is to conform to the data types given by the above platform

ri_wendu=[]
j=1
for i in range(1,tianshu):
    if i%7==0:
        j=1
        ri_wendu.append([j,table.row(i)[1].value])
        j=j+1
    else:
        ri_wendu.append([j,table.row(i)[1].value])
        j=j+1

print(ri_wendu)

This one is about temperature

The weather of these days is acceptable in summer. There is nothing strange. Let's continue to analyze.

Word cloud and color can be set

Library to use

from matplotlib import colors
import jieba

import xlrd
from wordcloud import WordCloud

The use of the word cloud library is very simple. I won't talk about it. It's just setting the color. You may not be familiar with it. You can refer to it

f=open(r'tianqi.txt','r',encoding='utf-8')#Take out the wind direction and wind force from the above data
text=f.read()
cut_text = "".join(jieba.cut(text))
color_list=['#FF0000','#9955FF','#66FFFF']#Create color array
colormap=colors.ListedColormap(color_list)#call
#color_mask = cv2.imread("11.jpg")
word_cloud=WordCloud(
    font_path="msyh.ttc",
    background_color='black',
    mode="RGBA",
    prefer_horizontal=1,
    #mask=color_mask,
    height=200,
    width=200,
    scale=1,
    colormap=colormap,#Set color
    margin=5
)

word_cloud1=word_cloud.generate(cut_text)
word_cloud1.to_file('1.png')

print("Picture saved successfully")


It can be seen from the picture that the wind is quite small, but why is the rain so heavy??? Look down

Set word cloud color card

FFFFFF #DDDDDD #AAAAAA #888888 #666666 #444444 #000000
#FFB7DD #FF88C2 #FF44AA #FF0088 #C10066 #A20055 #8C0044
#FFCCCC #FF8888 #FF3333 #FF0000 #CC0000 #AA0000 #880000
#FFC8B4 #FFA488 #FF7744 #FF5511 #E63F00 #C63300 #A42D00
#FFDDAA #FFBB66 #FFAA33 #FF8800 #EE7700 #CC6600 #BB5500
#FFEE99 #FFDD55 #FFCC22 #FFBB00 #DDAA00 #AA7700 #886600
#FFFFBB #FFFF77 #FFFF33 #FFFF00 #EEEE00 #BBBB00 #888800
#EEFFBB #DDFF77 #CCFF33 #BBFF00 #99DD00 #88AA00 #668800
#CCFF99 #BBFF66 #99FF33 #77FF00 #66DD00 #55AA00 #227700
#99FF99 #66FF66 #33FF33 #00FF00 #00DD00 #00AA00 #008800
#BBFFEE #77FFCC #33FFAA #00FF99 #00DD77 #00AA55 #008844
#AAFFEE #77FFEE #33FFDD #00FFCC #00DDAA #00AA88 #008866
#99FFFF #66FFFF #33FFFF #00FFFF #00DDDD #00AAAA #008888
#CCEEFF #77DDFF #33CCFF #00BBFF #009FCC #0088A8 #007799
#CCDDFF #99BBFF #5599FF #0066FF #0044BB #003C9D #003377
#CCCCFF #9999FF #5555FF #0000FF #0000CC #0000AA #000088
#CCBBFF #9F88FF #7744FF #5500FF #4400CC #2200AA #220088
#D1BBFF #B088FF #9955FF #7700FF #5500DD #4400B3 #3A0088
#E8CCFF #D28EFF #B94FFF #9900FF #7700BB #66009D #550088
#F0BBFF #E38EFF #E93EFF #CC00FF #A500CC #7A0099 #660077
#FFB3FF #FF77FF #FF3EFF #FF0 0FF #CC00CC #990099 #770077
Corresponding color

conclusion

Things gradually surfaced, and we can intuitively see from the above figure that the proportion of rain in recent ten years is still very high, so this may be a coincidence, and the underwear may not be taken by the roommate, but

I was just about to turn off the computer. When I went to sleep, a piece of data made me wake up instantly:

Although it rained a lot, it turned out to be a heavy rainstorm, and the case gradually blurred.

So I quickly called my roommate and asked him if he had collected his clothes when he left. As a result, he was really mistaken by my accurate calculation. When he left, he collected the clothes. I don't know if he received them wrong. They were all put in the wardrobe. I searched for them for a long time and found that there were not only a pair of flower underpants but also other unspeakable things in the cabinet. Ha ha ha. I quickly informed Tiezhu to have a look. The result was really good. Then I received the news that my roommate came home safely. I guess God must think his roommate should be punished for stealing underpants (we're not sure whether it's stealing underpants, so we think it's the wrong one)

Well, that's all for today's investigation. Welcome to subscribe Hand training project Let's make progress and learn knowledge together.
Finally, don't forget to praise the enthusiastic fat student!!! I sincerely hope that my roommate will not read this article.

Keywords: Python crawler data visualization

Added by ldtiw on Tue, 18 Jan 2022 07:32:58 +0200