Crawling and data analysis of domestic GDP

(1) I. background of the topic

Gross domestic product (GDP) refers to the final result of the production activities of all resident units of a country (or region) in a certain period calculated according to the national market price. It is often recognized as the best indicator to measure the national economic situation. GDP is an important comprehensive statistical index in the accounting system; It is also the core index in China's new national economic accounting system. It reflects the economic strength and market scale of a country (or region). In recent years, with the rapid development of China's economy, GDP has also increased significantly. I want to analyze China's GDP in recent years through this climb.

(2) Theme web crawler design scheme

1. Topic crawler name

Climbing of domestic GDP

2. Content and data feature analysis of topic web crawler

The data value and year-on-year growth of the website's GDP and primary, secondary and tertiary industries.

3. Overview of thematic web crawler design scheme

Crawl the gdp data of the website, find the relevant data of the corresponding tag link, and then clean and visually analyze the data.

(3) Analysis of structural characteristics of theme pages

1. Structure and feature analysis of theme page

Crawl main page

In recent years, the GDP of each quarter, the production value of the primary, secondary and tertiary industries and their respective year-on-year growth ratios

2. HTML page parsing

3. Node (label) search method and traversal method

Review the elements of the website, and then find the data page under div class under the corresponding body tag.

(4) Web crawler program design

1. Data crawling and acquisition

 1 from bs4 import BeautifulSoup
 2 import requests
 3 import matplotlib.pyplot as plt
 4 import pandas as pd
 5 import csv
 6 def getHTMLText(url):
 7     try:
 8         headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/\
 9         537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34"}
10         r=requests.get(url,headers=headers)
11         r.raise_for_status()#If the status is not 200, the HTTPError abnormal
12         r.encoding = r.apparent_encoding
13         print(r)
14         return r.text
15     except:
16         return "no"
17 
18 url="https://market.cnal.com/historical/gdp.html"
19 a=getHTMLText(url)
20 soup=BeautifulSoup(a)

2. Clean and process the data

 1 fff=[]
 2 
 3 for i in soup.find_all("tr")[2:]:
 4     fd=[]
 5     gg=0
 6     for a in i:
 7         if gg==1:
 8             fd.append(a)
 9         elif gg==3:
10             fd.append(a)
11         elif gg==5:
12             fd.append(a)
13         elif gg==7:
14             fd.append(a)
15         elif gg==9:
16             fd.append(a)
17         elif gg==11:
18             fd.append(a)
19         elif gg==13:
20             fd.append(a)
21         elif gg==15:
22             fd.append(a)
23         elif gg==17:
24             fd.append(a)
25         gg=gg+1 
26     fff.append(fd)ght=[]
27 for gtk in fff:
28     hv=[]
29     for i in gtk:
30         hv.append(str(i)[4:-5])
31     ght.append(hv)with open("D:\\Li kunbin python Programming\\hk1.csv","w",encoding="utf_8_sig") as fi:
32     writer=csv.writer(fi)
33     writer.writerow(["quarter",
34                      "gross domestic product",
35                      "Year on year growth",
36                      "Primary industry",
37                      "Year on year growth",
38                      "Secondary industry",
39                      "Year on year growth",
40                      "Tertiary industry",
41                      "Year on year growth"])#Data column name for each column
42     for da in ght:
43         writer.writerow(da)
44 fi.close()
45 df=pd.read_csv("D:\\Li kunbin python Programming\\hk1.csv")
46 df

3 data analysis and visualization

 1 for hji in range(len(df["quarter"])):
 2     df.loc[hji,"gross domestic product"]=float("".join(str(df.loc[hji,"gross domestic product"]).split(",")))
 3     df.loc[hji,"Primary industry"]=float("".join(str(df.loc[hji,"Primary industry"]).split(",")))
 4     df.loc[hji,"Secondary industry"]=float("".join(str(df.loc[hji,"Secondary industry"]).split(",")))
 5     df.loc[hji,"Tertiary industry"]=float("".join(str(df.loc[hji,"Tertiary industry"]).split(",")))
 6     df.loc[hji,"Year on year growth"]=float(str(df.loc[hji,"Year on year growth"])[0:-2])/100
 7     df.loc[hji,"Year on year growth.1"]=float(str(df.loc[hji,"Year on year growth.1"])[0:-2])/100
 8     df.loc[hji,"Year on year growth.2"]=float(str(df.loc[hji,"Year on year growth.2"])[0:-2])/100
 9     df.loc[hji,"Year on year growth.3"]=float(str(df.loc[hji,"Year on year growth.3"])[0:-2])/100import seaborn as sns
10 plt.scatter(df["gross domestic product"],
11            df["Primary industry"])
12 
13 plt.show()

 1 plt.scatter(df["Primary industry"],
 2                           df["Tertiary industry"],
 3                           s=None,
 4                           c="k",
 5                           marker="+",
 6                           cmap=None,
 7                           norm=0.7,
 8                           vmin=None,
 9                           vmax=None,
10                           alpha=0.5,
11                           linewidths=None,
12                           verts=None,
13                           edgecolors=None,
14                           data=None)
15 plt.show()

 1 #Understand the position and quantity distribution of the average production value through the box diagram
 2 plt.subplot(2,2,1)
 3 plt.boxplot(df["gross domestic product"],       
 4             notch=True,
 5             sym=None,
 6             vert=None,
 7             whis=None,
 8             positions=None,
 9             widths=None,
10             patch_artist=True,
11             meanline=None,
12             showmeans=None,
13             showcaps=None,
14             showbox=None,
15             showfliers=None,
16             boxprops=None,
17             labels=None,
18             flierprops=None,
19             medianprops=None,
20             meanprops=None,
21             capprops=None,
22             whiskerprops=None)
23 plt.title("gross domestic product")
24 plt.ylabel("quantity")
25 plt.subplot(2,2,2)
26 plt.boxplot(df["Primary industry"],       
27             notch=True,
28             sym=None,
29             vert=None,
30             whis=None,
31             positions=None,
32             widths=None,
33             patch_artist=True,
34             meanline=None,
35             showmeans=None,
36             showcaps=None,
37             showbox=None,
38             showfliers=None,
39             boxprops=None,
40             labels=None,
41             flierprops=None,
42             medianprops=None,
43             meanprops=None,
44             capprops=None,
45             whiskerprops=None)
46 plt.title("Primary industry")
47 plt.ylabel("quantity")
48 plt.subplot(2,2,3)
49 
50 plt.boxplot(df["Secondary industry"],       
51             notch=True,
52             sym=None,
53             vert=None,
54             whis=None,
55             positions=None,
56             widths=None,
57             patch_artist=True,
58             meanline=None,
59             showmeans=None,
60             showcaps=None,
61             showbox=None,
62             showfliers=None,
63             boxprops=None,
64             labels=None,
65             flierprops=None,
66             medianprops=None,
67             meanprops=None,
68             capprops=None,
69             whiskerprops=None)
70 plt.title("Secondary industry")
71 plt.ylabel("quantity")
72 plt.subplot(2,2,4)
73 plt.boxplot(df["Tertiary industry"],       
74             notch=True,
75             sym=None,
76             vert=None,
77             whis=None,
78             positions=None,
79             widths=None,
80             patch_artist=True,
81             meanline=None,
82             showmeans=None,
83             showcaps=None,
84             showbox=None,
85             showfliers=None,
86             boxprops=None,
87             labels=None,
88             flierprops=None,
89             medianprops=None,
90             meanprops=None,
91             capprops=None,
92             whiskerprops=None)
93 plt.title("Tertiary industry")
94 plt.ylabel("quantity")
95 plt.show()

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import matplotlib.pyplot as plt
 4 import seaborn as sns
 5 import pandas as pd
 6 #df=pd.read_csv("C:\\Users\\wei\\data.csv")
 7 ggf=df.sort_values(by="gross domestic product",
 8                         axis=0,
 9                         ascending=False,)
10 bk=ggf["Primary industry"][0:6]
11 zk=ggf["Secondary industry"][0:6]
12 city_1=ggf["Tertiary industry"][0:6]
13 #Show Chinese labels,Dealing with Chinese garbled code
14 plt.rcParams['font.sans-serif']=['Microsoft YaHei']
15 plt.rcParams['axes.unicode_minus']=False 
16 plt.figure(figsize=(10,6))
17 x=list(range(len(zk)))
18 #Set spacing for pictures
19 total_width=0.8
20 n=3
21 width=total_width/n
22 for i in range(len(x)):
23     x[i]-=width
24 plt.bar(x,
25         bk,
26         width=width,
27         label="Primary industry",
28         fc="teal"
29        )
30 for aa,ab in zip(x,bk):
31     plt.text(aa,
32              ab,
33              ab,
34              ha="center",
35              va='bottom',
36              fontsize=10)
37 for i in range(len(x)):
38     x[i]+=width
39 plt.bar(x,
40         zk,
41         width=width,#width
42         label="Secondary industry",
43         tick_label=city_1,
44         color="b"
45        )
46 for aa,ab in zip(x,zk):
47     plt.text(aa,
48              ab,
49              ab,
50              ha="center",
51              va='bottom',
52              fontsize=10)
53 
54 for i in range(len(x)):
55     x[i]+=width
56 plt.bar(x,
57         city_1,
58         width=width,
59         label="Tertiary industry",
60         color="r"
61        )
62 for aa,ab in zip(x,city_1):
63     plt.text(aa,
64              ab,
65              ab,
66              ha="center",
67              va='bottom',
68              fontsize=10)
69 plt.legend()
70 plt.xlabel("industry")
71 plt.ylabel("gdp")
72 plt.title("industry gdp Contribution comparison")
73 plt.grid()
74 plt.show()

 1 #Organize drawing data
 2 hi=df.sort_values(by="gross domestic product",
 3                         axis=0,
 4                         ascending=False,)
 5 for ikl in range(len(df["gross domestic product"])):
 6     if ikl==29:
 7         fa=hi.loc[ikl,"gross domestic product"]
 8     elif ikl==60:
 9         fb=hi.loc[ikl,"gross domestic product"]
10     elif ikl==90:
11         fc=hi.loc[ikl,"gross domestic product"]
12 a_25=0
13 a_50=0
14 a_75=0
15 a_100=0
16 DF=len(hi["gross domestic product"])
17 for gh in range(DF):
18     if hi.loc[gh,"gross domestic product"]>fa:
19         a_100=a_100+1
20     elif hi.loc[gh,"gross domestic product"]>fb:
21         a_75=a_75+1
22     elif hi.loc[gh,"gross domestic product"]>fc:
23         a_50=a_50+1
24     else:
25         a_25=a_25+1
26 a_data=[a_25,a_50,a_75,a_100]
27 plt.rcParams['font.sans-serif']=['Microsoft YaHei']  #Show Chinese labels,Dealing with Chinese garbled code
28 plt.rcParams['axes.unicode_minus']=False 
29 #Construction data
30 explode = [0, 0, 0, 0]
31 labels = ["0-25%", "25-50%", "50-75%", "75-100%"]
32 colors = ['#9ACD32', 'mistyrose', '#DDA0DD', '#FF0000']
33 plt.pie(
34     a_data,  #Drawing data
35     explode=explode, #Specifies that some parts of the pie chart are highlighted, that is, they appear explosive
36     labels=labels,
37     colors=colors,
38     autopct='%.2f%%',
39     pctdistance=0.8,
40     labeldistance=1.1,
41     startangle=180,
42     radius=1.2,
43     counterclock=False,
44     wedgeprops={'linewidth':1.5,'edgecolor':'r'},
45     textprops={'fontsize':10,'color':'black'},
46     )
47 #Add diagram title
48 plt.title('Quantity distribution')
49 #display graphics
50 plt.show()

4.4 data persistence

5. Summarize the codes of the above parts and attach the complete program code

  1 from bs4 import BeautifulSoup
  2 import requests
  3 import matplotlib.pyplot as plt
  4 import pandas as pd
  5 import csv
  6 def getHTMLText(url):
  7     try:
  8         headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/\
  9         537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34"}
 10         r=requests.get(url,headers=headers)
 11         r.raise_for_status()#If the status is not 200, the HTTPError abnormal
 12         r.encoding = r.apparent_encoding
 13         print(r)
 14         return r.text
 15     except:
 16         return "no"
 17 
 18 url="https://market.cnal.com/historical/gdp.html"
 19 a=getHTMLText(url)
 20 soup=BeautifulSoup(a)fff=[]
 21 
 22 for i in soup.find_all("tr")[2:]:
 23     fd=[]
 24     gg=0
 25     for a in i:
 26         if gg==1:
 27             fd.append(a)
 28         elif gg==3:
 29             fd.append(a)
 30         elif gg==5:
 31             fd.append(a)
 32         elif gg==7:
 33             fd.append(a)
 34         elif gg==9:
 35             fd.append(a)
 36         elif gg==11:
 37             fd.append(a)
 38         elif gg==13:
 39             fd.append(a)
 40         elif gg==15:
 41             fd.append(a)
 42         elif gg==17:
 43             fd.append(a)
 44         gg=gg+1 
 45     fff.append(fd)ght=[]
 46 for gtk in fff:
 47     hv=[]
 48     for i in gtk:
 49         hv.append(str(i)[4:-5])
 50     ght.append(hv)
 51 with open("D:\\Li kunbin python Programming\\hk1.csv","w",encoding="utf_8_sig") as fi:
 52     writer=csv.writer(fi)
 53     writer.writerow(["quarter",
 54                      "gross domestic product",
 55                      "Year on year growth",
 56                      "Primary industry",
 57                      "Year on year growth",
 58                      "Secondary industry",
 59                      "Year on year growth",
 60                      "Tertiary industry",
 61                      "Year on year growth"])#Data column name for each column
 62     for da in ght:
 63         writer.writerow(da)
 64 fi.close()
 65 df=pd.read_csv("D:\\Li kunbin python Programming\\hk1.csv")
 66 df
 67 for hji in range(len(df["quarter"])):
 68     df.loc[hji,"gross domestic product"]=float("".join(str(df.loc[hji,"gross domestic product"]).split(",")))
 69     df.loc[hji,"Primary industry"]=float("".join(str(df.loc[hji,"Primary industry"]).split(",")))
 70     df.loc[hji,"Secondary industry"]=float("".join(str(df.loc[hji,"Secondary industry"]).split(",")))
 71     df.loc[hji,"Tertiary industry"]=float("".join(str(df.loc[hji,"Tertiary industry"]).split(",")))
 72     df.loc[hji,"Year on year growth"]=float(str(df.loc[hji,"Year on year growth"])[0:-2])/100
 73     df.loc[hji,"Year on year growth.1"]=float(str(df.loc[hji,"Year on year growth.1"])[0:-2])/100
 74     df.loc[hji,"Year on year growth.2"]=float(str(df.loc[hji,"Year on year growth.2"])[0:-2])/100
 75     df.loc[hji,"Year on year growth.3"]=float(str(df.loc[hji,"Year on year growth.3"])[0:-2])/100
 76 mport seaborn as sns
 77 plt.scatter(df["gross domestic product"],
 78            df["Primary industry"])
 79 
 80 plt.show()
 81 plt.scatter(df["Primary industry"],
 82                           df["Tertiary industry"],
 83                           s=None,
 84                           c="k",
 85                           marker="+",
 86                           cmap=None,
 87                           norm=0.7,
 88                           vmin=None,
 89                           vmax=None,
 90                           alpha=0.5,
 91                           linewidths=None,
 92                           verts=None,
 93                           edgecolors=None,
 94                           data=None)
 95 plt.show()
 96 #Understand the position and quantity distribution of the average production value through the box diagram
 97 plt.subplot(2,2,1)
 98 plt.boxplot(df["gross domestic product"],       
 99             notch=True,
100             sym=None,
101             vert=None,
102             whis=None,
103             positions=None,
104             widths=None,
105             patch_artist=True,
106             meanline=None,
107             showmeans=None,
108             showcaps=None,
109             showbox=None,
110             showfliers=None,
111             boxprops=None,
112             labels=None,
113             flierprops=None,
114             medianprops=None,
115             meanprops=None,
116             capprops=None,
117             whiskerprops=None)
118 plt.title("gross domestic product")
119 plt.ylabel("quantity")
120 plt.subplot(2,2,2)
121 plt.boxplot(df["Primary industry"],       
122             notch=True,
123             sym=None,
124             vert=None,
125             whis=None,
126             positions=None,
127             widths=None,
128             patch_artist=True,
129             meanline=None,
130             showmeans=None,
131             showcaps=None,
132             showbox=None,
133             showfliers=None,
134             boxprops=None,
135             labels=None,
136             flierprops=None,
137             medianprops=None,
138             meanprops=None,
139             capprops=None,
140             whiskerprops=None)
141 plt.title("Primary industry")
142 plt.ylabel("quantity")
143 plt.subplot(2,2,3)
144 
145 plt.boxplot(df["Secondary industry"],       
146             notch=True,
147             sym=None,
148             vert=None,
149             whis=None,
150             positions=None,
151             widths=None,
152             patch_artist=True,
153             meanline=None,
154             showmeans=None,
155             showcaps=None,
156             showbox=None,
157             showfliers=None,
158             boxprops=None,
159             labels=None,
160             flierprops=None,
161             medianprops=None,
162             meanprops=None,
163             capprops=None,
164             whiskerprops=None)
165 plt.title("Secondary industry")
166 plt.ylabel("quantity")
167 plt.subplot(2,2,4)
168 plt.boxplot(df["Tertiary industry"],       
169             notch=True,
170             sym=None,
171             vert=None,
172             whis=None,
173             positions=None,
174             widths=None,
175             patch_artist=True,
176             meanline=None,
177             showmeans=None,
178             showcaps=None,
179             showbox=None,
180             showfliers=None,
181             boxprops=None,
182             labels=None,
183             flierprops=None,
184             medianprops=None,
185             meanprops=None,
186             capprops=None,
187             whiskerprops=None)
188 plt.title("Tertiary industry")
189 plt.ylabel("quantity")
190 plt.show()
191 #Organize drawing data
192 hi=df.sort_values(by="gross domestic product",
193                         axis=0,
194                         ascending=False,)
195 for ikl in range(len(df["gross domestic product"])):
196     if ikl==29:
197         fa=hi.loc[ikl,"gross domestic product"]
198     elif ikl==60:
199         fb=hi.loc[ikl,"gross domestic product"]
200     elif ikl==90:
201         fc=hi.loc[ikl,"gross domestic product"]
202 a_25=0
203 a_50=0
204 a_75=0
205 a_100=0
206 DF=len(hi["gross domestic product"])
207 for gh in range(DF):
208     if hi.loc[gh,"gross domestic product"]>fa:
209         a_100=a_100+1
210     elif hi.loc[gh,"gross domestic product"]>fb:
211         a_75=a_75+1
212     elif hi.loc[gh,"gross domestic product"]>fc:
213         a_50=a_50+1
214     else:
215         a_25=a_25+1
216 a_data=[a_25,a_50,a_75,a_100]
217 plt.rcParams['font.sans-serif']=['Microsoft YaHei']  #Show Chinese labels,Dealing with Chinese garbled code
218 plt.rcParams['axes.unicode_minus']=False 
219 #Construction data
220 explode = [0, 0, 0, 0]
221 labels = ["0-25%", "25-50%", "50-75%", "75-100%"]
222 colors = ['#9ACD32', 'mistyrose', '#DDA0DD', '#FF0000']
223 plt.pie(
224     a_data,  #Drawing data
225     explode=explode, #Specifies that some parts of the pie chart are highlighted, that is, they appear explosive
226     labels=labels,
227     colors=colors,
228     autopct='%.2f%%',
229     pctdistance=0.8,
230     labeldistance=1.1,
231     startangle=180,
232     radius=1.2,
233     counterclock=False,
234     wedgeprops={'linewidth':1.5,'edgecolor':'r'},
235     textprops={'fontsize':10,'color':'black'},
236     )
237 #Add diagram title
238 plt.title('Quantity distribution')
239 #display graphics
240 plt.show()
241 import requests
242 from bs4 import BeautifulSoup
243 import matplotlib.pyplot as plt
244 import seaborn as sns
245 import pandas as pd
246 #df=pd.read_csv("C:\\Users\\wei\\data.csv")
247 ggf=df.sort_values(by="gross domestic product",
248                         axis=0,
249                         ascending=False,)
250 bk=ggf["Primary industry"][0:6]
251 zk=ggf["Secondary industry"][0:6]
252 city_1=ggf["Tertiary industry"][0:6]
253 #Show Chinese labels,Dealing with Chinese garbled code
254 plt.rcParams['font.sans-serif']=['Microsoft YaHei']
255 plt.rcParams['axes.unicode_minus']=False 
256 plt.figure(figsize=(10,6))
257 x=list(range(len(zk)))
258 #Set spacing for pictures
259 total_width=0.8
260 n=3
261 width=total_width/n
262 for i in range(len(x)):
263     x[i]-=width
264 plt.bar(x,
265         bk,
266         width=width,
267         label="Primary industry",
268         fc="teal"
269        )
270 for aa,ab in zip(x,bk):
271     plt.text(aa,
272              ab,
273              ab,
274              ha="center",
275              va='bottom',
276              fontsize=10)
277 for i in range(len(x)):
278     x[i]+=width
279 plt.bar(x,
280         zk,
281         width=width,#width
282         label="Secondary industry",
283         tick_label=city_1,
284         color="b"
285        )
286 for aa,ab in zip(x,zk):
287     plt.text(aa,
288              ab,
289              ab,
290              ha="center",
291              va='bottom',
292              fontsize=10)
293 
294 for i in range(len(x)):
295     x[i]+=width
296 plt.bar(x,
297         city_1,
298         width=width,
299         label="Tertiary industry",
300         color="r"
301        )
302 for aa,ab in zip(x,city_1):
303     plt.text(aa,
304              ab,
305              ab,
306              ha="center",
307              va='bottom',
308              fontsize=10)
309 plt.legend()
310 plt.xlabel("industry")
311 plt.ylabel("gdp")
312 plt.title("industry gdp Contribution comparison")
313 plt.grid()
314 plt.show()

(5) . summary

1. What conclusions can be drawn from the analysis and visualization of subject data? Is the expected goal achieved?

Through this course design, I have a clear understanding of the growth and decline of China's GDP and the situation of their respective industries in recent years, and have a deeper understanding of China's economic situation.

2. What are the gains in the process of completing this design? And suggestions for improvement?

In this course, I have mastered deeper knowledge of crawlers and stimulated my interest in crawlers and python. Many difficulties were encountered in the design process,

Through online information search and online video teaching, I am very helpful in this course design,

In short, this design has benefited me a lot and made me understand that I still have a lot of knowledge to learn. I will continue to refuel and study hard.

Added by patryn on Wed, 29 Dec 2021 11:33:16 +0200

Programming VIP

Crawling and data analysis of domestic GDP

Popular Keywords