China University Comprehensive Ranking Web Crawler

1. Background of topic selection

Why choose this topic? What are the expected goals of data analysis to be achieved? (10) Describe from the aspects of society, economy, technology and data sources (less than 200 words)

As time goes on, we enter the college entrance examination and enter postgraduate studies. At this time, it is very important to know the rank of Chinese universities as an examinee. Understanding Chinese universities can improve our view of school selection and try to enter a good university with certain scores. This crawler will get rankings of Chinese undergraduate colleges and universities, know which provinces they are in, and know their star-rated comprehensive strength. I hope I can know more about Chinese universities after this crawl, the data comes from Gaosan

 

2. Thematic Web Crawler Design (10 points)

1. Thematic Web Crawler Name

Name is python web crawler - comprehensive ranking of China University

 

2. Analysis of Content and Data Features of Thematic Web Crawler Crawling

Contents include rank, school name, location, comprehensive score, star ranking, and level of running a school. These data can show the level and location of a university

3. Overview of Thematic Web Crawler Design (including implementation ideas and technical difficulties)

Find a web address and look at his source code, see where the data is located in the source code, see who wrapped it by tags, see the characteristics of the package label, think about whether it can be crawled, how to crawl, the technical difficulty is that they have no features in one label, between tags, how to crawl data in labels without features?

 

3. Analysis of the Structural Characteristics of Theme Pages (10 points)

1. Theme page structure and feature analysis 2.Htmls page parsing 3. Node (label) lookup method and traversal method (draw node tree structure if necessary)

 

 

The structure is that the data is wrapped by tbodys, each data is wrapped by td tags, and each row is wrapped by tr tags, so we can look for tbodys in the web page, then look for tr tags and then look for td tags to find and access all the data, gradually cracking

4. Web Crawler Programming (60 points)

Crawl Data

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import matplotlib.pyplot as plt
 4 import jieba
 5 jieba.setLogLevel(jieba.logging.INFO)
 6 
 7 #Get Web Page Information
 8 def gethtml():
 9     #exception handling
10     try:
11         #High Triple Network url
12         url='http://www.gaosan.com/gaokao/299171.html'
13         #Initiate a request for information on a Web page
14         re= requests.get(url)
15         #If the return value is not 200, then something happens HTMLerror
16         re.raise_for_status()
17         #Uniform Encoding
18         re.encoding = re.apparent_encoding
19         return  re.text
20     except:
21         #Prompt for failure if failure occurs
22         return 'Get Failed'
23 
24 #Amplification Information
25 college=[]
26 #Interpret Web Pages with Tasty Talks
27 def htmlsoup(txt):
28     #Delicious soup parse web page, according to HTML Principles
29     soup=BeautifulSoup(txt, 'html.parser')
30     #Parse the page to find all tags that are tbody Contents
31     soupf= soup.find_all('tbody')
32     #Analytical of delicious soups html Page Traversal
33     for i in soupf:
34         #Find All tr Label
35         tr=i.find_all('tr')
36         #Every last tr Are all one tr Label
37         for j in tr:
38             #j Are all one td
39             for td in j:
40                 #string Get content between tags
41                 #Return str
42                 strtr=str(td.string)
43                 #Character added to university list
44                 college.append(strtr)
45     return college
46 
47 #Processing Content
48 def dealhtml(college):
49     #Tools for slicing
50     dealmark=0
51     #Traversal one by one of the data
52     # Used to load groups
53     dealc = []
54     for c in college:
55         dealmark=dealmark+1
56         #Every 6 Groups
57         if dealmark%6!=0:
58             dealc.append(c)
59         #If there are already six groups
60         else:
61             dealc.append(c)
62             #Add to total
63             dealcollege.append(dealc)
64             #Reinitialization
65             dealc=[]
66     return dealcollege
67 
68 #Handling Score Exceptions
69 def dealcolleges(dealcollege):
70     for i in dealcollege:
71         if i[3]=='\u3000':
72             i[3]='No Score'
73         else:
74             continue
75     return dealcollege
76 
77 import pandas
78 #Functions for Output
79 def printcollege(dealcollege):
80     for i in dealcollege:
81         print("%-4s%-16s%-6s%-8s%-10s%-12s"%(i[0],i[1],i[2],i[3],i[4],i[5]))
82 
83 #Processed Universities
84 dealcollege=[]
85 #Run Function
86 def run():
87     txt=gethtml()
88     collegesoup=htmlsoup(txt)
89     d=dealhtml(collegesoup)
90     dd=dealcolleges(d)
91     printcollege(dd)
92 
93 if __name__ == '__main__':
94     run()

 

 

The word cloud is as follows

 1 from wordcloud import WordCloud
 2 #Province
 3 province=[]
 4 #from dealcollege piece in
 5 for d in dealcollege[1:]:
 6     province.append(d[2])
 7 #Individual Provinces
 8 everyprovinse=set(province)
 9 listp=list(everyprovinse)
10 #How many provinces in total
11 len_listp=len(listp)
12 number=[0 for i in range(len_listp)]
13 #Number of provinces
14 for p in province:
15     for i in range(len_listp):
16         #How many
17         if listp[i]==p:
18             number[i]+=1
19             break
20         else:
21             continue
22 
23 strprovince=''
24 #All provinces together
25 for i in range(len_listp):
26     intnumber=int(number[i])
27     #plus\t Can participle
28     strp=(listp[i]+'\t')*intnumber
29     strprovince+=strp
30 #Chinese Package Address
31 path='e:\\SimHei.ttf'
32 w=WordCloud(font_path=path,
33             background_color='White',
34             width=800,
35             height=600,
36             )
37 #Place text
38 w.generate(strprovince)
39 plt.imshow(w)
40 # Axis
41 plt.axis()
42 #display
43 plt.show()
44 #Word Cloud Preservation Site
45 w.to_file('address')

 

 

visualization

 1 #Bar Chart
 2 import matplotlib.pyplot as plt
 3 import seaborn as sns
 4 #Chinese Pack
 5 plt.rcParams['font.sans-serif'] = ['SimHei']
 6 plt.rcParams['axes.unicode_minus'] = False
 7 x =listp
 8 y =number
 9 # Seaborn Draw a bar chart
10 data=pandas.DataFrame({
11 'Number':y,
12 'Province':x})
13 sns.barplot(
14             x="Province",
15             y="Number",
16             data=data
17             )
18 plt.show()
19 
20 #Scatter plot
21 import matplotlib.pyplot as plt
22 #Stars in Universities
23 xx=[0 for i in range(8)]
24 #Score Range of Universities
25 score=[0 for i in range(8)]
26 for i in dealcollege[1:]:
27     if i[3]!='No Score':
28         inti=int(eval(i[3]))
29         if 0<=inti<60:
30             score[0]+=1
31         elif 60<=inti<65:
32             score[1]+=1
33         elif 65<=inti<70:
34             score[2]+=1
35         elif 70<=inti<75:
36             score[3]+=1
37         elif 75<=inti<80:
38             score[4]+=1
39         elif 80<=inti<85:
40             score[5]+=1
41         elif 85<=inti<90:
42             score[6]+=1
43         elif 90<=inti<100:
44             score[7]+=1
45     for h in range(8):
46         if int(i[-2][:-1])==h:
47             xx[h]+=1
48 plt.scatter(score,xx,marker='x', c="red", alpha=0.8)
49 plt.show()

 

 

Preservation

1 #Preservation
2 import csv
3 f = open('e:\\daxue.csv','w',newline='')
4 w= csv.writer(f)
5 for i in dealcollege:
6     w.writerow(i)
7 f.close()

 

 

All Code

  1 # --*-- coding:utf-8 --*--
  2 import requests
  3 from bs4 import BeautifulSoup
  4 import matplotlib.pyplot as plt
  5 
  6 
  7 #Get Web Page Information
  8 def gethtml():
  9     #exception handling
 10     try:
 11         #High Triple Network url
 12         url='http://www.gaosan.com/gaokao/299171.html'
 13         #Initiate a request for information on a Web page
 14         re= requests.get(url)
 15         #If the return value is not 200, then something happens HTMLerror
 16         re.raise_for_status()
 17         #Uniform Encoding
 18         re.encoding = re.apparent_encoding
 19         return  re.text
 20     except:
 21         #Prompt for failure if failure occurs
 22         return 'Get Failed'
 23 
 24 #Amplification Information
 25 college=[]
 26 #Interpret Web Pages with Tasty Talks
 27 def htmlsoup(txt):
 28     #Delicious soup parse web page, according to HTML Principles
 29     soup=BeautifulSoup(txt, 'html.parser')
 30     #Parse the page to find all tags that are tbody Contents
 31     soupf= soup.find_all('tbody')
 32     #Analytical of delicious soups html Page Traversal
 33     for i in soupf:
 34         #Find All tr Label
 35         tr=i.find_all('tr')
 36         #Every last tr Are all one tr Label
 37         for j in tr:
 38             #j Are all one td
 39             for td in j:
 40                 #string Get content between tags
 41                 #Return str
 42                 strtr=str(td.string)
 43                 #Character added to university list
 44                 college.append(strtr)
 45     return college
 46 
 47 #Processing Content
 48 def dealhtml(college):
 49     #Tools for slicing
 50     dealmark=0
 51     #Traversal one by one of the data
 52     # Used to load groups
 53     dealc = []
 54     for c in college:
 55         dealmark=dealmark+1
 56         #Every 6 Groups
 57         if dealmark%6!=0:
 58             dealc.append(c)
 59         #If there are already six groups
 60         else:
 61             dealc.append(c)
 62             #Add to total
 63             dealcollege.append(dealc)
 64             #Reinitialization
 65             dealc=[]
 66     return dealcollege
 67 
 68 #Handling Score Exceptions
 69 def dealcolleges(dealcollege):
 70     for i in dealcollege:
 71         if i[3]=='\u3000':
 72             i[3]='No Score'
 73         else:
 74             continue
 75     return dealcollege
 76 
 77 import pandas
 78 #Functions for Output
 79 def printcollege(dealcollege):
 80     for i in dealcollege:
 81         print("%-4s%-16s%-6s%-8s%-10s%-12s"%(i[0],i[1],i[2],i[3],i[4],i[5]))
 82 
 83 #Processed Universities
 84 dealcollege=[]
 85 #Run Function
 86 def run():
 87     txt=gethtml()
 88     collegesoup=htmlsoup(txt)
 89     d=dealhtml(collegesoup)
 90     dd=dealcolleges(d)
 91     printcollege(dd)
 92 
 93 if __name__ == '__main__':
 94     run()
 95 
 96 #Word Cloud
 97 from wordcloud import WordCloud
 98 #Province
 99 province=[]
100 #from dealcollege piece in
101 for d in dealcollege[1:]:
102     province.append(d[2])
103 #Individual Provinces
104 everyprovinse=set(province)
105 listp=list(everyprovinse)
106 #How many provinces in total
107 len_listp=len(listp)
108 number=[0 for i in range(len_listp)]
109 #Number of provinces
110 for p in province:
111     for i in range(len_listp):
112         #How many
113         if listp[i]==p:
114             number[i]+=1
115             break
116         else:
117             continue
118 
119 strprovince=''
120 #All provinces together
121 for i in range(len_listp):
122     intnumber=int(number[i])
123     #plus\t Can participle
124     strp=(listp[i]+'\t')*intnumber
125     strprovince+=strp
126 #Chinese Package Address
127 path='e:\\SimHei.ttf'
128 w=WordCloud(font_path=path,
129             background_color='White',
130             width=800,
131             height=600,
132             )
133 #Place text
134 w.generate(strprovince)
135 plt.imshow(w)
136 # Axis
137 plt.axis()
138 #display
139 plt.show()
140 #Word Cloud Preservation Site
141 w.to_file('address')
142 
143 #Bar Chart
144 import matplotlib.pyplot as plt
145 import seaborn as sns
146 #Chinese Pack
147 plt.rcParams['font.sans-serif'] = ['SimHei']
148 plt.rcParams['axes.unicode_minus'] = False
149 x =listp
150 y =number
151 # Seaborn Draw a bar chart
152 data=pandas.DataFrame({
153 'Number':y,
154 'Province':x})
155 sns.barplot(
156             x="Province",
157             y="Number",
158             data=data
159             )
160 plt.show()
161 
162 #Scatter plot
163 import matplotlib.pyplot as plt
164 #Stars in Universities
165 xx=[0 for i in range(8)]
166 #Score Range of Universities
167 score=[0 for i in range(8)]
168 for i in dealcollege[1:]:
169     if i[3]!='No Score':
170         inti=int(eval(i[3]))
171         if 0<=inti<60:
172             score[0]+=1
173         elif 60<=inti<65:
174             score[1]+=1
175         elif 65<=inti<70:
176             score[2]+=1
177         elif 70<=inti<75:
178             score[3]+=1
179         elif 75<=inti<80:
180             score[4]+=1
181         elif 80<=inti<85:
182             score[5]+=1
183         elif 85<=inti<90:
184             score[6]+=1
185         elif 90<=inti<100:
186             score[7]+=1
187     for h in range(8):
188         if int(i[-2][:-1])==h:
189             xx[h]+=1
190 plt.scatter(score,xx,marker='x', c="red", alpha=0.8)
191 plt.show()
192 
193 #Preservation
194 import csv
195 f = open('e:\\daxue.csv','w',newline='')
196 w= csv.writer(f)
197 for i in dealcollege:
198     w.writerow(i)
199 f.close()

 

5. Summary (10 points)

1. What conclusions can you draw from the analysis and visualization of the subject data? Are you meeting your expected goals?

Universities are concentrated in Beijing, Jiangsu, Shandong, Henan, Beijing has the highest quality at most, and the ranking star of the national documents can get their best choice, reaching the expected target

2. What are the benefits of completing this design? And suggestions for improvement?

You can find a suitable university according to its ranking and level. Web address information is insufficient, if you can get the best majors in each university for data analysis, it will be more targeted

#Bar Chart
import matplotlib.pyplot as plt
import seaborn as sns
#Chinese Package
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
x =listp
y =number
# Seaborn Draw Bar
data=pandas.DataFrame({
'Quantity': y,
'province': x})
sns.barplot(
x = Province,
y="quantity",
data=data
)
plt.show()

#Scatter Chart
import matplotlib.pyplot as plt
#Stars in Universities
xx=[0 for i in range(8)]
#Score Range of Universities
score=[0 for i in range(8)]
for i in dealcollege[1:]:
if i[3]!=' No Score':
inti=int(eval(i[3]))
if 0<=inti<60:
score[0]+=1
elif 60<=inti<65:
score[1]+=1
elif 65<=inti<70:
score[2]+=1
elif 70<=inti<75:
score[3]+=1
elif 75<=inti<80:
score[4]+=1
elif 80<=inti<85:
score[5]+=1
elif 85<=inti<90:
score[6]+=1
elif 90<=inti<100:
score[7]+=1
for h in range(8):
if int(i[-2][:-1])==h:
xx[h]+=1
plt.scatter(score,xx,marker='x', c="red", alpha=0.8)
plt.show()

Added by nathanr on Mon, 03 Jan 2022 19:43:10 +0200