preface
Using Python to realize the visualization of China Metro data. No more nonsense.
Let's start happily~
development tool
Python version: 3.6.4
Related modules:
requests module;
wordcloud module;
pandas module;
numpy module;
jieba module;
Pyecarts module;
matplotlib module;
And some Python built-in modules.
Environment construction
Many people learn Python and don't know where to start. Many people learn to find python,After mastering the basic grammar, I don't know where to start. Many people who may already know the case do not learn more advanced knowledge. For these three types of people, I provide you with a good learning platform, free access to video tutorials, e-books, and the source code of the course! QQ Group:101677771 Welcome to join us and discuss and study together
Install Python and add it to the environment variable. pip can install the relevant modules required.
This time, through the acquisition of subway line data, the urban distribution data are visually analyzed.
Analysis acquisition
Metro information is obtained from Gaode map.
The above mainly obtains the "id", "cityname" and "name" of the city.
It is used to splice the request website to obtain the specific information of the subway line.
Find the request information and get the details of subway lines and stations in the lines in each city.
get data
Specific code
import json import requests from bs4 import BeautifulSoup headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'} def get_message(ID, cityname, name): """ Metro line information acquisition """ url = 'http://map.amap.com/service/subway?_1555502190153&srhdata=' + ID + '_drw_' + cityname + '.json' response = requests.get(url=url, headers=headers) html = response.text result = json.loads(html) for i in result['l']: for j in i['st']: # Judge whether the subway line is included if len(i['la']) > 0: print(name, i['ln'] + '(' + i['la'] + ')', j['n']) with open('subway.csv', 'a+', encoding='gbk') as f: f.write(name + ',' + i['ln'] + '(' + i['la'] + ')' + ',' + j['n'] + '\n') else: print(name, i['ln'], j['n']) with open('subway.csv', 'a+', encoding='gbk') as f: f.write(name + ',' + i['ln'] + ',' + j['n'] + '\n') def get_city(): """ Urban information acquisition """ url = 'http://map.amap.com/subway/index.html?&1100' response = requests.get(url=url, headers=headers) html = response.text # code html = html.encode('ISO-8859-1') html = html.decode('utf-8') soup = BeautifulSoup(html, 'lxml') # City list res1 = soup.find_all(class_="city-list fl")[0] res2 = soup.find_all(class_="more-city-list")[0] for i in res1.find_all('a'): # City ID value ID = i['id'] # City Pinyin name cityname = i['cityname'] # City name name = i.get_text() get_message(ID, cityname, name) for i in res2.find_all('a'): # City ID value ID = i['id'] # City Pinyin name cityname = i['cityname'] # City name name = i.get_text() get_message(ID, cityname, name) if __name__ == '__main__': get_city()
Display of data acquisition results
3541 subway stations
Data visualization
Firstly, clean the data to remove the duplicate transfer station information.
from wordcloud import WordCloud, ImageColorGenerator from pyecharts import Line, Bar import matplotlib.pyplot as plt import pandas as pd import numpy as np import jieba # Set column name to align with data pd.set_option('display.unicode.ambiguous_as_wide', True) pd.set_option('display.unicode.east_asian_width', True) # Show 10 lines pd.set_option('display.max_rows', 10) # Read data df = pd.read_csv('subway.csv', header=None, names=['city', 'line', 'station'], encoding='gbk') # Subway lines in various cities df_line = df.groupby(['city', 'line']).count().reset_index() print(df_line)
By grouping cities and subway lines, the total number of subway lines in China is obtained.
183 subway lines
def create_map(df): # draw a map value = [i for i in df['line']] attr = [i for i in df['city']] geo = Geo("Distribution of opened metro cities", title_pos='center', title_top='0', width=800, height=400, title_color="#fff", background_color="#404a59", ) geo.add("", attr, value, is_visualmap=True, visual_range=[0, 25], visual_text_color="#fff", symbol_size=15) geo.render("Distribution of opened metro cities.html") def create_line(df): """ Number and distribution of generated urban subway lines """ title_len = df['line'] bins = [0, 5, 10, 15, 20, 25] level = ['0-5', '5-10', '10-15', '15-20', '20 above'] len_stage = pd.cut(title_len, bins=bins, labels=level).value_counts().sort_index() # Generate histogram attr = len_stage.index v1 = len_stage.values bar = Bar("Number and distribution of subway lines in each city", title_pos='center', title_top='18', width=800, height=400) bar.add("", attr, v1, is_stack=True, is_label_show=True) bar.render("Number and distribution of subway lines in each city.html") # Number of subway lines in each city df_city = df_line.groupby(['city']).count().reset_index().sort_values(by='line', ascending=False) print(df_city) create_map(df_city) create_line(df_city)
Data of cities that have opened subway, as well as the number of subway lines in each city.
Subways opened in 32 cities
Urban distribution
Most of them are provincial capitals, as well as some cities with strong economic strength.
Number and distribution of lines
It can be seen that most of them are still in the "0-5" stage, of course, at least 1 line.
# Which line has the most subway stations in which city print(df_line.sort_values(by='station', ascending=False))
Which line has the most subway stations in which city
Beijing line 10 is the first and Chongqing line 3 is the second
Remove data from duplicate transfer stations
# Remove subway data from duplicate transfer stations df_station = df.groupby(['city', 'station']).count().reset_index() print(df_station)
Including 3034 subway stations
Nearly 400 subway stations have been reduced
Next, let's see which city has the most subway stations
# Count the number of subway stations included in each city (duplicate transfer stations have been removed) print(df_station.groupby(['city']).count().reset_index().sort_values(by='station', ascending=False))
There are so many subway stations in Wuhan
Realize the operation in the new weekly to generate the subway noun cloud
def create_wordcloud(df): """ Generate Metro noun cloud """ # participle text = '' for line in df['station']: text += ' '.join(jieba.cut(line, cut_all=False)) text += ' ' backgroud_Image = plt.imread('rocket.jpg') wc = WordCloud( background_color='white', mask=backgroud_Image, font_path='C:\Windows\Fonts\Huakangli Gold Black W8.TTF', max_words=1000, max_font_size=150, min_font_size=15, prefer_horizontal=1, random_state=50, ) wc.generate_from_text(text) img_colors = ImageColorGenerator(backgroud_Image) wc.recolor(color_func=img_colors) # Look at those with high word frequency process_word = WordCloud.process_text(wc, text) sort = sorted(process_word.items(), key=lambda e: e[1], reverse=True) print(sort[:50]) plt.imshow(wc) plt.axis('off') wc.to_file("Subway noun cloud.jpg") print('Word cloud generated successfully!') create_wordcloud(df_station)
Show word cloud