Python crawler actual combat, pyecharts module, python realizes China Metro data visualization

preface

Using Python to realize the visualization of China Metro data. No more nonsense.

Let's start happily~

development tool

Python version: 3.6.4

Related modules:

requests module;

wordcloud module;

pandas module;

numpy module;

jieba module;

Pyecarts module;

matplotlib module;

And some Python built-in modules.

Environment construction

Many people learn Python and don't know where to start.

Many people learn to find python,After mastering the basic grammar, I don't know where to start.

Many people who may already know the case do not learn more advanced knowledge.

For these three types of people, I provide you with a good learning platform, free access to video tutorials, e-books, and the source code of the course!

QQ Group:101677771

Welcome to join us and discuss and study together

 

Install Python and add it to the environment variable. pip can install the relevant modules required.

This time, through the acquisition of subway line data, the urban distribution data are visually analyzed.

Analysis acquisition

Metro information is obtained from Gaode map.

The above mainly obtains the "id", "cityname" and "name" of the city.

It is used to splice the request website to obtain the specific information of the subway line.

Find the request information and get the details of subway lines and stations in the lines in each city.

get data

Specific code

import json
import requests
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

def get_message(ID, cityname, name):
    """
    Metro line information acquisition
    """
    url = 'http://map.amap.com/service/subway?_1555502190153&srhdata=' + ID + '_drw_' + cityname + '.json'
    response = requests.get(url=url, headers=headers)
    html = response.text
    result = json.loads(html)
    for i in result['l']:
        for j in i['st']:
            # Judge whether the subway line is included
            if len(i['la']) > 0:
                print(name, i['ln'] + '(' + i['la'] + ')', j['n'])
                with open('subway.csv', 'a+', encoding='gbk') as f:
                    f.write(name + ',' + i['ln'] + '(' + i['la'] + ')' + ',' + j['n'] + '\n')
            else:
                print(name, i['ln'], j['n'])
                with open('subway.csv', 'a+', encoding='gbk') as f:
                    f.write(name + ',' + i['ln'] + ',' + j['n'] + '\n')

def get_city():
    """
    Urban information acquisition
    """
    url = 'http://map.amap.com/subway/index.html?&1100'
    response = requests.get(url=url, headers=headers)
    html = response.text
    # code
    html = html.encode('ISO-8859-1')
    html = html.decode('utf-8')
    soup = BeautifulSoup(html, 'lxml')
    # City list
    res1 = soup.find_all(class_="city-list fl")[0]
    res2 = soup.find_all(class_="more-city-list")[0]
    for i in res1.find_all('a'):
        # City ID value
        ID = i['id']
        # City Pinyin name
        cityname = i['cityname']
        # City name
        name = i.get_text()
        get_message(ID, cityname, name)
    for i in res2.find_all('a'):
        # City ID value
        ID = i['id']
        # City Pinyin name
        cityname = i['cityname']
        # City name
        name = i.get_text()
        get_message(ID, cityname, name)

if __name__ == '__main__':
    get_city()

Display of data acquisition results

3541 subway stations

Data visualization

Firstly, clean the data to remove the duplicate transfer station information.

from wordcloud import WordCloud, ImageColorGenerator
from pyecharts import Line, Bar
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import jieba

# Set column name to align with data
pd.set_option('display.unicode.ambiguous_as_wide', True)
pd.set_option('display.unicode.east_asian_width', True)
# Show 10 lines
pd.set_option('display.max_rows', 10)
# Read data
df = pd.read_csv('subway.csv', header=None, names=['city', 'line', 'station'], encoding='gbk')
# Subway lines in various cities
df_line = df.groupby(['city', 'line']).count().reset_index()
print(df_line)

By grouping cities and subway lines, the total number of subway lines in China is obtained.

183 subway lines

def create_map(df):
    # draw a map
    value = [i for i in df['line']]
    attr = [i for i in df['city']]
    geo = Geo("Distribution of opened metro cities", title_pos='center', title_top='0', width=800, height=400, title_color="#fff", background_color="#404a59", )
    geo.add("", attr, value, is_visualmap=True, visual_range=[0, 25], visual_text_color="#fff", symbol_size=15)
    geo.render("Distribution of opened metro cities.html")

def create_line(df):
    """
    Number and distribution of generated urban subway lines
    """
    title_len = df['line']
    bins = [0, 5, 10, 15, 20, 25]
    level = ['0-5', '5-10', '10-15', '15-20', '20 above']
    len_stage = pd.cut(title_len, bins=bins, labels=level).value_counts().sort_index()
    # Generate histogram
    attr = len_stage.index
    v1 = len_stage.values
    bar = Bar("Number and distribution of subway lines in each city", title_pos='center', title_top='18', width=800, height=400)
    bar.add("", attr, v1, is_stack=True, is_label_show=True)
    bar.render("Number and distribution of subway lines in each city.html")

# Number of subway lines in each city
df_city = df_line.groupby(['city']).count().reset_index().sort_values(by='line', ascending=False)
print(df_city)
create_map(df_city)
create_line(df_city)

Data of cities that have opened subway, as well as the number of subway lines in each city.

Subways opened in 32 cities

Urban distribution

Most of them are provincial capitals, as well as some cities with strong economic strength.

Number and distribution of lines

It can be seen that most of them are still in the "0-5" stage, of course, at least 1 line.

# Which line has the most subway stations in which city
print(df_line.sort_values(by='station', ascending=False))

Which line has the most subway stations in which city

Beijing line 10 is the first and Chongqing line 3 is the second

Remove data from duplicate transfer stations

# Remove subway data from duplicate transfer stations
df_station = df.groupby(['city', 'station']).count().reset_index()
print(df_station)

Including 3034 subway stations

Nearly 400 subway stations have been reduced

Next, let's see which city has the most subway stations

# Count the number of subway stations included in each city (duplicate transfer stations have been removed)
print(df_station.groupby(['city']).count().reset_index().sort_values(by='station', ascending=False))

There are so many subway stations in Wuhan

Realize the operation in the new weekly to generate the subway noun cloud

def create_wordcloud(df):
    """
    Generate Metro noun cloud
    """
    # participle
    text = ''
    for line in df['station']:
        text += ' '.join(jieba.cut(line, cut_all=False))
        text += ' '
    backgroud_Image = plt.imread('rocket.jpg')
    wc = WordCloud(
        background_color='white',
        mask=backgroud_Image,
        font_path='C:\Windows\Fonts\Huakangli Gold Black W8.TTF',
        max_words=1000,
        max_font_size=150,
        min_font_size=15,
        prefer_horizontal=1,
        random_state=50,
    )
    wc.generate_from_text(text)
    img_colors = ImageColorGenerator(backgroud_Image)
    wc.recolor(color_func=img_colors)
    # Look at those with high word frequency
    process_word = WordCloud.process_text(wc, text)
    sort = sorted(process_word.items(), key=lambda e: e[1], reverse=True)
    print(sort[:50])
    plt.imshow(wc)
    plt.axis('off')
    wc.to_file("Subway noun cloud.jpg")
    print('Word cloud generated successfully!')

create_wordcloud(df_station)

Show word cloud

 

Keywords: Python

Added by JohnMike on Wed, 19 Jan 2022 17:27:17 +0200