OD identification and community discovery of Python shared bicycle data (TransBigData+igraph)

Jupyter notebook of this case: click here .

For the travel of sharing bicycles, each travel can be regarded as a travel process from the beginning to the end. When we regard the starting point and the ending point as nodes and the travel between them as edges, we can build a network. By analyzing this network, we can get information about the spatial structure of the city and the macro travel characteristics of shared bicycle demand.

Community discovery, also known as graph segmentation, helps us reveal the hidden relationship between nodes in the network. In this example, we will introduce how to integrate TransBigData into the community discovery analysis process of sharing bicycle data.

import pandas as pd
import numpy as np
import geopandas as gpd
import transbigdata as tbd

Data preprocessing

Before community discovery, we first need to preprocess the data. Extract the travel OD from the shared single vehicle order, eliminate the abnormal travel, and take the cleaned data as the research object.

#Read shared vehicle data
bikedata = pd.read_csv(r'data/bikedata-sample.csv')
bikedata.head(5)
BIKE_IDDATA_TIMELOCK_STATUSLONGITUDELATITUDE
052018-09-01 0:00:361121.36356631.259615
162018-09-01 0:00:500121.40622631.214436
262018-09-01 0:03:011121.40940231.215259
362018-09-01 0:24:530121.40922831.214427
462018-09-01 0:26:381121.40977131.214406

Read the boundary of the study area and use TBD clean_ The outofshape method eliminates data outside the study area

#Read the boundary of Shanghai Administrative Division
shanghai_admin = gpd.read_file(r'data/shanghai.json')
#Exclude data outside the scope of the study
bikedata = tbd.clean_outofshape(bikedata, shanghai_admin, col=['LONGITUDE', 'LATITUDE'], accuracy=500)

Use TBD bikedata_ to_ OD method identifies travel OD information from single vehicle data

#Identify single vehicle travel OD
move_data,stop_data = tbd.bikedata_to_od(bikedata,
                   col = ['BIKE_ID','DATA_TIME','LONGITUDE','LATITUDE','LOCK_STATUS'])
move_data.head(5)
BIKE_IDstimeslonslatetimeelonelat
9662018-09-01 0:00:50121.40622631.2144362018-09-01 0:03:01121.40940231.215259
56162018-09-01 0:24:53121.40922831.2144272018-09-01 0:26:38121.40977131.214406
56462018-09-01 0:50:16121.40972731.2144032018-09-01 0:52:14121.41261031.214905
78462018-09-01 0:53:38121.41333331.2149512018-09-01 0:55:38121.41265631.217051
102862018-09-01 11:35:01121.41926131.2134142018-09-01 11:35:13121.41951831.213657

We need to eliminate long and short bike sharing trips. Use TBD Getdistance obtains the straight-line distance between the start and end points, and filters and deletes trips with a straight-line distance of less than 100m and more than 10km

#Calculate the riding straight distance
move_data['distance'] = tbd.getdistance(move_data['slon'],move_data['slat'],move_data['elon'],move_data['elat'])
#Clean the riding data and delete long and short trips
move_data = move_data[(move_data['distance']>100)&(move_data['distance']<10000)]

Next, we take 500 meters × The 500m grid is the smallest analysis unit, using TBD grid_ The params method obtains the grid division parameters, and then inputs the parameters into TBD odagg_ Grid method for grid aggregation of OD

# Get grid partition parameters
bounds = (120.85, 30.67, 122.24, 31.87)
params = tbd.grid_params(bounds,accuracy = 500)
#Aggregate OD
od_gdf = tbd.odagg_grid(move_data, params, col=['slon', 'slat', 'elon', 'elat'])
od_gdf.head(5)
SLONCOLSLATCOLELONCOLELATCOLcountSHBLONSHBLATEHBLONEHBLATgeometry
0269526961120.98678231.097177120.98678231.101674LINESTRING (120.98678 31.09718, 120.98678 31.1...
408031171291161271121.46551931.250062121.46025831.241069LINESTRING (121.46552 31.25006, 121.46026 31.2...
408071171291171281121.46551931.250062121.46551931.245565LINESTRING (121.46552 31.25006, 121.46552 31.2...
408101171291171311121.46551931.250062121.46551931.259055LINESTRING (121.46552 31.25006, 121.46552 31.2...
408111171291181261121.46551931.250062121.47078031.236572LINESTRING (121.46552 31.25006, 121.47078 31.2...

The results of OD aggregation are visualized on the map with TBD plot_ Map load the map basemap and use TBD Plot scale add scale bar and north arrow

#Plot Frame 
import matplotlib.pyplot as plt
import plot_map
fig =plt.figure(1,(8,8),dpi=300)
ax =plt.subplot(111)
plt.sca(ax)
#Add map basemap
tbd.plot_map(plt,bounds,zoom = 11,style = 8)
#Draw colorbar
cax = plt.axes([0.05, 0.33, 0.02, 0.3])
plt.title('Data count')
plt.sca(ax)
#Draw OD
od_gdf.plot(ax = ax,column = 'count',cmap = 'Blues_r',linewidth = 0.5,vmax = 10,cax = cax,legend = True)
#Add scale bar and north arrow
tbd.plotscale(ax,bounds = bounds,textsize = 10,compasssize = 1,textcolor = 'white',accuracy = 2000,rect = [0.06,0.03],zorder = 10)
plt.axis('off')
plt.xlim(bounds[0],bounds[2])
plt.ylim(bounds[1],bounds[3])
plt.show()

Extract node information

Use the igraph package to build the network. In Python, the function of igraph is similar to that of networkx. They both provide the function of network analysis, but only differ in the support of some algorithms. When building a network, we need to provide information about the nodes and edges of the network to igraph. Take each grid appearing in OD data as a node. When building node information, you need to create a numeric number starting from 0 for the node. The code is as follows

#Change the longitude and latitude grid number of the start and end points into a field
od_gdf['S'] = od_gdf['SLONCOL'].astype(str) + ',' + od_gdf['SLATCOL'].astype(str)
od_gdf['E'] = od_gdf['ELONCOL'].astype(str) + ',' + od_gdf['ELATCOL'].astype(str)
#Extract node set
node = set(od_gdf['S'])|set(od_gdf['E'])
#Turn node collection into DataFrame
node = pd.DataFrame(node)
#Renumber nodes
node['id'] = range(len(node))
node
0id
0118,1340
1109,1021
259,712
393,783
496,174
.........
980694,979806
9807106,1529807
9808124,1349808
980998,1589809
9810152,869810

9811 rows × 2 columns

Extract edge information

Connect the new number to the OD information table to extract the travel volume between the new ID S

#Connect the new number to the OD data
node.columns = ['S','S_id']
od_gdf = pd.merge(od_gdf,node,on = ['S'])
node.columns = ['E','E_id']
od_gdf = pd.merge(od_gdf,node,on = ['E'])
#Extract edge information
edge = od_gdf[['S_id','E_id','count']]
edge
S_idE_idcount
0826171051
1951325091
211825093
334825091
4168425091
............
68468802444902
68469421638022
68470478666542
6847164846023
68472786782703

68473 rows × 3 columns

Build network

Import the igraph package, create the network, add nodes, and input the edge data into the network. At the same time, add corresponding weights for each edge

import igraph
#Create network
g = igraph.Graph()
#Add nodes to the network.
g.add_vertices(len(node))
#Add edges to the network.
g.add_edges(edge[['S_id','E_id']].values)
#Extract the weights of the edges.
edge_weights = edge[['count']].values
#Add weights to edges.
for i in range(len(edge_weights)):
    g.es[i]['weight'] = edge_weights[i]

Community discovery

The community discovery algorithm is applied to the constructed network. Among them, we use the g.community that comes with the igraph package_ The multi-level method implements the Fast unfolding community discovery algorithm. As we introduced earlier, the Fast unfolding algorithm combines communities iteratively layer by layer until the modularity is optimal_ Return can be set in the multilevel method_ Levels returns the intermediate result of the iteration. Here we set return_levels is False, and only the final result is returned for analysis

#Community discovery
g_clustered = g.community_multilevel(weights = edge_weights, return_levels=False)

The results of community discovery are stored in G_ In the clustered variable, you can use the built-in method to directly calculate the modularity

#Modularity
g_clustered.modularity
0.8496561130926571

Generally speaking, the modularity above 0.5 already belongs to a high value. The modularity of this result reaches 0.84, indicating that the community structure of the network is very obvious, and the community division result can also divide the network well. Next, we assign the community division results to the node information table to prepare for the following visualization. The code is as follows

#Assign the result to the node
node['group'] = g_clustered.membership
#Heavy life list
node.columns = ['grid','node_id','group']
node
gridnode_idgroup
0118,13400
1109,10211
259,7122
393,7833
496,1744
............
980694,9798068
9807106,152980736
9808124,134980837
980998,15898099
9810152,86981026

9811 rows × 3 columns

Community visualization

In the results of community discovery, there may be only a small number of nodes in some communities, which can not form a large-scale community. These communities are outliers and should be deleted before visualization. Here we keep communities with more than 10 grids

#Count the number of grids per community
group = node['group'].value_counts()
#Extract communities larger than 10 grids
group = group[group>10]
#Keep only the grids of these communities
node = node[node['group'].apply(lambda r:r in group.index)]

Restore the grid number and use TBD gridid_ to_ The polygon method generates the geographic geometry of the grid from the grid number

#Get grid number by segmentation
node['LONCOL'] = node['grid'].apply(lambda r:r.split(',')[0]).astype(int)
node['LATCOL'] = node['grid'].apply(lambda r:r.split(',')[1]).astype(int)
#Generate Grid Geographic graphics
node['geometry'] = tbd.gridid_to_polygon(node['LONCOL'],node['LATCOL'],params)
#Convert to GeoDataFrame
import geopandas as gpd
node = gpd.GeoDataFrame(node)
node
gridnode_idgroupLONCOLLATCOLgeometry
0118,13400118134POLYGON ((121.46815 31.27030, 121.47341 31.270...
1109,10211109102POLYGON ((121.42080 31.12641, 121.42606 31.126...
393,78339378POLYGON ((121.33663 31.01849, 121.34189 31.018...
496,17449617POLYGON ((121.35241 30.74419, 121.35767 30.744...
5156,11755156117POLYGON ((121.66806 31.19385, 121.67332 31.193...
.....................
980694,97980689497POLYGON ((121.34189 31.10392, 121.34715 31.103...
9807106,152980736106152POLYGON ((121.40502 31.35124, 121.41028 31.351...
9808124,134980837124134POLYGON ((121.49971 31.27030, 121.50498 31.270...
980998,1589809998158POLYGON ((121.36293 31.37822, 121.36819 31.378...
9810152,8698102615286POLYGON ((121.64702 31.05446, 121.65228 31.054...

8527 rows × 6 columns

In this step, we restore each node to a grid, mark the community number to which the node belongs, generate the geographic information of each grid, and convert it into a GeoDataFrame. We can draw the grid with the following code to test whether the generation is successful

node.plot('group')

Here, we map the group number of the group field to the color for preliminary visualization. The colors of different groups are different. It can be seen from the graph of the result that most grids of the same color are clustered together in geospatial space, indicating that the spatial connection of sharing bicycles can closely connect the close areas in geospatial space

The visualization effect of the previous results is not obvious. We can't clearly see the partition from the figure. Next, we can adjust and visualize the partition results. The adjustment of visualization mainly includes the following ideas:

  • A more appropriate zoning result should be that each region is spatially continuous. In the preliminary visualization results, many grids exist in isolation in space, and these points should be eliminated.

  • In the visualization results, we can merge the grids of the same group to form face features for each partition, so that the boundary of the partition can be drawn in the next visualization.

  • In the zoning results, there may be "enclaves" of other regions in some regions, that is, they belong to this partition but are surrounded by other partitions. They can only "fly" through the possessions of other partitions to reach their own enclaves. This kind of zoning is also difficult to manage in the actual operation of sharing bicycles, which should be avoided.

To solve the above problems, we can use two GIS processing methods provided by TransBigData, TBD merge_ Polygon and TBD polyon_ exterior. TBD merge_ Polygon can merge face features of the same group, while TBD polyon_ Exterior can take the outer boundary of the polygon and then form a new polygon to eliminate the enclave. At the same time, you can also set the minimum area to eliminate face features smaller than this area. The code is as follows

#Combine the face features of the same group by grouping the group field
node_community = tbd.merge_polygon(node,'group')
#Input polygon GeoDataFrame data and take the outer boundary of the polygon to form a new polygon
#Set the minimum area minarea to eliminate all faces smaller than this area to avoid a large number of outliers
node_community = tbd.polyon_exterior(node_community,minarea = 0.000100)

After processing the face features of the community, you need to visualize the face features. We want to give different colors to different faces. In the visualization chapter, we mentioned that when there is no numerical difference between the displayed elements, the color selection needs to keep their respective colors with the same brightness and saturation. seaborn's palette method can quickly generate multiple colors under the same brightness and saturation

#Generate palette
import seaborn as sns
## l: Brightness
## s: Saturation
cmap = sns.hls_palette(n_colors=len(node_community), l=.7, s=0.8)
sns.palplot(cmap)

Visualize community results

#Plot Frame 
import matplotlib.pyplot as plt
import plot_map
fig =plt.figure(1,(8,8),dpi=300)
ax =plt.subplot(111)
plt.sca(ax)
#Add map basemap
tbd.plot_map(plt,bounds,zoom = 10,style = 6)
#Set colormap
from matplotlib.colors import ListedColormap
#Disrupt the order of the community
node_community = node_community.sample(frac=1)
#Draw community
node_community.plot(cmap = ListedColormap(cmap),ax = ax,edgecolor = '#333',alpha = 0.8)
#Add scale bar and north arrow
tbd.plotscale(ax,bounds = bounds,textsize = 10,compasssize = 1,textcolor = 'k'
              ,accuracy = 2000,rect = [0.06,0.03],zorder = 10)
plt.axis('off')
plt.xlim(bounds[0],bounds[2])
plt.ylim(bounds[1],bounds[3])
plt.show()

So far, we have successfully visualized the shared bicycle community and mapped the boundary of each community. When the community discovery model is used for zoning, no geospatial information is input into the model, and the segmentation of the study area only depends on the network connection composed of shared single vehicle travel demand.

Keywords: Python Back-end

Added by La Parka on Sat, 11 Dec 2021 03:39:55 +0200