OD identification and community discovery of Python shared bicycle data (TransBigData+igraph)

Jupyter notebook of this case: click here .

For the travel of sharing bicycles, each travel can be regarded as a travel process from the beginning to the end. When we regard the starting point and the ending point as nodes and the travel between them as edges, we can build a network. By analyzing this network, we can get information about the spatial structure of the city and the macro travel characteristics of shared bicycle demand.

Community discovery, also known as graph segmentation, helps us reveal the hidden relationship between nodes in the network. In this example, we will introduce how to integrate TransBigData into the community discovery analysis process of sharing bicycle data.

import pandas as pd
import numpy as np
import geopandas as gpd
import transbigdata as tbd

Data preprocessing 

Before community discovery, we first need to preprocess the data. Extract the travel OD from the shared single vehicle order, eliminate the abnormal travel, and take the cleaned data as the research object.

#Read shared vehicle data
bikedata = pd.read_csv(r'data/bikedata-sample.csv')
bikedata.head(5)

	BIKE_ID	DATA_TIME	LOCK_STATUS	LONGITUDE	LATITUDE
0	5	2018-09-01 0:00:36	1	121.363566	31.259615
1	6	2018-09-01 0:00:50	0	121.406226	31.214436
2	6	2018-09-01 0:03:01	1	121.409402	31.215259
3	6	2018-09-01 0:24:53	0	121.409228	31.214427
4	6	2018-09-01 0:26:38	1	121.409771	31.214406

Read the boundary of the study area and use TBD clean_ The outofshape method eliminates data outside the study area

#Read the boundary of Shanghai Administrative Division
shanghai_admin = gpd.read_file(r'data/shanghai.json')
#Exclude data outside the scope of the study
bikedata = tbd.clean_outofshape(bikedata, shanghai_admin, col=['LONGITUDE', 'LATITUDE'], accuracy=500)

Use TBD bikedata_ to_ OD method identifies travel OD information from single vehicle data

#Identify single vehicle travel OD
move_data,stop_data = tbd.bikedata_to_od(bikedata,
                   col = ['BIKE_ID','DATA_TIME','LONGITUDE','LATITUDE','LOCK_STATUS'])
move_data.head(5)

	BIKE_ID	stime	slon	slat	etime	elon	elat
96	6	2018-09-01 0:00:50	121.406226	31.214436	2018-09-01 0:03:01	121.409402	31.215259
561	6	2018-09-01 0:24:53	121.409228	31.214427	2018-09-01 0:26:38	121.409771	31.214406
564	6	2018-09-01 0:50:16	121.409727	31.214403	2018-09-01 0:52:14	121.412610	31.214905
784	6	2018-09-01 0:53:38	121.413333	31.214951	2018-09-01 0:55:38	121.412656	31.217051
1028	6	2018-09-01 11:35:01	121.419261	31.213414	2018-09-01 11:35:13	121.419518	31.213657

We need to eliminate long and short bike sharing trips. Use TBD Getdistance obtains the straight-line distance between the start and end points, and filters and deletes trips with a straight-line distance of less than 100m and more than 10km

#Calculate the riding straight distance
move_data['distance'] = tbd.getdistance(move_data['slon'],move_data['slat'],move_data['elon'],move_data['elat'])
#Clean the riding data and delete long and short trips
move_data = move_data[(move_data['distance']>100)&(move_data['distance']<10000)]

Next, we take 500 meters × The 500m grid is the smallest analysis unit, using TBD grid_ The params method obtains the grid division parameters, and then inputs the parameters into TBD odagg_ Grid method for grid aggregation of OD

# Get grid partition parameters
bounds = (120.85, 30.67, 122.24, 31.87)
params = tbd.grid_params(bounds,accuracy = 500)
#Aggregate OD
od_gdf = tbd.odagg_grid(move_data, params, col=['slon', 'slat', 'elon', 'elat'])
od_gdf.head(5)

	SLONCOL	SLATCOL	ELONCOL	ELATCOL	count	SHBLON	SHBLAT	EHBLON	EHBLAT	geometry
0	26	95	26	96	1	120.986782	31.097177	120.986782	31.101674	LINESTRING (120.98678 31.09718, 120.98678 31.1...
40803	117	129	116	127	1	121.465519	31.250062	121.460258	31.241069	LINESTRING (121.46552 31.25006, 121.46026 31.2...
40807	117	129	117	128	1	121.465519	31.250062	121.465519	31.245565	LINESTRING (121.46552 31.25006, 121.46552 31.2...
40810	117	129	117	131	1	121.465519	31.250062	121.465519	31.259055	LINESTRING (121.46552 31.25006, 121.46552 31.2...
40811	117	129	118	126	1	121.465519	31.250062	121.470780	31.236572	LINESTRING (121.46552 31.25006, 121.47078 31.2...

The results of OD aggregation are visualized on the map with TBD plot_ Map load the map basemap and use TBD Plot scale add scale bar and north arrow

#Plot Frame 
import matplotlib.pyplot as plt
import plot_map
fig =plt.figure(1,(8,8),dpi=300)
ax =plt.subplot(111)
plt.sca(ax)
#Add map basemap
tbd.plot_map(plt,bounds,zoom = 11,style = 8)
#Draw colorbar
cax = plt.axes([0.05, 0.33, 0.02, 0.3])
plt.title('Data count')
plt.sca(ax)
#Draw OD
od_gdf.plot(ax = ax,column = 'count',cmap = 'Blues_r',linewidth = 0.5,vmax = 10,cax = cax,legend = True)
#Add scale bar and north arrow
tbd.plotscale(ax,bounds = bounds,textsize = 10,compasssize = 1,textcolor = 'white',accuracy = 2000,rect = [0.06,0.03],zorder = 10)
plt.axis('off')
plt.xlim(bounds[0],bounds[2])
plt.ylim(bounds[1],bounds[3])
plt.show()

Extract node information 

Use the igraph package to build the network. In Python, the function of igraph is similar to that of networkx. They both provide the function of network analysis, but only differ in the support of some algorithms. When building a network, we need to provide information about the nodes and edges of the network to igraph. Take each grid appearing in OD data as a node. When building node information, you need to create a numeric number starting from 0 for the node. The code is as follows

#Change the longitude and latitude grid number of the start and end points into a field
od_gdf['S'] = od_gdf['SLONCOL'].astype(str) + ',' + od_gdf['SLATCOL'].astype(str)
od_gdf['E'] = od_gdf['ELONCOL'].astype(str) + ',' + od_gdf['ELATCOL'].astype(str)
#Extract node set
node = set(od_gdf['S'])|set(od_gdf['E'])
#Turn node collection into DataFrame
node = pd.DataFrame(node)
#Renumber nodes
node['id'] = range(len(node))
node

	0	id
0	118,134	0
1	109,102	1
2	59,71	2
3	93,78	3
4	96,17	4
...	...	...
9806	94,97	9806
9807	106,152	9807
9808	124,134	9808
9809	98,158	9809
9810	152,86	9810

9811 rows × 2 columns

Extract edge information 

Connect the new number to the OD information table to extract the travel volume between the new ID S

#Connect the new number to the OD data
node.columns = ['S','S_id']
od_gdf = pd.merge(od_gdf,node,on = ['S'])
node.columns = ['E','E_id']
od_gdf = pd.merge(od_gdf,node,on = ['E'])
#Extract edge information
edge = od_gdf[['S_id','E_id','count']]
edge

	S_id	E_id	count
0	8261	7105	1
1	9513	2509	1
2	118	2509	3
3	348	2509	1
4	1684	2509	1
...	...	...	...
68468	8024	4490	2
68469	4216	3802	2
68470	4786	6654	2
68471	6484	602	3
68472	7867	8270	3

68473 rows × 3 columns

Build network 

Import the igraph package, create the network, add nodes, and input the edge data into the network. At the same time, add corresponding weights for each edge

import igraph
#Create network
g = igraph.Graph()
#Add nodes to the network.
g.add_vertices(len(node))
#Add edges to the network.
g.add_edges(edge[['S_id','E_id']].values)
#Extract the weights of the edges.
edge_weights = edge[['count']].values
#Add weights to edges.
for i in range(len(edge_weights)):
    g.es[i]['weight'] = edge_weights[i]

Community discovery 

The community discovery algorithm is applied to the constructed network. Among them, we use the g.community that comes with the igraph package_ The multi-level method implements the Fast unfolding community discovery algorithm. As we introduced earlier, the Fast unfolding algorithm combines communities iteratively layer by layer until the modularity is optimal_ Return can be set in the multilevel method_ Levels returns the intermediate result of the iteration. Here we set return_levels is False, and only the final result is returned for analysis

#Community discovery
g_clustered = g.community_multilevel(weights = edge_weights, return_levels=False)

The results of community discovery are stored in G_ In the clustered variable, you can use the built-in method to directly calculate the modularity

#Modularity
g_clustered.modularity

0.8496561130926571

Generally speaking, the modularity above 0.5 already belongs to a high value. The modularity of this result reaches 0.84, indicating that the community structure of the network is very obvious, and the community division result can also divide the network well. Next, we assign the community division results to the node information table to prepare for the following visualization. The code is as follows

#Assign the result to the node
node['group'] = g_clustered.membership
#Heavy life list
node.columns = ['grid','node_id','group']
node

	grid	node_id	group
0	118,134	0	0
1	109,102	1	1
2	59,71	2	2
3	93,78	3	3
4	96,17	4	4
...	...	...	...
9806	94,97	9806	8
9807	106,152	9807	36
9808	124,134	9808	37
9809	98,158	9809	9
9810	152,86	9810	26

9811 rows × 3 columns

Community visualization 

In the results of community discovery, there may be only a small number of nodes in some communities, which can not form a large-scale community. These communities are outliers and should be deleted before visualization. Here we keep communities with more than 10 grids

#Count the number of grids per community
group = node['group'].value_counts()
#Extract communities larger than 10 grids
group = group[group>10]
#Keep only the grids of these communities
node = node[node['group'].apply(lambda r:r in group.index)]

Restore the grid number and use TBD gridid_ to_ The polygon method generates the geographic geometry of the grid from the grid number

#Get grid number by segmentation
node['LONCOL'] = node['grid'].apply(lambda r:r.split(',')[0]).astype(int)
node['LATCOL'] = node['grid'].apply(lambda r:r.split(',')[1]).astype(int)
#Generate Grid Geographic graphics
node['geometry'] = tbd.gridid_to_polygon(node['LONCOL'],node['LATCOL'],params)
#Convert to GeoDataFrame
import geopandas as gpd
node = gpd.GeoDataFrame(node)
node

	grid	node_id	group	LONCOL	LATCOL	geometry
0	118,134	0	0	118	134	POLYGON ((121.46815 31.27030, 121.47341 31.270...
1	109,102	1	1	109	102	POLYGON ((121.42080 31.12641, 121.42606 31.126...
3	93,78	3	3	93	78	POLYGON ((121.33663 31.01849, 121.34189 31.018...
4	96,17	4	4	96	17	POLYGON ((121.35241 30.74419, 121.35767 30.744...
5	156,117	5	5	156	117	POLYGON ((121.66806 31.19385, 121.67332 31.193...
...	...	...	...	...	...	...
9806	94,97	9806	8	94	97	POLYGON ((121.34189 31.10392, 121.34715 31.103...
9807	106,152	9807	36	106	152	POLYGON ((121.40502 31.35124, 121.41028 31.351...
9808	124,134	9808	37	124	134	POLYGON ((121.49971 31.27030, 121.50498 31.270...
9809	98,158	9809	9	98	158	POLYGON ((121.36293 31.37822, 121.36819 31.378...
9810	152,86	9810	26	152	86	POLYGON ((121.64702 31.05446, 121.65228 31.054...

8527 rows × 6 columns

In this step, we restore each node to a grid, mark the community number to which the node belongs, generate the geographic information of each grid, and convert it into a GeoDataFrame. We can draw the grid with the following code to test whether the generation is successful

node.plot('group')

Here, we map the group number of the group field to the color for preliminary visualization. The colors of different groups are different. It can be seen from the graph of the result that most grids of the same color are clustered together in geospatial space, indicating that the spatial connection of sharing bicycles can closely connect the close areas in geospatial space

The visualization effect of the previous results is not obvious. We can't clearly see the partition from the figure. Next, we can adjust and visualize the partition results. The adjustment of visualization mainly includes the following ideas:

A more appropriate zoning result should be that each region is spatially continuous. In the preliminary visualization results, many grids exist in isolation in space, and these points should be eliminated.
In the visualization results, we can merge the grids of the same group to form face features for each partition, so that the boundary of the partition can be drawn in the next visualization.
In the zoning results, there may be "enclaves" of other regions in some regions, that is, they belong to this partition but are surrounded by other partitions. They can only "fly" through the possessions of other partitions to reach their own enclaves. This kind of zoning is also difficult to manage in the actual operation of sharing bicycles, which should be avoided.

To solve the above problems, we can use two GIS processing methods provided by TransBigData, TBD merge_ Polygon and TBD polyon_ exterior. TBD merge_ Polygon can merge face features of the same group, while TBD polyon_ Exterior can take the outer boundary of the polygon and then form a new polygon to eliminate the enclave. At the same time, you can also set the minimum area to eliminate face features smaller than this area. The code is as follows

#Combine the face features of the same group by grouping the group field
node_community = tbd.merge_polygon(node,'group')
#Input polygon GeoDataFrame data and take the outer boundary of the polygon to form a new polygon
#Set the minimum area minarea to eliminate all faces smaller than this area to avoid a large number of outliers
node_community = tbd.polyon_exterior(node_community,minarea = 0.000100)

After processing the face features of the community, you need to visualize the face features. We want to give different colors to different faces. In the visualization chapter, we mentioned that when there is no numerical difference between the displayed elements, the color selection needs to keep their respective colors with the same brightness and saturation. seaborn's palette method can quickly generate multiple colors under the same brightness and saturation

#Generate palette
import seaborn as sns
## l: Brightness
## s: Saturation
cmap = sns.hls_palette(n_colors=len(node_community), l=.7, s=0.8)
sns.palplot(cmap)

Visualize community results

#Plot Frame 
import matplotlib.pyplot as plt
import plot_map
fig =plt.figure(1,(8,8),dpi=300)
ax =plt.subplot(111)
plt.sca(ax)
#Add map basemap
tbd.plot_map(plt,bounds,zoom = 10,style = 6)
#Set colormap
from matplotlib.colors import ListedColormap
#Disrupt the order of the community
node_community = node_community.sample(frac=1)
#Draw community
node_community.plot(cmap = ListedColormap(cmap),ax = ax,edgecolor = '#333',alpha = 0.8)
#Add scale bar and north arrow
tbd.plotscale(ax,bounds = bounds,textsize = 10,compasssize = 1,textcolor = 'k'
              ,accuracy = 2000,rect = [0.06,0.03],zorder = 10)
plt.axis('off')
plt.xlim(bounds[0],bounds[2])
plt.ylim(bounds[1],bounds[3])
plt.show()

So far, we have successfully visualized the shared bicycle community and mapped the boundary of each community. When the community discovery model is used for zoning, no geospatial information is input into the model, and the segmentation of the study area only depends on the network connection composed of shared single vehicle travel demand.

Keywords: Python Back-end

Added by La Parka on Sat, 11 Dec 2021 03:39:55 +0200

Programming VIP