Team learning graph neural network (seventh)

Creation of super large data set class

Previously, we only touched on data sets that can store all data in memory. The data set classes corresponding to these data sets load all data into memory when creating objects. However, if the data set is super large, it is difficult for us to have enough memory to store all the data completely. Therefore, you need a dataset class that loads samples into memory on demand.

Dataset class

In PyG, we inherit torch_ geometric. data. The dataset base class defines a dataset class that loads samples into memory on demand. Inherit torch_ geometric. data. The method to be implemented by the inmemorydataset base class. Inheriting this base class also needs to be implemented. In addition, the following methods need to be implemented:

len(): returns the number of samples in the dataset
get(): implements the operation of loading a single graph. Internally, getitem() returns the Data objects obtained by calling get() and selectively converts them according to the transform parameter.

Instead of defining a Dataset class, we can directly generate a Dataloader object for training through the following methods:

from torch_geometric.data import Data, DataLoader
data_list = [Data(...), ..., Data(...)]
loader = DataLoader(data_list, batch_size=32)

We can also form a batch from a list of Data objects in the following way:

from torch_geometric.data import Data, Batch
data_list = [Data(...), ..., Data(...)]
loader = Batch.from_data_list(data_list, batch_size=32)

Graph samples are encapsulated into batch and DataLoader classes

Merge small drawings to form large drawings

Graph can have any number of nodes and edges. It is not a regular data structure. Therefore, the operation of encapsulating graph data into batches is different from that of encapsulating image and sequence data into batches. Pytorch geometry encapsulates multiple graphs into batches by merging small graphs as connected components to build a large graph. Therefore, the adjacency matrix of the small graph is stored on the diagonal of the adjacency matrix of the large graph. The adjacency matrix, attribute matrix and prediction target matrix of the large map are respectively:

This approach has the following key advantages:

The GNN operation relying on the message passing scheme does not need to be modified, because messages still cannot be exchanged between two nodes belonging to different graphs.
There is no additional computational or memory overhead.

Attribute increment and splicing of small graphs

When storing a small graph into a large graph, you need to modify the attributes of the small graph. One of the most significant examples is to add value to the node serial number. In the most general form, the DataLoader class of pytorch geometry will automatically update the edge_ The index tensor increases in value, and the increased value is the cumulative number of nodes in the previous graph of the currently processed graph. Pytorch geometry allows us to override torch_geometric.data.inc() and torch_geometric.data.cat_dim() function to achieve the desired behavior.

Graph matching

If you want to store multiple graphs in a Data object, for example, for graph matching and other applications, we need to ensure that all these graphs are correctly encapsulated into batch behavior. For example, consider storing two graphs, a source graph Gs and a target graph Gt, in a Data class, that is

class PairData(Data):
	def __init__(self, edge_index_s, x_s, edge_index_t,x_t):
		super(PairData, self).__init__()
		self.edge_index_s = edge_index_s
		self.x_s = x_s
		self.edge_index_t = edge_index_t
		self.x_t = x_t

In this case, edge_index_s should be added according to the number of nodes in the source graph Gs, i.e. x_s.size(0), while edge_index_t should be added according to the number of nodes in the target graph Gt, i.e. x_t.size(0).
Let's take a look at node value-added through an example:

edge_index_s = torch.tensor([
    [0, 0, 0, 0],
    [1, 2, 3, 4],
])
x_s = torch.randn(5, 16)  # 5 nodes.
edge_index_t = torch.tensor([
    [0, 0, 0],
    [1, 2, 3],
])
x_t = torch.randn(4, 16)  # 4 nodes.

data = PairData(edge_index_s, x_s, edge_index_t, x_t)
data_list = [data, data]
loader = DataLoader(data_list, batch_size=2)
batch = next(iter(loader))

print(batch)
>>> Batch(edge_index_s=[2, 8], x_s=[10, 16],
          edge_index_t=[2, 6], x_t=[8, 16])

print(batch.edge_index_s)
>>> tensor([[0, 0, 0, 0, 5, 5, 5, 5],
            [1, 2, 3, 4, 6, 7, 8, 9]])

print(batch.edge_index_t)
>>> tensor([[0, 0, 0, 4, 4, 4],
            [1, 2, 3, 5, 6, 7]])

We can use follow in DataLoader_ Batch parameter to maintain batch properties.

Bipartite graph

The adjacency matrix of bipartite graph defines the connection relationship between two types of nodes. The number of nodes of different types does not need to be the same, so the value-added operations of the source node and the target node of the edge should be different. We need to tell pytorch geometry that it should be at edge_index independently performs value-added operations for the source node and target node of the edge.

def __inc__(self, key, value):
	if key == 'edge_index':
		return torch.tensor([[self.x_s.size(0)],[self.x_t.size(0)]])
	else:
		return super().__inc__(key, value)

Where, edge_index[0] according to x_s.size(0) (the source node of the edge) performs value-added operations, while edge_index[1] (target node of the edge) according to x_t.size(0) performs value-added operations.

Splicing on new dimensions

Sometimes, the attributes of Data objects need to be spliced in a new dimension (such as the classic encapsulation)
Batch), for example, graph level attributes or forecast targets. Specifically, shape [num_features]
The attribute list of should be returned as [num_examples, num_features], not
[num_examples * num_features]. Pytorch geometry
cat_dim() returns a join dimension of None to achieve this.

class MyData(Data):
	def __cat_dim__(self, key, item):
		if key == 'foo':
			return None
		else:
			return super().__cat_dim__(key, item)

Figure prediction task practice

Operation:
(1) 128G of virtual memory required
(2) Using the parameters of the tutorial, you need to run 49 epochs and 16 num_workers, the running time of each epoch is about 3 ~ 4 minutes, and the overall operation needs at least 5 hours
(3) After the trial run starts, the program will create a task in the saves directory_ The folder with the name specified by the name parameter is used to record the test process. When there is already a folder with the same name in the saves directory, the program will be in the task_ Add a suffix at the end of the name parameter as the folder name. During the test run, all print output will be written to the output file under the test folder, tensorboard The information recorded by summarywriter is also stored in the file under the trial folder.

reference material:
1.Creation of on-demand dataset classes
2.Figure prediction task practice

Keywords: Python Machine Learning AI Deep Learning

Added by rxero on Fri, 21 Jan 2022 19:40:14 +0200

Programming VIP