Creation of super large data set class
Previously, we only touched on data sets that can store all data in memory. The data set classes corresponding to these data sets load all data into memory when creating objects. However, if the data set is super large, it is difficult for us to have enough memory to store all the data completely. Therefore, you need a dataset class that loads samples into memory on demand.
Dataset class
In PyG, we inherit torch_ geometric. data. The dataset base class defines a dataset class that loads samples into memory on demand. Inherit torch_ geometric. data. The method to be implemented by the inmemorydataset base class. Inheriting this base class also needs to be implemented. In addition, the following methods need to be implemented:
- len(): returns the number of samples in the dataset
- get(): implements the operation of loading a single graph. Internally, getitem() returns the Data objects obtained by calling get() and selectively converts them according to the transform parameter.
Instead of defining a Dataset class, we can directly generate a Dataloader object for training through the following methods:
from torch_geometric.data import Data, DataLoader data_list = [Data(...), ..., Data(...)] loader = DataLoader(data_list, batch_size=32)
We can also form a batch from a list of Data objects in the following way:
from torch_geometric.data import Data, Batch data_list = [Data(...), ..., Data(...)] loader = Batch.from_data_list(data_list, batch_size=32)
Graph samples are encapsulated into batch and DataLoader classes
Merge small drawings to form large drawings
Graph can have any number of nodes and edges. It is not a regular data structure. Therefore, the operation of encapsulating graph data into batches is different from that of encapsulating image and sequence data into batches. Pytorch geometry encapsulates multiple graphs into batches by merging small graphs as connected components to build a large graph. Therefore, the adjacency matrix of the small graph is stored on the diagonal of the adjacency matrix of the large graph. The adjacency matrix, attribute matrix and prediction target matrix of the large map are respectively:
This approach has the following key advantages:
- The GNN operation relying on the message passing scheme does not need to be modified, because messages still cannot be exchanged between two nodes belonging to different graphs.
- There is no additional computational or memory overhead.
Attribute increment and splicing of small graphs
When storing a small graph into a large graph, you need to modify the attributes of the small graph. One of the most significant examples is to add value to the node serial number. In the most general form, the DataLoader class of pytorch geometry will automatically update the edge_ The index tensor increases in value, and the increased value is the cumulative number of nodes in the previous graph of the currently processed graph. Pytorch geometry allows us to override torch_geometric.data.inc() and torch_geometric.data.cat_dim() function to achieve the desired behavior.
Graph matching
If you want to store multiple graphs in a Data object, for example, for graph matching and other applications, we need to ensure that all these graphs are correctly encapsulated into batch behavior. For example, consider storing two graphs, a source graph Gs and a target graph Gt, in a Data class, that is
class PairData(Data): def __init__(self, edge_index_s, x_s, edge_index_t,x_t): super(PairData, self).__init__() self.edge_index_s = edge_index_s self.x_s = x_s self.edge_index_t = edge_index_t self.x_t = x_t
In this case, edge_index_s should be added according to the number of nodes in the source graph Gs, i.e. x_s.size(0), while edge_index_t should be added according to the number of nodes in the target graph Gt, i.e. x_t.size(0).
Let's take a look at node value-added through an example:
edge_index_s = torch.tensor([ [0, 0, 0, 0], [1, 2, 3, 4], ]) x_s = torch.randn(5, 16) # 5 nodes. edge_index_t = torch.tensor([ [0, 0, 0], [1, 2, 3], ]) x_t = torch.randn(4, 16) # 4 nodes. data = PairData(edge_index_s, x_s, edge_index_t, x_t) data_list = [data, data] loader = DataLoader(data_list, batch_size=2) batch = next(iter(loader)) print(batch) >>> Batch(edge_index_s=[2, 8], x_s=[10, 16], edge_index_t=[2, 6], x_t=[8, 16]) print(batch.edge_index_s) >>> tensor([[0, 0, 0, 0, 5, 5, 5, 5], [1, 2, 3, 4, 6, 7, 8, 9]]) print(batch.edge_index_t) >>> tensor([[0, 0, 0, 4, 4, 4], [1, 2, 3, 5, 6, 7]])
We can use follow in DataLoader_ Batch parameter to maintain batch properties.
Bipartite graph
The adjacency matrix of bipartite graph defines the connection relationship between two types of nodes. The number of nodes of different types does not need to be the same, so the value-added operations of the source node and the target node of the edge should be different. We need to tell pytorch geometry that it should be at edge_index independently performs value-added operations for the source node and target node of the edge.
def __inc__(self, key, value): if key == 'edge_index': return torch.tensor([[self.x_s.size(0)],[self.x_t.size(0)]]) else: return super().__inc__(key, value)
Where, edge_index[0] according to x_s.size(0) (the source node of the edge) performs value-added operations, while edge_index[1] (target node of the edge) according to x_t.size(0) performs value-added operations.
Splicing on new dimensions
Sometimes, the attributes of Data objects need to be spliced in a new dimension (such as the classic encapsulation)
Batch), for example, graph level attributes or forecast targets. Specifically, shape [num_features]
The attribute list of should be returned as [num_examples, num_features], not
[num_examples * num_features]. Pytorch geometry
cat_dim() returns a join dimension of None to achieve this.
class MyData(Data): def __cat_dim__(self, key, item): if key == 'foo': return None else: return super().__cat_dim__(key, item)
Figure prediction task practice
Operation:
(1) 128G of virtual memory required
(2) Using the parameters of the tutorial, you need to run 49 epochs and 16 num_workers, the running time of each epoch is about 3 ~ 4 minutes, and the overall operation needs at least 5 hours
(3) After the trial run starts, the program will create a task in the saves directory_ The folder with the name specified by the name parameter is used to record the test process. When there is already a folder with the same name in the saves directory, the program will be in the task_ Add a suffix at the end of the name parameter as the folder name. During the test run, all print output will be written to the output file under the test folder, tensorboard The information recorded by summarywriter is also stored in the file under the trial folder.
reference material:
1.Creation of on-demand dataset classes
2.Figure prediction task practice