[PYG] summary of common and mining pits

PyG (PyTorch geometry) is a graph neural network framework based on PyTorch. It is recommended to understand the use of PyTorch before learning PyG. Otherwise, you can't understand it. You can see my experience about the use of PyTorch

Use of PyTorch 12.0 review articles

PyG includes data set processing in graph neural network training, multi GPU training, multiple classical graph neural network models, multiple commonly used graph neural network training data sets, and supports self built data sets, mainly including the following modules

torch_geometric: main module
torch_geometric.nn: build neural network layer
torch_geometric.data: representation of graph structure data
torch_geometric.loader: load dataset
torch_geometric.datasets: common graph neural network datasets
torch_geometric.transforms: data transformation
torch_geometric.utils: common tools
torch_ geometric. Graph Gym: a common graph neural network model
torch_geometric.profile: training of supervision model

1. Overall introduction

Introduce the use of PyG through some examples, and have some understanding first.

(

You can see mine

How to implement a graph convolution neural network with PyG and train it on Cora data set

Let's have a general understanding first

)

1.1 processing of drawing data

Torch for PyG_ geometric. data. Data saves the data of the graph structure. The imported data (this data refers to the specific data you imported, not the previous torch_geometric.data) will contain the following attributes in PyG

data.x: The attribute information of the graph node. For example, each user in the social network is a node. This x can represent the attribute information of the user. The dimension is [num_nodes,num_node_features]
data. edge_ Index: node connection information in coo format. The type is torch Long, the dimension is [2,num_edges] (it specifically contains two lists, and the number at the corresponding position of each list indicates that there is an edge connection between the corresponding nodes)
data.edge_attr: attribute information of edges in the graph, dimension [num_edges,num_edge_features]
data.y: For label information, the dimensions are different according to the specific task. If it is a classification task on the node, the dimension is [num_edges, number of categories]. If it is a classification task on the whole graph, the dimension is [1, number of categories]
data.pos: node location information (generally used for visualization of graph structure data)

In addition to the above properties, we can also use data Face custom attributes.

Let's see how to use PyG to represent the following figure

import torch
from torch_geometric.data import Data

# Edge connection information
# Note that the edges of an undirected graph are defined twice
edge_index = torch.tensor(
    [
        # This indicates that nodes 0 and 1 are connected because they are undirected graphs
        # Then 1 and 0 are also connected
        # Look up and down
        [0, 1, 1, 2],
        [1, 0, 2, 1],
    ],
    # Specify data type
    dtype=torch.long
)
# Attribute information of node
x = torch.tensor(
    [
        # Three nodes
        # The attribute vector dimension of each node is 1
        [-1],
        [0],
        [1],
    ]
)
# Data instantiated as a graph structure
data = Data(x=x, edge_index=edge_index)
# View graph data
print(data)
# What information does the graph data contain
print(data.keys)
# View node attribute information
print(data['x'])
# Number of nodes
print(data.num_nodes)
# Number of sides
print(data.num_edges)
# Dimension of node attribute vector
print(data.num_node_features)
# Are there isolated nodes in the diagram
print(data.has_isolated_nodes())
# Is there a ring in the figure
print(data.has_self_loops())
# Is it a directed graph
print(data.is_directed())

1.2 common graph neural network data sets

PyG contains some common graph depth learning public data sets, such as

Planetoid dataset (Cora, Citeseer, Pubmed)
Some come from http://graphkernels.cs.tu-dortmund.de Common graph neural network classification data set
QM7,QM9
3D point cloud dataset, such as FAUST, ModelNet10, etc

Next, take the ENZYMES dataset (including 600 graphs, each graph is divided into 6 categories and the classification of graph level) as an example of how to use PyG's public dataset

from torch_geometric.datasets import TUDataset

# Import dataset
dataset = TUDataset(
    # Specifies the storage location of the dataset
    # If there is no corresponding dataset at the specified location
    # PyG will download automatically
    root='../data/ENZYMES',
    # Dataset to use
    name='ENZYMES',
)
# Length of data set
print(len(dataset))
# Number of categories in the dataset
print(dataset.num_classes)
# Dimension of node attribute vector in dataset
print(dataset.num_node_features)
# 600 graphs. We can choose which graph to use according to the index
data = dataset[0]
print(data)
# Randomly disrupt data sets
dataset = dataset.shuffle()

1.3 how to load a dataset

In real graph neural network training, we usually load part of the dataset into memory to train graph neural network, which is called a batch. Then how does PyG load a batch? PyG will divide it into the batch size we specify according to our dataset

for instance

from torch_geometric.loader import DataLoader
from torch_geometric.datasets import TUDataset


# data set
dataset = TUDataset(
    root='../data/ENZYMES',
    name='ENZYMES',
    use_node_attr=True,
)
# Create dataset loader
# Load 32 data into memory at a time
loader = DataLoader(
    # Dataset to load
    dataset=dataset,
    # ENZYMES contains 600 diagrams
    # Load 32 at a time
    batch_size=32,
    # Whether to randomly disrupt the data after each addition (which can increase the generalization of the model)
    shuffle=True
)
for batch in loader:
    print(batch)
    print(batch.num_graphs)

2. Establishment of spatial map convolution neural network

Spatial graph convolution (note that the word "convolution" in graph neural network is obtained in the generalized sense of "feature extraction", which is different from the convolution kernel calculation in convolution neural network) can be regarded as a process of information transmission and fusion between adjacent nodes, and the calculation formula can be generalized as

PyG uses MessagePassing to implement the above calculation process. Next, take two classical graph neural network papers as examples to introduce the use of MessagePassing.

https://arxiv.org/abs/1609.02907arxiv.org/abs/1609.02907

Dynamic Graph CNN for Learning on Point Cloudsarxiv.org/abs/1801.07829

2.1 implementation of GCN

In the first paper, the convolution formula proposed by the author is

from abc import ABC

import torch
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree


# Define GCN spatial graph convolution neural network
class GCNConv(MessagePassing, ABC):
    # Network initialization
    def __init__(self, in_channels, out_channels):
        """
        :param in_channels: Dimension of node attribute vector
        :param out_channels: After graph convolution, the feature of the node represents the dimension
        """
        # Define gamma function as summation function, aggr='add '
        super(GCNConv, self).__init__(aggr='add')
        # Define the innermost linear transformation
        # The implementation is a linear layer
        self.linear_change = torch.nn.Linear(in_channels, out_channels)

    # Define information aggregation function
    def message(self, x_j, norm):
        # Regularization
        # norm.view(-1,1) turns norm into a column vector
        # x_j is the characteristic representation matrix of nodes
        return norm.view(-1, 1) * x_j

    # Forward transfer, graph convolution
    def forward(self, x, edge_index):
        """
        :param x:The dimension of the node in the figure is[Number of nodes,Number of adjacent dimensions of node attribute]
        :param edge_index: Connection information of edges in the graph,Dimension is[2,Number of sides]
        :return:
        """
        # Add node to its own ring
        # Because the node contains itself when gathering the information of adjacent nodes at the back
        # add_self_loops will be in the edge_ In the connection information table on the index side,
        # Add information like [i,i]
        # Represents a ring from a node to itself
        # The function returns [connection information of the edge, attribute information on the edge]
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
        # Linear transformation
        x = self.linear_change(x)
        # Computational regularization
        row, col = edge_index
        # Gets the degree of the node
        deg = degree(col, x.size(0), dtype=x.dtype)
        # Regularization formula
        deg_inv_sqrt = deg.pow(-0.5)
        # Set the unknown value to 0 to avoid errors in the following calculation
        deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0
        # Regularization part
        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
        # Information transmission and integration
        # propagate will automatically call self Message function and pass parameters to it
        return self.propagate(edge_index, x=x, norm=norm)


# Test the graph convolution neural network we just defined
if __name__ == '__main__':
    # Instantiate a graph convolution neural network
    # It is assumed that the dimension of the node attribute vector of the graph is 16 and the dimension of the node feature representation vector accumulated by the graph volume is 32
    conv = GCNConv(16, 32)
    # Randomly generate a node attribute vector
    # 5 nodes, and the attribute vector is 16 dimensions
    x = torch.randn(5, 16)
    # Randomly generated edge connection information
    # Suppose there are 3 sides
    edge_index = [
        [0, 1, 1, 2, 1, 3],
        [1, 0, 2, 1, 3, 1]
    ]
    edge_index = torch.tensor(edge_index, dtype=torch.long)
    # Perform graph convolution
    output = conv(x, edge_index)
    # Output the characteristic representation matrix after convolution
    print(output.data)

2.2 implementation of edge revolution

In the second paper, the convolution formula proposed by the author is

import torch
from torch.nn import Sequential as Seq
from torch.nn import Linear, ReLU
from torch_geometric.nn import MessagePassing


# Define EdgeConv graph convolution neural network
class EdgeConv(MessagePassing):
    # Initialization graph convolution neural network
    def __init__(self, in_channels, out_channels):
        # The gamma function is defined as the maximum function
        super().__init__(aggr='max')
        # A feedforward neural network is defined
        self.mlp = Seq(
            # In the linear layer, the input after the information aggregation function is 2*in_channels
            Linear(2 * in_channels, out_channels),
            # Activation function
            ReLU(),
            # Output layer
            Linear(out_channels, out_channels)
        )

    # Define information aggregation function
    def message(self, x_i, x_j):
        tmp = torch.cat([x_i, x_j - x_i], dim=1)
        # The dimension of tmp after cat is [number of sides, 2*in_channels]
        return self.mlp(tmp)

    # Forward transfer, graph convolution
    def forward(self, x, edge_index):
        # x is the node attribute vector matrix
        # edge_index is the connection information of the edge
        # Carry out information transmission and fusion
        return self.propagate(edge_index, x=x)

3. Self mapping neural network data set

PyG divides the self built dataset into two folders - raw_dir,processed_dir. row_dir is the original data set, processed_dir is the data set after PyG processing

There are three filtering methods for data set PyG -- transform and pre_transform,pre_filter.

Transform: read data and transform it
pre_transform: transform the entire data set, and then store the transformed data. pre_filter is the same

PyG divides data sets into two types

torch_geometric.data.InMemoryDataset: a data set that can be completely put into memory
torch_geometric.data.Dataset: cannot be completely put into memory

3.1 create a graph dataset that can be completely put into memory

Do four things:

Implement torch_geometric.data.InMemoryDataset.raw_file_names(): Tell PyG where to put the dataset
Implement torch_geometric.data.InMemoryDataset.processed_file_names(): Tell PyG where to put the data set after processing
Implement torch_geometric.data.InMemoryDataset.download(): Tell PyG where to get the data set
Implement torch_geometric.data.InMemoryDataset.process(): Tell PyG how to process your dataset

A common template is like this

import torch
from torch_geometric.data import InMemoryDataset, download_url


# General template for implementing In Memory Dataset
class MyDataset(InMemoryDataset):
    # initialization
    def __init__(self, root, transfrom=None, pre_transform=None):
        # Root is the root directory of the dataset
        super(MyDataset, self).__init__(root, transfrom, pre_transform)
        # Load dataset
        self.data, self.slices = torch.load(self.processed_paths[0])

    def raw_file_names(self) -> Union[str, List[str], Tuple]:
        return ['file_1', 'file_2', ...]

    def processed_file_names(self) -> Union[str, List[str], Tuple]:
        return ['data.pt']

    def download(self):
        # Download dataset to raw_dir folder
        download_url(url, self.raw_dir)

    def process(self):
        data_list = [...]
        # Data filtering
        if self.pre_filter is not None:
            data_list = [data for data in data_list if self.pre_filter(data)]
        if self.pre_transform is not None:
            data_list = [self.pre_transform(data) for data in data_list]
        # self.collate combines all data to speed up storage
        # Data is the combined data
        # slices is a segmentation method that tells PyG how to restore data to the original data
        data, slices = self.collate(data_list)
        # Save data
        torch.save((data, slices), self.processed_paths[0])

3.2 creating data sets that cannot be completely put into memory

This is similar to the Dataset in PyTorch. On the basis of the above things, you need to

Implement torch_geometric.data.Dataset.len(): Tell PyG how big the dataset is
Implement torch_geometric.data.Dataset.get(): tells PyG how to get a data from the dataset

The general template is

import os.path as osp
import torch
from torch_geometric.data import Dataset, download_url


class MyDataset(Dataset):
    # initialization
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyDataset, self).__init__(root, transform, pre_transform)

    def raw_file_names(self) -> Union[str, List[str], Tuple]:
        return ['file_1', 'file_2', ...]

    def processed_file_names(self) -> Union[str, List[str], Tuple]:
        return ['data_1.pt', ...]

    def download(self):
        path = download_url(url, self.raw_dir)

    def process(self):
        i = 0
        for raw_path in self.raw_paths:
            # Read data
            data = Data(...)
            # Filter dataset
            if self.pre_filter is not None and not self.pre_filter(data):
                pass
            if self.pre_transform is not None:
                data = self.pre_transform(data)
            # Save data
            torch.save(data, osp.join(self.processed_dir, 'data_{}.pt'.format(i)))
            i += 1

    def len(self):
        return len(self.processed_file_names)

    def get(self,idx):
        data = torch.load(osp.join(self.processed_dir, 'data_{}.pt'.format(idx)))
        return data

4. Batch processing

It comes from the idea of batch processing in traditional deep learning --- batch data, then combine each batch of data into a group, and then train them in groups. The amount of data in each group is called batch_size. PyG divides the graph data set into multiple groups for training

PyG will automatically help us set the graph dataset according to the batch defined by us_ Size segmentation, and then merge the data in each batch.

If we want to control how PyG combines data in a batch, we need to rewrite the torch ourselves_ geometric. data. Data.__ inc__ ()

Give two concrete examples

Suppose that each data in our dataset (note that each data) contains two graphs, and each data is like this

For this kind of data set, how to control PyG to merge multiple data into one batch_ For example, batch_size=2 means that every two data in the data set form a group to form a graph. The data in each batch is like this

from typing import Any

import torch
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader


# Define graph data
class PairData(Data):
    def __init__(self, edge_index_s=None, x_s=None, edge_index_t=None, x_t=None):
        # Each data contains two graphs s,t
        """
        :param edge_index_s: chart s Connection relationship of
        :param x_s: chart s Node attribute matrix
        :param edge_index_t: chart t Connection relationship of
        :param x_t: chart t Node attribute matrix
        """
        super(PairData, self).__init__()
        self.edge_index_s = edge_index_s
        self.x_s = x_s
        self.edge_index_t = edge_index_t
        self.x_t = x_t

    def __inc__(self, key: str, value: Any, *args, **kwargs) -> Any:
        # If you want to merge figure s
        # Then tell PyG the number of nodes of graph s
        if key == 'edge_index_s':
            return self.x_s.size(0)
        # If you are merging a graph t
        # Then tell PyG the number of nodes of graph t
        if key == 'edge_index_t':
            return self.x_t.size(0)
        # Default in other cases
        else:
            return super().__inc__(key, value, *args, **kwargs)


# Let's verify the merge method we defined above
# Definition diagram s
edge_index_s = torch.tensor([
    [0, 0, 0, 0],
    [1, 2, 3, 4],
])
x_s = torch.randn(5, 16)
# Definition diagram t
edge_index_t = torch.tensor([
    [0, 0, 0],
    [1, 2, 3],
])
x_t = torch.randn(4, 16)  # 4 nodes.
# Verify that the simple definition dataset contains two data sets
data = PairData(edge_index_s, x_s, edge_index_t, x_t)
data_list = [data, data]
# batch_size=2
# follow_batch description node information
loader = DataLoader(data_list, batch_size=2, follow_batch=['x_s', 'x_t'])
# Verify whether PyG effectively combines the data of a batch in the way we define
batch = next(iter(loader))
# View data merged into one batch
print(batch)
# View s in batch (this is the combination of s in two original data as one)
print(batch.edge_index_s)
# View t in batch
print(batch.edge_index_t)

Take another example of a bipartite graph. Suppose that each data in our dataset is a bipartite graph, like this

Or batch_size=2, we want to control PyG so that the data becomes

import torch
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader


# Define bipartite graph structure
class BipartiteData(Data):
    def __init__(self, edge_index=None, x_s=None, x_t=None):
        super().__init__()
        # Contains a set of edges
        # Two groups of nodes
        self.edge_index = edge_index
        self.x_s = x_s
        self.x_t = x_t

    # Define how each batch is merged
    def __inc__(self, key, value, *args, **kwargs):
        # If you want to merge the edge connection information of two graphs
        if key == 'edge_index':
            # The left side (the first row of edge connection information) is merged according to the number of nodes in the first group
            # The right side (the second row of edge connection information) is merged according to the number of nodes in the second group
            return torch.tensor([[self.x_s.size(0)], [self.x_t.size(0)]])
        else:
            return super().__inc__(key, value, *args, **kwargs)


edge_index = torch.tensor([
    [0, 0, 1, 1],
    [0, 1, 1, 2],
])
x_s = torch.randn(2, 16)
x_t = torch.randn(3, 16)
data = BipartiteData(edge_index, x_s, x_t)
data_list = [data, data]
loader = DataLoader(data_list, batch_size=2)
batch = next(iter(loader))
print(batch)
print(batch.edge_index)

5. Establishment of heterogeneous map

The graph discussed earlier can be classified as a simple graph - containing only one type of node and one type of edge.

However, in reality, we need to deal with many types of nodes and many types of edges between these nodes, which requires the concept of heterogeneous graph. In heterogeneous graph, different types of edges describe different relationships between different types of nodes. The task of heterogeneous graph neural network is to learn the characteristic representation of nodes or the whole heterogeneous graph on this graph structure. Heterogeneity map is accurately defined as follows:

Next, a movie scoring dataset MovieLens is used to illustrate how to construct a heterogeneous graph.

MovieLens contains the ratings of 600 users for movies. We use this data set to build a bipartite graph, including two types of nodes: movies and users, and one type of edge (containing multiple types of nodes, so it can be regarded as a heterogeneous graph)

Movies in MovieLens The CSV file describes the information of the movie, including the unique ID of the movie in the dataset, the movie name, and the type of movie

ratings.csv contains user ratings for movies

Next, the bipartite graph data set is established according to the two csv

import os.path as osp

import torch
import pandas as pd
from sentence_transformers import SentenceTransformer

from torch_geometric.data import HeteroData, download_url, extract_zip
from torch_geometric.transforms import ToUndirected, RandomLinkSplit

# Data set download address
url = 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
# Data set storage path
root = osp.join(osp.dirname(osp.realpath(__file__)), '../data/MovieLens')
# Download and unzip the dataset
extract_zip(download_url(url, root), root)
# Get movies csv，ratings.csv file
movie_path = osp.join(root, 'ml-latest-small', 'movies.csv')
rating_path = osp.join(root, 'ml-latest-small', 'ratings.csv')
# Viewing datasets with pandas
print(pd.read_csv(movie_path).head())
print(pd.read_csv(rating_path).head())


# Put the movie name in that column
# Using the embedding model, each movie name is represented by a vector (Embedding)
class SequenceEncoder(object):
    # initialization
    # Specify the embedded model we use
    # And equipment used
    def __init__(self, model_name='all-MiniLM-L6-v2', device=None):
        # Equipment used
        self.device = device
        # Embedded model name used
        self.model = SentenceTransformer(model_name, device=device)

    # The embedded model does not participate in the training of subsequent graph neural networks
    @torch.no_grad()
    def __call__(self, df):
        x = self.model.encode(
            # Value to embed
            df.values,
            # Display processing progress
            show_progress_bar=True,
            # Tensor converted to PyTorch
            convert_to_tensor=True,
            # Equipment used
            device=self.device
        )
        return x.cpu()


# Embed the movie type column
class GenresEncoder(object):

    # Separator
    def __init__(self, sep='|'):
        self.sep = sep

    def __call__(self, df):
        # Split all movie types
        # The logic of the last two for is:
        # for col in df.values takes the value of each row
        # for g in col.split(self.sep) splits the extracted value with the specified separator
        # set(g) converts the split result into a set to remove duplication
        genres = set(g for col in df.values for g in col.split(self.sep))
        # Number the movie type
        mapping = {genre: i for i, genre in enumerate(genres)}
        # The type of film is expressed in the form of multi hot
        x = torch.zeros(len(df), len(mapping))
        for i, col in enumerate(df.values):
            for genre in col.split(self.sep):
                x[i, mapping[genre]] = 1
        return x


# Read the information from the CSV file and create the node information in the bipartite graph
def load_node_csv(path, index_col, encoders=None, **kwargs):
    """
    :param path: CSV File path
    :param index_col: The index column in the file, that is, the column where the node is located
    :param encoders:Node embedder
    :param kwargs:
    :return:
    """
    df = pd.read_csv(path, index_col=index_col, **kwargs)
    # The index is represented by a number
    mapping = {index: i for i, index in enumerate(df.index.unique())}
    # Node attribute vector matrix
    x = None
    # If the embedder is not empty
    if encoders is not None:
        # Embed the corresponding columns
        # Get embedded vector representation
        xs = [encoder(df[col]) for col, encoder in encoders.items()]
        x = torch.cat(xs, dim=-1)

    return x, mapping


# Get node information
# Process movies CSV table to convert the 'movie name' and 'movie type' columns into the representation of embedded vectors
movie_x, movie_mapping = load_node_csv(
    movie_path, index_col='movieId', encoders={
        # Movie list embedder
        'title': SequenceEncoder(),
        # Embedder for movie type columns
        'genres': GenresEncoder()
    })
# Processing rates CSV table, the user ID is represented by the tensor in PyTorch
user_x, user_mapping = load_node_csv(rating_path, index_col='userId')
# Establish heterogeneous graph (here is a bipartite graph)
# HeteroData() is a built-in data structure in PyG that represents heterogeneous graphs
data = HeteroData()
# Add information of different types of nodes
# Add user information. The user has no attribute vector
# Just tell PyG how many user nodes there are
data['user'].num_nodes = len(user_mapping)
# Tell PyG the attribute vector matrix of the movie, and PyG will infer the number of movie nodes according to x
data['movie'].x = movie_x
print(data)


# Establish information between users and movies
# Convert the user's rating of the movie into a tensor in PyTorch
# Facilitate subsequent model training
class IdentityEncoder(object):

    def __init__(self, dtype=None):
        self.dtype = dtype

    def __call__(self, df):
        return torch.from_numpy(df.values).view(-1, 1).to(self.dtype)


# Establish the connection information of bipartite graph edges
def load_edge_csv(path, src_index_col, src_mapping, dst_index_col, dst_mapping,
                  encoders=None, **kwargs):
    """
    :param path: CSV Path to table
    :param src_index_col: The left node of bipartite graph comes from CSV Which column of the table, such as'user_id'This column
    :param src_mapping:take user_id Map to node number, as we defined earlier user_mapping
    :param dst_index_col:Similarly, the movie node on the right of the bipartite graph
    :param dst_mapping:
    :param encoders:Edge inserter
    :param kwargs:
    :return:
    """
    df = pd.read_csv(path, **kwargs)
    # Establish connection information
    src = [src_mapping[index] for index in df[src_index_col]]
    dst = [dst_mapping[index] for index in df[dst_index_col]]
    # Note here, edge_index dimension is [2, number of sides]
    edge_index = torch.tensor([src, dst])
    # Attribute information of edges
    edge_attr = None
    # If the embedder is not empty
    if encoders is not None:
        edge_attrs = [encoder(df[col]) for col, encoder in encoders.items()]
        edge_attr = torch.cat(edge_attrs, dim=-1)

    return edge_index, edge_attr


# Get the edge information of bipartite graph
edge_index, edge_label = load_edge_csv(
    rating_path,
    # On the left of the bipartite graph is the user
    src_index_col='userId',
    src_mapping=user_mapping,
    # On the right is the movie
    dst_index_col='movieId',
    dst_mapping=movie_mapping,
    encoders={'rating': IdentityEncoder(dtype=torch.long)},
)
# Name the edges in the bipartite graph ('user ',' rates', 'movie')
data['user', 'rates', 'movie'].edge_index = edge_index
data['user', 'rates', 'movie'].edge_label = edge_label
print(data)

# At this point, our heterogeneous graph (here is a bipartite graph) data set is constructed
# Next, it is further transformed into a data set that can be trained
# Convert to undirected graph
data = ToUndirected()(data)
# Delete the attribute information of the opposite side because there is no scoring data for the user
del data['movie', 'rev_rates', 'user'].edge_label

# The data set is divided into training set, test set and verification set according to a certain proportion
transform = RandomLinkSplit(
    num_val=0.05,
    num_test=0.1,
    # Negative sampling ratio
    # No negative sampling, all input for training
    neg_sampling_ratio=0.0,
    # Tell PyG the connection of edges
    edge_types=[('user', 'rates', 'movie')],
    rev_edge_types=[('movie', 'rev_rates', 'user')],
)
# Split dataset
train_data, val_data, test_data = transform(data)
print(train_data)
print(val_data)
print(test_data)

6. Establishment of heterogeneous graph neural network

Take OGB dataset as an example

There are four types of nodes in the OGB dataset

author
paper
institution
field of study

4 types of edges

Writes: the connection between author and paper
Affiliated with: the connection between author and institution
Cites: relationship between paper and paper
Has topic: relationship between paper and field of study

The task on the OGB dataset is to predict the location of the paper in the whole relationship network

Let's see how to represent this heterogeneous graph

from torch_geometric.data import HeteroData

# HeteroData is a heterograph data structure of PyG
data = HeteroData()
# Add node information
data['paper'].x = ...
data['author'].x = ...
data['institution'].x = ...
data['field_of_study'].x = ...
# Add edge connection information
data['paper', 'cites', 'paper'].edge_index = ...  
data['author', 'writes', 'paper'].edge_index = ...  
data['author', 'affiliated_with', 'institution'].edge_index = ...  
data['author', 'has_topic', 'institution'].edge_index = ...  
# Add attribute information for edges
data['paper', 'cites', 'paper'].edge_attr = ...  
data['author', 'writes', 'paper'].edge_attr = ...  
data['author', 'affiliated_with', 'institution'].edge_attr = ...  
data['paper', 'has_topic', 'field_of_study'].edge_attr = ...

In this way, the above heterograph is established, and we can input it into a heterograph neural network

# Heterogeneous graph neural network
model = HeteroGNN(...)
# Obtain the output of heterogeneous graph neural network
# Note that the input of heterogeneous graph neural network is_ dict
output = model(data.x_dict, data.edge_index_dict, data.edge_attr_dict)

If PyG contains the heterogeneous map you want to use, you can import it directly in this way

from torch_geometric.datasets import OGB_MAG

# Import dataset
dataset = OGB_MAG(
    root='../data',
    # Pretreatment mode
    # Convert to vector
    preprocess='metapath2vec',
)
print(dataset[0])

The following describes the functions commonly used in HeteroData

#Gets a node or edge in a heterogeneous graph
paper_node_data=data['paper']
cites_edge_data=data['paper','cites','paper']
#If the connection node set of edges or the naming of edges is unique, it can also be written like this
#Get edges using connection endpoints
cites_edge_data=data['paper','paper']
#Get using the name of the edge
cites_edge_data=data['cites']
#Add a new attribute to the node
data['paper'].year=...
#Delete some attributes of the node
def data['field_of_study']
#Get all types of information in heterogeneous graph through metadata
node_types,edge_types=data.metadata()
#All types of nodes
print(node_types)
#All types of edges
print(edge_types)
#Judging some properties of heterogeneous graphs
print(data.has_isolated_nodes())
#If the dimensions of different types of information match, the heterogeneous graph can also be fused into a simple graph
homogeneous_data=data.to_homogeneous()
import torch_geometric.transforms as T
#Transform heterogeneous graphs
#Become an undirected graph
data=T.ToUndirected()(data)
#Ring added to itself
data=T.AddSelfLoops()(data)

The following describes how to establish heterogeneous graph neural network

6.1 transform simple graph neural network into heterogeneous graph neural network

PyG can be accessed via torch_geometric.nn.to_hetero(), or torch_geometric.nn.to_hetero_with_bases() transforms a simple graph neural network into a heterogeneous graph

import torch
import torch_geometric.transforms as T
from torch_geometric.datasets import OGB_MAG
from torch_geometric.nn import SAGEConv, to_hetero

#Import dataset
data = OGB_MAG(
    root='./data', 
    preprocess='metapath2vec', 
    transform=T.ToUndirected())[0]

#Define an ordinary graph neural network
class GNN(torch.nn.Module):
    def __init__(self, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv((-1, -1), hidden_channels)
        self.conv2 = SAGEConv((-1, -1), out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x

#Instantiate our defined graph neural network
model = GNN(hidden_channels=64, out_channels=dataset.num_classes)
#Convert it to heterogeneous graph form
model = to_hetero(model, data.metadata(), aggr='sum')

PyG to_ Here's how hetero works

According to our heterogeneous graph data structure, it automatically copies the layer structure in the simple graph neural network structure defined by us, and adds the information transmission path.

torch_geometric.nn.conv.HeteroConv convolution layer also plays a similar function

from torch_geometric.nn import HeteroConv, GCNConv, SAGEConv, GATConv, Linear

class HeteroGNN(torch.nn.Module):
    def __init__(self, hidden_channels, out_channels, num_layers):
        super().__init__()

        self.convs = torch.nn.ModuleList()
        #Define map volume layer
        for _ in range(num_layers):
            #The outermost uses HeteroConv to convert the inner convolution layer into a heterogeneous graph version
            conv = HeteroConv(
                #Convolution layer to convert
                {
                ('paper', 'cites', 'paper'): GCNConv(-1, hidden_channels),
                ('author', 'writes', 'paper'): GATConv((-1, -1), hidden_channels),
                ('author', 'affiliated_with', 'institution'): SAGEConv((-1, -1), hidden_channels),
                }, 
                aggr='sum')
            self.convs.append(conv)

        self.lin = Linear(hidden_channels, out_channels)

    def forward(self, x_dict, edge_index_dict):
        for conv in self.convs:
            x_dict = conv(x_dict, edge_index_dict)
            x_dict = {key: x.relu() for key, x in x_dict.items()}
        return self.lin(x_dict['author'])

model = HeteroGNN(hidden_channels=64, out_channels=dataset.num_classes,
                  num_layers=2)

6.2 using heterogeneous graph neural network in PyG

7. Use of graphgym

GraphGym is further encapsulated on the basis of PyG. The experiment of graph neural network can be carried out in a parametric way. See the details

https://pytorch-geometric.readthedocs.io/en/latest/modules/graphgym.htmlpytorch-geometric.readthedocs.io/en/latest/modules/graphgym.html

(I think I'd better build it myself without packaging)

8. Common convolution layers included in pyg

PyG contains multiple convolution layers in classical graph neural network papers

Part of the paper and source code of the convolution layer can be seen in my

Figure papers and source codes of some graph convolution layers in the neural network framework pytorch geometry (pyg) 13.3 review articles

II PyG stepping pit

1. Problems such as connection timeout occur when downloading datasets using Planetoid

: due to the slow connection of github, click the source file of Planetoid, find the first url attribute and set it to

url='https://gitee.com/jiajiewu/planetoid/raw/master/data'

Change to Chinese website

2. When building data, 'OMP:...' appears Question of

Add at the beginning of the code:

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

(problems encountered when using PyG will be continuously updated)

Reprint

Figure neural network framework - use of pytorch geometry (pyg) and stepping on pits - knowledge

Keywords: Pytorch pyg

Added by abushahin on Thu, 20 Jan 2022 01:54:57 +0200

Programming VIP