In the following, we focus our description on the parts of the NN that explicitly or implicitly learn the latent hypergraph representation, giving less importance to the readout layers.
5.3.1 Hypergraph Convolutional Networks.
Graph convolution networks are by far the most popular technique for learning graph embedding. The convolution operator on (hyper)graphs generalizes the operation of convolution from grid data to graph data [
161] and can be elegantly motivated using the theory of (hyper)graph signal processing as a generalization of Euclidean convolutions to non-Euclidean data.
In the literature, (hyper)graph convolutions are usually classified between spatial and spectral based on how the convolution operator is defined [
161,
192]. On the one hand, spectral graph convolution uses the Fourier transform to transform the graph signal to the spectral domain, where it carries out the convolution operation. On the other hand, spatial graph convolution aggregates the node features from the perspective of the spatial domain [
187]. However, as thoroughly shown in [
13,
14], this distinction is becoming less and less clear since many graph convolutions can be defined in both spectral and spatial domains. Based on this consideration, we prefer to avoid such a blurry categorization in our survey in favor of a more general framework encompassing almost all convolutional methods while still giving some space to peculiar spectral convolutions.
As Hamilton et al. [
65] show, graph convolutions are a particular case of the more general and reader-friendly Message-passing Framework (MPF), whose underlying intuition is straightforward. Given an initial hypergraph representation (e.g., nodes’ feature matrix
\(\mathbf {X}^{(0)}\)), the MPF iteratively updates it according to the following process. At each iteration, every node aggregates information from its local neighborhood, and, as the iterations progress, each node embedding contains more and more information from further reaches of the hypergraph. Hence, node embeddings encode two-fold knowledge: structural and feature-based information, both deriving from iteratively gathering neighbors’ representations according to the hypergraph structure. Each message-passing step is represented as a layer of a (deep) hypergraph convolutional NN (convHNN), which may act as input to the next layer.
Conversely to graphs where every edge simply connects two nodes, hyperedges in hypergraphs encode many-to-many interactions among a set of two or more nodes. Thus, to properly harness the high-order nature of hypergraphs, the propagation of the information over nodes should consider such richer relations. This intuition can be accomplished using a two-stage (a.k.a. two-level or two-step) update/aggregation process [
7]:
where
\(\mathbf {x}_e^{(l)}, \mathbf {x}_v^{(l)}\) are the latent representations of hyperedges and nodes at layer
l, respectively;
\(f_{\mathcal {V}\rightarrow \mathcal {E}}, f_{\mathcal {E}\rightarrow \mathcal {V}}\) are two (non-linear) parametric permutation-invariant aggregation functions known as
node-to-hyperedge and
hyperedge-to-node aggregation, respectively; and
\(\boldsymbol {\Theta }^{(l)} \in \mathbb {R}^{d_l \times d_{l+1}}\) and
\(\boldsymbol {\Theta }^{(l)} \in \mathbb {R}^{d^{\prime }_l \times d^{\prime }_{l+1}}\) are the learnable convolutional parameters. As for standard NNs, parameters are learned via the optimization of objective functions specifically designed for the given task. The optimization is performed using backpropagation and gradient-descent-based algorithms like Stochastic Gradient Descent (SGD) and Adam [
90]. For simplicity, we omit the input argument
H and the potential set of hyper-parameters from Equation (
4). Equation (
4) is very similar to the one proposed by Chien et al. [
35] (
AllSet [
35]), and it generalizes many existing works (
UNIGNN [
79],
HyperSaR [
145],
HyperGAT [
40],
HNHN [
43],
HGC-RNN [
177], [
7]). Peculiar two-stage MP processes are the ones proposed in [
60] and [
63]. In [
60], Guan et al. define
\(f_{\mathcal {V}\rightarrow \mathcal {E}}\) as a
hyperedge shrinking function in which they select the most informative (like an attention mechanism) nodes to update the hyperedge representation. In [
63], instead, one of the stages is fixed. Specifically, node embeddings are aggregated using a mean and are not updated in the MP phase. Node embeddings are learned beforehand using a Graph Convolutional NN, and their weighted mean is used as the initial hyperedge representations.
A slightly different two-stage MPF is proposed by Srinivasan et al. [
136], where the nodes and the hyperedges are updated simultaneously (in Equation (
4), the updates happen sequentially) and where the aggregation functions also consider the embeddings of the second order neighbors.
While only the few works cited above rely on the two-stage MPF, most of the hypergraph convolutional NNs usually resort to the hypergraph’s (weighted) clique expansion, thus reducing Equation (
4) to a single-stage MPF (as for graphs) first introduced by Feng et al. [
48] (
HGNN). In this case, the
tth (single-stage) hypergraph convolutional layer can be elegantly defined as
where
\(\mathbf {R}\in \mathbb {R}^{n \times n}\) is the so-called
reference operator [
57], usually instantiated as a “surrogate” of the adjacency matrix;
\(\boldsymbol {\Theta }^{(l)} \in \mathbb {R}^{d_l \times d_{l+1}}\) are the learnable convolutional parameters (with
\(d_l\) the dimension of the latent representation at layer
l); and
\(\sigma\) is a non-linear activation function (e.g., sigmoid). It is worth noting that although the MPF is flexible enough to allow to send different messages across the neighbors, the convolutional layer in Equation (
5) assumes that every node sends the same message to each of its neighbors. Moreover, since the update of the node embeddings happens directly, hyperedge embeddings are not explicitly learned in the process. However, one can learn hyperedge embeddings by resorting to the dual transformation and applying the MPF (
EHGNN [
88]). To reduce the two-section graph connectivity and speed up the learning process, Yadati et al. [
171] (
(Fast)HyperGCN) proposed replacing each hyperedge with an incomplete clique.
Equation (
5) covers most of the hypergraph convolutional models (discussed later); however, there are exceptions that may only fit in Equation (
4). For instance,
H\(^2\)SeqRec [
95] defines an ad hoc convolutional operation in the hyperbolic space to alleviate the sparsity issue in hypergraphs for recommendation tasks, while
DHGNN [
86] and
KHNN [
99] use a 1D convolution and a parametric aggregation function to better model discriminative information among nodes. In [
170], Yadati proposed
G-MPNN [
170], a generalized MPF able to handle multi-relational ordered hypergraphs, and
MPNN-R to handle recursive hypergraphs.
There are also methods that use a non-parametric aggregation function just to propagate information in the hypergraph and to initialize node representations, such as
DHCN [
166] and
MHCN [
181].
In the following, we discuss five convolutional design factors (i.e., reference operator, skip-connections, attention mechanism, gated update, and spectral convolution) that impact how the information is propagated and aggregated during the learning process. These design choices may be coupled together in the same network architecture.
Reference operator. In Equation (
5), the matrix
\(\mathbf {R}\) describes how embeddings are aggregated. In its simplest form,
\(\mathbf {R}\) is exactly the adjacency matrix of the hypergraph clique expansion, defined as
\(\mathbf {H}\mathbf {H}^\top -\mathbf {D}_v\) (
NHP [
172],
SHCN [
33],
HHNE [
18],
DHCN [
166]
HGNN\(^+\) [
53]). In this case, the aggregation function is the sum of the representations of the neighbor nodes.
However, a not normalized adjacency matrix may hinder the optimization process as node features may be in very different ranges and, further, feature values may increase indefinitely on deep networks. For these reasons, the adjacency matrix is commonly normalized. The row-normalized adjacency matrix becomes
\(\mathbf {R}_\text{row} = \mathbf {D}_v^{-1}\mathbf {H}\mathbf {D}_e^{-1} \mathbf {W}\mathbf {H}^\top\) (
HCHA [
10],
HHGR [
185],
GC-HGNN [
121], and
IHGNN [
34] without the hyperedge-related matrices), while it has the form
\(\mathbf {R}_\text{sym} = \mathbf {D}_v^{-\frac{1}{2}}\mathbf {H}\mathbf {D}_e^{-1} \mathbf {W}\mathbf {H}^\top \mathbf {D}_v^{-\frac{1}{2}}\) if symmetrically normalized (
HGNN [
48],
GC-HGNN [
121],
DH-HGCN [
68],
HyperCTR [
70],
DualHGNN [
159],
DualHGCN [
169],
MHGNN [
9],
STHAN-SR [
134],
MultiHGNN [
76],
HGDD [
119],
STHGCN [
155],
F\(^2\)HNN [
110],
S\(^2\)HCN [
109],
AdaHGNN [
160],
HybridHGCN [
77],
HyperINF [
87],
MT-HGCN [
154],
HGNN\(^+\) [
53],
HGNNA [
41]).
This latter formulation is usually preferred since row normalization does not consider the connectivity of the neighbor nodes and, as such, gives the same relative importance to highly and sparsely connected nodes. It is worth noticing that both normalized versions also consider self-loops (i.e.,
\(\mathbf {R}\) has a nonzero diagonal); thus, the representation of a given node at layer
l is transformed and included in its representation at layer
\(l+1\) (i.e., skip-connection). Node-level normalization is also possible, and in such a case,
\(\mathbf {W}\) (
DHCF [
84],
HAIN [
15]) and/or
\(\mathbf {D}_e^{-1}\) (
HHNE [
18],
MGCN [
32],
HyperRec [
152], [
164]) can be removed from the computation of
\(\mathbf {R}_\text{sym/row}\). The set of learnable parameters is shared among all nodes.
An interesting approach is devised by Liu et al. in
HGCELM [
100]. Specifically, the authors defined a single-layered random hypergraph convolution based on Equation (
5) with
\(\mathbf {R}_\text{sym}\), where the parameters are initialized randomly and used to perform a random projection (inspired by extreme learning machines [
73]). Such representations are then fed to a hypergraph convolutional regression layer to predict node labels.
In both
HCCF [
162] and
SHT [
163], the hypergraph topology is not fixed but jointly learned with the embeddings. To do so, the reference operator is parametric, and it also includes a nonlinear activation function (i.e., LeakyReLU [
167]).
Skip-connections. Over-smoothing is a well-known problem of deep (hyper)graph convolutional networks [
30]. Over-smoothing occurs when the information aggregated from the neighbors during the message passing starts to dominate the updated node representations. This phenomenon happens as the receptive field of the convolution grows exponentially with respect to the model depth. A possible solution to alleviate this issue is to use skip-connections (a.k.a., residual connections/layer), which try to directly preserve the node information from the previous round of message passing in the update. This concept can also be generalized by including all (or some of the) previous node representations. Hypergraph convolution can be integrated with generalized skip-connections. In this case, Equation (
5) becomes
where
\(\boldsymbol {\Phi }^{(l)} \in \mathbb {R}^{d_l \times d_{l+1}}\) are the skip-connection parameters.
If the skip-connection is non-parametric and involves only the last layer, i.e.,
\(\boldsymbol {\Phi }^{(l)} \equiv \mathbf {I}_{d_l}\) and
\(\boldsymbol {\Phi }^{(\lt l)} \equiv \mathbf {0}\), it is called identity mapping, and it simply “pushes forward” the representation of the nodes at layer
l (
DHCF [
84],
HCHA [
10]). If parametric, the skip-connection can either have a different set of parameters w.r.t. the convolution (
DualHGCN [
169],
HAIN [
15]) or share the same set of parameters, i.e.,
\(\boldsymbol {\Theta }^{(l)} \equiv \boldsymbol {\Phi }^{(l)}\) with
\(\boldsymbol {\Phi }^{(\lt l)} \equiv \mathbf {0}\) (
SHCN [
33],
HHNE [
18],
MGCN [
32],
HyperRec [
152]). Note that if
\(\mathbf {R}\) has a non-zero diagonal (i.e., it considers self-loops), the skip-connection of the previous layer with shared parameters is already included in Equation (
5) (e.g.,
HGNN [
48]). The identity mapping can also be performed on the aggregated messages rather than the previous node representation(s), and it can be combined with the “standard” skip-connection as in
ResHGNN and
UniGCNII [
79].
A particular case of the skip-connection is the initial residual connection that pushes forward only the initial representation instead of the representation of the previous layer, i.e.,
\(\boldsymbol {\Phi }^{(\gt 0)} \equiv \mathbf {0}\). For instance, in
ResHGNN [
76] and
HyperINF [
87], the residual connection is not parameterized but weighted, specifically,
\(\boldsymbol {\Phi }^{0} \equiv \alpha _l \mathbf {I}\) and
\(\boldsymbol {\Theta }^{(l)} \equiv (1-\alpha _l) \mathbf {I}\), where
\(\alpha _l\) is a hyperparameter that balances the contributions of the previous representations. An alternative to the additive skip-connection is the multiplicative-skip connection that aims to learn a nonlinear gate to modulate the embeddings at a feature-wise granularity through dimension re-weighting (
MHCN [
181]).
Attention mechanism. The aggregation function in Equation (
5) represents the (normalized) average overall neighbor nodes. Even though straightforward, this choice might be sub-optimal as it gives all neighbors the same importance. A natural way to overcome this limitation is to weigh each neighbor differently. This solution is the basic idea behind the attention mechanism, i.e., a neural network that learns how to weight the neighbor nodes, first introduced for graphs by Velickovic et al. [
150]. In a nutshell, (hyper)graph attention is a particular case of (H)GNN in which the message sent to the neighbors is not the same.
The literature offers different types of attention mechanisms, and the most popular in hypergraphs are additive attention (
DHCN [
166],
DHGNN [
86],
DualHGNN [
159],
HNN [
139],
HGC-RNN [
177]) and multiplicative attention (
HyperGAT [
40],
SHARE [
153], [
164],
HGNNA [
41],
HGAT [
29],
DH-HGCN [
68], [
140]). Although less popular on hypergraphs, it is also possible to add multiple attention “heads” (
HCHA [
10],
DHAT [
105],
SHT [
163]), where each one computes
k different attention weights on independently parameterized attention layers. Eventually, the aggregated messages using the
k attentions are linearly combined. Attention mechanisms can also help combine node embeddings in tasks like group recommendation (
HHGR [
185],
HCR [
85]) or handling temporal sequences (e.g.,
STHAN-(S)R [
134],
(D)STHGCN [
155]). Recently, thanks to the success of the transformer architecture [
149], self-attention mechanisms are starting to appear in convHNN (
DHGNN [
86],
HOT [
89],
SSF [
72],
Hyper-SAGNN [
191],
HyperRec [
152]), as widely used to combine sequential node/hyperedge embeddings for task-specific purposes (
Higashi [
190],
DualHGNN [
159],
DHAT [
105],
HyperTeNet [
151],
H\(^2\)SeqRec [
95]).
Gated updates. Before the advent of transformer architectures, recurrent neural networks were the standard tool to deal with sequential inputs. In hypergraphs, the sequential inputs can refer to (1) node level, i.e., node features change over time, or (2) structural level, i.e., the hypergraph topology changes over time. In the node-level case, the neighbors’ information contains not only their current hidden state but also a completely new observation. In this setting, the main mechanisms used are Gated Recurrent Units (GRUs) ([
164],
HGC-RNN [
177]), Gated GNN [
96] ([
156]) and Gated Linear Units (GLUs) stacked on top of convolutional layers (
MT-HGCN [
154],
(D)STHGCN [
155]). As far as we know, there are no relevant hypergraph embedding techniques dealing with structural sequential input.
Spectral convolution. Spectral hypergraph convolution relies on the spectral (hyper)graph theory, and it is the basis of the very first convHNN method proposed by Feng et al. [
48] (
HGNN). As previously discussed,
HGNN can also be interpreted as a special case of the MPF (see Equation (
5)); however, it could not be straightforward to directly map a particular spectral hypergraph convolution method in the MPF, though it could be argued that it is possible [
13].
The main concept behind spectral hypergraph convolution is the connection between the Fourier transform and the Laplace operator. We can define hypergraph Fourier transforms of a hypergraph signal through the eigendecomposition of the hypergraph Laplacian, where the eigenvalues represent hypergraph frequencies (i.e., the spectrum of the hypergraph), and eigenvectors denote frequency components (i.e., hypergraph Fourier basis). Then the convolution in the spectral domain is defined via point-wise products in the transformed Fourier space (
HpLapGCN [
49],
pLapHGNN [
108]).
Instead of Fourier basis, it is possible to employ Wavelet basis as in
HGWNN [
117] and
HWNN [
138], where the hypergraph wavelet-transform projects graph signal from the vertex domain onto the spectral domain.
5.3.3 Encoder-based Approaches.
When it comes to learning low-dimensional embeddings, the encoder-decoder architecture is the go-to approach in machine learning. However, this type of architecture may fail in capturing structural information encoded by hypergraph-like structured data. For this reason, these networks are usually included as a sub-module of bigger networks. For instance,
Hyper-SAGNN [
191] uses this approach to initialize node representations, while
DHNE [
147] uses an autoencoder to learn a first latent representation for the nodes, then fine-tune using a neural network specifically designed to preserve the second-order proximity. More “sophisticated” approaches make use of Variational Autoencoders (
HeteHG-VAE [
45]) and DeepSets (
DHE [
120]). In general, when coupled with specific decoders, these methods are mostly used in task-specific hypergraph embeddings (e.g.,
HeteHG-VAE [
45],
Event2Vec [
37],
HNN-HM [
97]).
Remarks. Thanks to their versatility, HNNs can be easily adapted to a broad spectrum of tasks, such as (multi-class) classification (e.g.,
HpLapGCN [
49],
HyperGCN [
171],
DualHGNN [
159],
DHE [
120]), regression (e.g., [
164],
HGC-RNN [
177]), link prediction (e.g.,
HeteHG-VAE [
45],
DHCF [
84],
DHNE [
147]), and recommendation (e.g.,
HHGR [
185],
HyperRec [
152],
SHARE [
153]). The modularity of NN methods also eases the fusion of different input channels. For instance, using the dual of a hypergraph makes it possible to have the same architecture designed for hypergraph node embeddings to learn hyperedge embeddings (e.g.,
NHNE [
74]). In general, multi-channel input can be helpful in domain-specific tasks, for example, when dealing with heterogeneous hypergraphs (e.g.,
MHCN [
181],
DHCF [
84]). HNNs can also manage dynamic hypergraphs [
164] or, in general, tasks that require modeling sequential information (e.g.,
H\(^2\)SeqRec [
95]).
Moreover, HNNs allow performing transfer learning and reusing well-established architectures on various tasks by simply designing the readout layer(s) or a task-specific loss function. The typical learning setting of HNNs is (semi-)supervised transductive learning with only few exceptions like
DHE [
120],
G-MPNN [
170],
AllSet [
35], and
HGAT [
29] that can be used in inductive settings.
As for (deep) NNs in general, HNNs require much data to be adequately trained, and the hyper-parameter tuning can be costly. Another drawback of HNNs, especially on very deep architectures, is that the training phase may require high computational power (e.g., GPUs, TPUs), hampering the applicability on edge devices.
Among all the HNNs techniques, convHNNs are nowadays predominant in the literature thanks to their superior performance. This trend is evident from Table 3 in the online Supplemental Material (Appendix F): both encoder-based and random-walk-based methods have almost disappeared since 2020. The success of such a methodology is mainly due to the reuse of the knowledge coming from the graph representation learning literature. Akin to GNNs, these approaches learn the node representations that can then be aggregated to learn the latent representations of hyperedges (e.g.,
NHP [
172]) or the whole hypergraph (e.g., [
164]). Recently the trend seems to be drifting toward convolutional methods specifically designed for HGs (e.g., [
136],
AllSet [
35]), where the hyperedge embeddings are explicitly learned.
However, convHNNs are generally more demanding in terms of computational requirements than encoder-based and random-walk-based methods; thus, such approaches should be preferred in case of limited computational capabilities. Further, the clique-expansion transformation used by many convolutional approaches is a possible limitation since the high-order nature of the relations is only indirectly exploited in the learning process. An important strength of random-walk-based methods is the universality of the approach that can be (almost) seamlessly applied to any hypergraphs. Encoder-based methods, instead, may require some effort in the design of the encoder/decoder architecture. Still, when combined together, random-walk- and encoder-based methods can achieve state-of-the-art performance, e.g.,
Hyper-SAGNN [
191] and
DHE [
120].
Task-wise, we do not notice any particular trend: all (deep) neural network approaches have been applied successfully on classification, link prediction, and recommendation tasks. Nonetheless, when dealing with dynamic or multi-channel hypergraphs, convHNNs are preferable since they can easily be designed to handle recurrent or multi-channel inputs.