new

Get trending papers in your email inbox!

Subscribe

byAK and the research community

Mar 14

A Topological Perspective on Demystifying GNN-Based Link Prediction Performance

Graph Neural Networks (GNNs) have shown great promise in learning node embeddings for link prediction (LP). While numerous studies aim to improve the overall LP performance of GNNs, none have explored its varying performance across different nodes and its underlying reasons. To this end, we aim to demystify which nodes will perform better from the perspective of their local topology. Despite the widespread belief that low-degree nodes exhibit poorer LP performance, our empirical findings provide nuances to this viewpoint and prompt us to propose a better metric, Topological Concentration (TC), based on the intersection of the local subgraph of each node with the ones of its neighbors. We empirically demonstrate that TC has a higher correlation with LP performance than other node-level topological metrics like degree and subgraph density, offering a better way to identify low-performing nodes than using cold-start. With TC, we discover a novel topological distribution shift issue in which newly joined neighbors of a node tend to become less interactive with that node's existing neighbors, compromising the generalizability of node embeddings for LP at testing time. To make the computation of TC scalable, We further propose Approximated Topological Concentration (ATC) and theoretically/empirically justify its efficacy in approximating TC and reducing the computation complexity. Given the positive correlation between node TC and its LP performance, we explore the potential of boosting LP performance via enhancing TC by re-weighting edges in the message-passing and discuss its effectiveness with limitations. Our code is publicly available at https://github.com/YuWVandy/Topo_LP_GNN.

A Retrieve-and-Read Framework for Knowledge Graph Link Prediction

Knowledge graph (KG) link prediction aims to infer new facts based on existing facts in the KG. Recent studies have shown that using the graph neighborhood of a node via graph neural networks (GNNs) provides more useful information compared to just using the query information. Conventional GNNs for KG link prediction follow the standard message-passing paradigm on the entire KG, which leads to superfluous computation, over-smoothing of node representations, and also limits their expressive power. On a large scale, it becomes computationally expensive to aggregate useful information from the entire KG for inference. To address the limitations of existing KG link prediction frameworks, we propose a novel retrieve-and-read framework, which first retrieves a relevant subgraph context for the query and then jointly reasons over the context and the query with a high-capacity reader. As part of our exemplar instantiation for the new framework, we propose a novel Transformer-based GNN as the reader, which incorporates graph-based attention structure and cross-attention between query and context for deep fusion. This simple yet effective design enables the model to focus on salient context information relevant to the query. Empirical results on two standard KG link prediction datasets demonstrate the competitive performance of the proposed method. Furthermore, our analysis yields valuable insights for designing improved retrievers within the framework.

A Survey on Machine Learning Solutions for Graph Pattern Extraction

A subgraph is constructed by using a subset of vertices and edges of a given graph. There exist many graph properties that are hereditary for subgraphs. Hence, researchers from different communities have paid a great deal of attention in studying numerous subgraph problems, on top of the ordinary graph problems. Many algorithms are proposed in studying subgraph problems, where one common approach is by extracting the patterns and structures of a given graph. Due to the complex structures of certain types of graphs and to improve overall performances of the existing frameworks, machine learning techniques have recently been employed in dealing with various subgraph problems. In this article, we present a comprehensive review on five well known subgraph problems that have been tackled by using machine learning methods. They are subgraph isomorphism (both counting and matching), maximum common subgraph, community detection and community search problems. We provide an outline of each proposed method, and examine its designs and performances. We also explore non-learning-based algorithms for each problem and a brief discussion is given. We then suggest some promising research directions in this area, hoping that relevant subgraph problems can be tackled by using a similar strategy. Since there is a huge growth in employing machine learning techniques in recent years, we believe that this survey will serve as a good reference point to relevant research communities.

Revisiting Graph Neural Networks on Graph-level Tasks: Comprehensive Experiments, Analysis, and Improvements

Graphs are essential data structures for modeling complex interactions in domains such as social networks, molecular structures, and biological systems. Graph-level tasks, which predict properties or classes for the entire graph, are critical for applications, such as molecular property prediction and subgraph counting. Graph Neural Networks (GNNs) have shown promise in these tasks, but their evaluations are often limited to narrow datasets, tasks, and inconsistent experimental setups, restricting their generalizability. To address these limitations, we propose a unified evaluation framework for graph-level GNNs. This framework provides a standardized setting to evaluate GNNs across diverse datasets, various graph tasks (e.g., graph classification and regression), and challenging scenarios, including noisy, imbalanced, and few-shot graphs. Additionally, we propose a novel GNN model with enhanced expressivity and generalization capabilities. Specifically, we enhance the expressivity of GNNs through a k-path rooted subgraph approach, enabling the model to effectively count subgraphs (e.g., paths and cycles). Moreover, we introduce a unified graph contrastive learning algorithm for graphs across diverse domains, which adaptively removes unimportant edges to augment graphs, thereby significantly improving generalization performance. Extensive experiments demonstrate that our model achieves superior performance against fourteen effective baselines across twenty-seven graph datasets, establishing it as a robust and generalizable model for graph-level tasks.

Personalized Subgraph Federated Learning

Subgraphs of a larger global graph may be distributed across multiple devices, and only locally accessible due to privacy restrictions, although there may be links between subgraphs. Recently proposed subgraph Federated Learning (FL) methods deal with those missing links across local subgraphs while distributively training Graph Neural Networks (GNNs) on them. However, they have overlooked the inevitable heterogeneity between subgraphs comprising different communities of a global graph, consequently collapsing the incompatible knowledge from local GNN models. To this end, we introduce a new subgraph FL problem, personalized subgraph FL, which focuses on the joint improvement of the interrelated local GNNs rather than learning a single global model, and propose a novel framework, FEDerated Personalized sUBgraph learning (FED-PUB), to tackle it. Since the server cannot access the subgraph in each client, FED-PUB utilizes functional embeddings of the local GNNs using random graphs as inputs to compute similarities between them, and use the similarities to perform weighted averaging for server-side aggregation. Further, it learns a personalized sparse mask at each client to select and update only the subgraph-relevant subset of the aggregated parameters. We validate our FED-PUB for its subgraph FL performance on six datasets, considering both non-overlapping and overlapping subgraphs, on which it significantly outperforms relevant baselines. Our code is available at https://github.com/JinheonBaek/FED-PUB.

Peregrine: A Pattern-Aware Graph Mining System

Graph mining workloads aim to extract structural properties of a graph by exploring its subgraph structures. General purpose graph mining systems provide a generic runtime to explore subgraph structures of interest with the help of user-defined functions that guide the overall exploration process. However, the state-of-the-art graph mining systems remain largely oblivious to the shape (or pattern) of the subgraphs that they mine. This causes them to: (a) explore unnecessary subgraphs; (b) perform expensive computations on the explored subgraphs; and, (c) hold intermediate partial subgraphs in memory; all of which affect their overall performance. Furthermore, their programming models are often tied to their underlying exploration strategies, which makes it difficult for domain users to express complex mining tasks. In this paper, we develop Peregrine, a pattern-aware graph mining system that directly explores the subgraphs of interest while avoiding exploration of unnecessary subgraphs, and simultaneously bypassing expensive computations throughout the mining process. We design a pattern-based programming model that treats "graph patterns" as first class constructs and enables Peregrine to extract the semantics of patterns, which it uses to guide its exploration. Our evaluation shows that Peregrine outperforms state-of-the-art distributed and single machine graph mining systems, and scales to complex mining tasks on larger graphs, while retaining simplicity and expressivity with its "pattern-first" programming approach.

Integrating Knowledge Graph embedding and pretrained Language Models in Hypercomplex Spaces

Knowledge Graphs, such as Wikidata, comprise structural and textual knowledge in order to represent knowledge. For each of the two modalities dedicated approaches for graph embedding and language models learn patterns that allow for predicting novel structural knowledge. Few approaches have integrated learning and inference with both modalities and these existing ones could only partially exploit the interaction of structural and textual knowledge. In our approach, we build on existing strong representations of single modalities and we use hypercomplex algebra to represent both, (i), single-modality embedding as well as, (ii), the interaction between different modalities and their complementary means of knowledge representation. More specifically, we suggest Dihedron and Quaternion representations of 4D hypercomplex numbers to integrate four modalities namely structural knowledge graph embedding, word-level representations (e.g.\ Word2vec, Fasttext), sentence-level representations (Sentence transformer), and document-level representations (sentence transformer, Doc2vec). Our unified vector representation scores the plausibility of labelled edges via Hamilton and Dihedron products, thus modeling pairwise interactions between different modalities. Extensive experimental evaluation on standard benchmark datasets shows the superiority of our two new models using abundant textual information besides sparse structural knowledge to enhance performance in link prediction tasks.

Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models

The need to analyze graphs is ubiquitous across various fields, from social networks to biological research and recommendation systems. Therefore, enabling the ability of large language models (LLMs) to process graphs is an important step toward more advanced general intelligence. However, current LLM benchmarks on graph analysis require models to directly reason over the prompts describing graph topology, and are thus limited to small graphs with only a few dozens of nodes. In contrast, human experts typically write programs based on popular libraries for task solving, and can thus handle graphs with different scales. To this end, a question naturally arises: can LLMs analyze graphs like professionals? In this paper, we introduce ProGraph, a manually crafted benchmark containing 3 categories of graph tasks. The benchmark expects solutions based on programming instead of directly reasoning over raw inputs. Our findings reveal that the performance of current LLMs is unsatisfactory, with the best model achieving only 36% accuracy. To bridge this gap, we propose LLM4Graph datasets, which include crawled documents and auto-generated codes based on 6 widely used graph libraries. By augmenting closed-source LLMs with document retrieval and fine-tuning open-source ones on the codes, we show 11-32% absolute improvements in their accuracies. Our results underscore that the capabilities of LLMs in handling structured data are still under-explored, and show the effectiveness of LLM4Graph in enhancing LLMs' proficiency of graph analysis. The benchmark, datasets and enhanced open-source models are available at https://github.com/BUPT-GAMMA/ProGraph.

Heterogeneous Graph Representation Learning with Relation Awareness

Representation learning on heterogeneous graphs aims to obtain meaningful node representations to facilitate various downstream tasks, such as node classification and link prediction. Existing heterogeneous graph learning methods are primarily developed by following the propagation mechanism of node representations. There are few efforts on studying the role of relations for improving the learning of more fine-grained node representations. Indeed, it is important to collaboratively learn the semantic representations of relations and discern node representations with respect to different relation types. To this end, in this paper, we propose a novel Relation-aware Heterogeneous Graph Neural Network, namely R-HGNN, to learn node representations on heterogeneous graphs at a fine-grained level by considering relation-aware characteristics. Specifically, a dedicated graph convolution component is first designed to learn unique node representations from each relation-specific graph separately. Then, a cross-relation message passing module is developed to improve the interactions of node representations across different relations. Also, the relation representations are learned in a layer-wise manner to capture relation semantics, which are used to guide the node representation learning process. Moreover, a semantic fusing module is presented to aggregate relation-aware node representations into a compact representation with the learned relation representations. Finally, we conduct extensive experiments on a variety of graph learning tasks, and experimental results demonstrate that our approach consistently outperforms existing methods among all the tasks.

From Graphs to Hypergraphs: Hypergraph Projection and its Remediation

We study the implications of the modeling choice to use a graph, instead of a hypergraph, to represent real-world interconnected systems whose constituent relationships are of higher order by nature. Such a modeling choice typically involves an underlying projection process that maps the original hypergraph onto a graph, and is common in graph-based analysis. While hypergraph projection can potentially lead to loss of higher-order relations, there exists very limited studies on the consequences of doing so, as well as its remediation. This work fills this gap by doing two things: (1) we develop analysis based on graph and set theory, showing two ubiquitous patterns of hyperedges that are root to structural information loss in all hypergraph projections; we also quantify the combinatorial impossibility of recovering the lost higher-order structures if no extra help is provided; (2) we still seek to recover the lost higher-order structures in hypergraph projection, and in light of (1)'s findings we propose to relax the problem into a learning-based setting. Under this setting, we develop a learning-based hypergraph reconstruction method based on an important statistic of hyperedge distributions that we find. Our reconstruction method is evaluated on 8 real-world datasets under different settings, and exhibits consistently good performance. We also demonstrate benefits of the reconstructed hypergraphs via use cases of protein rankings and link predictions.

LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration

GraphRAG addresses significant challenges in Retrieval-Augmented Generation (RAG) by leveraging graphs with embedded knowledge to enhance the reasoning capabilities of Large Language Models (LLMs). Despite its promising potential, the GraphRAG community currently lacks a unified framework for fine-grained decomposition of the graph-based knowledge retrieval process. Furthermore, there is no systematic categorization or evaluation of existing solutions within the retrieval process. In this paper, we present LEGO-GraphRAG, a modular framework that decomposes the retrieval process of GraphRAG into three interconnected modules: subgraph-extraction, path-filtering, and path-refinement. We systematically summarize and classify the algorithms and neural network (NN) models relevant to each module, providing a clearer understanding of the design space for GraphRAG instances. Additionally, we identify key design factors, such as Graph Coupling and Computational Cost, that influence the effectiveness of GraphRAG implementations. Through extensive empirical studies, we construct high-quality GraphRAG instances using a representative selection of solutions and analyze their impact on retrieval and reasoning performance. Our findings offer critical insights into optimizing GraphRAG instance design, ultimately contributing to the advancement of more accurate and contextually relevant LLM applications.

One for All: Towards Training One Graph Model for All Classification Tasks

Designing a single model to address multiple tasks has been a long-standing objective in artificial intelligence. Recently, large language models have demonstrated exceptional capability in solving different tasks within the language domain. However, a unified model for various graph tasks remains underexplored, primarily due to the challenges unique to the graph learning domain. First, graph data from different areas carry distinct attributes and follow different distributions. Such discrepancy makes it hard to represent graphs in a single representation space. Second, tasks on graphs diversify into node, link, and graph tasks, requiring distinct embedding strategies. Finally, an appropriate graph prompting paradigm for in-context learning is unclear. We propose One for All (OFA), the first general framework that can use a single graph model to address the above challenges. Specifically, OFA proposes text-attributed graphs to unify different graph data by describing nodes and edges with natural language and uses language models to encode the diverse and possibly cross-domain text attributes to feature vectors in the same embedding space. Furthermore, OFA introduces the concept of nodes-of-interest to standardize different tasks with a single task representation. For in-context learning on graphs, OFA introduces a novel graph prompting paradigm that appends prompting substructures to the input graph, which enables it to address varied tasks without fine-tuning. We train the OFA model using graph data from multiple domains (including citation networks, molecular graphs, knowledge graphs, etc.) simultaneously and evaluate its ability in supervised, few-shot, and zero-shot learning scenarios. OFA performs well across different tasks, making it the first general-purpose across-domains classification model on graphs.

Breaking the Entanglement of Homophily and Heterophily in Semi-supervised Node Classification

Recently, graph neural networks (GNNs) have shown prominent performance in semi-supervised node classification by leveraging knowledge from the graph database. However, most existing GNNs follow the homophily assumption, where connected nodes are more likely to exhibit similar feature distributions and the same labels, and such an assumption has proven to be vulnerable in a growing number of practical applications. As a supplement, heterophily reflects dissimilarity in connected nodes, which has gained significant attention in graph learning. To this end, data engineers aim to develop a powerful GNN model that can ensure performance under both homophily and heterophily. Despite numerous attempts, most existing GNNs struggle to achieve optimal node representations due to the constraints of undirected graphs. The neglect of directed edges results in sub-optimal graph representations, thereby hindering the capacity of GNNs. To address this issue, we introduce AMUD, which quantifies the relationship between node profiles and topology from a statistical perspective, offering valuable insights for Adaptively Modeling the natural directed graphs as the Undirected or Directed graph to maximize the benefits from subsequent graph learning. Furthermore, we propose Adaptive Directed Pattern Aggregation (ADPA) as a new directed graph learning paradigm for AMUD. Empirical studies have demonstrated that AMUD guides efficient graph learning. Meanwhile, extensive experiments on 14 benchmark datasets substantiate the impressive performance of ADPA, outperforming baselines by significant margins of 3.96\%.

Conditional Graph Information Bottleneck for Molecular Relational Learning

Molecular relational learning, whose goal is to learn the interaction behavior between molecular pairs, got a surge of interest in molecular sciences due to its wide range of applications. Recently, graph neural networks have recently shown great success in molecular relational learning by modeling a molecule as a graph structure, and considering atom-level interactions between two molecules. Despite their success, existing molecular relational learning methods tend to overlook the nature of chemistry, i.e., a chemical compound is composed of multiple substructures such as functional groups that cause distinctive chemical reactions. In this work, we propose a novel relational learning framework, called CGIB, that predicts the interaction behavior between a pair of graphs by detecting core subgraphs therein. The main idea is, given a pair of graphs, to find a subgraph from a graph that contains the minimal sufficient information regarding the task at hand conditioned on the paired graph based on the principle of conditional graph information bottleneck. We argue that our proposed method mimics the nature of chemical reactions, i.e., the core substructure of a molecule varies depending on which other molecule it interacts with. Extensive experiments on various tasks with real-world datasets demonstrate the superiority of CGIB over state-of-the-art baselines. Our code is available at https://github.com/Namkyeong/CGIB.

Inductive Entity Representations from Text via Link Prediction

Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector representations of entities in order to preform link prediction. However, the extent to which these representations learned for link prediction generalize to other tasks is unclear. This is important given the cost of learning such representations. Ideally, we would prefer representations that do not need to be trained again when transferring to a different task, while retaining reasonable performance. In this work, we propose a holistic evaluation protocol for entity representations learned via a link prediction objective. We consider the inductive link prediction and entity classification tasks, which involve entities not seen during training. We also consider an information retrieval task for entity-oriented search. We evaluate an architecture based on a pretrained language model, that exhibits strong generalization to entities not observed during training, and outperforms related state-of-the-art methods (22% MRR improvement in link prediction on average). We further provide evidence that the learned representations transfer well to other tasks without fine-tuning. In the entity classification task we obtain an average improvement of 16% in accuracy compared with baselines that also employ pre-trained models. In the information retrieval task, we obtain significant improvements of up to 8.8% in NDCG@10 for natural language queries. We thus show that the learned representations are not limited KG-specific tasks, and have greater generalization properties than evaluated in previous work.

Simplicial Closure and higher-order link prediction

Networks provide a powerful formalism for modeling complex systems by using a model of pairwise interactions. But much of the structure within these systems involves interactions that take place among more than two nodes at once; for example, communication within a group rather than person-to person, collaboration among a team rather than a pair of coauthors, or biological interaction between a set of molecules rather than just two. Such higher-order interactions are ubiquitous, but their empirical study has received limited attention, and little is known about possible organizational principles of such structures. Here we study the temporal evolution of 19 datasets with explicit accounting for higher-order interactions. We show that there is a rich variety of structure in our datasets but datasets from the same system types have consistent patterns of higher-order structure. Furthermore, we find that tie strength and edge density are competing positive indicators of higher-order organization, and these trends are consistent across interactions involving differing numbers of nodes. To systematically further the study of theories for such higher-order structures, we propose higher-order link prediction as a benchmark problem to assess models and algorithms that predict higher-order structure. We find a fundamental differences from traditional pairwise link prediction, with a greater role for local rather than long-range information in predicting the appearance of new interactions.

Neural Common Neighbor with Completion for Link Prediction

Despite its outstanding performance in various graph tasks, vanilla Message Passing Neural Network (MPNN) usually fails in link prediction tasks, as it only uses representations of two individual target nodes and ignores the pairwise relation between them. To capture the pairwise relations, some models add manual features to the input graph and use the output of MPNN to produce pairwise representations. In contrast, others directly use manual features as pairwise representations. Though this simplification avoids applying a GNN to each link individually and thus improves scalability, these models still have much room for performance improvement due to the hand-crafted and unlearnable pairwise features. To upgrade performance while maintaining scalability, we propose Neural Common Neighbor (NCN), which uses learnable pairwise representations. To further boost NCN, we study the unobserved link problem. The incompleteness of the graph is ubiquitous and leads to distribution shifts between the training and test set, loss of common neighbor information, and performance degradation of models. Therefore, we propose two intervention methods: common neighbor completion and target link removal. Combining the two methods with NCN, we propose Neural Common Neighbor with Completion (NCNC). NCN and NCNC outperform recent strong baselines by large margins. NCNC achieves state-of-the-art performance in link prediction tasks. Our code is available at https://github.com/GraphPKU/NeuralCommonNeighbor.

Neighborhood-aware Scalable Temporal Network Representation Learning

Temporal networks have been widely used to model real-world complex systems such as financial systems and e-commerce systems. In a temporal network, the joint neighborhood of a set of nodes often provides crucial structural information useful for predicting whether they may interact at a certain time. However, recent representation learning methods for temporal networks often fail to extract such information or depend on online construction of structural features, which is time-consuming. To address the issue, this work proposes Neighborhood-Aware Temporal network model (NAT). For each node in the network, NAT abandons the commonly-used one-single-vector-based representation while adopting a novel dictionary-type neighborhood representation. Such a dictionary representation records a downsampled set of the neighboring nodes as keys, and allows fast construction of structural features for a joint neighborhood of multiple nodes. We also design a dedicated data structure termed N-cache to support parallel access and update of those dictionary representations on GPUs. NAT gets evaluated over seven real-world large-scale temporal networks. NAT not only outperforms all cutting-edge baselines by averaged 1.2% and 4.2% in transductive and inductive link prediction accuracy, respectively, but also keeps scalable by achieving a speed-up of 4.1-76.7x against the baselines that adopt joint structural features and achieves a speed-up of 1.6-4.0x against the baselines that cannot adopt those features. The link to the code: https: //github.com/Graph-COM/Neighborhood-Aware-Temporal-Network.

Fast and Accurate Network Embeddings via Very Sparse Random Projection

We present FastRP, a scalable and performant algorithm for learning distributed node representations in a graph. FastRP is over 4,000 times faster than state-of-the-art methods such as DeepWalk and node2vec, while achieving comparable or even better performance as evaluated on several real-world networks on various downstream tasks. We observe that most network embedding methods consist of two components: construct a node similarity matrix and then apply dimension reduction techniques to this matrix. We show that the success of these methods should be attributed to the proper construction of this similarity matrix, rather than the dimension reduction method employed. FastRP is proposed as a scalable algorithm for network embeddings. Two key features of FastRP are: 1) it explicitly constructs a node similarity matrix that captures transitive relationships in a graph and normalizes matrix entries based on node degrees; 2) it utilizes very sparse random projection, which is a scalable optimization-free method for dimension reduction. An extra benefit from combining these two design choices is that it allows the iterative computation of node embeddings so that the similarity matrix need not be explicitly constructed, which further speeds up FastRP. FastRP is also advantageous for its ease of implementation, parallelization and hyperparameter tuning. The source code is available at https://github.com/GTmac/FastRP.

Graphlets correct for the topological information missed by random walks

Random walks are widely used for mining networks due to the computational efficiency of computing them. For instance, graph representation learning learns a d-dimensional embedding space, so that the nodes that tend to co-occur on random walks (a proxy of being in the same network neighborhood) are close in the embedding space. Specific local network topology (i.e., structure) influences the co-occurrence of nodes on random walks, so random walks of limited length capture only partial topological information, hence diminishing the performance of downstream methods. We explicitly capture all topological neighborhood information and improve performance by introducing orbit adjacencies that quantify the adjacencies of two nodes as co-occurring on a given pair of graphlet orbits, which are symmetric positions on graphlets (small, connected, non-isomorphic, induced subgraphs of a large network). Importantly, we mathematically prove that random walks on up to k nodes capture only a subset of all the possible orbit adjacencies for up to k-node graphlets. Furthermore, we enable orbit adjacency-based analysis of networks by developing an efficient GRaphlet-orbit ADjacency COunter (GRADCO), which exhaustively computes all 28 orbit adjacency matrices for up to four-node graphlets. Note that four-node graphlets suffice, because real networks are usually small-world. In large networks on around 20,000 nodes, GRADCOcomputesthe28matricesinminutes. Onsixrealnetworksfromvarious domains, we compare the performance of node-label predictors obtained by using the network embeddings based on our orbit adjacencies to those based on random walks. We find that orbit adjacencies, which include those unseen by random walks, outperform random walk-based adjacencies, demonstrating the importance of the inclusion of the topological neighborhood information that is unseen by random walks.

A Generalization of Transformer Networks to Graphs

We propose a generalization of transformer neural network architecture for arbitrary graphs. The original transformer was designed for Natural Language Processing (NLP), which operates on fully connected graphs representing all connections between the words in a sequence. Such architecture does not leverage the graph connectivity inductive bias, and can perform poorly when the graph topology is important and has not been encoded into the node features. We introduce a graph transformer with four new properties compared to the standard model. First, the attention mechanism is a function of the neighborhood connectivity for each node in the graph. Second, the positional encoding is represented by the Laplacian eigenvectors, which naturally generalize the sinusoidal positional encodings often used in NLP. Third, the layer normalization is replaced by a batch normalization layer, which provides faster training and better generalization performance. Finally, the architecture is extended to edge feature representation, which can be critical to tasks s.a. chemistry (bond type) or link prediction (entity relationship in knowledge graphs). Numerical experiments on a graph benchmark demonstrate the performance of the proposed graph transformer architecture. This work closes the gap between the original transformer, which was designed for the limited case of line graphs, and graph neural networks, that can work with arbitrary graphs. As our architecture is simple and generic, we believe it can be used as a black box for future applications that wish to consider transformer and graphs.

Graph Self-supervised Learning with Accurate Discrepancy Learning

Self-supervised learning of graph neural networks (GNNs) aims to learn an accurate representation of the graphs in an unsupervised manner, to obtain transferable representations of them for diverse downstream tasks. Predictive learning and contrastive learning are the two most prevalent approaches for graph self-supervised learning. However, they have their own drawbacks. While the predictive learning methods can learn the contextual relationships between neighboring nodes and edges, they cannot learn global graph-level similarities. Contrastive learning, while it can learn global graph-level similarities, its objective to maximize the similarity between two differently perturbed graphs may result in representations that cannot discriminate two similar graphs with different properties. To tackle such limitations, we propose a framework that aims to learn the exact discrepancy between the original and the perturbed graphs, coined as Discrepancy-based Self-supervised LeArning (D-SLA). Specifically, we create multiple perturbations of the given graph with varying degrees of similarity, and train the model to predict whether each graph is the original graph or the perturbed one. Moreover, we further aim to accurately capture the amount of discrepancy for each perturbed graph using the graph edit distance. We validate our D-SLA on various graph-related downstream tasks, including molecular property prediction, protein function prediction, and link prediction tasks, on which ours largely outperforms relevant baselines.

EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs

Graph representation learning resurges as a trending research subject owing to the widespread use of deep learning for Euclidean data, which inspire various creative designs of neural networks in the non-Euclidean domain, particularly graphs. With the success of these graph neural networks (GNN) in the static setting, we approach further practical scenarios where the graph dynamically evolves. Existing approaches typically resort to node embeddings and use a recurrent neural network (RNN, broadly speaking) to regulate the embeddings and learn the temporal dynamics. These methods require the knowledge of a node in the full time span (including both training and testing) and are less applicable to the frequent change of the node set. In some extreme scenarios, the node sets at different time steps may completely differ. To resolve this challenge, we propose EvolveGCN, which adapts the graph convolutional network (GCN) model along the temporal dimension without resorting to node embeddings. The proposed approach captures the dynamism of the graph sequence through using an RNN to evolve the GCN parameters. Two architectures are considered for the parameter evolution. We evaluate the proposed approach on tasks including link prediction, edge classification, and node classification. The experimental results indicate a generally higher performance of EvolveGCN compared with related approaches. The code is available at https://github.com/IBM/EvolveGCN.

Sheaf Neural Networks for Graph-based Recommender Systems

Recent progress in Graph Neural Networks has resulted in wide adoption by many applications, including recommendation systems. The reason for Graph Neural Networks' superiority over other approaches is that many problems in recommendation systems can be naturally modeled as graphs, where nodes can be either users or items and edges represent preference relationships. In current Graph Neural Network approaches, nodes are represented with a static vector learned at training time. This static vector might only be suitable to capture some of the nuances of users or items they define. To overcome this limitation, we propose using a recently proposed model inspired by category theory: Sheaf Neural Networks. Sheaf Neural Networks, and its connected Laplacian, can address the previous problem by associating every node (and edge) with a vector space instead than a single vector. The vector space representation is richer and allows picking the proper representation at inference time. This approach can be generalized for different related tasks on graphs and achieves state-of-the-art performance in terms of F1-Score@N in collaborative filtering and Hits@20 in link prediction. For collaborative filtering, the approach is evaluated on the MovieLens 100K with a 5.1% improvement, on MovieLens 1M with a 5.4% improvement and on Book-Crossing with a 2.8% improvement, while for link prediction on the ogbl-ddi dataset with a 1.6% refinement with respect to the respective baselines.

Leveraging Pre-trained Language Models for Time Interval Prediction in Text-Enhanced Temporal Knowledge Graphs

Most knowledge graph completion (KGC) methods learn latent representations of entities and relations of a given graph by mapping them into a vector space. Although the majority of these methods focus on static knowledge graphs, a large number of publicly available KGs contain temporal information stating the time instant/period over which a certain fact has been true. Such graphs are often known as temporal knowledge graphs. Furthermore, knowledge graphs may also contain textual descriptions of entities and relations. Both temporal information and textual descriptions are not taken into account during representation learning by static KGC methods, and only structural information of the graph is leveraged. Recently, some studies have used temporal information to improve link prediction, yet they do not exploit textual descriptions and do not support inductive inference (prediction on entities that have not been seen in training). We propose a novel framework called TEMT that exploits the power of pre-trained language models (PLMs) for text-enhanced temporal knowledge graph completion. The knowledge stored in the parameters of a PLM allows TEMT to produce rich semantic representations of facts and to generalize on previously unseen entities. TEMT leverages textual and temporal information available in a KG, treats them separately, and fuses them to get plausibility scores of facts. Unlike previous approaches, TEMT effectively captures dependencies across different time points and enables predictions on unseen entities. To assess the performance of TEMT, we carried out several experiments including time interval prediction, both in transductive and inductive settings, and triple classification. The experimental results show that TEMT is competitive with the state-of-the-art.

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

Given a graph with textual attributes, we enable users to `chat with their graph': that is, to ask questions about the graph using a conversational interface. In response to a user's questions, our method provides textual replies and highlights the relevant parts of the graph. While existing works integrate large language models (LLMs) and graph neural networks (GNNs) in various ways, they mostly focus on either conventional graph tasks (such as node, edge, and graph classification), or on answering simple graph queries on small or synthetic graphs. In contrast, we develop a flexible question-answering framework targeting real-world textual graphs, applicable to multiple applications including scene graph understanding, common sense reasoning, and knowledge graph reasoning. Toward this goal, we first develop a Graph Question Answering (GraphQA) benchmark with data collected from different tasks. Then, we propose our G-Retriever method, introducing the first retrieval-augmented generation (RAG) approach for general textual graphs, which can be fine-tuned to enhance graph understanding via soft prompting. To resist hallucination and to allow for textual graphs that greatly exceed the LLM's context window size, G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem. Empirical evaluations show that our method outperforms baselines on textual graph tasks from multiple domains, scales well with larger graph sizes, and mitigates hallucination.~Our codes and datasets are available at: \url{https://github.com/XiaoxinHe/G-Retriever}

IRWE: Inductive Random Walk for Joint Inference of Identity and Position Network Embedding

Network embedding, which maps graphs to distributed representations, is a unified framework for various graph inference tasks. According to the topology properties (e.g., structural roles and community memberships of nodes) to be preserved, it can be categorized into the identity and position embedding. However, existing methods can only capture one type of property. Some approaches can support the inductive inference that generalizes the embedding model to new nodes or graphs but relies on the availability of attributes. Due to the complicated correlations between topology and attributes, it is unclear for some inductive methods which type of property they can capture. In this study, we explore a unified framework for the joint inductive inference of identity and position embeddings without attributes. An inductive random walk embedding (IRWE) method is proposed, which combines multiple attention units to handle the random walk on graph topology and simultaneously derives identity and position embeddings that are jointly optimized. In particular, we demonstrate that some random walk statistics can be informative features to characterize node identities and positions while supporting the inductive embedding inference. Experiments validate the superior performance of IRWE beyond various baselines for the transductive and inductive inference of identity and position embeddings.

LEMON: LanguagE ModeL for Negative Sampling of Knowledge Graph Embeddings

Knowledge Graph Embedding models have become an important area of machine learning.Those models provide a latent representation of entities and relations in a knowledge graph which can then be used in downstream machine learning tasks such as link prediction. The learning process of such models can be performed by contrasting positive and negative triples. While all triples of a KG are considered positive, negative triples are usually not readily available. Therefore, the choice of the sampling method to obtain the negative triples play a crucial role in the performance and effectiveness of Knowledge Graph Embedding models. Most of the current methods fetch negative samples from a random distribution of entities in the underlying Knowledge Graph which also often includes meaningless triples. Other known methods use adversarial techniques or generative neural networks which consequently reduce the efficiency of the process. In this paper, we propose an approach for generating informative negative samples considering available complementary knowledge about entities. Particularly, Pre-trained Language Models are used to form neighborhood clusters by utilizing the distances between entities to obtain representations of symbolic entities via their textual information. Our comprehensive evaluations demonstrate the effectiveness of the proposed approach on benchmark Knowledge Graphs with textual information for the link prediction task.

Isomorphic-Consistent Variational Graph Auto-Encoders for Multi-Level Graph Representation Learning

Graph representation learning is a fundamental research theme and can be generalized to benefit multiple downstream tasks from the node and link levels to the higher graph level. In practice, it is desirable to develop task-agnostic general graph representation learning methods that are typically trained in an unsupervised manner. Related research reveals that the power of graph representation learning methods depends on whether they can differentiate distinct graph structures as different embeddings and map isomorphic graphs to consistent embeddings (i.e., the isomorphic consistency of graph models). However, for task-agnostic general graph representation learning, existing unsupervised graph models, represented by the variational graph auto-encoders (VGAEs), can only keep the isomorphic consistency within the subgraphs of 1-hop neighborhoods and thus usually manifest inferior performance on the more difficult higher-level tasks. To overcome the limitations of existing unsupervised methods, in this paper, we propose the Isomorphic-Consistent VGAE (IsoC-VGAE) for multi-level task-agnostic graph representation learning. We first devise a decoding scheme to provide a theoretical guarantee of keeping the isomorphic consistency under the settings of unsupervised learning. We then propose the Inverse Graph Neural Network (Inv-GNN) decoder as its intuitive realization, which trains the model via reconstructing the GNN node embeddings with multi-hop neighborhood information, so as to maintain the high-order isomorphic consistency within the VGAE framework. We conduct extensive experiments on the representative graph learning tasks at different levels, including node classification, link prediction and graph classification, and the results verify that our proposed model generally outperforms both the state-of-the-art unsupervised methods and representative supervised methods.

Disentangled Structural and Featural Representation for Task-Agnostic Graph Valuation

With the emergence of data marketplaces, the demand for methods to assess the value of data has increased significantly. While numerous techniques have been proposed for this purpose, none have specifically addressed graphs as the main data modality. Graphs are widely used across various fields, ranging from chemical molecules to social networks. In this study, we break down graphs into two main components: structural and featural, and we focus on evaluating data without relying on specific task-related metrics, making it applicable in practical scenarios where validation requirements may be lacking. We introduce a novel framework called blind message passing, which aligns the seller's and buyer's graphs using a shared node permutation based on graph matching. This allows us to utilize the graph Wasserstein distance to quantify the differences in the structural distribution of graph datasets, called the structural disparities. We then consider featural aspects of buyers' and sellers' graphs for data valuation and capture their statistical similarities and differences, referred to as relevance and diversity, respectively. Our approach ensures that buyers and sellers remain unaware of each other's datasets. Our experiments on real datasets demonstrate the effectiveness of our approach in capturing the relevance, diversity, and structural disparities of seller data for buyers, particularly in graph-based data valuation scenarios.

OpenGraph: Towards Open Graph Foundation Models

Graph learning has become indispensable for interpreting and harnessing relational data in diverse fields, ranging from recommendation systems to social network analysis. In this context, a variety of GNNs have emerged as promising methodologies for encoding the structural information of graphs. By effectively capturing the graph's underlying structure, these GNNs have shown great potential in enhancing performance in graph learning tasks, such as link prediction and node classification. However, despite their successes, a significant challenge persists: these advanced methods often face difficulties in generalizing to unseen graph data that significantly differs from the training instances. In this work, our aim is to advance the graph learning paradigm by developing a general graph foundation model. This model is designed to understand the complex topological patterns present in diverse graph data, enabling it to excel in zero-shot graph learning tasks across different downstream datasets. To achieve this goal, we address several key technical challenges in our OpenGraph model. Firstly, we propose a unified graph tokenizer to adapt our graph model to generalize well on unseen graph data, even when the underlying graph properties differ significantly from those encountered during training. Secondly, we develop a scalable graph transformer as the foundational encoder, which effectively captures node-wise dependencies within the global topological context. Thirdly, we introduce a data augmentation mechanism enhanced by a LLM to alleviate the limitations of data scarcity in real-world scenarios. Extensive experiments validate the effectiveness of our framework. By adapting our OpenGraph to new graph characteristics and comprehending the nuances of diverse graphs, our approach achieves remarkable zero-shot graph learning performance across various settings and domains.

Invariant Graph Transformer

Rationale discovery is defined as finding a subset of the input data that maximally supports the prediction of downstream tasks. In graph machine learning context, graph rationale is defined to locate the critical subgraph in the given graph topology, which fundamentally determines the prediction results. In contrast to the rationale subgraph, the remaining subgraph is named the environment subgraph. Graph rationalization can enhance the model performance as the mapping between the graph rationale and prediction label is viewed as invariant, by assumption. To ensure the discriminative power of the extracted rationale subgraphs, a key technique named "intervention" is applied. The core idea of intervention is that given any changing environment subgraphs, the semantics from the rationale subgraph is invariant, which guarantees the correct prediction result. However, most, if not all, of the existing rationalization works on graph data develop their intervention strategies on the graph level, which is coarse-grained. In this paper, we propose well-tailored intervention strategies on graph data. Our idea is driven by the development of Transformer models, whose self-attention module provides rich interactions between input nodes. Based on the self-attention module, our proposed invariant graph Transformer (IGT) can achieve fine-grained, more specifically, node-level and virtual node-level intervention. Our comprehensive experiments involve 7 real-world datasets, and the proposed IGT shows significant performance advantages compared to 13 baseline methods.

Unifying Structure and Language Semantic for Efficient Contrastive Knowledge Graph Completion with Structured Entity Anchors

The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known. In recent, pre-trained language model (PLM) based methods that utilize both textual and structural information are emerging, but their performances lag behind state-of-the-art (SOTA) structure-based methods or some methods lose their inductive inference capabilities in the process of fusing structure embedding to text encoder. In this paper, we propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning. We adopt entity anchors and these anchors and textual description of KG elements are fed together into the PLM-based encoder to learn unified representations. In addition, the proposed method utilizes additional random negative samples which can be reused in the each mini-batch during contrastive learning to learn a generalized entity representations. We verify the effectiveness of the our proposed method through various experiments and analysis. The experimental results on standard benchmark widely used in link prediction task show that the proposed model outperforms existing the SOTA KGC models. Especially, our method show the largest performance improvement on FB15K-237, which is competitive to the SOTA of structure-based KGC methods.

Large Language Models on Graphs: A Comprehensive Survey

Large language models (LLMs), such as ChatGPT and LLaMA, are creating significant advancements in natural language processing, due to their strong text encoding/decoding ability and newly found emergent capability (e.g., reasoning). While LLMs are mainly designed to process pure texts, there are many real-world scenarios where text data are associated with rich structure information in the form of graphs (e.g., academic networks, and e-commerce networks) or scenarios where graph data are paired with rich textual information (e.g., molecules with descriptions). Besides, although LLMs have shown their pure text-based reasoning ability, it is underexplored whether such ability can be generalized to graph scenarios (i.e., graph-based reasoning). In this paper, we provide a systematic review of scenarios and techniques related to large language models on graphs. We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-rich graphs, and text-paired graphs. We then discuss detailed techniques for utilizing LLMs on graphs, including LLM as Predictor, LLM as Encoder, and LLM as Aligner, and compare the advantages and disadvantages of different schools of models. Furthermore, we mention the real-world applications of such methods and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future research directions in this fast-growing field. The related source can be found at https://github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs.

Learning to Represent Programs with Heterogeneous Graphs

Program source code contains complex structure information, which can be represented in structured data forms like trees or graphs. To acquire the structural information in source code, most existing researches use abstract syntax trees (AST). A group of works add additional edges to ASTs to convert source code into graphs and use graph neural networks to learn representations for program graphs. Although these works provide additional control or data flow information to ASTs for downstream tasks, they neglect an important aspect of structure information in AST itself: the different types of nodes and edges. In ASTs, different nodes contain different kinds of information like variables or control flow, and the relation between a node and all its children can also be different. To address the information of node and edge types, we bring the idea of heterogeneous graphs to learning on source code and present a new formula of building heterogeneous program graphs from ASTs with additional type information for nodes and edges. We use the ASDL grammar of programming language to define the node and edge types of program graphs. Then we use heterogeneous graph neural networks to learn on these graphs. We evaluate our approach on two tasks: code comment generation and method naming. Both tasks require reasoning on the semantics of complete code snippets. Experiment results show that our approach outperforms baseline models, including homogeneous graph-based models, showing that leveraging the type information of nodes and edges in program graphs can help in learning program semantics.

Towards Deeper Graph Neural Networks

Graph neural networks have shown significant success in the field of graph representation learning. Graph convolutions perform neighborhood aggregation and represent one of the most important graph operations. Nevertheless, one layer of these neighborhood aggregation methods only consider immediate neighbors, and the performance decreases when going deeper to enable larger receptive fields. Several recent studies attribute this performance deterioration to the over-smoothing issue, which states that repeated propagation makes node representations of different classes indistinguishable. In this work, we study this observation systematically and develop new insights towards deeper graph neural networks. First, we provide a systematical analysis on this issue and argue that the key factor compromising the performance significantly is the entanglement of representation transformation and propagation in current graph convolution operations. After decoupling these two operations, deeper graph neural networks can be used to learn graph node representations from larger receptive fields. We further provide a theoretical analysis of the above observation when building very deep models, which can serve as a rigorous and gentle description of the over-smoothing issue. Based on our theoretical and empirical analysis, we propose Deep Adaptive Graph Neural Network (DAGNN) to adaptively incorporate information from large receptive fields. A set of experiments on citation, co-authorship, and co-purchase datasets have confirmed our analysis and insights and demonstrated the superiority of our proposed methods.

RESTORE: Graph Embedding Assessment Through Reconstruction

Following the success of Word2Vec embeddings, graph embeddings (GEs) have gained substantial traction. GEs are commonly generated and evaluated extrinsically on downstream applications, but intrinsic evaluations of the original graph properties in terms of topological structure and semantic information have been lacking. Understanding these will help identify the deficiency of the various families of GE methods when vectorizing graphs in terms of preserving the relevant knowledge or learning incorrect knowledge. To address this, we propose RESTORE, a framework for intrinsic GEs assessment through graph reconstruction. We show that reconstructing the original graph from the underlying GEs yields insights into the relative amount of information preserved in a given vector form. We first introduce the graph reconstruction task. We generate GEs from three GE families based on factorization methods, random walks, and deep learning (with representative algorithms from each family) on the CommonSense Knowledge Graph (CSKG). We analyze their effectiveness in preserving the (a) topological structure of node-level graph reconstruction with an increasing number of hops and (b) semantic information on various word semantic and analogy tests. Our evaluations show deep learning-based GE algorithm (SDNE) is overall better at preserving (a) with a mean average precision (mAP) of 0.54 and 0.35 for 2 and 3-hop reconstruction respectively, while the factorization-based algorithm (HOPE) is better at encapsulating (b) with an average Euclidean distance of 0.14, 0.17, and 0.11 for 1, 2, and 3-hop reconstruction respectively. The modest performance of these GEs leaves room for further research avenues on better graph representation learning.

Efficient Maximum Fair Clique Search over Large Networks

Mining cohesive subgraphs in attributed graphs is an essential problem in the domain of graph data analysis. The integration of fairness considerations significantly fuels interest in models and algorithms for mining fairness-aware cohesive subgraphs. Notably, the relative fair clique emerges as a robust model, ensuring not only comprehensive attribute coverage but also greater flexibility in distributing attribute vertices. Motivated by the strength of this model, we for the first time pioneer an investigation into the identification of the maximum relative fair clique in large-scale graphs. We introduce a novel concept of colorful support, which serves as the foundation for two innovative graph reduction techniques. These techniques effectively narrow the graph's size by iteratively removing edges that do not belong to relative fair cliques. Furthermore, a series of upper bounds of the maximum relative fair clique size is proposed by incorporating consideration of vertex attributes and colors. The pruning techniques derived from these upper bounds can significantly trim unnecessary search space during the branch-and-bound procedure. Adding to this, we present a heuristic algorithm with a linear time complexity, employing both a degree-based greedy strategy and a colored degree-based greedy strategy to identify a larger relative fair clique. This heuristic algorithm can serve a dual purpose by aiding in branch pruning, thereby enhancing overall search efficiency. Extensive experiments conducted on six real-life datasets demonstrate the efficiency, scalability, and effectiveness of our algorithms.

On the Power of the Weisfeiler-Leman Test for Graph Motif Parameters

Seminal research in the field of graph neural networks (GNNs) has revealed a direct correspondence between the expressive capabilities of GNNs and the k-dimensional Weisfeiler-Leman (kWL) test, a widely-recognized method for verifying graph isomorphism. This connection has reignited interest in comprehending the specific graph properties effectively distinguishable by the kWL test. A central focus of research in this field revolves around determining the least dimensionality k, for which kWL can discern graphs with different number of occurrences of a pattern graph P. We refer to such a least k as the WL-dimension of this pattern counting problem. This inquiry traditionally delves into two distinct counting problems related to patterns: subgraph counting and induced subgraph counting. Intriguingly, despite their initial appearance as separate challenges with seemingly divergent approaches, both of these problems are interconnected components of a more comprehensive problem: "graph motif parameters". In this paper, we provide a precise characterization of the WL-dimension of labeled graph motif parameters. As specific instances of this result, we obtain characterizations of the WL-dimension of the subgraph counting and induced subgraph counting problem for every labeled pattern P. We additionally demonstrate that in cases where the kWL test distinguishes between graphs with varying occurrences of a pattern P, the exact number of occurrences of P can be computed uniformly using only local information of the last layer of a corresponding GNN. We finally delve into the challenge of recognizing the WL-dimension of various graph parameters. We give a polynomial time algorithm for determining the WL-dimension of the subgraph counting problem for given pattern P, answering an open question from previous work.

Leveraging Large Language Models for Node Generation in Few-Shot Learning on Text-Attributed Graphs

Text-attributed graphs have recently garnered significant attention due to their wide range of applications in web domains. Existing methodologies employ word embedding models for acquiring text representations as node features, which are subsequently fed into Graph Neural Networks (GNNs) for training. Recently, the advent of Large Language Models (LLMs) has introduced their powerful capabilities in information retrieval and text generation, which can greatly enhance the text attributes of graph data. Furthermore, the acquisition and labeling of extensive datasets are both costly and time-consuming endeavors. Consequently, few-shot learning has emerged as a crucial problem in the context of graph learning tasks. In order to tackle this challenge, we propose a lightweight paradigm called LLM4NG, which adopts a plug-and-play approach to empower text-attributed graphs through node generation using LLMs. Specifically, we utilize LLMs to extract semantic information from the labels and generate samples that belong to these categories as exemplars. Subsequently, we employ an edge predictor to capture the structural information inherent in the raw dataset and integrate the newly generated samples into the original graph. This approach harnesses LLMs for enhancing class-level information and seamlessly introduces labeled nodes and edges without modifying the raw dataset, thereby facilitating the node classification task in few-shot scenarios. Extensive experiments demonstrate the outstanding performance of our proposed paradigm, particularly in low-shot scenarios. For instance, in the 1-shot setting of the ogbn-arxiv dataset, LLM4NG achieves a 76% improvement over the baseline model.

Explanation Graph Generation via Pre-trained Language Models: An Empirical Study with Contrastive Learning

Pre-trained sequence-to-sequence language models have led to widespread success in many natural language generation tasks. However, there has been relatively less work on analyzing their ability to generate structured outputs such as graphs. Unlike natural language, graphs have distinct structural and semantic properties in the context of a downstream NLP task, e.g., generating a graph that is connected and acyclic can be attributed to its structural constraints, while the semantics of a graph can refer to how meaningfully an edge represents the relation between two node concepts. In this work, we study pre-trained language models that generate explanation graphs in an end-to-end manner and analyze their ability to learn the structural constraints and semantics of such graphs. We first show that with limited supervision, pre-trained language models often generate graphs that either violate these constraints or are semantically incoherent. Since curating large amount of human-annotated graphs is expensive and tedious, we propose simple yet effective ways of graph perturbations via node and edge edit operations that lead to structurally and semantically positive and negative graphs. Next, we leverage these graphs in different contrastive learning models with Max-Margin and InfoNCE losses. Our methods lead to significant improvements in both structural and semantic accuracy of explanation graphs and also generalize to other similar graph generation tasks. Lastly, we show that human errors are the best negatives for contrastive learning and also that automatically generating more such human-like negative graphs can lead to further improvements. Our code and models are publicly available at https://github.com/swarnaHub/ExplagraphGen

GRAG: Graph Retrieval-Augmented Generation

While Retrieval-Augmented Generation (RAG) enhances the accuracy and relevance of responses by generative language models, it falls short in graph-based contexts where both textual and topological information are important. Naive RAG approaches inherently neglect the structural intricacies of textual graphs, resulting in a critical gap in the generation process. To address this challenge, we introduce Graph Retrieval-Augmented Generation (GRAG), which significantly enhances both the retrieval and generation processes by emphasizing the importance of subgraph structures. Unlike RAG approaches that focus solely on text-based entity retrieval, GRAG maintains an acute awareness of graph topology, which is crucial for generating contextually and factually coherent responses. Our GRAG approach consists of four main stages: indexing of k-hop ego-graphs, graph retrieval, soft pruning to mitigate the impact of irrelevant entities, and generation with pruned textual subgraphs. GRAG's core workflow-retrieving textual subgraphs followed by soft pruning-efficiently identifies relevant subgraph structures while avoiding the computational infeasibility typical of exhaustive subgraph searches, which are NP-hard. Moreover, we propose a novel prompting strategy that achieves lossless conversion from textual subgraphs to hierarchical text descriptions. Extensive experiments on graph multi-hop reasoning benchmarks demonstrate that in scenarios requiring multi-hop reasoning on textual graphs, our GRAG approach significantly outperforms current state-of-the-art RAG methods while effectively mitigating hallucinations.

Local Graph Clustering with Noisy Labels

The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.

Variationally Regularized Graph-based Representation Learning for Electronic Health Records

Electronic Health Records (EHR) are high-dimensional data with implicit connections among thousands of medical concepts. These connections, for instance, the co-occurrence of diseases and lab-disease correlations can be informative when only a subset of these variables is documented by the clinician. A feasible approach to improving the representation learning of EHR data is to associate relevant medical concepts and utilize these connections. Existing medical ontologies can be the reference for EHR structures, but they place numerous constraints on the data source. Recent progress on graph neural networks (GNN) enables end-to-end learning of topological structures for non-grid or non-sequential data. However, there are problems to be addressed on how to learn the medical graph adaptively and how to understand the effect of the medical graph on representation learning. In this paper, we propose a variationally regularized encoder-decoder graph network that achieves more robustness in graph structure learning by regularizing node representations. Our model outperforms the existing graph and non-graph based methods in various EHR predictive tasks based on both public data and real-world clinical data. Besides the improvements in empirical experiment performances, we provide an interpretation of the effect of variational regularization compared to standard graph neural network, using singular value analysis.

Generating Drug Repurposing Hypotheses through the Combination of Disease-Specific Hypergraphs

The drug development pipeline for a new compound can last 10-20 years and cost over 10 billion. Drug repurposing offers a more time- and cost-effective alternative. Computational approaches based on biomedical knowledge graph representations have recently yielded new drug repurposing hypotheses. In this study, we present a novel, disease-specific hypergraph representation learning technique to derive contextual embeddings of biological pathways of various lengths but that all start at any given drug and all end at the disease of interest. Further, we extend this method to multi-disease hypergraphs. To determine the repurposing potential of each of the 1,522 drugs, we derive drug-specific distributions of cosine similarity values and ultimately consider the median for ranking. Cosine similarity values are computed between (1) all biological pathways starting at the considered drug and ending at the disease of interest and (2) all biological pathways starting at drugs currently prescribed against that disease and ending at the disease of interest. We illustrate our approach with Alzheimer's disease (AD) and two of its risk factors: hypertension (HTN) and type 2 diabetes (T2D). We compare each drug's rank across four hypergraph settings (single- or multi-disease): AD only, AD + HTN, AD + T2D, and AD + HTN + T2D. Notably, our framework led to the identification of two promising drugs whose repurposing potential was significantly higher in hypergraphs combining two diseases: dapagliflozin (antidiabetic; moved up, from top 32% to top 7%, across all considered drugs) and debrisoquine (antihypertensive; moved up, from top 76% to top 23%). Our approach serves as a hypothesis generation tool, to be paired with a validation pipeline relying on laboratory experiments and semi-automated parsing of the biomedical literature.

Rethinking the Power of Graph Canonization in Graph Representation Learning with Stability

The expressivity of Graph Neural Networks (GNNs) has been studied broadly in recent years to reveal the design principles for more powerful GNNs. Graph canonization is known as a typical approach to distinguish non-isomorphic graphs, yet rarely adopted when developing expressive GNNs. This paper proposes to maximize the expressivity of GNNs by graph canonization, then the power of such GNNs is studies from the perspective of model stability. A stable GNN will map similar graphs to close graph representations in the vectorial space, and the stability of GNNs is critical to generalize their performance to unseen graphs. We theoretically reveal the trade-off of expressivity and stability in graph-canonization-enhanced GNNs. Then we introduce a notion of universal graph canonization as the general solution to address the trade-off and characterize a widely applicable sufficient condition to solve the universal graph canonization. A comprehensive set of experiments demonstrates the effectiveness of the proposed method. In many popular graph benchmark datasets, graph canonization successfully enhances GNNs and provides highly competitive performance, indicating the capability and great potential of proposed method in general graph representation learning. In graph datasets where the sufficient condition holds, GNNs enhanced by universal graph canonization consistently outperform GNN baselines and successfully improve the SOTA performance up to 31%, providing the optimal solution to numerous challenging real-world graph analytical tasks like gene network representation learning in bioinformatics.

Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning

Representation learning on text-attributed graphs (TAGs) has become a critical research problem in recent years. A typical example of a TAG is a paper citation graph, where the text of each paper serves as node attributes. Initial graph neural network (GNN) pipelines handled these text attributes by transforming them into shallow or hand-crafted features, such as skip-gram or bag-of-words features. Recent efforts have focused on enhancing these pipelines with language models (LMs), which typically demand intricate designs and substantial computational resources. With the advent of powerful large language models (LLMs) such as GPT or Llama2, which demonstrate an ability to reason and to utilize general knowledge, there is a growing need for techniques which combine the textual modelling abilities of LLMs with the structural learning capabilities of GNNs. Hence, in this work, we focus on leveraging LLMs to capture textual information as features, which can be used to boost GNN performance on downstream tasks. A key innovation is our use of explanations as features: we prompt an LLM to perform zero-shot classification, request textual explanations for its decision-making process, and design an LLM-to-LM interpreter to translate these explanations into informative features for downstream GNNs. Our experiments demonstrate that our method achieves state-of-the-art results on well-established TAG datasets, including Cora, PubMed, ogbn-arxiv, as well as our newly introduced dataset, tape-arxiv23. Furthermore, our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv. Lastly, we believe the versatility of the proposed method extends beyond TAGs and holds the potential to enhance other tasks involving graph-text data. Our codes and datasets are available at: https://github.com/XiaoxinHe/TAPE.

Multi-Label Zero-Shot Product Attribute-Value Extraction

E-commerce platforms should provide detailed product descriptions (attribute values) for effective product search and recommendation. However, attribute value information is typically not available for new products. To predict unseen attribute values, large quantities of labeled training data are needed to train a traditional supervised learning model. Typically, it is difficult, time-consuming, and costly to manually label large quantities of new product profiles. In this paper, we propose a novel method to efficiently and effectively extract unseen attribute values from new products in the absence of labeled data (zero-shot setting). We propose HyperPAVE, a multi-label zero-shot attribute value extraction model that leverages inductive inference in heterogeneous hypergraphs. In particular, our proposed technique constructs heterogeneous hypergraphs to capture complex higher-order relations (i.e. user behavior information) to learn more accurate feature representations for graph nodes. Furthermore, our proposed HyperPAVE model uses an inductive link prediction mechanism to infer future connections between unseen nodes. This enables HyperPAVE to identify new attribute values without the need for labeled training data. We conduct extensive experiments with ablation studies on different categories of the MAVE dataset. The results demonstrate that our proposed HyperPAVE model significantly outperforms existing classification-based, generation-based large language models for attribute value extraction in the zero-shot setting.

Can LLMs be Good Graph Judger for Knowledge Graph Construction?

In real-world scenarios, most of the data obtained from information retrieval (IR) system is unstructured. Converting natural language sentences into structured Knowledge Graphs (KGs) remains a critical challenge. The quality of constructed KGs may also impact the performance of some KG-dependent domains like GraphRAG systems and recommendation systems. Recently, Large Language Models (LLMs) have demonstrated impressive capabilities in addressing a wide range of natural language processing tasks. However, there are still challenges when utilizing LLMs to address the task of generating structured KGs. And we have identified three limitations with respect to existing KG construction methods. (1)There is a large amount of information and excessive noise in real-world documents, which could result in extracting messy information. (2)Native LLMs struggle to effectively extract accuracy knowledge from some domain-specific documents. (3)Hallucinations phenomenon cannot be overlooked when utilizing LLMs directly as an unsupervised method for constructing KGs. In this paper, we propose GraphJudger, a knowledge graph construction framework to address the aforementioned challenges. We introduce three innovative modules in our method, which are entity-centric iterative text denoising, knowledge aware instruction tuning and graph judgement, respectively. We seek to utilize the capacity of LLMs to function as a graph judger, a capability superior to their role only as a predictor for KG construction problems. Experiments conducted on two general text-graph pair datasets and one domain-specific text-graph pair dataset show superior performances compared to baseline methods. The code of our proposed method is available at https://github.com/hhy-huang/GraphJudger.