FlowWalker: A Memory-Efficient and High-Performance GPU-Based Dynamic Graph Random Walk Framework
- Junyi Mei,
- Shixuan Sun,
- Chao Li,
- Cheng Xu,
- Cheng Chen,
- Yibo Liu,
- Jing Wang,
- Cheng Zhao,
- Xiaofeng Hou,
- Minyi Guo,
- Bingsheng He,
- Xiaoliang Cong
Dynamic graph random walk (DGRW) emerges as a practical tool for capturing structural relations within a graph. Effectively executing DGRW on GPU presents certain challenges. First, existing sampling methods demand a pre-processing buffer, causing ...
Accelerating String-Key Learned Index Structures via Memoization-Based Incremental Training
Learned indexes use machine learning models to learn the mappings between keys and their corresponding positions in key-value indexes. These indexes use the mapping information as training data. Learned indexes require frequent retrainings of their ...
Truss-Based Community Search over Streaming Directed Graphs
Community search aims to retrieve dense subgraphs that contain the query vertices. While many effective community models and algorithms have been proposed in the literature, none of them address the unique challenges posed by streaming graphs, where ...
InferDB: In-Database Machine Learning Inference Using Indexes
The performance of inference with machine learning (ML) models and its integration with analytical query processing have become critical bottlenecks for data analysis in many organizations. An ML inference pipeline typically consists of a preprocessing ...
AAA: An Adaptive Mechanism for Locally Differentially Private Mean Estimation
Local differential privacy (LDP) is a strong privacy standard that has been adopted by popular software systems, including Chrome, iOS, MacOS, and Windows. The main idea is that each individual perturbs their own data locally, and only submits the ...
Accelerating Merkle Patricia Trie with GPU
Merkle Patricia Trie (MPT) is a type of trie structure that offers efficient lookup and insert operators for immutable data systems that require multi-version access and tamper-evident controls, such as blockchains and verifiable databases. The ...
Privacy Amplification via Shuffling: Unified, Simplified, and Tightened
The shuffle model of differential privacy provides promising privacy-utility balances in decentralized, privacy-preserving data analysis. However, the current analyses of privacy amplification via shuffling lack both tightness and generality. To address ...
Detecting Metadata-Related Logic Bugs in Database Systems via Raw Database Construction
Database Management Systems (DBMSs) are widely used to efficiently store and retrieve data. DBMSs usually support various metadata, e.g., integrity constraints for ensuring data integrity and indexes for locating data. DBMSs can further utilize these ...
From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying
Safeguarding the Intellectual Property (IP) of data has become critically important as machine learning applications continue to proliferate, and their success heavily relies on the quality of training data. While various mechanisms exist to secure data ...
Oasis: An Optimal Disjoint Segmented Learned Range Filter
The learning-enhanced data structure has inspired the development of the range filter, bringing significantly better false positive rate (FPR) than traditional non-learned range filters. Its core idea is to employ piece-wise linear functions that ...
LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes
- Yuhao Deng,
- Chengliang Chai,
- Lei Cao,
- Qin Yuan,
- Siyuan Chen,
- Yanrui Yu,
- Zhaoze Sun,
- Junyi Wang,
- Jiajun Li,
- Ziqi Cao,
- Kaisen Jin,
- Chi Zhang,
- Yuqing Jiang,
- Yuanfang Zhang,
- Yuping Wang,
- Ye Yuan,
- Guoren Wang,
- Nan Tang
Discovering tables from poorly maintained data lakes is a significant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there's a lack of a ...
GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization
- Jiale Lao,
- Yibo Wang,
- Yufei Li,
- Jianping Wang,
- Yunjia Zhang,
- Zhiyuan Cheng,
- Wanghu Chen,
- Mingjie Tang,
- Jianguo Wang
Modern database management systems (DBMS) expose hundreds of configurable knobs to control system behaviours. Determining the appropriate values for these knobs to improve DBMS performance is a long-standing problem in the database community. As there is ...
Raising the ClaSS of Streaming Time Series Segmentation
Ubiquitous sensors today emit high frequency streams of numerical measurements that reflect properties of human, animal, industrial, commercial, and natural processes. Shifts in such processes, e.g. caused by external events or internal state changes, ...
Fast Local Subgraph Counting
We study local subgraph counting queries, Q = (p, o), to count how many times a given k-node pattern graph p appears around every node υ in a data graph G when the given center node o in p maps to υ. Such local subgraph counting becomes important in GNNs ...
ReAcTable: Enhancing ReAct for Table Question Answering
Table Question Answering (TQA) presents a substantial challenge at the intersection of natural language processing and data analytics. This task involves answering natural language (NL) questions on top of tabular data, demanding proficiency in logical ...
NeutronOrch: Rethinking Sample-Based GNN Training under CPU-GPU Heterogeneous Environments
Graph Neural Networks (GNNs) have shown exceptional performance across a wide range of applications. Current frameworks leverage CPU-GPU heterogeneous environments for GNN model training, incorporating mini-batch and sampling techniques to mitigate GPU ...
Rapidash: Efficient Detection of Constraint Violations
Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. Given their significance, there ...
Differentially Private Data Generation with Missing Data
Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the problems of ...
Everything You Always Wanted to Know About Storage Compressibility of Pre-Trained ML Models but Were Afraid to Ask
As the number of pre-trained machine learning (ML) models is growing exponentially, data reduction tools are not catching up. Existing data reduction techniques are not specifically designed for pre-trained model (PTM) dataset files. This is largely due ...
Fight Fire with Fire: Towards Robust Graph Neural Networks on Dynamic Graphs via Actively Defense
Graph neural networks (GNNs) have achieved great success on various graph tasks. However, recent studies have revealed that GNNs are vulnerable to injective attacks. Due to the openness of platforms, attackers can inject malicious nodes with carefully ...
SeLeP: Learning Based Semantic Prefetching for Exploratory Database Workloads
Prefetching is a crucial technique employed in traditional databases to enhance interactivity, particularly in the context of data exploration. Data exploration is a query processing paradigm in which users search for insights buried in the data, often ...
Contributions Estimation in Federated Learning: A Comprehensive Experimental Evaluation
Federated Learning (FL) provides a privacy-preserving and decentralized approach to collaborative machine learning for multiple FL clients. The contribution estimation mechanism in FL is extensively studied within the database community, which aims to ...
Visualization-Aware Time Series Min-Max Caching with Error Bound Guarantees
This paper addresses the challenges in interactive visual exploration of large multi-variate time series data. Traditional data reduction techniques may improve latency but can distort visualizations. State-of-the-art methods aimed at 100% accurate ...
Chorus: Foundation Models for Unified Data Discovery and Exploration
We apply foundation models to data discovery and exploration tasks. Foundation models are large language models (LLMS) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly ...
Cloud-Native Database Systems and Unikernels: Reimagining OS Abstractions for Modern Hardware
This paper explores the intersection of operating systems and database systems, focusing on the potential of specialized kernels for cloud-native database systems. Although the idea of custom, DBMS-optimized OS kernels is old, it is largely unrealized ...
Subjects
Currently Not Available