research-article

FlowWalker: A Memory-Efficient and High-Performance GPU-Based Dynamic Graph Random Walk Framework

Authors:

Xiaoliang CongAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 8

Pages 1788 - 1801

https://doi.org/10.14778/3659437.3659438

Published: 31 May 2024 Publication History

Abstract

Dynamic graph random walk (DGRW) emerges as a practical tool for capturing structural relations within a graph. Effectively executing DGRW on GPU presents certain challenges. First, existing sampling methods demand a pre-processing buffer, causing substantial space complexity. Moreover, the power-law distribution of graph vertex degrees introduces workload imbalance issues, rendering DGRW embarrassed to parallelize. In this paper, we propose FlowWalker, a GPU-based dynamic graph random walk framework. FlowWalker implements an efficient parallel sampling method to fully exploit the GPU parallelism and reduce space complexity. Moreover, it employs a sampler-centric paradigm alongside a dynamic scheduling strategy to handle the huge amounts of walking queries. FlowWalker stands as a memory-efficient framework that requires no auxiliary data structures in GPU global memory. We examine the performance of FlowWalker extensively on ten datasets, and experiment results show that FlowWalker achieves up to 752.2×, 72.1×, and 16.4× speedup compared with existing CPU, GPU, and FPGA random walk frameworks, respectively. Case study shows that FlowWalker diminishes random walk time from 35% to 3% in a pipeline of ByteDance friend recommendation GNN training.

References

[1]

Paolo Boldi and Marco Rosa. 2012. Arc-community detection via triangular random walks. In 2012 Eighth Latin American Web Congress. IEEE, 48--56.

Digital Library

[2]

Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. In Proceedings of the 20th international conference on World Wide Web, Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar (Eds.). ACM Press, 587--596.

Digital Library

[3]

Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004). ACM Press, Manhattan, USA, 595--601.

Digital Library

[4]

Min-Te Chao. 1982. A general purpose unequal probability sampling plan. Biometrika 69, 3 (1982), 653--656.

[5]

Thomas H Cormen and Michael T Goodrich. 1996. A bridging model for parallel computation, communication, and I/O. ACM Computing Surveys (CSUR) 28, 4es (1996), 208--es.

Digital Library

[6]

Xiaoheng Deng, Genghao Li, Mianxiong Dong, and Kaoru Ota. 2017. Finding overlapping communities based on Markov chain and link clustering. Peer-to-peer Networking and Applications 10 (2017), 411--420.

[7]

Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 135--144.

Digital Library

[8]

Dániel Fogaras, Balázs Rácz, Károly Csalogány, and Tamás Sarlós. 2005. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics 2, 3 (2005), 333--358.

[9]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855--864.

Digital Library

[10]

Yu He, Yangqiu Song, Jianxin Li, Cheng Ji, Jian Peng, and Hao Peng. 2019. HeteSpaceyWalk: A Heterogeneous Spacey Random Walk for Heterogeneous Information Network Embedding. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM '19). Association for Computing Machinery, New York, NY, USA, 639--648.

Digital Library

[11]

W. Daniel Hillis and Guy L. Steele. 1986. Data Parallel Algorithms. Commun. ACM 29, 12 (dec 1986), 1170--1183.

Digital Library

[12]

Lorenz Hübschle-Schneider and Peter Sanders. 2022. Parallel weighted random sampling. ACM Transactions on Mathematical Software (TOMS) 48, 3 (2022), 1--40.

Digital Library

[13]

Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini. 2021. Accelerating graph sampling for graph machine learning using GPUs. In Proceedings of the Sixteenth European Conference on Computer Systems. 311--326.

Digital Library

[14]

Yong-Yeon Jo, Myung-Hwan Jang, Hyungsoo Jung, and Sang-Wook Kim. 2018. A High-Performance Graph Engine for Efficient Social Network Analysis. In Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon, France, April 23--27, 2018, Pierre-Antoine Champin, Fabien Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis (Eds.). ACM, 61--62.

Digital Library

[15]

Peter M Kogge and Harold S Stone. 1973. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE transactions on computers 100, 8 (1973), 786--793.

Digital Library

[16]

Aapo Kyrola. 2013. DrunkardMob: billions of random walks on just a PC. Proceedings of the 7th ACM conference on Recommender systems (2013).

Digital Library

[17]

Aapo Kyrola. 2013. Drunkardmob: billions of random walks on just a pc. In Proceedings of the 7th ACM conference on Recommender systems. 257--264.

Digital Library

[18]

Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-Scale Graph Computation on Just a PC. In 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8--10, 2012. 31--46. https://www.usenix.org/conference/osdi12/technicalsessions/presentation/kyrola

[19]

Ni Lao and William W Cohen. 2010. Relational retrieval using a combination of path-constrained random walks. Machine learning 81 (2010), 53--67.

[20]

Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.

[21]

Xueting Liao, Yubao Wu, and Xiaojun Cao. 2019. Second-Order CoSimRank for Similarity Measures in Social Networks. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC). 1--6.

[22]

Xin Lv, Yuxian Gu, Xu Han, Lei Hou, Juanzi Li, and Zhiyuan Liu. 2019. Adapting Meta Knowledge Graph Information for Multi-Hop Reasoning over Few-Shot Relations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3--7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 3374--3379.

[23]

Pedro J. Martín, Luis F. Ayuso, Roberto Torres, and Antonio Gavilanes. 2012. Algorithmic strategies for optimizing the parallel reduction primitive in CUDA. In 2012 International Conference on High Performance Computing & Simulation (HPCS). 511--519.

[24]

Junyi Mei, Shixuan Sun, Chao Li, Cheng Xu, Cheng Chen, Yibo Liu, Jing Wang, Cheng Zhao, Xiaofeng Hou, Minyi Guo, Bingsheng He, and Xiaoliang Cong. 2024. FlowWalker: A Memory-efficient and High-performance GPU-based Dynamic Graph Random Walk Framework. arXiv:2404.08364 [cs.DC]

[25]

NVIDIA. 2022. CUB Documentation. https://nvlabs.github.io/cub/index.html, Last accessed on 2023-6-25.

[26]

NVIDIA. 2023. CUDA Toolkit Documentation, cuRAND. https://docs.nvidia.com/cuda/curand/index.html, Last accessed on 2023-6-25.

[27]

S. Olver and A. Townsend. 2013. Fast inverse transform sampling in one and two dimensions. arXiv: Numerical Analysis (2013).

[28]

Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S Li, and Hang Liu. 2020. CSAW: A framework for graph sampling and random walk on GPUs. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--15.

[29]

Serafeim Papadias, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2022. Space-efficient random walks on streaming graphs. Proceedings of the VLDB Endowment 16, 2 (2022), 356--368.

Digital Library

[30]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701--710.

Digital Library

[31]

Christian Robert and George Casella. 2013. Monte Carlo statistical methods. In Springer Science & Business Media.

[32]

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The Graph Neural Network Model. IEEE Transactions on Neural Networks 20, 1 (2009), 61--80.

Digital Library

[33]

Shubhabrata Sengupta, Aaron Lefohn, and John D Owens. 2006. A work-efficient step-efficient prefix sum algorithm. (2006).

[34]

Yingxia Shao, Shiyue Huang, Xupeng Miao, Bin Cui, and Lei Chen. 2020. Memory-aware framework for efficient second-order random walk on large graphs. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 1797--1812.

Digital Library

[35]

Baoxu Shi and Tim Weninger. 2016. Discriminative predicate path mining for fact checking in knowledge graphs. Knowledge-based systems 104 (2016), 123--133.

[36]

Shixuan Sun, Yuhang Chen, Shengliang Lu, Bingsheng He, and Yuchen Li. 2021. ThunderRW: An In-Memory Graph Random Walk Engine. Proc. VLDB Endow. 14, 11 (2021), 1992--2005. http://www.vldb.org/pvldb/vol14/p1992-sun.pdf

Digital Library

[37]

Yizhou Sun and Jiawei Han. 2012. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explor. 14, 2 (2012), 20--28.

Digital Library

[38]

Hongshi Tan, Xinyu Chen, Yao Chen, Bingsheng He, and Weng-Fai Wong. 2023. LightRW: FPGA Accelerated Graph Dynamic Random Walks. Proc. ACM Manag. Data 1, 1, Article 90 (may 2023), 27 pages.

Digital Library

[39]

Alok Tripathy, Katherine Yelick, and Aydin Buluc. 2023. Distributed Matrix-Based Sampling for Graph Neural Network Training. arXiv preprint arXiv:2311.02909 (2023).

[40]

Jeffrey S Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11, 1 (1985), 37--57.

Digital Library

[41]

Alastair J. Walker. 1977. An Efficient Method for Generating Discrete Random Variables with General Distributions. ACM Trans. Math. Softw. 3, 3 (1977), 253--256.

Digital Library

[42]

Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, B. Zhao, and D. Lee. 2018. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018).

Digital Library

[43]

Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. 2019. Deep graph library: Towards efficient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315 (2019).

[44]

Pengyu Wang, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Jingwen Leng, Quan Chen, and Minyi Guo. 2021. Skywalker: Efficient Alias-Method-Based Graph Sampling and Random Walk on GPUs. In 30th International Conference on Parallel Architectures and Compilation Techniques, PACT 2021, Atlanta, GA, USA, September 26--29, 2021. IEEE, 304--317.

[45]

Pengyu Wang, Cheng Xu, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Xiaofeng Hou, and Minyi Guo. 2023. Optimizing GPU-based Graph Sampling and Random Walk for Efficiency and Scalability. IEEE Trans. Comput. (2023).

Digital Library

[46]

Rui Wang, Yongkun Li, Hong Xie, Yinlong Xu, and John CS Lui. 2020. {GraphWalker}: An {I/O-Efficient} and {Resource-Friendly} Graph Analytic System for Fast and Scalable Random Walks. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). 559--571.

[47]

Shuke Wang, Mingxing Zhang, Ke Yang, Kang Chen, Shaonan Ma, Jinlei Jiang, and Yongwei Wu. 2023. NosWalker: A Decoupled Architecture for Out-of-Core Random Walk Processing. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 466--482.

Digital Library

[48]

Yubao Wu, Yuchen Bian, and Xiang Zhang. 2016. Remember where you came from: on the second-order random walk based proximity measures. Proceedings of the VLDB Endowment 10, 1 (2016), 13--24.

Digital Library

[49]

Ke Yang, Xiaosong Ma, Saravanan Thirumuruganathan, Kang Chen, and Yongwei Wu. 2021. Random walks on huge graphs at cache efficiency. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 311--326.

Digital Library

[50]

Ke Yang, Mingxing Zhang, Kang Chen, Xiaosong Ma, Yang Bai, and Yong Jiang. 2019. KnightKing: a fast distributed graph random walk engine. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27--30, 2019, Tim Brecht and Carey Williamson (Eds.). ACM, 524--537.

Digital Library

[51]

Hongbo Yin, Yingxia Shao, Xupeng Miao, Yawen Li, and Bin Cui. 2022. Scalable Graph Sampling on GPUs with Compressed Graph. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (Atlanta, GA, USA) (CIKM '22). Association for Computing Machinery, New York, NY, USA, 2383--2392.

Digital Library

[52]

Dalong Zhang, Xin Huang, Ziqi Liu, Jun Zhou, Zhiyang Hu, Xianzheng Song, Zhibang Ge, Lin Wang, Zhiqiang Zhang, and Yuan Qi. 2020. AGL: A Scalable System for Industrial-Purpose Graph Machine Learning. Proc. VLDB Endow. 13, 12 (aug 2020), 3125--3137.

Digital Library

[53]

Rong Zhu, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou. 2019. AliGraph: a comprehensive graph neural network platform. Proceedings of the VLDB Endowment 12, 12 (2019), 2094--2105.

Digital Library

Recommendations

Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing Frontiers

The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
Thread Batching for High-performance Energy-efficient GPU Memory Design
Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU’s performance and energy efficiency. In ...
Multi-GPU DGEMM and High Performance Linpack on Highly Energy-Efficient Clusters

High Performance Linpack can maximize requirements throughout a computer system. An efficient multi-GPU double-precision general matrix multiply (DGEMM), together with adjustments to the HPL, is required to utilize a heterogeneous computer to its full ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 8

April 2024

335 pages

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 31 May 2024

Published in PVLDB Volume 17, Issue 8

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
42
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents