skip to main content
research-article

SwitchTx: scalable in-network coordination for distributed transaction processing

Published: 01 July 2022 Publication History

Abstract

Online-transaction-processing (OLTP) applications require the underlying storage system to guarantee consistency and serializability for distributed transactions involving large numbers of servers, which tends to introduce high coordination cost and cause low system performance. In-network coordination is a promising approach to alleviate this problem, which leverages programmable switches to move a piece of coordination functionality into the network. This paper presents a fast and scalable transaction processing system called SwitchTx. At the core of SwitchTx is a decentralized multi-switch in-network coordination mechanism, which leverages modern switches' programmability to reduce coordination cost while avoiding the central-switch-caused problems in the state-of-the-art Eris transaction processing system. SwitchTx abstracts various coordination tasks (e.g., locking, validating, and replicating) as in-switch gather-and-scatter (GaS) operations, and offloads coordination to a tree of switches for each transaction (instead of to a central switch for all transactions) where the client and the participants connect to the leaves. Moreover, to control the transaction traffic intelligently, SwitchTx reorders the coordination messages according to their semantics and redesigns the congestion control combined with admission control. Evaluation shows that SwitchTx outperforms current transaction processing systems in various workloads by up to 2.16X in throughput, 40.4% in latency, and 41.5% in lock time.

References

[1]
James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. Spanner: Google's globally distributed database. ACM Transactions on Computer Systems (TOCS), 31(3):1--22, 2013.
[2]
Philip A Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency control and recovery in database systems, volume 370. Addison-wesley New York, 1987.
[3]
Philip A Bernstein and Nathan Goodman. Concurrency control in distributed database systems. ACM Computing Surveys (CSUR), 13(2):185--221, 1981.
[4]
Hsiang-Tsung Kung and John T Robinson. On optimistic methods for concurrency control. ACM Transactions on Database Systems (TODS), 6(2):213--226, 1981.
[5]
Aleksandar Dragojević, Dushyanth Narayanan, Edmund B Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: distributed transactions with consistency, availability, and performance. In Proceedings of the 25th symposium on operating systems principles, pages 54--70, 2015.
[6]
Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Fabian Nagel, and Kenneth A. Ross. Reducing database locking contention through multi-version concurrency. Proc. VLDB Endow., 7(13):1331--1342, aug 2014.
[7]
Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J Abadi. Calvin: fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 1--12, 2012.
[8]
Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, and Jinyang Li. Extracting more concurrency from distributed transactions. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 479--494, 2014.
[9]
Chao Xie, Chunzhi Su, Cody Littley, Lorenzo Alvisi, Manos Kapritsos, and Yang Wang. High-performance acid via modular concurrency control. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 279--294, 2015.
[10]
Bailu Ding, Lucja Kot, and Johannes Gehrke. Improving optimistic concurrency control through transaction batching and operation reordering. Proceedings of the VLDB Endowment, 12(2):169--182, 2018.
[11]
Robin Rehrmann, Carsten Binnig, Alexander Böhm, Kihong Kim, and Wolfgang Lehner. Sharing opportunities for oltp workloads in different isolation levels. Proceedings of the VLDB Endowment, 13(10):1696--1708, 2020.
[12]
Adriana Szekeres, Michael Whittaker, Jialin Li, Naveen Kr Sharma, Arvind Krishnamurthy, Dan RK Ports, and Irene Zhang. Meerkat: multicore-scalable replicated transactions following the zero-coordination principle. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1--14, 2020.
[13]
Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan PC Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, et al. H-store: a high-performance, distributed main memory transaction processing system. Proceedings of the VLDB Endowment, 1(2):1496--1499, 2008.
[14]
Jialin Li, Ellis Michael, and Dan RK Ports. Eris: Coordination-free consistent transactions using in-network concurrency control. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 104--120, 2017.
[15]
Thamir Qadah, Suyash Gupta, and Mohammad Sadoghi. Q-store: Distributed, multi-partition transactions via queue-oriented execution and communication. In EDBT, pages 73--84, 2020.
[16]
YiLu, Xiangyao Yu, Lei Cao, and Samuel Madden. Aria: A fast and practical deterministic oltp database. Proc. VLDB Endow., 13(12):2047--2060, jul 2020.
[17]
Wenhao Lv, Youyou Lu, Yiming Zhang, Peile Duan, and Jiwu Shu. InfiniFS: An efficient metadata service for Large-Scale distributed filesystems. In 20th USENIX Conference on File and Storage Technologies (FAST 22), pages 313--328, Santa Clara, CA, February 2022. USENIX Association.
[18]
Carlo Curino, Evan Philip Charles Jones, Yang Zhang, and Samuel R Madden. Schism: a workload-driven approach to database replication and partitioning. 2010.
[19]
Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J Elmore, Ashraf Aboulnaga, Andrew Pavlo, and Michael Stonebraker. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment, 8(3):245--256, 2014.
[20]
Erfan Zamanian, Julian Shun, Carsten Binnig, and Tim Kraska. Chiller: Contention-centric transaction execution and data partitioning for modern networks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 511--526, 2020.
[21]
Thamir M. Qadah and Mohammad Sadoghi. Quecc: A queue-oriented, control-free concurrency architecture. In Proceedings of the 19th International Middleware Conference, Middleware '18, page 13--25, New York, NY, USA, 2018. Association for Computing Machinery.
[22]
Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. Octopus: An rdma-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '17, page 773--785, USA, 2017. USENIX Association.
[23]
Anuj Kalia, Michael Kaminsky, and David G Andersen. Fasst: Fast, scalable and simple distributed transactions with two-sided (rdma) datagram rpcs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 185--201, 2016.
[24]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. Farm: fast remote memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 401--414, 2014.
[25]
Chao Wang, Kezhao Huang, and Xuehai Qian. A comprehensive evaluation of rdma-enabled concurrency control protocols. CoRR, abs/2002.12664, 2020.
[26]
Jiwu Shu, Youmin Chen, Qing Wang, Bohong Zhu, Junru Li, and Youyou Lu. Th-dpms: Design and implementation of an rdma-enabled distributed persistent memory storage system. ACM Trans. Storage, 16(4), oct 2020.
[27]
Masoud Hemmatpour, Bartolomeo Montrucchio, Maurizio Rebaudengo, and Mohammad Sadoghi. Analyzing in-memory nosql landscape. IEEE Transactions on Knowledge and Data Engineering, 34(4):1628--1643, 2022.
[28]
Dan RK Ports and Jacob Nelson. When should the network be the computer? In Proceedings of the Workshop on Hot Topics in Operating Systems, pages 209--215, 2019.
[29]
James McCauley, Aurojit Panda, Arvind Krishnamurthy, and Scott Shenker. Thoughts on load distribution and the role of programmable switches. ACM SIGCOMM Computer Communication Review, 49(1):18--23, 2019.
[30]
Theophilus A Benson. In-network compute: Considered armed and dangerous. In Proceedings of the Workshop on Hot Topics in Operating Systems, pages 216--224, 2019.
[31]
Qing Wang, Youyou Lu, Erci Xu, Junru Li, Youmin Chen, and Jiwu Shu. Concordia: Distributed shared memory with in-network cache coherence. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pages 277--292, 2021.
[32]
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. Fast in-memory transaction processing using rdma and htm. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 87--104, 2015.
[33]
Xingda Wei, Zhiyuan Dong, Rong Chen, and Haibo Chen. Deconstructing rdma-enabled distributed transactions: Hybrid is better! In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 233--251, 2018.
[34]
Antonios Katsarakis, Yijun Ma, Zhaowei Tan, Andrew Bainbridge, Matthew Balkwill, Aleksandar Dragojevic, Boris Grot, Bozidar Radunovic, and Yongguang Zhang. Zeus: locality-aware distributed transactions. In Proceedings of the Sixteenth European Conference on Computer Systems, pages 145--161, 2021.
[35]
Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, page 54--70, New York, NY, USA, 2015. Association for Computing Machinery.
[36]
Mohammad Sadoghi and Spyros Blanas. Transaction processing on modern hardware. Synthesis Lectures on Data Management, 14(2):1--138, 2019.
[37]
Suyash Gupta and Mohammad Sadoghi. Easycommit: A non-blocking two-phase commit protocol. In EDBT, pages 157--168, 2018.
[38]
Tao Wang, Hang Zhu, Fabian Ruffy, Xin Jin, Anirudh Sivaraman, Dan RK Ports, and Aurojit Panda. Multitenancy for fast and programmable networks in the cloud. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20), 2020.
[39]
Tian Pan, Nianbing Yu, Chenhao Jia, Jianwen Pi, Liang Xu, Yisong Qiao, Zhiguo Li, Kun Liu, Jie Lu, Jianyuan Lu, Enge Song, Jiao Zhang, Tao Huang, and Shunmin Zhu. Sailfish: Accelerating cloud-scale multi-tenant multi-service gateways with programmable switches. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, SIGCOMM '21, page 194--206, New York, NY, USA, 2021. Association for Computing Machinery.
[40]
Sean Choi, Boris Burkov, Alex Eckert, Tian Fang, Saman Kazemkhani, Rob Sherwood, Ying Zhang, and Hongyi Zeng. Fboss: Building switch software at scale. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '18, page 342--356, New York, NY, USA, 2018. Association for Computing Machinery.
[41]
Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable rdma rpc on reliable connection with efficient resource sharing. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1--14, 2019.
[42]
Bohong Zhu, Youmin Chen, Qing Wang, Youyou Lu, and Jiwu Shu. Octopus+: An rdma-enabled distributed persistent memory file system. ACM Trans. Storage, 17(3), aug 2021.
[43]
Mellanox. Raw packet. https://community.mellanox.com/s/article/raw-ethernet-programming-basic-introduction-code-example, 2020.
[44]
David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654--663, 1997.
[45]
Xiaozhou Li, David G Andersen, Michael Kaminsky, and Michael J Freedman. Algorithmic improvements for fast concurrent cuckoo hashing. In Proceedings of the Ninth European Conference on Computer Systems, pages 1--14, 2014.
[46]
Barefoot Technologies Corporation. Barefoot tofino https://barefootnetworks.com/products/brief-tofino/, 2018.
[47]
Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing, pages 143--154, 2010.
[48]
Yihe Huang, William Qian, Eddie Kohler, Barbara Liskov, and Liuba Shrira. Opportunities for optimism in contended main-memory multicore transactions. Proceedings of the VLDB Endowment, 13(5):629--642, 2020.
[49]
Standard Specification. TPC BENCHMARK C. 1994.
[50]
Xiangyao Yu, George Bezerra, Andrew Pavlo, Srinivas Devadas, and Michael Stonebraker. Staring into the abyss: An evaluation of concurrency control with one thousand cores. Proc. VLDB Endow., 8(3):209--220, November 2014.
[51]
Jialin Li, Ellis Michael, and Dan RK Ports. Implementation of the Eris protocol with the software sequencer. https://github.com/nicklijl/simbricks-nopaxos, 2021.
[52]
Anuj Kalia, Michael Kaminsky, and David G Andersen. Implementation of FaSST. https://github.com/efficient/fasst, 2017.
[53]
Yi Lu, Yu Xiangyao, Lei Cao, and Madden Samuel. Implementation of the Aria. https://github.com/luyi0619/aria, 2021.
[54]
Donghui Wang, Peng Cai, Weining Qian, and Aoying Zhou. Discriminative admission control for shared-everything database under mixed oltp workloads. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 780--791. IEEE, 2021.
[55]
Bianca Schroeder, Mor Harchol-Balter, Arun Iyengar, Erich Nahum, and Adam Wierman. How to determine a good multi-programming level for external scheduling. In 22nd International Conference on Data Engineering (ICDE'06), pages 60--60. IEEE, 2006.
[56]
Hyeontaek Lim, Michael Kaminsky, and David G Andersen. Cicada: Dependably fast multicore in-memory transactions. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 21--35, 2017.
[57]
Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. Fast and general distributed transactions using rdma and htm. In Proceedings of the Eleventh European Conference on Computer Systems, pages 1--17, 2016.
[58]
Erfan Zamanian, Carsten Binnig, Tim Kraska, and Tim Harris. The end of a myth: Distributed transactions can scale. arXiv preprint arXiv:1607.00655, 2016.
[59]
Junru Li, Youyou Lu, Qing Wang, Jiazhen Lin, Zhe Yang, and Jiwu Shu. AlNiCo: SmartNIC-accelerated contention-aware request scheduling for transaction processing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 951--966, Carlsbad, CA, July 2022. USENIX Association.
[60]
Theo Jepsen, Alberto Lerner, Fernando Pedone, Robert Soulé, and Philippe Cudré-Mauroux. In-network support for transaction triaging. 2021.
[61]
Theo Jepsen. Building blocks for leveraging in-network computing. PhD thesis, Università della Svizzera italiana, 2020.
[62]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. Netcache: Balancing key-value stores with fast in-network caching. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 121--136, 2017.
[63]
Zaoxing Liu, Zhihao Bai, Zhenming Liu, Xiaozhou Li, Changhoon Kim, Vladimir Braverman, Xin Jin, and Ion Stoica. Distcache: Provable load balancing for large-scale storage systems with distributed caching. In 17th USENIX Conference on File and Storage Technologies (FAST 19), pages 143--157, 2019.
[64]
Matthias Jasny, Lasse Thostrup, Tobias Ziegler, and Carsten Binnig. P4db - the case for in-network oltp. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD '22, page 1375--1389, New York, NY, USA, 2022. Association for Computing Machinery.
[65]
Theo Jepsen, Leandro Pacheco de Sousa, Masoud Moshref, Fernando Pedone, and Robert Soulé. Infinite resources for optimistic concurrency control. In Proceedings of the 2018 Morning Workshop on In-Network Computing, pages 26--32, 2018.
[66]
Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. Accelerating distributed reinforcement learning with in-switch computing. In Proceedings of the 46th International Symposium on Computer Architecture, pages 279--291, 2019.
[67]
ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael M Swift. Atp: In-network aggregation for multi-tenant learning. In NSDI, pages 741--761, 2021.
[68]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richtárik. Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701, 2019.
[69]
Luo Mai, Lukas Rupprecht, Abdul Alim, Paolo Costa, Matteo Migliavacca, Peter Pietzuch, and Alexander L Wolf. Netagg: Using middleboxes for application-specific on-path aggregation in data centres. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, pages 249--262, 2014.
[70]
Marios Kogias and Edouard Bugnion. Hovercraft: Achieving scalability and fault-tolerance for microsecond-scale datacenter services. Technical report, 2020.

Cited By

View all
  1. SwitchTx: scalable in-network coordination for distributed transaction processing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 15, Issue 11
    July 2022
    980 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 July 2022
    Published in PVLDB Volume 15, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)68
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 05 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media