skip to main content
10.1145/3394486.3403299acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

To Tune or Not to Tune?: In Search of Optimal Configurations for Data Analytics

Published: 20 August 2020 Publication History
  • Get Citation Alerts
  • Abstract

    This experimental study presents a number of issues that pose a challenge for practical configuration tuning and its deployment in data analytics frameworks. These issues include: 1) the assumption of a static workload or environment, ignoring the dynamic characteristics of the analytics environment (e.g., increase in input data size, changes in allocation of resources). 2) the amortization of tuning costs and how this influences what workloads can be tuned in practice in a cost-effective manner. 3) the need for a comprehensive incremental tuning solution for a diverse set of workloads. We adapt different ML techniques in order to obtain efficient incremental tuning in our problem domain, and propose Tuneful, a configuration tuning framework. We show how it is designed to overcome the above issues and illustrate its applicability by running a wide array of experiments in cloud environments provided by two different service providers.

    References

    [1]
    TPC-H SQL benchmark, 2014. http://www.tpc.org/tpch/.
    [2]
    Apache Spark: fast and general engine for large-scale data processing, 2015. https://spark.apache.org/.
    [3]
    Amazon EC2 instance Pricing, 2018. https://aws.amazon.com/ec2/pricing/on-demand/.
    [4]
    Hadoop distributed file system, 2018. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
    [5]
    Tuneful: Experiment data repository, 2020. https://github.com/ayat-khairy/tuneful-data.git.
    [6]
    Tuneful: project repository, 2020. https://github.com/ayat-khairy/tuneful-code.git.
    [7]
    Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI, volume 2, pages 4--2, 2017.
    [8]
    Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. Opentuner: An extensible framework for program autotuning. In Parallel Architecture and Compilation Techniques (PACT), 2014 23rd International Conference on, pages 303--315. IEEE, 2014.
    [9]
    Emanuele Borgonovo and Elmar Plischke. Sensitivity analysis: a review of recent advances. European Journal of Operational Research, 248(3):869--887, 2016.
    [10]
    Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408--423, 2015.
    [11]
    Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. Tuning database configuration parameters with ituned. Proceedings of the VLDB Endowment, 2(1):1246--1257, 2009.
    [12]
    Ayat Fekry, Lucian Carata, Thomas Pasquier, Andrew Rice, and Andy Hopper. Tuneful: An online significance-aware configuration tuner for big data analytics, 2020. https://arxiv.org/pdf/2001.08002.pdf.
    [13]
    Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines. Machine learning, 46(1--3):389--422, 2002.
    [14]
    Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on, pages 41--51. IEEE, 2010.
    [15]
    SPT Krishnan and Jose L Ugia Gonzalez. Google compute engine. In Building Your Next Big Thing with Google Cloud Platform, pages 53--81. Springer, 2015.
    [16]
    Palden Lama and Xiaobo Zhou. Aroma: Automated resource allocation and configuration of MapReduce environment in the cloud. In Proceedings of the 9th international conference on Autonomic computing, pages 63--72. ACM, 2012.
    [17]
    Guangdeng Liao, Kushal Datta, and Theodore L Willke. Gunther: Search-based auto-tuning of MapReduce. In European Conference on Parallel Processing, pages 406--419. Springer, 2013.
    [18]
    Andy Liaw, Matthew Wiener, et al. Classification and regression by randomforest. R news, 2(3):18--22, 2002.
    [19]
    Ashraf Mahgoub, Paul Wood, Sachandhan Ganesh, Subrata Mitra, Wolfgang Gerlach, Travis Harrison, Folker Meyer, Ananth Grama, Saurabh Bagchi, and Somali Chaterji. Rafiki: A middleware for parameter tuning of NoSQL datastores for dynamic metagenomics workloads. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, pages 28--40. ACM, 2017.
    [20]
    Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148--175, 2015.
    [21]
    Ilya M Sobol. On quasi-monte carlo integrations. Mathematics and computers in simulation, 47(2--5):103--112, 1998.
    [22]
    Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In Advances in neural information processing systems, pages 2004--2012, 2013.
    [23]
    Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1009--1024. ACM, 2017.
    [24]
    Guolu Wang, Jungang Xu, and Ben He. A novel method for tuning configuration parameters of Spark based on machine learning. In High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016 IEEE 18th International Conference on, pages 586--593. IEEE, 2016.
    [25]
    John A Weymark. Generalized Gini inequality indices. Mathematical Social Sciences, 1(4):409--430, 1981.
    [26]
    Zhibin Yu, Zhendong Bei, and Xuehai Qian. Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 564--577. ACM, 2018.
    [27]
    Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, Zhuoyue Liu, Kunpeng Song, and Yingchun Yang. Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In Proceedings of the 2017 Symposium on Cloud Computing, pages 338--350. ACM, 2017.

    Cited By

    View all
    • (2024)An Efficient Transfer Learning Based Configuration Adviser for Database TuningProceedings of the VLDB Endowment10.14778/3632093.363211417:3(539-552)Online publication date: 20-Jan-2024
    • (2024)ShrinkHPO: Towards Explainable Parallel Hyperparameter Optimization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00371(4897-4910)Online publication date: 13-May-2024
    • (2024) QHB + : Accelerated Configuration Optimization for Automated Performance Tuning of Spark SQL Applications IEEE Access10.1109/ACCESS.2024.339133312(60138-60148)Online publication date: 2024
    • Show More Cited By

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
    August 2020
    3664 pages
    ISBN:9781450379984
    DOI:10.1145/3394486
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2020

    Check for updates

    Author Tags

    1. bayesian optimization
    2. configuration tuning
    3. cost amortization
    4. data analytics

    Qualifiers

    • Research-article

    Conference

    KDD '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)178
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 14 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Efficient Transfer Learning Based Configuration Adviser for Database TuningProceedings of the VLDB Endowment10.14778/3632093.363211417:3(539-552)Online publication date: 20-Jan-2024
    • (2024)ShrinkHPO: Towards Explainable Parallel Hyperparameter Optimization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00371(4897-4910)Online publication date: 13-May-2024
    • (2024) QHB + : Accelerated Configuration Optimization for Automated Performance Tuning of Spark SQL Applications IEEE Access10.1109/ACCESS.2024.339133312(60138-60148)Online publication date: 2024
    • (2023)A Sample-Aware Database Tuning System With Deep Reinforcement LearningJournal of Database Management10.4018/JDM.33351935:1(1-25)Online publication date: 9-Nov-2023
    • (2023)FASTune: Towards Fast and Stable Database Tuning System with Reinforcement LearningElectronics10.3390/electronics1210216812:10(2168)Online publication date: 10-May-2023
    • (2023)Towards General and Efficient Online Tuning for SparkProceedings of the VLDB Endowment10.14778/3611540.361154816:12(3570-3583)Online publication date: 1-Aug-2023
    • (2023)EFTuner: A Bi-Objective Configuration Parameter Auto-Tuning Method Towards Energy-Efficient Big Data ProcessingProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609443(292-301)Online publication date: 4-Aug-2023
    • (2023)A Unified and Efficient Coordinating Framework for Autonomous DBMS TuningProceedings of the ACM on Management of Data10.1145/35893311:2(1-26)Online publication date: 20-Jun-2023
    • (2023)DBPA: A Benchmark for Transactional Database Performance AnomaliesProceedings of the ACM on Management of Data10.1145/35889261:1(1-26)Online publication date: 30-May-2023
    • (2023)Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics2023 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59175.2023.10253884(403-412)Online publication date: 17-Nov-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media