skip to main content
10.1109/ICSE-SEIP52600.2021.00043acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

MicroHECL: high-efficient root cause localization in large-scale microservice systems

Published: 17 December 2021 Publication History

Abstract

Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68% with root cause localization time reduced from 30 minutes to 5 minutes.

References

[1]
P. D. Francesco, I. Malavolta, and P. Lago, "Research on architecting microservices: Trends, focus, and potential for industrial adoption," in ICSA, 2017, pp. 21--30.
[2]
X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, W. Li, and D. Ding, "Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study," IEEE Transactions on Software Engineering, pp. 1--1, 2018.
[3]
X. Guo, X. Peng, H. Wang, W. Li, H. Jiang, D. Ding, T. Xie, and L. Su, "Graph-based trace analysis for microservice architecture understanding and problem diagnosis," in ESEC/SIGSOFT FSE, 2020.
[4]
C. Pham, L. Wang, B. Tak, S. Baset, C. Tang, Z. T. Kalbarczyk, and R. K. Iyer, "Failure diagnosis for distributed systems using targeted fault injection," IEEE Trans. Parallel Distributed Syst., vol. 28, no. 2, pp. 503--516, 2017.
[5]
X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, "Latent error prediction and fault localization for microservice applications by learning from system trace logs," in ESEC/SIGSOFT FSE, 2019, pp. 683--694.
[6]
M. Kim, R. Sumbaly, and S. Shah, "Root cause detection in a service-oriented architecture," in SIGMETRICS, 2013, pp. 93--104.
[7]
P. Chen, Y. Qi, P. Zheng, and D. Hou, "Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems," in INFOCOM, 2014, pp. 1887--1895.
[8]
H. Wang, P. Nguyen, J. Li, S. Köprü, G. Zhang, S. Katariya, and S. Ben-Romdhane, "GRANO: interactive graph-based root cause analysis for cloud-native distributed data platform," Proc. VLDB Endow., vol. 12, no. 12, pp. 1942--1945, 2019.
[9]
Á. Brandón, M. Solé, A. Huélamo, D. Solans, M. S. Pérez, and V. Muntés-Mulero, "Graph-based root cause analysis for service-oriented and microservice architectures," J. Syst. Softw., vol. 159, 2020.
[10]
L. Wu, J. Tordsson, E. Elmroth, and O. Kao, "Microrca: Root cause localization of performance issues in microservices," in NOMS, 2020, pp. 1--9.
[11]
P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y. Wang, and P. Chen, "Cloudranger: Root cause identification for cloud native systems," in CCGRID, 2018, pp. 492--502.
[12]
J. Lin, P. Chen, and Z. Zheng, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments," in ICSOC, 2018, pp. 3--20.
[13]
Docker.Com, "Docker," 2020. [Online]. Available: https://docker.com/
[14]
Kubernetes.Io, "Kubernetes," 2020. [Online]. Available: https://kubernetes.io/
[15]
Istio.Io, "Istio," 2020. [Online]. Available: https://istio.io/
[16]
Z. Cai, W. Li, W. Zhu, L. Liu, and B. Yang, "A real-time trace-level root-cause diagnosis system in alibaba datacenters," IEEE Access, vol. 7, pp. 142 692--142 702, 2019.
[17]
Nginx.Org, "Nginx," 2020. [Online]. Available: http://nginx.org/
[18]
Apache.Org, "Flink," 2020. [Online]. Available: https://flink.apache.org/
[19]
H. Abe and S. Tsumoto, "Analyzing behavior of objective rule evaluation indices based on a correlation coefficient," in KES (2), 2008, pp. 758--765.
[20]
R. Vlasveld, "Introduction to one-class support vector machines," 2013. [Online]. Available: http://rvlasveld.github.io/blog/2013/07/12/introduction-to-one-class-support-vector-machines/
[21]
Scikit-learn.Org, "Scikitlearn," 2020. [Online]. Available: https://scikit-learn.org/
[22]
Q. Fu, J. Lou, Y. Wang, and J. Li, "Execution anomaly detection in distributed systems through unstructured log analysis," in ICDM, 2009, pp. 149--158.
[23]
J. Xu, P. Chen, L. Yang, F. Meng, and P. Wang, "Logdc: Problem diagnosis for declartively-deployed cloud applications with log," in ICEBE, 2017, pp. 282--287.
[24]
T. Jia, P. Chen, L. Yang, Y. Li, F. Meng, and J. Xu, "An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services," in ICWS, 2017, pp. 25--32.
[25]
Y. Gan, Y. Zhang, K. Hu, D. Cheng, Y. He, M. Pancholi, and C. Delimitrou, "Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices," in ASPLOS, 2019, pp. 19--33.

Cited By

View all
  • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
  • (2024)Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive OptimizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695489(1107-1119)Online publication date: 27-Oct-2024
  • (2024)MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal DataProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695485(1057-1068)Online publication date: 27-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice
May 2021
405 pages
ISBN:9780738146690

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 17 December 2021

Check for updates

Author Tags

  1. anomaly detection
  2. availability
  3. microservice
  4. root cause localization
  5. service call graph

Qualifiers

  • Research-article

Conference

ICSE '21
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)4
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
  • (2024)Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive OptimizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695489(1107-1119)Online publication date: 27-Oct-2024
  • (2024)MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal DataProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695485(1057-1068)Online publication date: 27-Oct-2024
  • (2024)SLIM: a Scalable and Interpretable Light-weight Fault Localization Algorithm for Imbalanced Data in MicroserviceProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694984(27-39)Online publication date: 27-Oct-2024
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • (2024)Fault Diagnosis for Test Alarms in Microservices through Multi-source DataCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663833(115-125)Online publication date: 10-Jul-2024
  • (2024)Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal GraphCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663827(50-61)Online publication date: 10-Jul-2024
  • (2024)BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point DetectionProceedings of the ACM on Software Engineering10.1145/36608051:FSE(2214-2237)Online publication date: 12-Jul-2024
  • (2024)TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime StateProceedings of the ACM on Software Engineering10.1145/36437481:FSE(473-493)Online publication date: 12-Jul-2024
  • (2024)FaultInsight: Interpreting Hyperscale Data Center Host FaultsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672051(141-152)Online publication date: 25-Aug-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media