research-article

SemParser: A Semantic Parser for Log Analytics

Authors:

Michael R. LyuAuthors Info & Claims

ICSE '23: Proceedings of the 45th International Conference on Software Engineering

Pages 881 - 893

https://doi.org/10.1109/ICSE48619.2023.00082

Published: 26 July 2023 Publication History

Abstract

Logs, being run-time information automatically generated by software, record system events and activities with their timestamps. Before obtaining more insights into the run-time status of the software, a fundamental step of log analysis, called log parsing, is employed to extract structured templates and parameters from the semi-structured raw log messages. However, current log parsers are all syntax-based and regard each message as a character string, ignoring the semantic information included in parameters and templates.

Thus, we propose the first semantic-based parser SemParser to unlock the critical bottleneck of mining semantics from log messages. It contains two steps, an end-to-end semantics miner and a joint parser. Specifically, the first step aims to identify explicit semantics inside a single log, and the second step is responsible for jointly inferring implicit semantics and computing structural outputs according to the contextual knowledge base of the logs. To analyze the effectiveness of our semantic parser, we first demonstrate that it can derive rich semantics from log messages collected from six widely-applied systems with an average F1 score of 0.985. Then, we conduct two representative downstream tasks, showing that current downstream models improve their performance with appropriately extracted semantics by 1.2%-11.7% and 8.65% on two anomaly detection datasets and a failure identification dataset, respectively. We believe these findings provide insights into semantically understanding log messages for the log analysis community.

References

[1]

M. Chen, A. X. Zheng, J. Lloyd, M. I. Jordan, and E. Brewer, "Failure diagnosis using decision trees," in International Conference on Autonomic Computing, New York, NY, USA, May 17--19, 2004. IEEE Computer Society, 2004, pp. 36--43. [Online].

[2]

W. Xu, L. Huang, A. Fox, D. A. Patterson, and M. Jordan, "Large-scale system problems detection by mining console logs," EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009--103, Jul 2009. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-103.html

[3]

N. Zhao, H. Wang, Z. Li, X. Peng, G. Wang, Z. Pan, Y. Wu, Z. Feng, X. Wen, W. Zhang et al., "An empirical investigation of practical log anomaly detection for online service systems," in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1404--1415. [Online].

Digital Library

[4]

A. R. Chen, T.-H. P. Chen, and S. Wang, "Pathidea: Improving information retrieval-based bug localization by re-constructing execution paths using logs," IEEE Transactions on Software Engineering, 2021.

[5]

A. Amar and P. C. Rigby, "Mining historical test logs to predict bugs and localize faults in the test logs," in Proceedings of the 41st International Conference on Software Engineering: Companion Proceedings, Montreal, QC, Canada, May 25--31, 2019. IEEE / ACM, 2019, pp. 140--151. [Online].

Digital Library

[6]

R. Vaarandi, "A data clustering algorithm for mining patterns from event logs," in Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IEEE Cat. No. 03EX764), Kansas City, MO, USA, Oct 3, 2003. IEEE, 2003, pp. 119--126. [Online]. Available: https://ieeexplore.ieee.org/document/1251233

[7]

M. Nagappan and M. A. Vouk, "Abstracting log lines to log event types for mining software system logs," in Proceedings of the 7th International Working Conference on Mining Software Repositories, Cape Town, South Africa, May 2--3, 2010, IEEE. IEEE Computer Society, 2010, pp. 114--117. [Online].

[8]

H. Dai, H. Li, C. S. Chen, W. Shang, and T.-H. Chen, "Logram: Efficient log parsing using n-gram dictionaries," CoRR, vol. abs/2001.03038, 2020. [Online]. Available: http://arxiv.org/abs/2001.03038

[9]

L. Tang, T. Li, and C.-S. Perng, "Logsig: Generating system events from raw textual logs," in Proceedings of the 20th Conference on Information and Knowledge Management, UK, October 24--28, 2011. ACM, 2011, pp. 785--794. [Online].

Digital Library

[10]

M. Mizutani, "Incremental mining of system log format," in International Conference on Services Computing, Santa Clara, CA, USA, June 28 - July 3, 2013. IEEE Computer Society, 2013, pp. 595--602. [Online].

Digital Library

[11]

G. Chu, J. Wang, Q. Qi, H. Sun, S. Tao, and J. Liao, "Prefix-graph: A versatile log parsing approach merging prefix tree with probabilistic graph," in 2021 IEEE 37th International Conference on Data Engineering. IEEE, 2021, pp. 2411--2422. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9458609

[12]

P. He, J. Zhu, Z. Zheng, and M. R. Lyu, "Drain: An online log parsing approach with fixed depth tree," in 2017 IEEE International Conference on Web Services, Honolulu, HI, USA, June 25--30, 2017. IEEE, 2017, pp. 33--40. [Online].

[13]

H. Li, T.-H. P. Chen, W. Shang, and A. E. Hassan, "Studying software logging using topic models," Empirical Software Engineering, vol. 23, no. 5, pp. 2655--2694, 2018. [Online].

Digital Library

[14]

X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie, X. Yang, Q. Cheng, Z. Li et al., "Robust log-based anomaly detection on unstable log data," in Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia, August 26--30, 2019. ACM, 2019, pp. 807--817. [Online].

Digital Library

[15]

Q. Fu, J.-G. Lou, Y. Wang, and J. Li, "Execution anomaly detection in distributed systems through unstructured log analysis," in International Conference on Data Mining, Miami, Florida, USA, December 6--9, 2009. IEEE Computer Society, 2009, pp. 149--158. [Online].

Digital Library

[16]

S. Messaoudi, A. Panichella, D. Bianculli, L. Briand, and R. Sasnauskas, "A search-based approach for accurate identification of log message formats," in Proceedings of the 26th Conference on Program Comprehension, Gothenburg, Sweden, May 27--28, 2018. ACM, 2018, pp. 167--177. [Online].

Digital Library

[17]

D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," Journal of machine Learning research, vol. 3, no. Jan, pp. 993--1022, 2003.

Digital Library

[18]

R. He, W. S. Lee, H. T. Ng, and D. Dahlmeier, "An unsupervised neural attention model for aspect extraction," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 388--397.

[19]

M. Sundermeyer, R. Schlüter, and H. Ney, "Lstm neural networks for language modeling," in Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, September 9--13, 2012. ISCA, 2012, pp. 194--197. [Online]. Available: http://www.isca-speech.org/archive/interspeech_2012/i12_0194.html

[20]

A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, May 26--31, 2013. IEEE, 2013, pp. 6645--6649. [Online].

[21]

M. Xuezhe and H. H. Eduard, "End-to-end sequence labeling via bi-directional lstm-cnns-crf," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, August 7--12, 2016. The Association for Computer Linguistics, 2016. [Online].

[22]

Z. Huang, W. Xu, and K. Yu, "Bidirectional lstm-crf models for sequence tagging," CoRR, vol. abs/1508.01991, 2015. [Online]. Available: http://arxiv.org/abs/1508.01991

[23]

J. P. Chiu and E. Nichols, "Named entity recognition with bidirectional lstm-cnns," Trans. Assoc. Comput. Linguistics, vol. 4, pp. 357--370, 2016. [Online]. Available: https://transacl.org/ojs/index.php/tacl/article/view/792

[24]

M. Shetty, C. Bansal, S. Kumar, N. Rao, N. Nagappan, and T. Zimmermann, "Neural knowledge extraction from cloud service incidents," in 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice. IEEE, 2021, pp. 218--227.

[25]

J. Tabassum, M. Maddela, W. Xu, and A. Ritter, "Code and named entity recognition in stackoverflow," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 5--10, 2020. Association for Computational Linguistics, 2020, pp. 4913--4926. [Online].

[26]

R. Caruana, "Multitask learning," Machine learning, vol. 28, no. 1, pp. 41--75, 1997. [Online].

Digital Library

[27]

D. Cotroneo, L. De Simone, P. Liguori, R. Natella, and N. Bidokhti, "How bad can a bug get? an empirical analysis of software failures in the openstack cloud computing platform," in Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia, August 26--30, 2019. ACM, 2019, pp. 200--211. [Online].

Digital Library

[28]

J. Cohen, "A coefficient of agreement for nominal scales," Educational and psychological measurement, vol. 20, pp. 37--46, 1960. [Online]. Available: https://w3.ric.edu/faculty/organic/coge/cohen1960.pdf

[29]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Annual Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, USA, December 5--8, 2013, 2013, pp. 3111--3119. [Online]. Available: https://proceedings.neurips.cc/paper/2013/hash/9aa42b3188ec039965f3c4923ce901b-Abstract.html

[30]

R. Řehůřek and P. Sojka, "Software framework for topic modelling with large corpora," in Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks, Valletta, Malta, May 22, 2010. University of Malta, 2010, pp. 46--50. [Online]. Available: http://www.fi.muni.cz/usr/sojka/presentations/lrec2010-poster-rehurek-sojka.pdf

[31]

S. He, J. Zhu, P. He, and M. R. Lyu, "Loghub: A large collection of system log datasets towards automated log analytics," CoRR, vol. abs/2008.06448, 2020. [Online]. Available: https://arxiv.org/abs/2008.06448

[32]

J. Liu, J. Zhu, S. He, P. He, Z. Zheng, and M. R. Lyu, "Logzip: extracting hidden structures via iterative clustering for log compression," in International Conference on Automated Software Engineering, San Diego, CA, USA, November 11--15, 2019. IEEE, 2019, pp. 863--873. [Online].

Digital Library

[33]

Z. Chen, J. Liu, W. Gu, Y. Su, and M. R. Lyu, "Experience report: Deep learning-based system log analysis for anomaly detection," CoRR, vol. abs/2107.05908, 2021. [Online]. Available: https://arxiv.org/abs/2107.05908

[34]

Q. Lin, H. Zhang, J.-G. Lou, Y. Zhang, and X. Chen, "Log clustering based problem identification for online service systems," in Proceedings of the 38th International Conference on Software Engineering, Austin, TX, USA, May 14--22, 2016 - Companion Volume. ACM, 2016, pp. 102--111. [Online].

Digital Library

[35]

W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, "Detecting large-scale system problems by mining console logs," in Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, 2009, pp. 117--132.

[36]

S. He, J. Zhu, P. He, and M. R. Lyu, "Experience report: System log analysis for anomaly detection," in International Symposium on Software Reliability Engineering, Ottawa, ON, Canada, October 23--27, 2016. IEEE Computer Society, 2016, pp. 207--218. [Online].

[37]

K. Shima, "Length matters: Clustering system log messages using length of words," CoRR, vol. abs/1611.03213, 2016. [Online]. Available: http://arxiv.org/abs/1611.03213

[38]

Z. M. Jiang, A. E. Hassan, P. Flora, and G. Hamann, "Abstracting execution logs to execution events for enterprise applications (short paper)," in Proceedings of the Eighth International Conference on Quality Software, Oxford, UK, August 12--13, 2008,. IEEE Computer Society, 2008, pp. 181--186. [Online].

Digital Library

[39]

A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, "Clustering event logs using iterative partitioning," in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009. ACM, 2009, pp. 1255--1264. [Online].

Digital Library

[40]

M. Du, F. Li, G. Zheng, and V. Srikumar, "Deeplog: Anomaly detection and diagnosis from system logs through deep learning," in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, October 30 - November 03, 2017. ACM, 2017, pp. 1285--1298. [Online].

Digital Library

[41]

S. Lu, X. Wei, Y. Li, and L. Wang, "Detecting anomaly in big data system logs using convolutional neural network," in Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress, Athens, Greece, August 12--15, 2018. IEEE Computer Society, 2018, pp. 151--158. [Online].

[42]

S. Nedelkoski, J. Bogatinovski, A. Acker, J. Cardoso, and O. Kao, "Self-attentive classification-based anomaly detection in unstructured logs," in International Conference on Data Mining, Sorrento, Italy, November 17--20, 2020. IEEE, 2020, pp. 1196--1201. [Online].

[43]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, December 4--9, 2017, 2017, pp. 5998--6008. [Online]. Available: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

[44]

P. Covington, J. Adams, and E. Sargin, "Deep neural networks for youtube recommendations," in Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 15--19, 2016. ACM, 2016, pp. 191--198. [Online].

Digital Library

[45]

T. Osadchiy, I. Poliakov, P. Olivier, M. Rowland, and E. Foster, "Recommender system based on pairwise association rules," Expert Systems with Applications, vol. 115, pp. 535--542, 2019. [Online].

[46]

S. He, P. He, Z. Chen, T. Yang, Y. Su, and M. R. Lyu, "A survey on automated log analysis for reliability engineering," ACM Computing Surveys (CSUR), vol. 54, no. 6, pp. 1--37, 2021.

Digital Library

[47]

Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo, "Failure prediction in ibm bluegene/levent logs," in International Symposium on Parallel and Distributed Processing, Miami, Florida USA, April 14--18, 2008. IEEE, 2008, pp. 1--5. [Online].

[48]

H. Ott, J. Bogatinovski, A. Acker, S. Nedelkoski, and O. Kao, "Robust and transferable anomaly detection in log data using pre-trained language models," in 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence). IEEE, 2021, pp. 19--24.

[49]

T. Jia, L. Yang, P. Chen, Y. Li, F. Meng, and J. Xu, "Logsed: Anomaly diagnosis through mining time-weighted control flow graph in logs," in International Conference on Cloud Computing (CLOUD), Honolulu, HI, USA, June 25--30, 2017. IEEE Computer Society, 2017, pp. 447--455. [Online].

[50]

T. Jia, P. Chen, L. Yang, Y. Li, F. Meng, and J. Xu, "An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services," in International Conference on Web Services, Honolulu, HI, USA, June 25--30, 2017. IEEE, 2017, pp. 25--32. [Online].

[51]

H. Amar, L. Bao, N. Busany, D. Lo, and S. Maoz, "Using finite-state models for log differencing," in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 49--59.

[52]

H. Jiang, X. Li, Z. Yang, and J. Xuan, "What causes my test alarm? automatic cause analysis for test alarms in system and integration testing," in Proceedings of the 39th International Conference on Software Engineering, Buenos Aires, Argentina, May 20--28, 2017. IEEE / ACM, 2017, pp. 712--723. [Online].

Digital Library

Cited By

Yu SWu YLi YHe PFilkov VRay BZhou M(2024)Unlocking the Power of Numbers: Log Compression via Numeric Token ParsingProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695474(919-930)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695474
Jiang ZLiu JChen ZLi YHuang JHuo YHe PGu JLyu M(2024)LILAC: Log Parsing using LLMs with Adaptive Parsing CacheProceedings of the ACM on Software Engineering10.1145/36437331:FSE(137-160)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3643733
Zhou YSu Y(2023)ForestZip: An Effective Parallel Parser for Log CompressionProceedings of the 2023 3rd Guangdong-Hong Kong-Macao Greater Bay Area Artificial Intelligence and Big Data Forum10.1145/3660395.3660443(274-278)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1145/3660395.3660443

Recommendations

UniParser: A Unified Log Parser for Heterogeneous Log Data
WWW '22: Proceedings of the ACM Web Conference 2022

Logs provide first-hand information for engineers to diagnose failures in large-scale online service systems. Log parsing, which transforms semi-structured raw log messages into structured data, is a prerequisite of automated log analysis such as log-...
Self-supervised log parsing using semantic contribution difference
Abstract
Logs can help developers to promptly diagnose software system failures. Log parsers, which parse semi-structured logs into structured log templates, are the first component for automated log analysis. However, almost all existing log ...
Highlights
- Integrates advanced NLP technology to construct semantic contributions of words to parse logs.
SPINE: a scalable log parser with feedback guidance
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Log parsing, which extracts log templates and parameters, is a critical prerequisite step for automated log analysis techniques. Though existing log parsers have achieved promising accuracy on public log datasets, they still face many challenges when ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '23: Proceedings of the 45th International Conference on Software Engineering

May 2023

2713 pages

ISBN:9781665457019

General Chair:
John Grundy
Department of Software Systems and Cybersecurity, Faculty of IT, Monash University, Australia
,
Program Co-chairs:
Lori Pollock
University of Delaware, DE, USA
,
Massimiliano Di Penta
University of Sannio, Italy

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 26 July 2023

Check for updates

Qualifiers

Research-article

Conference

ICSE '23

Sponsor:

SIGSOFT

ICSE '23: 45th International Conference on Software Engineering

May 14 - 20, 2023

Victoria, Melbourne, Australia

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
75
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)6

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yu SWu YLi YHe PFilkov VRay BZhou M(2024)Unlocking the Power of Numbers: Log Compression via Numeric Token ParsingProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695474(919-930)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695474
Jiang ZLiu JChen ZLi YHuang JHuo YHe PGu JLyu M(2024)LILAC: Log Parsing using LLMs with Adaptive Parsing CacheProceedings of the ACM on Software Engineering10.1145/36437331:FSE(137-160)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3643733
Zhou YSu Y(2023)ForestZip: An Effective Parallel Parser for Log CompressionProceedings of the 2023 3rd Guangdong-Hong Kong-Macao Greater Bay Area Artificial Intelligence and Big Data Forum10.1145/3660395.3660443(274-278)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1145/3660395.3660443

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents