Understanding performance problems in deep learning systems

J Cao, B Chen, C Sun, L Hu, S Wu, X Peng - Proceedings of the 30th …, 2022 - dl.acm.org
J Cao, B Chen, C Sun, L Hu, S Wu, X Peng
Proceedings of the 30th ACM Joint European Software Engineering Conference …, 2022dl.acm.org
Deep learning (DL) has been widely applied to many domains. Unique challenges in
engineering DL systems are posed by the programming paradigm shift from traditional
systems to DL systems, and performance is one of the challenges. Performance problems
(PPs) in DL systems can cause severe consequences such as excessive resource
consumption and financial loss. While bugs in DL systems have been extensively
investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present …
Deep learning (DL) has been widely applied to many domains. Unique challenges in engineering DL systems are posed by the programming paradigm shift from traditional systems to DL systems, and performance is one of the challenges. Performance problems (PPs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to i) characterize symptoms, root causes, and introducing and exposing stages of PPs in DL systems developed in TensorFLow and Keras, with 224 PPs collected from 210 StackOverflow posts, and to ii) assess the capability of existing performance analysis approaches in tackling PPs, with a constructed benchmark of 58 PPs in DL systems. Our findings shed light on the implications on developing high-performance DL systems, and detecting and localizing PPs in DL systems. To demonstrate the usefulness of our findings, we develop a static checker DeepPerf to detect three types of PPs. It has detected 488 new PPs in 130 GitHub projects. 105 and 27 PPs have been confirmed and fixed.
ACM Digital Library