skip to main content
10.1109/ICSE43902.2021.00144acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Restoring Execution Environments of Jupyter Notebooks

Published: 05 November 2021 Publication History

Abstract

More than ninety percent of published Jupyter notebooks do not state dependencies on external packages. This makes them non-executable and thus hinders reproducibility of scientific results. We present SnifferDog, an approach that 1) collects the APIs of Python packages and versions, creating a database of APIs; 2) analyzes notebooks to determine candidates for required packages and versions; and 3) checks which packages are required to make the notebook executable (and ideally, reproduce its stored results). In its evaluation, we show that SnifferDog precisely restores execution environments for the largest majority of notebooks, making them immediately executable for end users.

References

[1]
J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire, "A large-scale study about quality and reproducibility of Jupyter notebooks," in 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 2019, pp. 507--517.
[2]
J. Wang, T.-Y. Kuo, L. Li, and A. Zeller, "Assessing and restoring reproducibility of Jupyter notebooks," in The 35th IEEE/ACM International Conference on Automated Software Engineering (ASE 2020), 2020.
[3]
J. M. Perkel, "Why Jupyter is data scientists' computational notebook of choice," Nature, vol. 563, no. 7732, pp. 145--147, 2018.
[4]
E. Horton and C. Parnin, "DockerizeMe: Automatic inference of environment dependencies for Python code snippets," in Proceedings of the 41st International Conference on Software Engineering, ser. ICSE '19. IEEE Press, 2019, pp. 328---338. [Online]. Available: https://doi.org/10.1109/ICSE.2019.00047
[5]
"Pip user guide." [Online]. Available: https://pip.pypa.io/en/stable/user_guide/#requirements- files
[6]
"Anaconda software distribution," 2020. [Online]. Available: https://docs.anaconda.com/
[7]
"The Python language reference." [Online]. Available: https://docs.python.org/3/reference/import.html
[8]
"Python module index." [Online]. Available: https://docs.python.org/2.7/py-modindex.html
[9]
"Python module index." [Online]. Available: https://docs.python.org/3.1/modindex.html
[10]
"Python module index." [Online]. Available: https://docs.python.org/3.2/py-modindex.html
[11]
"Python module index." [Online]. Available: https://docs.python.org/3.3/py-modindex.html
[12]
"Python module index." [Online]. Available: https://docs.python.org/3.4/py-modindex.html
[13]
"Python module index." [Online]. Available: https://docs.python.org/3.5/py-modindex.html
[14]
"Python module index." [Online]. Available: https://docs.python.org/3.6/py-modindex.html
[15]
"Python module index." [Online]. Available: https://docs.python.org/3.7/py-modindex.html
[16]
"Python module index." [Online]. Available: https://docs.python.org/3/py-modindex.html
[17]
"Sample size calculator." [Online]. Available: https://www.surveysystem.com/sscalc.htm
[18]
"Documentation¶." [Online]. Available: https://setuptools.readthedocs.io/en/latest/
[19]
"Built-in magic commands¶." [Online]. Available: https://ipython.readthedocs.io/en/stable/interactive/magics.html
[20]
E. Horton and C. Parnin, "Gistable: Evaluating the executability of python code snippets on github," in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2018, pp. 217--227.
[21]
E. Horton and C. Parnin, "V2: Fast detection of configuration drift in python," in Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE '19. IEEE Press, 2019, p. 477--488. [Online]. Available: https://doi.org/10.1109/ASE.2019.00052
[22]
A. Koenzen, N. Ernst, and M.-A. Storey, "Code duplication and reuse in jupyter notebooks," arXiv preprint arXiv:2005.13709, 2020.
[23]
J. Wang, L. Li, and A. Zeller, "Better code, better sharing: On the need of analyzing Jupyter notebooks," in The 42nd International Conference on Software Engineering, NIER Track (ICSE 2020), 2020.
[24]
H. Fangohr, V. Fauske, T. Kluyver, M. Albert, O. Laslett, D. Cortés-Ortuño, M. Beg, and M. Ragan-Kelly, "Testing with jupyter notebooks: Notebook validation (nbval) plug-in for pytest," arXiv preprint arXiv:2001.04808, 2020.
[25]
D. Koop and J. Patel, "Dataflow notebooks: encoding and tracking dependencies of cells," in 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2017), 2017.
[26]
M. B. Kery and B. A. Myers, "Interactions for untangling messy history in a computational notebook," in 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 2018, pp. 147--155.
[27]
M. S. Rehman, "Towards understanding data analysis workflows using a large notebook corpus," in Proceedings of the 2019 International Conference on Management of Data, 2019, pp. 1841--1843.
[28]
A. Rule, I. Drosos, A. Tabard, and J. D. Hollan, "Aiding collaborative reuse of computational notebooks with annotated cell folding," Proceedings of the ACM on Human-Computer Interaction, vol. 2, no. CSCW, pp. 1--12, 2018.
[29]
S. Samuel and B. König-Ries, "ProvBook: Provenance-based semantic enrichment of interactive notebooks for reproducibility." in International Semantic Web Conference (P&D/Industry/BlueSky), 2018.
[30]
A. Watson, S. Bateman, and S. Ray, "PySnippet: Accelerating exploratory data analysis in Jupyter notebook through facilitated access to example code," in EDBT/ICDT Workshops, 2019.
[31]
H. Nguyen, D. A. Case, and A. S. Rose, "NGLview-interactive molecular graphics for Jupyter notebooks," Bioinformatics, vol. 34, no. 7, pp. 1241--1242, 2018.
[32]
H. Fangohr, M. Beg, M. Bergemann, V. Bondar, S. Brockhauser, C. Carinan, R. Costa, C. Fortmann, D. F. Marsa, G. Giovanetti et al., "Data exploration and analysis with jupyter notebooks," in 17th Biennial International Conference on Accelerator and Large Experimental Physics Control Systems, no. TALK-2020-009, 2019.
[33]
M. García-Domínguez, C. Domínguez, J. Heras, E. Mata, and V. Pascual, "Jupyter notebooks for simplifying transfer learning," in International Conference on Computer Aided Systems Theory. Springer, 2019, pp. 215--221.
[34]
A. Rule, A. Tabard, and J. D. Hollan, "Exploration and explanation in computational notebooks," in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 2018, pp. 1--12.
[35]
Y. Wang, M. Wen, Y. Liu, Y. Wang, Z. Li, C. Wang, H. Yu, S.-C. Cheung, C. Xu, and Z. Zhu, "Watchman: monitoring dependency conflicts for python library ecosystem," in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 2020, pp. 125--135.
[36]
J. Wang, L. Li, K. Liu, and H. Cai, "Exploring how deprecated Python library APIs are (not) handled," in The 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020), 2020.
[37]
Z. Zhang, H. Zhu, M. Wen, Y. Tao, Y. Liu, and Y. Xiong, "How do Python framework APIs evolve? an exploratory study," in 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2020, pp. 81--92.
[38]
C. Tucker, D. Shuffelton, R. Jhala, and S. Lerner, "Opium: Optimal package install/uninstall manager," in 29th International Conference on Software Engineering (ICSE'07). IEEE, 2007, pp. 178--188.
[39]
J. Patra, P. N. Dixit, and M. Pradel, "ConflictJS: finding and understanding conflicts between JavaScript libraries," in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 741--751.
[40]
C. Soto-Valero, A. Benelallam, N. Harrand, O. Barais, and B. Baudry, "The emergence of software diversity in Maven Central," in 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 2019, pp. 333--343.
[41]
Y. Wang, M. Wen, Z. Liu, R. Wu, R. Wang, B. Yang, H. Yu, Z. Zhu, and S.-C. Cheung, "Do the dependency conflicts in my project matter?" in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 319--330.
[42]
Y. Wang, M. Wen, R. Wu, Z. Liu, S. H. Tan, Z. Zhu, H. Yu, and S.-C. Cheung, "Could I have a stack trace to examine the dependency conflict issue?" in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 2019, pp. 572--583.
[43]
L. Li, T. Riom, T. F. Bissyandé, H. Wang, J. Klein, and Y. Le Traon, "Revisiting the impact of common libraries for android-related investigations," Journal of Systems and Software (JSS), 2019.
[44]
X. Zhan, L. Fan, T. Liu, S. Chen, L. Li, H. Wang, Y. Xu, X. Luo, and Y. Liu, "Automated third-party library detection for android applications: Are we there yet?" in The 35th IEEE/ACM International Conference on Automated Software Engineering (ASE 2020), 2020.
[45]
L. Li, T. F. Bissyandé, H. Wang, and J. Klein, "Cid: Automating the detection of api-related compatibility issues in android apps," in The ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2018), 2018.
[46]
H. Cai, Z. Zhang, L. Li, and X. Fu, "A large-scale study of application incompatibilities in android," in The 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019), 2019.
[47]
Y. Wang, B. Chen, K. Huang, B. Shi, C. Xu, X. Peng, Y. Wu, and Y. Liu, "An empirical study of usages, updates and risks of third-party libraries in java projects," in 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2020, pp. 35--45.
[48]
L. Li, J. Gao, T. F. Bissyandé, L. Ma, X. Xia, and J. Klein, "Cda: Characterising deprecated android apis," Empirical Software Engineering (EMSE), 2020.

Cited By

View all
  • (2024)Decide: Knowledge-Based Version Incompatibility Detection in Deep Learning StacksCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663796(547-551)Online publication date: 10-Jul-2024
  • (2024)Bloat beneath Python’s Scales: A Fine-Grained Inter-Project Dependency AnalysisProceedings of the ACM on Software Engineering10.1145/36608211:FSE(2584-2607)Online publication date: 12-Jul-2024
  • (2024)DistilKaggle: A Distilled Dataset of Kaggle Jupyter NotebooksProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644882(647-651)Online publication date: 15-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '21: Proceedings of the 43rd International Conference on Software Engineering
May 2021
1768 pages
ISBN:9781450390859

Sponsors

Publisher

IEEE Press

Publication History

Published: 05 November 2021

Check for updates

Author Tags

  1. API
  2. Environment
  3. Jupyter Notebook
  4. Python

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICSE '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Decide: Knowledge-Based Version Incompatibility Detection in Deep Learning StacksCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663796(547-551)Online publication date: 10-Jul-2024
  • (2024)Bloat beneath Python’s Scales: A Fine-Grained Inter-Project Dependency AnalysisProceedings of the ACM on Software Engineering10.1145/36608211:FSE(2584-2607)Online publication date: 12-Jul-2024
  • (2024)DistilKaggle: A Distilled Dataset of Kaggle Jupyter NotebooksProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644882(647-651)Online publication date: 15-Apr-2024
  • (2024)Bug Analysis in Jupyter Notebook Projects: An Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/364153933:4(1-34)Online publication date: 18-Apr-2024
  • (2024)ModuleGuard: Understanding and Detecting Module Conflicts in Python EcosystemProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639221(1-12)Online publication date: 20-May-2024
  • (2024)Less is More? An Empirical Study on Configuration Issues in Python PyPI EcosystemProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639077(1-12)Online publication date: 20-May-2024
  • (2024)Neural Library Recommendation by Embedding Project-Library Knowledge GraphIEEE Transactions on Software Engineering10.1109/TSE.2024.339350450:6(1620-1638)Online publication date: 24-Apr-2024
  • (2024)Revisiting Knowledge-Based Inference of Python Runtime Environments: A Realistic and Adaptive ApproachIEEE Transactions on Software Engineering10.1109/TSE.2023.334647450:2(258-279)Online publication date: 1-Feb-2024
  • (2023)Knowledge-Based Version Incompatibility Detection for Deep LearningProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616364(708-719)Online publication date: 30-Nov-2023
  • (2023)LExecutor: Learning-Guided ExecutionProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616254(1522-1534)Online publication date: 30-Nov-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media