skip to main content
10.1109/MSR.2017.10acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Extracting code segments and their descriptions from research articles

Published: 20 May 2017 Publication History

Abstract

The availability of large corpora of online software-related documents today presents an opportunity to use machine learning to improve integrated development environments by first automatically collecting code examples along with associated descriptions. Digital libraries of computer science research and education conference and journal articles can be a rich source for code examples that are used to motivate or explain particular concepts or issues. Because they are used as examples in an article, these code examples are accompanied by descriptions of their functionality, properties, or other associated information expressed in natural language text. Identifying code segments in these documents is relatively straightforward, thus this paper tackles the problem of extracting the natural language text that is associated with each code segment in an article. We present and evaluate a set of heuristics that address the challenges of the text often not being colocated with the code segment as in developer communications such as online forums.

References

[1]
A. Bacchelli, M. D'Ambros, and M. Lanza, "Extracting source code from e-mails," in Program Comprehension (ICPC), 2010 IEEE 18th International Conference on, June 2010, pp. 24--33.
[2]
J. Tang, H. Li, Y. Cao, and Z. Tang, "Email data cleaning," in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, ser. KDD '05. New York, NY, USA: ACM, 2005, pp. 489--498. {Online}. Available
[3]
L. Cerulo, M. Ceccarelli, M. Di Penta, and G. Canfora, "A hidden markov model to detect coded information islands in free text," in Source Code Analysis and Manipulation (SCAM), 2013 IEEE 13th International Working Conference on, Sept 2013, pp. 157--166.
[4]
S. Panichella, J. Aponte, M. D. Penta, A. Marcus, and G. Canfora, "Mining source code descriptions from developer communications," in Program Comprehension (ICPC), 2012 IEEE 20th International Conference on, June 2012, pp. 63--72.
[5]
P. C. Rigby and M. P. Robillard, "Discovering essential code elements in informal documentation," in Proceedings of the 2013 International Conference on Software Engineering, ser. ICSE '13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 832--841. {Online}. Available: http://dl.acm.org/citation.cfm?id=2486788.2486897
[6]
S. Subramanian, L. Inozemtseva, and R. Holmes, "Live api documentation," in Proceedings of the 36th International Conference on Software Engineering, ser. ICSE 2014. New York, NY, USA: ACM, 2014, pp. 643--652. {Online}. Available
[7]
E. Wong, J. Yang, and L. Tan, "Autocomment: Mining question and answer sites for automatic comment generation," in Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, Nov 2013, pp. 562--567.
[8]
C. Treude and M. P. Robillard, "Augmenting api documentation with insights from stack overflow," in Proceedings of the 38th International Conference on Software Engineering, ser. ICSE '16. New York, NY, USA: ACM, 2016, pp. 392--403. {Online}. Available
[9]
J. Montandon, H. Borges, D. Felix, and M. Valente, "Documenting apis with examples: Lessons learned with the apiminer platform," in Reverse Engineering (WCRE), 2013 20th Working Conference on, Oct 2013, pp. 401--408.
[10]
N. Bettenburg, R. Premraj, T. Zimmermann, and S. Kim, "Extracting structural information from bug reports," in Proceedings of the 2008 International Working Conference on Mining Software Repositories, ser. MSR '08. New York, NY, USA: ACM, 2008, pp. 27--30. {Online}. Available
[11]
"ACM wiki page," https://en.wikipedia.org/wiki/Association_for_Computing_Machinery.
[12]
"IEEEXplore wiki page," https://en.wikipedia.org/wiki/IEEE_Xplore.
[13]
"ICSE publication history," http://dl.acm.org/event.cfm?id=RE228&tab=pubs&CFID=723067040&CFTOKEN=52119863.
[14]
D. Cruzes, M. Mendonça, V. Basili, F. Shull, and M. Jino, "Automated information extraction from empirical software engineering literature: Is that possible?" in Proceedings of the First International Symposium on Empirical Software Engineering and Measurement, ser. ESEM '07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 491--493. {Online}. Available: http://dl.acm.org/citation.cfm?id=1302496.1302980
[15]
G. Petrosyan, M. P. Robillard, and R. De Mori, "Discovering information explaining api types using text classification," in Proceedings of the 37th International Conference on Software Engineering - Volume 1, ser. ICSE '15. Piscataway, NJ, USA: IEEE Press, 2015, pp. 869--879. {Online}. Available: http://dl.acm.org/citation.cfm?id=2818754.2818859
[16]
S. Subramanian and R. Holmes, "Making sense of online code snippets," in Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on, May 2013, pp. 85--88.
[17]
"pdftotext online tool," http://pdftotext.com.
[18]
"convertmypdf online tool," http://www.convertmypdf.net/.
[19]
"convertpdftotext online tool," http://www.convertpdftotext.net/.
[20]
"ocrconvert ocr tool," http://www.ocrconvert.com/.
[21]
"abbyyfinereader ocr tool," https://www.abbyy.com/en-us/finereader/.
[22]
P. Chatterjee, M. Nishi, K. Damevski, V. Augustine, L. Pollock, and N. Kraft, "What information about code snippets is available in different software-related documents? an exploratory study," in Proceedings of the 24th IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER'17), Feb. 2017.
[23]
J. Siegmund, N. Siegmund, and S. Apel, "Views on internal and external validity in empirical software engineering," in Proceedings of the 37th International Conference on Software Engineering - Volume 1, ser. ICSE '15. Piscataway, NJ, USA: IEEE Press, 2015, pp. 9--19. {Online}. Available: http://dl.acm.org/citation.cfm?id=2818754.2818759
[24]
V. B. Kampenes, T. Dybå, J. E. Hannay, and D. I. K. Sjøberg, "Systematic review: A systematic review of effect size in software engineering experiments," Inf. Softw. Technol., vol. 49, no. 11--12, pp. 1073--1086, Nov. 2007. {Online}. Available
[25]
V. B. Kampenes, T. Dybå, J. E. Hannay, and D. I. K. Sjøberg, "A systematic review of quasi-experiments in software engineering," Inf. Softw. Technol., vol. 51, no. 1, pp. 71--82, Jan. 2009. {Online}. Available
[26]
W. F. Tichy, P. Lukowicz, L. Prechelt, and E. A. Heinz, "Experimental evaluation in computer science: A quantitative study," J. Syst. Softw., vol. 28, no. 1, pp. 9--18, Jan. 1995. {Online}. Available
[27]
C. Vassallo, S. Panichella, M. Di Penta, and G. Canfora, "Codes: Mining source code descriptions from developers discussions," in Proceedings of the 22Nd International Conference on Program Comprehension, ser. ICPC 2014. New York, NY, USA: ACM, 2014, pp. 106--109. {Online}. Available
[28]
M. Rahman, C. Roy, and I. Keivanloo, "Recommending insightful comments for source code using crowdsourced knowledge," in Source Code Analysis and Manipulation (SCAM), 2015 IEEE 15th International Working Conference on, Sept 2015, pp. 81--90.

Cited By

View all
  • (2022)Enriching Compiler Testing with Real Program from Bug ReportProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3556894(1-12)Online publication date: 10-Oct-2022
  • (2019)A Survey on Research of Code CommentProceedings of the 2019 3rd International Conference on Management Engineering, Software Engineering and Service Sciences10.1145/3312662.3312710(45-51)Online publication date: 12-Jan-2019
  • (2019)Recommending comprehensive solutions for programming tasks by mining crowd knowledgeProceedings of the 27th International Conference on Program Comprehension10.1109/ICPC.2019.00054(358-368)Online publication date: 25-May-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '17: Proceedings of the 14th International Conference on Mining Software Repositories
May 2017
567 pages
ISBN:9781538615447

Sponsors

Publisher

IEEE Press

Publication History

Published: 20 May 2017

Check for updates

Author Tags

  1. code snippet description
  2. information extraction
  3. mining software repositories
  4. text analysis

Qualifiers

  • Research-article

Conference

ICSE '17
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Enriching Compiler Testing with Real Program from Bug ReportProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3556894(1-12)Online publication date: 10-Oct-2022
  • (2019)A Survey on Research of Code CommentProceedings of the 2019 3rd International Conference on Management Engineering, Software Engineering and Service Sciences10.1145/3312662.3312710(45-51)Online publication date: 12-Jan-2019
  • (2019)Recommending comprehensive solutions for programming tasks by mining crowd knowledgeProceedings of the 27th International Conference on Program Comprehension10.1109/ICPC.2019.00054(358-368)Online publication date: 25-May-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media