Article

On the naturalness of software

Authors:

Premkumar DevanbuAuthors Info & Claims

ICSE '12: Proceedings of the 34th International Conference on Software Engineering

Pages 837 - 847

Published: 02 June 2012 Publication History

Abstract

Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.

References

[1]

K. Sparck Jones, "Natural language processing: a historical review," Current Issues in Computational Linguistics: in Honour of Don Walker (Ed Zampolli, Calzolari and Palmer), Amsterdam: Kluwer, 1994.

[2]

M. Gabel and Z. Su, "A study of the uniqueness of source code," in Proceedings, ACM SIGSOFT FSE. ACM, 2010, pp. 147-156.

Digital Library

[3]

P. Koehn, Statistical Machine Translation. Cambridge University Press, 2010.

Digital Library

[4]

C. Manning, H. Schütze, and MITCogNet, Foundations of statistical natural language processing. MIT Press, 1999, vol. 59.

Digital Library

[5]

M. Bruch, M. Monperrus, and M. Mezini, "Learning from examples to improve code completion systems," in Proceedings, ACM SIGSOFT ESEC/FSE, 2009.

Digital Library

[6]

M. Bruch, E. Bodden, M. Monperrus, and M. Mezini, "IDE 2.0: collective intelligence in software development," in Proceedings of the FSE/SDP workshop on Future of software engineering research. ACM, 2010, pp. 53-58.

Digital Library

[7]

R. Robbes and M. Lanza, "Improving code completion with program history," Automated Software Engineering, vol. 17, no. 2, pp. 181-212, 2010.

Digital Library

[8]

S. Han, D. R. Wallace, and R. C. Miller, "Code completion from abbreviated input," in Proceedings, ASE. IEEE Computer Society, 2009, pp. 332-343.

Digital Library

[9]

F. Jacob and R. Tairas, "Code template inference using language models," in Proceedings of the 48th Annual Southeast Regional Conference, 2010.

Digital Library

[10]

D. Hou and D. Pletcher, "An evaluation of the strategies of sorting, filtering, and grouping API methods for code completion," in Proceedings, ICSM, 2011.

Digital Library

[11]

D. Lawrie, C. Morrell, H. Feild, and D. Binkley, "What's in a name? a study of identifiers," Proceedings, ICPC, 2006.

Digital Library

[12]

D. Binkley, M. Hearn, and D. Lawrie, "Improving identifier informativeness using part of speech information," in Proceedings, MSR. ACM, 2011.

Digital Library

[13]

E. W. Hø st and B. M. Østvold, "Software language engineering," D. Ga¿evic, R. Lämmel, and E. Wyk, Eds. Berlin, Heidelberg: Springer-Verlag, 2009, ch. The Java Programmer's Phrase Book.

[14]

E. Høst and B. Østvold, "Debugging method names," in Proceedings, ECOOP. Springer, 2009, pp. 294-317.

Digital Library

[15]

G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker, "Towards automatically generating summary comments for java methods," in Proceedings, ASE, 2010.

Digital Library

[16]

R. Buse and W. Weimer, "Automatically documenting program changes," in Proceedings, ASE. ACM, 2010, pp. 33-42.

Digital Library

[17]

G. Sridhara, L. Pollock, and K. Vijay-Shanker, "Automatically detecting and describing high level actions within methods," in Proceedings, ICSE, 2011.

Digital Library

[18]

D. Shepherd, Z. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker, "Using natural language program analysis to locate and understand action-oriented concerns," in Proceedings, AOSD. ACM, 2007, pp. 212-224.

Digital Library

[19]

S. Rastkar, G. Murphy, and A. Bradley, "Generating natural language summaries for cross-cutting source code concerns," in Proceedings, ICSM, 2011.

Digital Library

[20]

D. Shepherd, L. Pollock, and T. Tourwé, "Using language clues to discover crosscutting concerns," in ACM SIGSOFT Software Engineering Notes, vol. 30, no. 4. ACM, 2005, pp. 1-6.

Digital Library

[21]

T. Xie, S. Thummalapenta, D. Lo, and C. Liu, "Data mining for software engineering," IEEE Computer, vol. 42, no. 8, pp. 35-42, 2009.

Digital Library

[22]

M. Gabel and Z. Su, "Javert: fully automatic mining of general temporal properties from dynamic traces," in Proceedings, ACM SIGSOFT FSE. ACM, 2008, pp. 339-349.

Digital Library

[23]

D. Mandelin, L. Xu, R. Bodík, and D. Kimelman, "Jungloid mining: helping to navigate the API jungle," in ACM SIGPLAN Notices, vol. 40, no. 6. ACM, 2005, pp. 48-61.

Digital Library

[24]

B. Livshits and T. Zimmermann, "DynaMine: finding common error patterns by mining software revision histories," ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5, pp. 296-305, 2005.

Digital Library

[25]

S. Kim, K. Pan, and E. Whitehead Jr, "Memories of bug fixes," in Proceedings, ACM SIGSOFT FSE. ACM, 2006, pp. 35-45.

Digital Library

[26]

E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi, "Sourcerer: mining and searching internet-scale software repositories," Data Mining and Knowledge Discovery, vol. 18, no. 2, pp. 300-336, 2009.

Digital Library

[27]

T. Zimmermann, P. Weisgerber, S. Diehl, and A. Zeller, "Mining version histories to guide software changes," in Proceedings, ICSE. IEEE Computer Society, 2004, pp. 563- 572.

Digital Library

[28]

S. Arnold, L. Mark, and J. Goldthwaite, "Programming by voice, VocalProgramming," in Proceedings, ACM Conf. on Assistive technologies. ACM, 2000, pp. 149-155.

Digital Library

[29]

A. Begel, "Spoken Language Support for Software Development," in Proceedings, VL/HCC. IEEE Computer Society, 2004, pp. 271-272.

Digital Library

[30]

T. Hubbell, D. Langan, and T. Hain, "A voice-activated syntax-directed editor for manually disabled programmers," in Proceedings, ACM SIGACCESS. ACM, 2006.

Digital Library

[31]

S. Mills, S. Saadat, and D. Whiting, "Is voice recognition the solution to keyboard-based RSI?" in Automation Congress, 2006. WAC '06. World, 2006.

[32]

J. Bellegarda, "Statistical language model adaptation: review and perspectives," Speech Communication, vol. 42, no. 1, pp. 93-108, 2004.

[33]

G. Antoniol, G. Canfora, G. Casazza, A. D. Lucia, and E. Merlo, "Recovering traceability links between code and documentation," IEEE Transactions on Software Engineering, vol. 28, pp. 970-983, 2002.

Digital Library

[34]

Z. Saul, V. Filkov, P. Devanbu, and C. Bird, "Recommending random walks," in Proceedings, ACM SIGSOFT ESEC/FSE. ACM, 2007, pp. 15-24.

Digital Library

[35]

M. Robillard, "Automatic generation of suggestions for program investigation," in ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5. ACM, 2005, pp. 11-20.

Digital Library

[36]

B. Livshits, A. Nori, S. Rajamani, and A. Banerjee, "Merlin: specification inference for explicit information flow problems," in ACM SIGPLAN Notices, vol. 44, no. 6. ACM, 2009, pp. 75-86.

Digital Library

[37]

M. Marcus, M. Marcinkiewicz, and B. Santorini, "Building a large annotated corpus of English: The Penn Treebank," Computational Linguistics, vol. 19, no. 2, pp. 313-330, 1993.

Digital Library

[38]

A. Kittur, E. Chi, and B. Suh, "Crowdsourcing user studies with Mechanical Turk," in Proceedings, CHI. ACM, 2008.

Digital Library

Cited By

Dinella ELahiri SNaik Md'Amorim M(2024)Inferring Natural Preconditions via Program TransformationCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663865(657-658)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663865
Xia CZhang LChristakis MPradel M(2024)Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPTProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680323(819-831)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680323
Zhu TLiu ZXu TTang ZZhang TPan MXia XBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)Exploring and Improving Code Completion for Test CodeProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644421(137-148)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644421
Show More Cited By

Recommendations

On the naturalness of software

Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive ...
Naturalness of Natural Language Artifacts in Software
ISEC '15: Proceedings of the 8th India Software Engineering Conference

We present a study on the naturalness of the natural language artifacts in software. Naturalness is essentially repetitiveness or predictability. By natural language artifacts, we mean source code comments, revision history messages, bug reports and so ...
Understanding Software: Max Kanat-Alexander on simplicity, coding, and how to suck less as a programmer

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '12: Proceedings of the 34th International Conference on Software Engineering

June 2012

1657 pages

ISBN:9781467310673

General Chair:
Martin Glinz,
Program Chairs:
Gail Murphy,
Mauro Pezzè

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

IEEE Press

Publication History

Published: 02 June 2012

Check for updates

Qualifiers

Article

Conference

ICSE '12

Sponsor:

SIGSOFT

ICSE '12: 34th International Conference on Software Engineering

June 2 - 9, 2012

Zurich, Switzerland

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

232
Total Citations
View Citations
2,393
Total Downloads

Downloads (Last 12 months)73
Downloads (Last 6 weeks)8

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dinella ELahiri SNaik Md'Amorim M(2024)Inferring Natural Preconditions via Program TransformationCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663865(657-658)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663865
Xia CZhang LChristakis MPradel M(2024)Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPTProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680323(819-831)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680323
Zhu TLiu ZXu TTang ZZhang TPan MXia XBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)Exploring and Improving Code Completion for Test CodeProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644421(137-148)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644421
Pârțachi PSugiyama MRoychoudhury APaiva AAbreu RStorey M(2024)Bringing Structure to Naturalness: On the Naturalness of ASTsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3643517(378-379)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3643517
Deng YXia CYang CZhang SYang SZhang LRoychoudhury APaiva AAbreu RStorey M(2024)Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning LibrariesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623343(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3623343
Cao SSun XBo LWu RLi BWu XTao CZhang TLiu W(2023)Learning to Detect Memory-related VulnerabilitiesACM Transactions on Software Engineering and Methodology10.1145/362474433:2(1-35)Online publication date: 23-Dec-2023
https://dl.acm.org/doi/10.1145/3624744
Ding ZTang YCheng XLi HShang W(2023) LoGenText-Plus: Improving Neural Machine Translation Based Logging Texts Generation with Syntactic TemplatesACM Transactions on Software Engineering and Methodology10.1145/362474033:2(1-45)Online publication date: 22-Dec-2023
https://dl.acm.org/doi/10.1145/3624740
Yang JWang YLou YWen MZhang LChandra SBlincoe KTonella P(2023)A Large-Scale Empirical Review of Patch Correctness Checking ApproachesProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616331(1203-1215)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616331
Wang CHu JGao CJin YXie THuang HLei ZDeng YChandra SBlincoe KTonella P(2023)How Practitioners Expect Code Completion?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616280(1294-1306)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616280
Zhao YDong YLi G(2023)Seq2Seq or Seq2Tree: Generating Code Using Both Paradigms via Mutual LearningProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609465(238-248)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609465
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents