skip to main content
10.5555/2337223.2337322acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
Article

On the naturalness of software

Published: 02 June 2012 Publication History

Abstract

Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.

References

[1]
K. Sparck Jones, "Natural language processing: a historical review," Current Issues in Computational Linguistics: in Honour of Don Walker (Ed Zampolli, Calzolari and Palmer), Amsterdam: Kluwer, 1994.
[2]
M. Gabel and Z. Su, "A study of the uniqueness of source code," in Proceedings, ACM SIGSOFT FSE. ACM, 2010, pp. 147-156.
[3]
P. Koehn, Statistical Machine Translation. Cambridge University Press, 2010.
[4]
C. Manning, H. Schütze, and MITCogNet, Foundations of statistical natural language processing. MIT Press, 1999, vol. 59.
[5]
M. Bruch, M. Monperrus, and M. Mezini, "Learning from examples to improve code completion systems," in Proceedings, ACM SIGSOFT ESEC/FSE, 2009.
[6]
M. Bruch, E. Bodden, M. Monperrus, and M. Mezini, "IDE 2.0: collective intelligence in software development," in Proceedings of the FSE/SDP workshop on Future of software engineering research. ACM, 2010, pp. 53-58.
[7]
R. Robbes and M. Lanza, "Improving code completion with program history," Automated Software Engineering, vol. 17, no. 2, pp. 181-212, 2010.
[8]
S. Han, D. R. Wallace, and R. C. Miller, "Code completion from abbreviated input," in Proceedings, ASE. IEEE Computer Society, 2009, pp. 332-343.
[9]
F. Jacob and R. Tairas, "Code template inference using language models," in Proceedings of the 48th Annual Southeast Regional Conference, 2010.
[10]
D. Hou and D. Pletcher, "An evaluation of the strategies of sorting, filtering, and grouping API methods for code completion," in Proceedings, ICSM, 2011.
[11]
D. Lawrie, C. Morrell, H. Feild, and D. Binkley, "What's in a name? a study of identifiers," Proceedings, ICPC, 2006.
[12]
D. Binkley, M. Hearn, and D. Lawrie, "Improving identifier informativeness using part of speech information," in Proceedings, MSR. ACM, 2011.
[13]
E. W. Hø st and B. M. Østvold, "Software language engineering," D. Ga¿evic, R. Lämmel, and E. Wyk, Eds. Berlin, Heidelberg: Springer-Verlag, 2009, ch. The Java Programmer's Phrase Book.
[14]
E. Høst and B. Østvold, "Debugging method names," in Proceedings, ECOOP. Springer, 2009, pp. 294-317.
[15]
G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker, "Towards automatically generating summary comments for java methods," in Proceedings, ASE, 2010.
[16]
R. Buse and W. Weimer, "Automatically documenting program changes," in Proceedings, ASE. ACM, 2010, pp. 33-42.
[17]
G. Sridhara, L. Pollock, and K. Vijay-Shanker, "Automatically detecting and describing high level actions within methods," in Proceedings, ICSE, 2011.
[18]
D. Shepherd, Z. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker, "Using natural language program analysis to locate and understand action-oriented concerns," in Proceedings, AOSD. ACM, 2007, pp. 212-224.
[19]
S. Rastkar, G. Murphy, and A. Bradley, "Generating natural language summaries for cross-cutting source code concerns," in Proceedings, ICSM, 2011.
[20]
D. Shepherd, L. Pollock, and T. Tourwé, "Using language clues to discover crosscutting concerns," in ACM SIGSOFT Software Engineering Notes, vol. 30, no. 4. ACM, 2005, pp. 1-6.
[21]
T. Xie, S. Thummalapenta, D. Lo, and C. Liu, "Data mining for software engineering," IEEE Computer, vol. 42, no. 8, pp. 35-42, 2009.
[22]
M. Gabel and Z. Su, "Javert: fully automatic mining of general temporal properties from dynamic traces," in Proceedings, ACM SIGSOFT FSE. ACM, 2008, pp. 339-349.
[23]
D. Mandelin, L. Xu, R. Bodík, and D. Kimelman, "Jungloid mining: helping to navigate the API jungle," in ACM SIGPLAN Notices, vol. 40, no. 6. ACM, 2005, pp. 48-61.
[24]
B. Livshits and T. Zimmermann, "DynaMine: finding common error patterns by mining software revision histories," ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5, pp. 296-305, 2005.
[25]
S. Kim, K. Pan, and E. Whitehead Jr, "Memories of bug fixes," in Proceedings, ACM SIGSOFT FSE. ACM, 2006, pp. 35-45.
[26]
E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi, "Sourcerer: mining and searching internet-scale software repositories," Data Mining and Knowledge Discovery, vol. 18, no. 2, pp. 300-336, 2009.
[27]
T. Zimmermann, P. Weisgerber, S. Diehl, and A. Zeller, "Mining version histories to guide software changes," in Proceedings, ICSE. IEEE Computer Society, 2004, pp. 563- 572.
[28]
S. Arnold, L. Mark, and J. Goldthwaite, "Programming by voice, VocalProgramming," in Proceedings, ACM Conf. on Assistive technologies. ACM, 2000, pp. 149-155.
[29]
A. Begel, "Spoken Language Support for Software Development," in Proceedings, VL/HCC. IEEE Computer Society, 2004, pp. 271-272.
[30]
T. Hubbell, D. Langan, and T. Hain, "A voice-activated syntax-directed editor for manually disabled programmers," in Proceedings, ACM SIGACCESS. ACM, 2006.
[31]
S. Mills, S. Saadat, and D. Whiting, "Is voice recognition the solution to keyboard-based RSI?" in Automation Congress, 2006. WAC '06. World, 2006.
[32]
J. Bellegarda, "Statistical language model adaptation: review and perspectives," Speech Communication, vol. 42, no. 1, pp. 93-108, 2004.
[33]
G. Antoniol, G. Canfora, G. Casazza, A. D. Lucia, and E. Merlo, "Recovering traceability links between code and documentation," IEEE Transactions on Software Engineering, vol. 28, pp. 970-983, 2002.
[34]
Z. Saul, V. Filkov, P. Devanbu, and C. Bird, "Recommending random walks," in Proceedings, ACM SIGSOFT ESEC/FSE. ACM, 2007, pp. 15-24.
[35]
M. Robillard, "Automatic generation of suggestions for program investigation," in ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5. ACM, 2005, pp. 11-20.
[36]
B. Livshits, A. Nori, S. Rajamani, and A. Banerjee, "Merlin: specification inference for explicit information flow problems," in ACM SIGPLAN Notices, vol. 44, no. 6. ACM, 2009, pp. 75-86.
[37]
M. Marcus, M. Marcinkiewicz, and B. Santorini, "Building a large annotated corpus of English: The Penn Treebank," Computational Linguistics, vol. 19, no. 2, pp. 313-330, 1993.
[38]
A. Kittur, E. Chi, and B. Suh, "Crowdsourcing user studies with Mechanical Turk," in Proceedings, CHI. ACM, 2008.

Cited By

View all
  • (2024)Inferring Natural Preconditions via Program TransformationCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663865(657-658)Online publication date: 10-Jul-2024
  • (2024)Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPTProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680323(819-831)Online publication date: 11-Sep-2024
  • (2024)Exploring and Improving Code Completion for Test CodeProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644421(137-148)Online publication date: 15-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '12: Proceedings of the 34th International Conference on Software Engineering
June 2012
1657 pages
ISBN:9781467310673

Sponsors

Publisher

IEEE Press

Publication History

Published: 02 June 2012

Check for updates

Qualifiers

  • Article

Conference

ICSE '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)73
  • Downloads (Last 6 weeks)8
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Inferring Natural Preconditions via Program TransformationCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663865(657-658)Online publication date: 10-Jul-2024
  • (2024)Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPTProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680323(819-831)Online publication date: 11-Sep-2024
  • (2024)Exploring and Improving Code Completion for Test CodeProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644421(137-148)Online publication date: 15-Apr-2024
  • (2024)Bringing Structure to Naturalness: On the Naturalness of ASTsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3643517(378-379)Online publication date: 14-Apr-2024
  • (2024)Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning LibrariesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623343(1-13)Online publication date: 20-May-2024
  • (2023)Learning to Detect Memory-related VulnerabilitiesACM Transactions on Software Engineering and Methodology10.1145/362474433:2(1-35)Online publication date: 23-Dec-2023
  • (2023) LoGenText-Plus: Improving Neural Machine Translation Based Logging Texts Generation with Syntactic TemplatesACM Transactions on Software Engineering and Methodology10.1145/362474033:2(1-45)Online publication date: 22-Dec-2023
  • (2023)A Large-Scale Empirical Review of Patch Correctness Checking ApproachesProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616331(1203-1215)Online publication date: 30-Nov-2023
  • (2023)How Practitioners Expect Code Completion?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616280(1294-1306)Online publication date: 30-Nov-2023
  • (2023)Seq2Seq or Seq2Tree: Generating Code Using Both Paradigms via Mutual LearningProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609465(238-248)Online publication date: 4-Aug-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media