skip to main content
research-article

Boa: Ultra-Large-Scale Software Repository and Source-Code Mining

Published: 02 December 2015 Publication History

Abstract

In today's software-centric world, ultra-large-scale software repositories, such as SourceForge, GitHub, and Google Code, are the new library of Alexandria. They contain an enormous corpus of software and related information. Scientists and engineers alike are interested in analyzing this wealth of information. However, systematic extraction and analysis of relevant data from these repositories for testing hypotheses is hard, and best left for mining software repository (MSR) experts! Specifically, mining source code yields significant insights into software development artifacts and processes. Unfortunately, mining source code at a large scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse grained, or sacrifice studying the history of the code. In this article we address mining source code: (a) at a very large scale; (b) at a fine-grained level of detail; and (c) with full history information. To address these challenges, we present domain-specific language features for source-code mining in our language and infrastructure called Boa. The goal of Boa is to ease testing MSR-related hypotheses. Our evaluation demonstrates that Boa substantially reduces programming efforts, thus lowering the barrier to entry. We also show drastic improvements in scalability.

References

[1]
Apache Software Foundation. 2015a. Hadoop: Open source implementation of MapReduce. http://hadoop. apache.org/.
[2]
Apache Software Foundation. 2015b. HBase: Open source implementation of Bigtable. http://hbase. apache.org/
[3]
Jennifer Bevan, E. James Whitehead, Jr., Sunghun Kim, and Michael Godfrey. 2005. Facilitating software evolution research with Kenyon. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE'05). 177--186.
[4]
Black Duck Software. 2015. Black duck open HUB. https://www.openhub.net/.
[5]
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422--426.
[6]
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). 363--375.
[7]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2.
[8]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 4th USENIX Conference on Operating Systems Design and Implementation (OSDI'04). 107--113.
[9]
Robert Di Falco. 2011. Hierarchical visitor pattern, C2 pattern repository. http://c2.com/cgi/wiki?
[10]
Paul Dourish and Victoria Bellotti. 1992. Awareness and coordination in shared workspaces. In Proceedings of the ACM Conference on Computer-Supported Cooperative Work (CSCW'92). 107--114.
[11]
Robert Dyer, Hridesh Rajan, and Yuanfang Cai. 2012. An exploratory study of the design impact of language features for aspect-oriented interfaces. In Proceedings of the 11th International Conference on Aspect-Oriented Software Development (AOSD'12). 143--154.
[12]
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013a. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 35th International Conference on Software Engineering (ICSE'13). 422--431.
[13]
Robert Dyer, Hridesh Rajan, and Yuanfang Cai. 2013b. Language features for software evolution and aspect-oriented interfaces: An exploratory study. Trans. Aspect-Orient. Softw. Devel. 10, 148--183.
[14]
Robert Dyer, Hridesh Rajan, and Tien N. Nguyen. 2013c. Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes. In Proceedings of the 12th International Conference on Generative Programming: Concepts and Experiences (GPCE'13). 23--32.
[15]
Robert Dyer, Hridesh Rajan, Hoan Anh Nguyen, and Tien N. Nguyen. 2014. Mining billions of AST nodes to study actual and potential usage of Java language features. In Proceedings of the 36th International Conference on Software Engineering (ICSE'14). 779--790.
[16]
Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'10). 147--156.
[17]
Harald C. Gall, Beat Fluri, and Martin Pinzger. 2009. Change analysis with Evolizer and Changedistiller. IEEE Softw. 26, 1, 26--33.
[18]
Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1994. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional.
[19]
Yongqin Gao, Matthew Van Antwerp, Scott Christley, and Greg Madey. 2007. A research collaboratory for open source software research. In Proceedings of the 1st International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS'07). IEEE Computer Society, 4.
[20]
Jesús M. GonzáLez-Barahona and Gregorio Robles. 2012. On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir. Softw. Engin. 17, 1--2, 75--89.
[21]
Seymour Goodman, Peter Wolcott, and Grey Burkhart. 1995. Building on the basics: An examination of high-performance computing export control policy in the 1990s. http://fsi.stanford.edu/sites/default/files/buildingbasics.pdf.
[22]
Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR'13). IEEE Press, 233--236.
[23]
Georgios Gousios and Diomidis Spinellis. 2009a. Alitheia core: An extensible software quality monitoring platform. In Proceedings of the 31st International Conference on Software Engineering (ICSE'09). IEEE Computer Society, 579--582.
[24]
Georgios Gousios and Diomidis Spinellis. 2009b. A platform for software engineering research. In Proceedings of the 6th International Working Conference on Mining Software Repositories (MSR'09). 31--40.
[25]
Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: GitHub's data from a firehose. In Proceedings of the 9th Working Conference on Mining Software Repositories (MSR'12). IEEE, 12--21.
[26]
Mark Grechanik, Collin McMillan, Luca Deferrari, Marco Comi, Stefano Crespi, Denys Poshyvanyk, Chen Fu, Qing Xie, and Carlo Ghezzi. 2010. An empirical investigation into a large-scale Java open source code repository. In Proceedings of the International Symposium on Empirical Software Engineering and Measurement (ESEM'10). 11:1--11:10.
[27]
Israel Herraiz, Daniel Izquierdo-Cortazar, and Francisco Rivas-Hernández. 2009. FLOSSMetrics: Free/Libre/Open source software metrics. In Proceedings of the European Conference on Software Maintenance and Reengineering (CSMR'09). IEEE Computer Society, 281--284.
[28]
Abram Hindle and Daniel M. German. 2005. SCQL: A formal model and a query language for source control repositories. In Proceedings of the International Workshop on Mining Software Repositories (MSR'05). 1--5.
[29]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys'07). 59--72.
[30]
Anastasia Izmaylova, Paul Klint, Ashim Shahi, and Jurgen J. Vinju. 2013. M3: An open model for measuring code artifacts. http://arxiv.org/abs/1312.1188v1
[31]
Simon Peyton Jones. 2003. Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press.
[32]
Paul Klint, Tijs Van Der Storm, and Jurgen Vinju. 2009. RASCAL: A domain specific language for source code analysis and manipulation. In Proceedings of the 9th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM'09). IEEE Computer Society, 168--177.
[33]
Susan Landau. 2000. Standing the test of time: The data encryption standard. Not. Amer. Math. Soc. 47, 3, 341.
[34]
Janusz Laski and Wojciech Szermer. 1992. Identification of program modifications and its applications in software maintenance. In Proceedings of the International Conference on Software Maintenance (ICSM'92). 282--290.
[35]
Josh Lerner and Jean Tirole. 2002. Some simple economics of open source. The J. Industr. Econ. 50, 2, 197--234.
[36]
Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. 2009. Sourcerer: Mining and searching internet-scale software repositories. Data Mining Knowl. Discov. 18, 2, 300--336.
[37]
Michael Martin, Benjamin Livshits, and Monica S. Lam. 2005. Finding application errors and security flaws using PQL: A program query language. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'05). 365--383.
[38]
Bruno C. D. S. Oliveira, Meng Wang, and Jeremy Gibbons. 2008. The visitor pattern as a reusable, generic, type-safe component. In Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'08). 439--456.
[39]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1099--1110.
[40]
Doug Orleans and Karl J. Lieberherr. 2001. DJ: Dynamic adaptive programming in Java. In Proceedings of the 3rd International Conference on Metalevel Architectures and Separation of Crosscutting Concerns (REFLECTION'01). 73--80.
[41]
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis with Sawzall. Sci. Program. 13, 4, 277--298.
[42]
Promise Dataset 2009. Promise 2009. http://promisedata.org/2009/datasets.html.
[43]
Hridesh Rajan. 2008. Mining software repositories for evaluating software engineering properties of language designs. In Proceedings of the 2nd Workshop on Assessment of Contemporary Modularization Techniques (ACoM'08).
[44]
Hridesh Rajan, Tien N. Nguyen, Robert Dyer, and Hoan Anh Nguyen. 2015. Boa website. http://boa.cs. iastate.edu/.
[45]
Eric Raymond. 1999. The cathedral and the bazaar. Knowl. Technol. Policy 12, 3, 23--49.
[46]
Gregor Richards, Christian Hammer, Brian Burg, and Jan Vitek. 2011. The Eval that men do: A large-scale study of the use of Eval in JavaScript applications. In Proceedings of the 25th European Conference on Object-Oriented Programming (ECOOP'11). 52--78.
[47]
Weiyi Shang, Bram Adams, and Ahmed E. Hassan. 2010. An experience report on scaling tools for mining software repositories using MapReduce. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE'10). 275--284.
[48]
SourceForge. 2015. SourceForge website. http://sourceforge.net/.
[49]
Sander Tichelaar, Stéphane Ducasse, and Serge Demeyer. 2000. FAMIX and XMI. In Proceedings of the 7th Working Conference on Reverse Engineering (WCRE'00). IEEE Computer Society, 296.
[50]
Tiobe Software BV. 2012. TIOBE programming community index for July 2012. Tech. rep. http://www. tiobe.com/tpci.htm.
[51]
Anthony Urso. 2013. Sizzle: A compiler and runtime for Sawzall, optimized for Hadoop. https://github.com/anthonyu/Sizzle.
[52]
Joost Visser. 2001. Visitor combination and traversal control. In Proceedings of the 16th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'01). 270--282.
[53]
Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. 2007. How long will it take to fix this bug? In Proceedings of the 4th International Workshop on Mining Software Repositories (MSR'07).
[54]
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 1--14.

Cited By

View all
  • (2024)Boidae: Your Personal Mining PlatformProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640026(40-43)Online publication date: 14-Apr-2024
  • (2024)Code search engines for the next generationJournal of Systems and Software10.1016/j.jss.2024.112065215(112065)Online publication date: Sep-2024
  • (2024)Promoting open science in test-driven software experimentsJournal of Systems and Software10.1016/j.jss.2024.111971212:COnline publication date: 1-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 25, Issue 1
December 2015
339 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/2852270
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2015
Accepted: 01 July 2015
Revised: 01 June 2014
Received: 01 January 2014
Published in TOSEM Volume 25, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Boa
  2. domain-specific language
  3. ease of use
  4. lower barrier to entry
  5. mining software repositories
  6. scalable

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • US National Science Foundation (NSF)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Boidae: Your Personal Mining PlatformProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640026(40-43)Online publication date: 14-Apr-2024
  • (2024)Code search engines for the next generationJournal of Systems and Software10.1016/j.jss.2024.112065215(112065)Online publication date: Sep-2024
  • (2024)Promoting open science in test-driven software experimentsJournal of Systems and Software10.1016/j.jss.2024.111971212:COnline publication date: 1-Jun-2024
  • (2024)Sahand 1.0: A new model for extracting information from source code in object-oriented projectsComputer Standards & Interfaces10.1016/j.csi.2023.10379788(103797)Online publication date: Mar-2024
  • (2024)Can instability variations warn developers when open-source projects boost?Empirical Software Engineering10.1007/s10664-024-10482-429:4Online publication date: 14-Jun-2024
  • (2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
  • (2023)Fingerprinting and Building Large Reproducible DatasetsProceedings of the 2023 ACM Conference on Reproducibility and Replicability10.1145/3589806.3600043(27-36)Online publication date: 27-Jun-2023
  • (2023)Learning the Relation Between Code Features and Code Transforms With Structured PredictionIEEE Transactions on Software Engineering10.1109/TSE.2023.327538049:7(3872-3900)Online publication date: 1-Jul-2023
  • (2023)A Statistical Method for API Usage Learning and API Misuse Violation Finding2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)10.1109/SERA57763.2023.10197708(358-365)Online publication date: 23-May-2023
  • (2023)Test‐driven development, engagement in activity, and maintainabilityIET Software10.1049/sfw2.1213517:4(509-525)Online publication date: 27-Jul-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media