research-article

Automatic query reformulations for text retrieval in software engineering

Authors:

Gabriele Bavota,

Andrian Marcus,

Andrea De Lucia,

Tim MenziesAuthors Info & Claims

ICSE '13: Proceedings of the 2013 International Conference on Software Engineering

Pages 842 - 851

Published: 18 May 2013 Publication History

Abstract

There are more than twenty distinct software engineering tasks addressed with text retrieval (TR) techniques, such as, traceability link recovery, feature location, refactoring, reuse, etc. A common issue with all TR applications is that the results of the retrieval depend largely on the quality of the query. When a query performs poorly, it has to be reformulated and this is a difficult task for someone who had trouble writing a good query in the first place.

We propose a recommender (called Refoqus) based on machine learning, which is trained with a sample of queries and relevant results. Then, for a given query, it automatically recommends a reformulation strategy that should improve its performance, based on the properties of the query. We evaluated Refoqus empirically against four baseline approaches that are used in natural language document retrieval. The data used for the evaluation corresponds to changes from five open source systems in Java and C++ and it is used in the context of TR-based concept location in source code. Refoqus outperformed the baselines and its recommendations lead to query performance improvement or preservation in 84% of the cases (in average).

References

[1]

A. Marcus and G. Antoniol, “On the use of text retrieval techniques in software engineering,” in Proceedings of 34th IEEE/ACM International Conference on Software Engineering, Technical Briefing, 2012.

[2]

J. H. Hayes, A. Dekhtyar, and S. K. Sundaram, “Advancing candidate link generation for requirements tracing: The study of methods.” IEEE Transactions on Software Engineering, vol. 32, no. 1, pp. 4–19, 2006.

Digital Library

[3]

G. Gay, S. Haiduc, A. Marcus, and T. Menzies, “On the use of relevance feedback in ir-based concept location,” in Proceedings of the International Conference on Software Maintenance, 2009, pp. 351–360.

[4]

M. Gibiec, A. Czauderna, and J. Cleland-Huang, “Towards mining replacement queries for hard-to-retrieve traces,” in Proceedings of the International Conference on Automated Software Engineering, 2010, pp. 245–254.

Digital Library

[5]

A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An information retrieval approach to concept location in source code,” in Proceedings of the Working Conference on Reverse Engineering, 2004, pp. 214–223.

Digital Library

[6]

J. Yang and L. Tan, “Inferring semantically related words from software context,” in Proceedings of 9th Working Conference on Mining Software Repositories, 2012, pp. 161–170.

[7]

D. Carmel and E. Yom-Tov, Estimating the Query Difficulty for Information Retrieval. Morgan and Claypool Publishers, 2010.

Digital Library

[8]

S. Haiduc, G. Bavota, A. De Lucia, A. Marcus, and R. Oliveto, “Evaluating the specificity of text retrieval queries to support software engineering tasks,” in Proceedings of the 34th IEEE/ACM International Conference on Software Engineering, NIER Track, 2012, pp. 1273–1276.

Digital Library

[9]

X. A. Lu and R. B. Keefer, “Query expansion/reduction and its impact on retrieval effectiveness,” NIST SPecial Publication SP, vol. 225, pp. 231–239, 1995.

[10]

G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais, “The vocabulary problem in human-system communication,” Communications of the ACM, vol. 30, no. 11, pp. 964–971, 1987.

Digital Library

[11]

C. Carpineto and G. Romano, “A survey of automatic query expansion in information retrieval,” ACM Computing Surveys, vol. 44, pp. 1–56, 2012.

Digital Library

[12]

G. Sridhara, E. Hill, L. L. Pollock, and K. Vijay-Shanker, “Identifying word relations in software: A comparative study of semantic similarity tools,” in Proceedings of the International Conference on Program Comprehension, 2008, pp. 123–132.

Digital Library

[13]

J. J. Rocchio, The SMART Retrieval System – Experiments in Automatic Document Processing. Prentice Hall, Inc., 1971, ch. Relevance feedback in information retrieval, pp. 313–323.

[14]

S. Haiduc, G. Bavota, R. Oliveto, A. D. Lucia, and A. Marcus, “Automatic query performance assessment during the retrieval of software artifacts,” in IEEE/ACM International Conference on Automated Software Engineering, ASE’12, 2012, pp. 90–99.

Digital Library

[15]

L. Breiman, J. Friedman, C. Stone, and R. A. Olshen, Classification and Regression Trees. Chapman and Hall, 1984.

[16]

B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk, “Feature location in source code: a taxonomy and survey,” Journal of Software: Evolution and Process, vol. 25, no. 1, pp. 53–95, 2013.

[17]

W. J. Conover, Practical Nonparametric Statistics, 3rd ed. Wiley, 1998.

[18]

S. Holm, “A simple sequentially rejective Bonferroni test procedure,” Scandinavian Journal on Statistics, vol. 6, pp. 65–70, 1979.

[19]

N. Balasubramanian, G. Kumaran, and V. R. Carvalho, “Exploring reductions for long web queries,” in Proceedings of SIGIR, 2010, pp. 571–578.

Digital Library

[20]

X. Xue, S. Huston, and W. B. Croft, “Improving verbose queries using subset distribution,” in Proceedings of the ACM International Conference on Information and Knowledge Management, 2010, pp. 1059–1068.

Digital Library

[21]

M. Petrenko, V. Rajlich, and R. Vanciu, “Partial domain comprehension in software evolution and maintenance,” in Proceedings of the International Conference on Program Comprehension, 2008, pp. 13–22.

Digital Library

[22]

J. Starke, C. Luce, and J. Sillito, “Searching and skimming: An exploratory study,” in Proceedings of the International Conference on Software Maintenance, 2009, pp. 157–166.

[23]

A. De Lucia, R. Oliveto, and P. Sgueglia, “Incremental approach and user feedbacks: a silver bullet for traceability recovery,” in Proceedings of the International Conference on Software Maintenance, 2006, pp. 299–309.

Digital Library

[24]

E. Hill, L. Pollock, and K. Vijay-Shanker, “Automatically capturing source code context of nl-queries for software maintenance and reuse,” in Proceedings of the International Conference on Software Engineering, 2009.

Digital Library

[25]

D. Shepherd, Z. Fry, E. Gibson, L. Pollock, and K. Vijay-Shanker, “Using natural language program analysis to locate and understand actionoriented concerns,” in Proceedings of the International Conference on Aspect Oriented Software Development, 2007, pp. 212–224.

Digital Library

Cited By

Kim KGhatpande SKim DZhou XLiu KBissyandé TKlein JLe Traon Y(2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604905
Pérez FLapeña RMarcén ACetina C(2023)How the Quality of Maintenance Tasks is Affected by Criteria for Selecting Engineers for CollaborationACM Transactions on Software Engineering and Methodology10.1145/356138432:3(1-22)Online publication date: 26-Apr-2023
https://dl.acm.org/doi/10.1145/3561384
Zeng CYu YLi SXia XWang ZGeng MBai LDong WLiao X(2023)deGraphCS: Embedding Variable-based Flow Graph for Neural Code SearchACM Transactions on Software Engineering and Methodology10.1145/354606632:2(1-27)Online publication date: 30-Mar-2023
https://dl.acm.org/doi/10.1145/3546066
Show More Cited By

Index Terms

Automatic query reformulations for text retrieval in software engineering

Recommendations

Learning to rank query reformulations
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Query reformulation techniques based on query logs have recently proven to be effective for web queries. However, when initial queries have reasonably good quality, these techniques are often not reliable enough to identify the helpful reformulations ...
Supporting Query Formulation for Text Retrieval Applications in Software Engineering
ICSME '14: Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution

Text Retrieval (TR) techniques have been successfully used to leverage the textual information found in software artifacts with the purpose of aiding developers with their daily tasks. TR techniques require a query as input and the usefulness of the ...
Complete yet practical search for minimal query reformulations under constraints
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

We revisit the Chase&Backchase (C&B) algorithm for query reformulation under constraints, which provides a uniform solution to such particular-case problems as view-based rewriting under constraints, semantic query optimization, and physical access path ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '13: Proceedings of the 2013 International Conference on Software Engineering

May 2013

1561 pages

ISBN:9781467330763

General Chair:
David Notkin,
Program Chairs:
Betty H. C. Cheng,
Klaus Pohl

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

IEEE Press

Publication History

Published: 18 May 2013

Check for updates

Qualifiers

Research-article

Conference

ICSE '13

Sponsor:

SIGSOFT

ICSE '13: 35th International Conference on Software Engineering

May 18 - 26, 2013

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
608
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim KGhatpande SKim DZhou XLiu KBissyandé TKlein JLe Traon Y(2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604905
Pérez FLapeña RMarcén ACetina C(2023)How the Quality of Maintenance Tasks is Affected by Criteria for Selecting Engineers for CollaborationACM Transactions on Software Engineering and Methodology10.1145/356138432:3(1-22)Online publication date: 26-Apr-2023
https://dl.acm.org/doi/10.1145/3561384
Zeng CYu YLi SXia XWang ZGeng MBai LDong WLiao X(2023)deGraphCS: Embedding Variable-based Flow Graph for Neural Code SearchACM Transactions on Software Engineering and Methodology10.1145/354606632:2(1-27)Online publication date: 30-Mar-2023
https://dl.acm.org/doi/10.1145/3546066
Chaidaroon SZhang XSubramaniyam SSvajlenko JShourya TKeivanloo IJoy RChen HDuh WHuang HKato MMothe JPoblete B(2023)Improving Programming Q&A with Neural Generative AugmentationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591860(3390-3394)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591860
Liu CXia XLo DLiu ZHassan ALi S(2021)CodeMatcher: Searching Code Based on Sequential Semantics of Important Query WordsACM Transactions on Software Engineering and Methodology10.1145/346540331:1(1-37)Online publication date: 28-Sep-2021
https://dl.acm.org/doi/10.1145/3465403
Wang HXia XLo DGrundy JWang X(2021)Automatic Solution Summarization for Crash BugsProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00117(1286-1297)Online publication date: 22-May-2021
https://dl.acm.org/doi/10.1109/ICSE43902.2021.00117
Cao KChen CBaltes STreude CChen X(2021)Automated Query Reformulation for Efficient Search based on Query Logs From Stack OverflowProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00116(1273-1285)Online publication date: 22-May-2021
https://dl.acm.org/doi/10.1109/ICSE43902.2021.00116
Ling CLin ZZou YXie B(2020)Adaptive Deep Code SearchProceedings of the 28th International Conference on Program Comprehension10.1145/3387904.3389278(48-59)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.1145/3387904.3389278
Shuai JXu LLiu CYan MXia XLei Y(2020)Improving Code Search with Co-Attentive Representation LearningProceedings of the 28th International Conference on Program Comprehension10.1145/3387904.3389269(196-207)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.1145/3387904.3389269
Hassan FBansal CNagappan NZimmermann TAwadallah A(2020)An Empirical Study of Software Exceptions in the Field using Search LogsProceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3382494.3410692(1-12)Online publication date: 5-Oct-2020
https://dl.acm.org/doi/10.1145/3382494.3410692
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents