skip to main content
research-article

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

Published: 01 July 2002 Publication History

Abstract

A code clone is a code portion in source files that is identical or similar to another. Since code clones are believed to reduce the maintainability of software, several code clone detection techniques and tools have been proposed. This paper proposes a new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison. For its implementation with several useful optimization techniques, we have developed a tool, named CCFinder, which extracts code clones in C, C++, Java, COBOL, and other source files. As well, metrics for the code clones have been developed: In order to evaluate the usefulness of CCFinder and metrics, we conducted several case studies where we applied the new tool to the source code of JDK, FreeBSD, NetBSD, Linux, and many other systems. As a result, CCFinder has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems. In addition, we have compared the proposed technique with other clone detection techniques.

References

[1]
{1} B.S. Baker, "A Program for Identifying Duplicated Code," Proc. Computing Science and Statistics: 24th Symp. Interface, vol. 24, pp. 49-57, Mar. 1992.
[2]
{2} B.S. Baker, "On Finding Duplication and Near-Duplication in Large Software System," Proc. Second IEEE Working Conf. Reverse Eng., pp. 86-95, July 1995.
[3]
{3} M. Balazinska, E. Merlo, M. Dagenais, B. Lagüe, and K.A. Kontogiannis, "Measuring Clone Based Reengineering Opportunities," Proc. Sixth IEEE Int'l Symp. Software Metrics (METRICS '99), pp. 292-303, Nov. 1999.
[4]
{4} M. Balazinska, E. Merlo, M. Dagenais, B, Lagüe, and K.A. Kontogiannis, "Partial Redesign of Java Software Systems Based on Clone Analysis," Proc. Sixth IEEE Working Conf. Reverse Eng. (WCRE "99), pp. 326-336, Oct. 1999.
[5]
{5} I.D. Baxter, A. Yahin, L. Moura, M. Sant'Anna, and L. Bier, "Clone Detection Using Abstract Syntax Trees," Proc. IEEE Int'l Conf. Software Maintenance (ICSM '98), pp. 368-377, Nov. 1998.
[6]
{6} G. Bracha, M. Odersky, D, Stoutamire, and P. Wadler, "GJ Specification," http://cm.bell-labs.com/cm/cs/who/wadler/ pizza/gj/, 1998.
[7]
{7} S. Ducasse, M. Rieger, and S. Demeyer, "A Language Independent Approach for Detecting Duplicated Code," Proc. IEEE Int'l Conf. Software Maintenance (ICSM '99), pp. 109-118, Aug. 1999.
[8]
{8} FreeBSD, http://www.freebsd.org/, 2002.
[9]
{9} Gnu Project, http://www.gnu.org/, 2002.
[10]
{10} D. Gusfield, Algorithms on Strings, Trees, and Sequences. pp. 89-180, Cambridge University Press, 1997.
[11]
{11} J.H. Johnson, "Identifying Redundancy in Source Code Using Fingerprints," Proc. IBM Centre for Advanced Studies Conference (CASCON '93), pp. 171-183, Oct. 1993.
[12]
{12} J.H. Johnson, "Substring Matching for Clone Detection and Change Tracking," Proc. IEEE Int'l Conf. Software Maintenance (ICSM "94), pp. 120-126, Sept. 1994.
[13]
{13} B.-K. Kang and J.M. Bieman, "Using Design Abstractions to Visualize, Quantify, and Restructure Software," J. Systems and Software, vol. 24, no. 2, pp. 175-187, 1998.
[14]
{14} K.A. Kontogiannis, R. De Mari, E. Merlo, M. Galler, and M. Bernstein, "Pattern Matching Techniques for Clone Detection and Concept Detection," J. Automated Software Eng., vol. 3, pp. 770-108, 1996.
[15]
{15} B. Laguë E.M. Merlo, J. Mayrand, and J. Hudepohl, "Assessing the Benefits of Incorporating Function Clone Detection in a Development Process," Proc. IEEE Int'l Conf. Software Maintenance (ICSM "97), pp. 314-321, Oct. 1997.
[16]
{16} Linux Online, http://www.linux.org/, 2002.
[17]
{17} J. Mayrand, C. Leblanc, and E.M. Merlo, "Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics," Proc. IEEE Int'l Conf. Software Maintenance (ICSM "96), pp. 244-253, Nov. 1996.
[18]
{18} A. Monden, D. Nakae, T. Kamiya, S. Sato, and K. Matsumoto, "Software Quality Analysis by Code in Industrial Legacy Software," Proc. IEEE Eighth Int'l Software Metrics Syrup. (METRICS '02), (to appear) June 2002.
[19]
{19} NetBSD Project, http://www.netbsd.org/, 2002.
[20]
{20} OpenOffice.org Source Project, http://www.openoffice.org/, 2002.
[21]
{21} S. Takabayashi, A. Monden, S. Sato, K. Matsumoto, K. Inoue, and K. Torii, "The Detection of Fault-Prone Program Using a NeuralNetwork," Proc. SEA-UNU/IIST Int'l Symp. Future Software Technology (ISFST'99), pp. 81-86, Oct. 1999.
[22]
{22} The Source for Java Technology, http://java.sun.com/, 2002.

Cited By

View all
  • (2024)Cross-language Source Code Clone Detection Based On Graph Neural NetworkProceedings of the 2024 3rd International Conference on Cryptography, Network Security and Communication Technology10.1145/3673277.3673310(189-194)Online publication date: 19-Jan-2024
  • (2024)CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone DetectionProceedings of the ACM on Software Engineering10.1145/36607771:FSE(1564-1584)Online publication date: 12-Jul-2024
  • (2024)Evaluating the Effectiveness of Deep Learning Models for Foundational Program Analysis TasksProceedings of the ACM on Programming Languages10.1145/36498298:OOPSLA1(500-528)Online publication date: 29-Apr-2024
  • Show More Cited By

Recommendations

Reviews

Andrew Brooks

A token-based code clone detection system called CCFinder is described in this paper. A clone pair is a pair of identical or similar code portions that could be merged into a single routine to reduce the maintenance burden. The paper describes a token-by-token matching algorithm that employs several optimization techniques, making analysis of industrial strength software practical. Language dependency is restricted: developing the Java sub-component took only two person days. Several metrics are developed, and results are presented for the source code for Java development kit 1.3.0, FreeBSD 4.0, NetBSD 1.5, and Linux 2.4.0. Many very similar source files are reported to be found in javax/swing/*.java. The paper contains a convincing visualization of strong similarities between FreeBSD and NetBSD (over 25,000 clone pairs). Between FreeBSD and Linux, 252 of 1,091 clone pairs (23 percent) were detected across line breaks, indicating how many clones line-by-line matching algorithms can miss. The transformation, optimization, and other implementation techniques employed by CCFinder implicitly define similarity, and what a clone pair is. The paper shows the dramatic effects of disabling various techniques on the numbers of clone pairs detected. However, these results are not related to the metric values for clones, so it remains unclear which set of techniques are optimal. The paper does not report application of the tool to itself. Considerable insights might have emerged had such an investigation been undertaken, and had CCFinder code undergone refactoring to merge clone pairs. However, this paper represents a major contribution to code clone detection, and is highly recommended to specialists in software maintenance. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Software Engineering
IEEE Transactions on Software Engineering  Volume 28, Issue 7
July 2002
96 pages

Publisher

IEEE Press

Publication History

Published: 01 July 2002

Author Tags

  1. CASE tool
  2. code clone
  3. duplicated code
  4. maintenance
  5. metrics

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Cross-language Source Code Clone Detection Based On Graph Neural NetworkProceedings of the 2024 3rd International Conference on Cryptography, Network Security and Communication Technology10.1145/3673277.3673310(189-194)Online publication date: 19-Jan-2024
  • (2024)CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone DetectionProceedings of the ACM on Software Engineering10.1145/36607771:FSE(1564-1584)Online publication date: 12-Jul-2024
  • (2024)Evaluating the Effectiveness of Deep Learning Models for Foundational Program Analysis TasksProceedings of the ACM on Programming Languages10.1145/36498298:OOPSLA1(500-528)Online publication date: 29-Apr-2024
  • (2024)Deep Is Better? An Empirical Comparison of Information Retrieval and Deep Learning Approaches to Code SummarizationACM Transactions on Software Engineering and Methodology10.1145/363197533:3(1-37)Online publication date: 15-Mar-2024
  • (2024)On Detecting and Measuring Exploitable JavaScript Functions in Real-world ApplicationsACM Transactions on Privacy and Security10.1145/363025327:1(1-37)Online publication date: 5-Feb-2024
  • (2024)Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code ModelsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639216(1-13)Online publication date: 20-May-2024
  • (2024)DSFM: Enhancing Functional Code Clone Detection with Deep Subtree InteractionsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639215(1-12)Online publication date: 20-May-2024
  • (2024)Machine Learning is All You Need: A Simple Token-based Approach for Effective Code Clone DetectionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639114(1-13)Online publication date: 20-May-2024
  • (2024)VarGAN: Adversarial Learning of Variable Semantic RepresentationsIEEE Transactions on Software Engineering10.1109/TSE.2024.339173050:6(1505-1517)Online publication date: 25-Apr-2024
  • (2024)Federated Learning for Software Engineering: A Case Study of Code Clone Detection and Defect PredictionIEEE Transactions on Software Engineering10.1109/TSE.2023.334789850:2(296-321)Online publication date: 1-Feb-2024
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media