skip to main content
10.1145/3524842.3527959acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

BotHunter: an approach to detect software bots in GitHub

Published: 17 October 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Bots have become popular in software projects as they play critical roles, from running tests to fixing bugs/vulnerabilities. However, the large number of software bots adds extra effort to practitioners and researchers to distinguish human accounts from bot accounts to avoid bias in data-driven studies. Researchers developed several approaches to identify bots at specific activity levels (issue/pull request or commit), considering a single repository and disregarding features that showed to be effective in other domains. To address this gap, we propose using a machine learning-based approach to identify the bot accounts regardless of their activity level. We selected and extracted 19 features related to the account's profile information, activities, and comment similarity. Then, we evaluated the performance of five machine learning classifiers using a dataset that has more than 5,000 GitHub accounts. Our results show that the Random Forest classifier performs the best, with an F1-score of 92.4% and AUC of 98.7%. Furthermore, the account profile information (e.g., account login) contains the most relevant features to identify the account type. Finally, we compare the performance of our Random Forest classifier to the state-of-the-art approaches, and our results show that our model outperforms the state-of-the-art techniques in identifying the account type regardless of their activity level.

    References

    [1]
    R. Abdalkareem, S. Mujahid, and E. Shihab. 2020. A Machine Learning Approach to Improve the Detection of CI Skip Commits. IEEE Transactions on Software Engineering (2020), 1--1.
    [2]
    Ahmad Abdellatif, Khaled Badran, and Emad Shihab. 2020. MSRBot: Using Bots to Answer Questions from Software Repositories. Empirical Software Engineering (EMSE) 25 (2020), 1834--1863. Issue 3.
    [3]
    Ahmad Abdellatif, Diego Elias Costa, Khaled Badran, Rabe Abdelkareem, and Emad Shihab. 2020. Challenges in Chatbot Development: A Study of Stack Overflow Posts. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR'20). To Appear.
    [4]
    Ahmad Abdellatif, Mairieli Wessel, Igor Steinmacher, Marco A. Gerosa, and Emad Shihab. 2022. BotHunter: An Approach to Detect Software Bots in GitHub.
    [5]
    Ahmad Abdellatif, Mairieli Wessel, Igor Steinmacher, Marco A. Gerosa, and Emad Shihab. 2022. BotHunter tool. https://github.com/ahmad-abdellatif/BotHunter. (Accessed on 03/29/2022).
    [6]
    Karan Aggarwal, Tanner Rutgers, Finbarr Timbers, Abram Hindle, Russ Greiner, and Eleni Stroulia. 2015. Detecting duplicate bug reports with software engineering domain knowledge. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 211--220.
    [7]
    André Altmann, Laura Toloşi, Oliver Sander, and Thomas Lengauer. 2010. Permutation importance: a corrected feature importance measure. Bioinformatics 26, 10 (04 2010), 1340--1347. arXiv:https://academic.oup.com/bioinformatics/article-pdf/26/10/1340/16892402/btq134.pdf
    [8]
    L. Bao, Z. Xing, X. Xia, D. Lo, and S. Li. 2017. Who Will Leave the Company?: A Large-Scale Industry Study of Developer Turnover by Mining Monthly Work Report. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 170--181.
    [9]
    Kelly Blincoe, Jyoti Sheoran, Sean Goggins, Eva Petakovic, and Daniela Damian. 2016. Understanding the popular users: Following, affiliation influence and leadership on GitHub. Information and Software Technology 70 (2016), 30--39.
    [10]
    Chris Brown and Chris Parnin. 2019. Sorry to Bother You: Designing Bots for Effective Recommendations. In Proceedings of the 1st International Workshop on Bots in Software Engineering (BotSE).
    [11]
    Rich Caruana and Alexandru Niculescu-Mizil. 2006. An Empirical Comparison of Supervised Learning Algorithms. In Proceedings of the 23rd International Conference on Machine Learning (Pittsburgh, Pennsylvania, USA) (ICML '06). Association for Computing Machinery, New York, NY, USA, 161--168.
    [12]
    Antonio Carvalho, Welder Luz, Diego Marcilio, Rodrigo Bonifácio, Gustavo Pinto, and Edna Dias Canedo. 2020. C-3PR: A Bot for Fixing Static Analysis Violations via Pull Requests. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). 161--171.
    [13]
    Nathan Cassee, Christos Kitsanelis, Eleni Constantinou, and Alexander Serebrenik. 2021. Human, bot or both? A study on the capabilities of classification models on mixed accounts. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 654--658.
    [14]
    Ana Paula Chaves and Marco Aurelio Gerosa. 2018. Single or multiple conversational agents? An interactional coherence comparison. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--13.
    [15]
    Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and Filippo Menczer. 2016. Botornot: A system to evaluate social bots. In Proceedings of the 25th international conference companion on world wide web. 273--274.
    [16]
    Tapajit Dey, Yuxing Ma, and Audris Mockus. 2019. Patterns of Effort Contribution and Demand and User Classification Based on Participation Patterns in NPM Ecosystem. In Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering (Recife, Brazil) (PROMISE'19). Association for Computing Machinery, New York, NY, USA, 36--45.
    [17]
    Tapajit Dey, Sara Mousavi, Eduardo Ponce, Tanner Fry, Bogdan Vasilescu, Anna Filippova, and Audris Mockus. 2020. Detecting and Characterizing Bots That Commit Code. In Proceedings of the 17th International Conference on Mining Software Repositories (Seoul, Republic of Korea) (MSR '20). Association for Computing Machinery, New York, NY, USA, 209--219.
    [18]
    Tapajit Dey, Sara Mousavi, Eduardo Ponce, Tanner Fry, Bogdan Vasilescu, Anna Filippova, and Audris Mockus. 2022. BIMAN: Bot Identification by commit Message, commit Association, and author Name. https://github.com/ssc-oscar/BIMAN_bot_detection. (Accessed on 01/05/2022).
    [19]
    Tapajit Dey, Bogdan Vasilescu, and Audris Mockus. 2020. An exploratory study of bot commits. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops. 61--65.
    [20]
    Phillip George Efthimion, Scott Payne, and Nicholas Proferes. 2018. Supervised machine learning bot detection techniques to identify social twitter bots. SMU Data Science Review 1, 2 (2018), 5.
    [21]
    Linda Erlenhov, Francisco Gomes de Oliveira Neto, Riccardo Scandariato, and Philipp Leitner. 2019. Current and Future Bots in Software Development. In Proceedings of the 1st International Workshop on Bots in Software Engineering (Montreal, Quebec, Canada) (BotSE '19). IEEE Press, Piscataway, NJ, USA, 7--11.
    [22]
    Mehdi Golzadeh. 2021. BoDeGHa: A python tool to predict the identity type in github activities (Human,Bot). https://github.com/mehdigolzadeh/BoDeGHa. (Accessed on 04/22/2021).
    [23]
    Mehdi Golzadeh, Alexandre Decan, Eleni Constantinou, and Tom Mens. 2021. Identifying bot activity in GitHub pull request and issue comments. In Identifying bot activity in GitHub pull request and issue comments (3rd International Workshop on Bots in Software Engineering).
    [24]
    Mehdi Golzadeh, Alexandre Decan, Damien Legay, and Tom Mens. 2021. A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments. Journal of Systems and Software 175 (2021), 110911.
    [25]
    Mehdi Golzadeh, Damien Legay, Alexandre Decan, and Tom Mens. 2020. Bot or not? Detecting bots in GitHub pull request activity based on comment similarity. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops. 31--35.
    [26]
    James Hardwick. 2020. GitActor.user should be type Actor, not User. Cannot distinguish between bots/users otherwise. - GitHub Ecosystem - GitHub Support Community. https://github.community/t/gitactor-user-should-be-type-actor-not-user-cannot-distinguish-between-bots-users-otherwise/14559. (Accessed on 04/13/2021).
    [27]
    Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi. 2013. A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering 39, 6 (2013), 757--773.
    [28]
    SayedHassan Khatoonabadi, Diego Elias Costa, Rabe Abdalkareem, and Emad Shihab. 2021. OnWasted Contributions: Understanding the Dynamics of Contributor-Abandoned Pull Requests. arXiv:2110.15447 [cs.SE]
    [29]
    Carlene Lebeuf, Alexey Zagalsky, Matthieu Foucault, and Margaret-Anne Storey. 2019. Defining and Classifying Software Bots: A Faceted Taxonomy. In Proceedings of the 1st International Workshop on Bots in Software Engineering (Montreal, Quebec, Canada) (BotSE '19). IEEE Press, Piscataway, NJ, USA, 1--6.
    [30]
    J. Lipcak and B. Rossi. 2018. A Large-Scale Study on Source Code Reviewer Recommendation. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 378--387.
    [31]
    Y. Ma, C. Bogart, S. Amreen, R. Zaretzki, and A. Mockus. 2019. World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). 143--154.
    [32]
    Amanda Minnich, Nikan Chavoshi, Danai Koutra, and Abdullah Mueen. 2017. BotWalk: Efficient adaptive exploration of Twitter bot networks. In Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017. 467--474.
    [33]
    Samim Mirhosseini and Chris Parnin. 2017. Can Automated Pull Requests Encourage Software Developers to Upgrade Out-of-date Dependencies?. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (Urbana-Champaign, IL, USA) (ASE 2017). IEEE Press, Piscataway, NJ, USA, 84--94. http://dl.acm.org/citation.cfm?id=3155562.3155577
    [34]
    Martin Monperrus. 2019. Explainable Software Bot Contributions: Case Study of Automated Bug Fixes. In Proceedings of the 1st International Workshop on Bots in Software Engineering (Montreal, Quebec, Canada) (BotSE '19). IEEE Press, Piscataway, NJ, USA, 12--15.
    [35]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
    [36]
    Marco Polignano, Marco Giuseppe de Pinto, Pasquale Lops, and Giovanni Semeraro. 2019. Identification Of Bot Accounts In Twitter Using 2D CNNs On User-generated Contents. In CLEF (Working Notes).
    [37]
    Jorge Rodríguez-Ruiz, Javier Israel Mata-Sánchez, Raul Monroy, Octavio Loyola-Gonzalez, and Armando López-Cuevas. 2020. A one-class classification approach for bot detection on twitter. Computers & Security 91 (2020), 101715.
    [38]
    Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, and Jeff Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen'sd for evaluating group differences on the NSSE and other surveys. In annual meeting of the Florida Association of Institutional Research. 1--33.
    [39]
    Ricardo Romero, Esteban Parra, and Sonia Haiduc. 2020. Experiences Building an Answer Bot for Gitter. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops (Seoul, Republic of Korea) (ICSEW'20). Association for Computing Machinery, New York, NY, USA, 66--70.
    [40]
    Giovanni C. Santia, Munif Ishad Mujib, and Jake Ryland Williams. 2019. Detecting Social Bots on Facebook in an Information Veracity Context. Proceedings of the International AAAI Conference on Web and Social Media 13, 01 (Jul. 2019), 463--472. https://ojs.aaai.org/index.php/ICWSM/article/view/3244
    [41]
    F. Sarker, B. Vasilescu, K. Blincoe, and V. Filkov. 2019. Socio-Technical Work-Rate Increase Associates With Changes in Work Patterns in Online Projects. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 936--947.
    [42]
    Dragos Serban, Bart Golsteijn, Ralph Holdorp, and Alexander Serebrenik. 2021. SAW-BOT: Proposing Fixes for Static Analysis Warnings with GitHub Suggestions. In 2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE). 26--30.
    [43]
    Igor Steinmacher, Tayana Conte, Marco Aurélio Gerosa, and David Redmiles. 2015. Social barriers faced by newcomers placing their first contribution in open source software projects. In Proceedings of the 18th ACM conference on Computer supported cooperative work & social computing. 1379--1392.
    [44]
    Margaret-Anne Storey and Alexey Zagalsky. 2016. Disrupting Developer Productivity One Bot at a Time (FSE 2016). Association for Computing Machinery, New York, NY, USA, 928--931.
    [45]
    Simon Urli, Zhongxing Yu, Lionel Seinturier, and Martin Monperrus. 2018. How to Design a Program Repair Bot? Insights from the Repairnator Project. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice (Gothenburg, Sweden) (ICSE-SEIP '18). Association for Computing Machinery, New York, NY, USA, 95--104.
    [46]
    Mairieli Wessel, Bruno Mendes de Souza, Igor Steinmacher, Igor S. Wiese, Ivanilton Polato, Ana Paula Chaves, and Marco A. Gerosa. 2018. The Power of Bots: Characterizing and Understanding Bots in OSS Projects. Proc. ACM Hum.-Comput. Interact. 2, CSCW, Article 182 (Nov. 2018), 19 pages.
    [47]
    Mairieli Wessel, Alexander Serebrenik, Igot Wiese, Igor Steinmacher, and Marco A. Gerosa. 2020. Effects of Adopting Code Review Bots on Pull Requests to OSS Projects. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 1--11.
    [48]
    Mairieli Wessel, Igor Wiese, Igor Steinmacher, and Marco A. Gerosa. 2021. Don't Disturb Me: Challenges of Interacting with Software Bots on Open Source Software Projects. Procidings of ACM Human-Computer Interaction CSCW (2021).
    [49]
    Daniel S Wilks. 2011. Statistical methods in the atmospheric sciences. Vol. 100. Academic press.
    [50]
    Marvin Wyrich and Justus Bogner. 2019. Towards an Autonomous Bot for Automatic Source Code Refactoring. In Proceedings of the 1st International Workshop on Bots in Software Engineering (Montreal, Quebec, Canada) (BotSE '19). IEEE Press, Piscataway, NJ, USA, 24--28.
    [51]
    Marvin Wyrich and Justus Bogner. 2019. Towards an Autonomous Bot for Automatic Source Code Refactoring. In Proceedings of the 1st International Workshop on Bots in Software Engineering (Montreal, Quebec, Canada) (BotSE '19). IEEE Press, 24--28.
    [52]
    Yang Zhang. 2018. Can we use Github API to determine if a Github user is a bot? - Stack Overflow. https://stackoverflow.com/questions/52731464/can-we-use-github-api-to-determine-if-a-github-user-is-a-bot. (Accessed on 04/13/2021).
    [53]
    Yun Zhang, David Lo, Xin Xia, and Jian-Ling Sun. 2015. Multi-Factor Duplicate Question Detection in Stack Overflow. Journal of Computer Science and Technology 30, 5 (2015), 981--997.

    Cited By

    View all
    • (2023)BDGOA: A bot detection approach for GitHub OAuth AppsIntelligent and Converged Networks10.23919/ICN.2023.00064:3(181-197)Online publication date: Sep-2023
    • (2023)A Dataset of Bot and Human Activities in GitHub2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00070(465-469)Online publication date: May-2023
    • (2023)Towards Better Online Communication for Future Software Development in Industry2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00250(1619-1624)Online publication date: Jun-2023
    • Show More Cited By

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories
    May 2022
    815 pages
    ISBN:9781450393034
    DOI:10.1145/3524842
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Foundation
    • CNPq

    Conference

    MSR '22
    Sponsor:

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)104
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 14 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)BDGOA: A bot detection approach for GitHub OAuth AppsIntelligent and Converged Networks10.23919/ICN.2023.00064:3(181-197)Online publication date: Sep-2023
    • (2023)A Dataset of Bot and Human Activities in GitHub2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00070(465-469)Online publication date: May-2023
    • (2023)Towards Better Online Communication for Future Software Development in Industry2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00250(1619-1624)Online publication date: Jun-2023
    • (2023)From Commit Message Generation to History-Aware Commit Message Completion2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00078(723-735)Online publication date: 11-Sep-2023
    • (2023)GitHub Actions: The Impact on the Pull Request ProcessEmpirical Software Engineering10.1007/s10664-023-10369-w28:6Online publication date: 26-Sep-2023
    • (2023)The GitHub Development Workflow Automation EcosystemsSoftware Ecosystems10.1007/978-3-031-36060-2_8(183-214)Online publication date: 26-May-2023
    • (2023)An Introduction to Software EcosystemsSoftware Ecosystems10.1007/978-3-031-36060-2_1(1-29)Online publication date: 26-May-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media