research-article

Open access

“It would work for me too”: How Online Communities Shape Software Developers’ Trust in AI-Powered Code Generation Tools

Authors:

Ruijia Cheng,

Ruotong Wang,

Thomas Zimmermann,

Denae FordAuthors Info & Claims

ACM Transactions on Interactive Intelligent Systems, Volume 14, Issue 2

Article No.: 11, Pages 1 - 39

https://doi.org/10.1145/3651990

Published: 15 May 2024 Publication History

PDF eReader

Abstract

While revolutionary AI-powered code generation tools have been rising rapidly, we know little about how and how to help software developers form appropriate trust in those AI tools. Through a two-phase formative study, we investigate how online communities shape developers’ trust in AI tools and how we can leverage community features to facilitate appropriate user trust. Through interviewing 17 developers, we find that developers collectively make sense of AI tools using the experiences shared by community members and leverage community signals to evaluate AI suggestions. We then surface design opportunities and conduct 11 design probe sessions to explore the design space of using community features to support user trust in AI code generation systems. We synthesize our findings and extend an existing model of user trust in AI technologies with sociotechnical factors. We map out the design considerations for integrating user community into the AI code generation experience.

1 Introduction

Recent years have witnessed a rapid development of generative AI models, with applications in areas like creative writing, digital arts, and text summarizations. Perhaps one of the most exciting areas of the application of generative AI models is software development with automatic code generation. Several AI-powered code generation tools have emerged recently—GitHub Copilot,¹ Tabnine,² Kite,³ Amazon Code Whisperer⁴—to name a few. These tools have been considered revolutionary for the workflow of software developers that are long considered complex and highly demanding [23, 52], as they are designed to augment developers’ productivity by auto-completing lines of code in real-time, and/or generating code based on prompts [5, 73]. Despite being recently invented, AI-powered code generation tools have already gained a huge amount of attention and popularity. For the example of GitHub Copilot, more than 1.2 million developers participated in its technical preview, and the user population is still rapidly growing since Copilot became generally available on June 21st, 2022 [18].

Along with the vast popularity comes the concern about developer’s trust in AI-powered code generation tools. While developers are accustomed to trust traditional software development tools, such as compilers or version control systems, for deterministic and predictable support, AI-powered code generation tools, being probabilistic in nature, can produce unexpected and inconsistent outputs [7]. Developers can suffer from over- and under-trust in AI code generation tools—an insufficient level of trust can prevent developers from using AI to increase productivity [49, 56], while too much trust can lead to overlooking risks and safety vulnerabilities to software used by millions of users [20, 31, 46]. Questions remain on how to help developers form and maintain appropriate trust in AI code generation tools.

We believe that developers’ trust in AI is sociotechnical—a product of interaction between developers about their interaction with AI. The recently established MATCH model of user trust in AI (explained in Section 2.1) [44], along with many studies [28, 36, 50], shows that user trust in AI is shaped by how a user interacts with an AI system. Moreover, user trust in technology more broadly is situated in the surrounding social context and is shaped by the culture and practice of the communities they interact with [10, 15, 30, 36, 72]. There exists a valuable opportunity to examine how users establish appropriate trust in AI through a sociotechnical lens.

We see the promise of online developer communities as a context for us to study how developers build (or face challenges to build) appropriate trust in AI-powered code generation tools. We have known from an abundance of HCI literature that developers thrive in online communities to ask and answer questions [6, 59], learn to adopt new tools and skills, share projects and exchange feedback [11, 14, 70], and seek technical and social support [25]. Following the emergence of AI-powered code generation tools, tens of thousands of developers have been organically forming end-user online communities on platforms such as GitHub, Stack Overflow, Reddit, and X (formerly known as Twitter) to share and discuss their interaction with AI. We hope to learn from these communities to design for developers to build appropriate trust in AI code generation systems, as well as to support future AI end-user communities for developers. Specifically, we ask the following research questions:

—

RQ1: How do online communities shape developers’ trust in AI-powered code generation tools?

—

RQ2: How can we leverage community features to design for appropriate user trust in AI-powered code generation tools?

We present a two-phase formative study to understand how online communities shape developers’ trust in AI-powered code generation tools and explore the design space. We use the MATCH model as the theoretical framework to describe how users build trust in AI technologies. In Study 1, we explored RQ1 by conducting semi-structured interviews with 17 developers on their experience participating in online communities about AI code generation tools (Section 3). We discovered that developers in online communities actively engage with examples provided by their peers to acquire a comprehensive understanding of AI. They also leverage a range of community signals to evaluate AI-generated code. We distilled these findings into two pathways through which user communities can foster appropriate trust in AI, extending the MATCH model (Section 3.4). In Study 2, we mapped out the design opportunities of introducing a user community to AI code generation systems based on the insights from Study 1. Specifically, to explore RQ2, we created visual stimuli that integrate community evaluation signals and experience sharing spaces into the GitHub Copilot experience, and presented the visual stimuli to 11 developers in a design probe study to collect further conceptual-level insights and ideas for design (Section 4). We concluded with several design recommendations for integrating a user community into AI systems to help build appropriate trust in AI (Section 5).

We make the following contributions to the HCI community: (1) We contribute empirical insights that explain how online communities help developers form appropriate trust with AI code generation tools. (2) We extend a previous model of trust building with AI technologies by adding the role of user communities in user trust in generative AI. (3) Finally, we unveil a series of important and user-evaluated design concepts coupled with visual examples for the future development of AI code generation tools.

2 Related Works

2.1 User Trust in AI and the MATCH Model

Trust has been highlighted as one of the most important factors in multiple design guidelines for human-AI collaboration [1, 3, 16]. Trust in AI is defined as the user’s attitude or judgment about how the AI system can help when the user is in a situation of uncertainty or vulnerability [43, 44, 61], and is therefore particularly important when users engage in high-stake activities where the consequence can have a severe impact [34]. For example, in software development, user trust in AI support is important both for tool adoption [65, 66] and for product safety [31, 46]. A misalignment between the trust of users and the ability of AI could lead to user overreliance and misuse of AI systems, resulting in undesirable consequences [21, 43, 51]. Therefore, in addition to making technologies more trustworthy, it is also important to support users in deciding when and how much to trust an AI, so that they can use their expertise when needed to complement the capabilities of AI systems [28, 34].

For this reason, the HCI community has been calling builders of AI systems to design for appropriate trust that is calibrated to different use cases and scenarios [34, 44]. An important premise to support appropriate trust is understanding how trust is developed or impacted when a user interacts with an AI system. Recently, Liao and Sundar [44] present the MATCH model (as we visualize in Figure 1) derived from a comprehensive review of the literature that explains the factors that affect how users build trust with an AI system [44]. In this article, we use the MATCH model as the framework and source of vocabulary for us to describe how users build trust in AI technologies. We explain the MATCH model as follows.

Fig. 1.

An AI model is typically built with certain trustworthiness attributes (leftmost box in the figure), such as its ability, benevolence, and integrity, that can directly affect users’ trust [19, 48, 68, 69]. For example, in the case of a code generation AI, the trustworthy attributes can be its model mechanisms and how it was trained. These capabilities are communicated to users via trust affordances with trustworthiness cues presented in the interface design (middle box in the figure). In the case of a code generation AI, the trust affordances can be how the AI-generated code is presented to users. Users then form trust judgments (rightmost box in the figure) that result in their actions towards the AI system by interpreting those trustworthiness cues using their own trust heuristics, such as their prior knowledge in the domain, perception of the reputation of the AI, emotion elicited by the system design, and so on. In the case of an AI system that generates code, this can happen when a developer uses their knowledge to decide whether to take the AI-generated code suggestion.

The MATCH model, therefore, has pointed out two potential pathways to support users to build appropriate trust with AI systems: communicating richer trustworthiness cues and guiding users with trust heuristics. HCI researchers have begun to explore system designs to support these two pathways. For instance, Drozdal et al. [19] found that transparency features in an AutoML tool can help users understand the system and build trust with the system. Zhang et al. [71] found that an implementation of a confidence score can support users in calibrating trust.

An important limitation of the current MATCH model and relevant design is that it focuses on the interaction of individual users with an AI system, while not explicitly considering the broader sociotechnical ecosystem surrounding the user and AI. As Liao and Sundar [44] point out, extrinsic cues such as other users’ reactions to the AI system can potentially affect the current users’ trust judgment. Ehsan et al. [22] discovered that users’ perceptions and needs in AI are situated in a social, collaborative setting. Beyond the context of human-AI collaboration, a recent study has shown that trust is situated and shaped by users’ information-seeking and assessment practices on online social platforms [72]. The PICSE framework [36], which discusses trust in software development tools, also recognizes the presence (or absence) of a user community behind a given tool as a significant factor that impacts user trust, in that the community provides users access to success stories, failures, and other pertinent information about the tool.

To fill this gap, we explore the opportunities of leveraging online communities to help users build appropriate trust with AI-powered code generation tools. Specifically, we pay attention to what people see about AI code generation tools in online communities and how they engage with them.

2.2 Developers’ Participation in Online Communities

Online communities have served as a sociotechnical ecosystem for users to connect and engage with each other through a range of interactions [8]. As online communities evolve over time, there is a select set of values that users find to be core to their success. One of those core values is the opportunity to engage with the community. The ability to share experiences through posts, respond to the experiences of others through comments, and/or have the flexibility to engage at low-risk levels such as voting have become a common standard of what users expect [12]. This ability to engage with other members or even actively observe has solidified these virtual spaces as an integral setting for situated learning [42]. Another essential core value of online communities is the expectation of sharing a variety of experiences. For instance, as of 2018, Reddit is home to more than 138,000 active communities alone⁵—topics ranging from understanding privacy and security of devices [41] to debating the impact of deep fakes [27]. Gathering a peer perspective and also being able to inquire about new technologies with those who have a common interest but have never met before work is a rare opportunity that online communities provide. Arguably, one of the most critical components of online communities is trustworthiness in the quality of content on the platform [64]. This is especially critical when someone is sharing technical content on complex topics, such as scientific explanations [37]. In these technical settings, trusting the quality of the source of the information and having confidence in the expertise of the members set the tone for how others decide to participate.

Online developer communities appear to have similar core values as well as caution. In communities such as Stack Overflow and GitHub, developers of software tools convene to discuss new programming paradigms [47] and more often than not lurk until they have something worth adding to the conversation [55]. More broadly, developers engage with other community members to seek answers, feedback, and mentorship [6, 25, 59], as well as to discover new tools, resources, and inspirations [11, 14, 70]. In all these modes of engagement, developers relate the experiences shared by others to their own contexts [24] and acquire new skills and community practices through observation and interaction with experts [25]. Developers have relied on these communities so much that researchers and toolsmiths have found it worthwhile to embed them into the development process. For instance, Bacchelli et al. [4] built a tool named Seahawk that experimented with accessing Stack Overflow snippets in the Integrated Development Environment (IDE) to reduce the friction between developers finding the resources they need, gathering, and contributing back to the developer community, all while staying productively on task.

As applications of generative AI have been exploding in recent years, we have the opportunity to observe developers forming and leveraging communities to make sense of those AI technologies. For example, discussions have emerged about the role of AI-generated content on platforms like Stack Overflow and its implications for the platform’s future. With the advent of ChatGPT, a versatile text generation AI,⁶ researchers have observed a notable 16% dip in weekly posts on Stack Overflow, suggesting that a significant segment of developers might be turning to AI for programming-related questions, while the quality of questions within the community has remained unchanged [17]. Initially, Stack Overflow prohibited AI-generated content,⁷ but it later reversed this decision, choosing to embrace a future where AI could help more aspiring developers in its community.⁸ Beyond platform-level responses, tens of thousands of developers online are actively discussing the implications of generative AI tools, touching on everyday programming workflows and broader career prospects. These discussions suggest the role of online communities in helping people navigate the new technological terrain of generative AI. The emergence of these discussions inspires us to unfold this investigation to understand the impact of online communities on developers’ trust in AI-powered code generation tools and what we can learn from these communities.

2.3 AI-powered Code Generation Tools

The recent advancement of generative AI technologies has led to revolutionary AI-powered code generation tools. Commercial AI-powered code generation tools, such as GitHub Copilot (referred to as “Copilot” in the rest of the article), Tabnine, Kite, and Amazon Code Whisperer, are rapidly being adopted and used by millions of developers in software development workflows [29, 32], influencing real-world software quality and safety. These AI tools, powered by large language models trained on massive code corpora, can generate real-time code suggestions while the user is writing code in their programming environments. For example, Copilot, powered by the OpenAI Codex model⁹ trained on a large corpus of code repositories on GitHub, serves as an extension in Visual Studio Code¹⁰ and can suggest code snippets in ghost text at the user’s cursor location. Figure 2 presents a snapshot where Copilot is suggesting code. Specifically, Copilot can suggest code based on natural language comments (referred to as “prompts”) and also suggest natural language comments based on user-generated code or prompts, and autocomplete single or multiple lines of code as the user types. Users can view multiple options for some of the suggestions and decide which to accept, reject, or make edits.

Fig. 2.

These AI-powered code generation tools introduce both opportunities and risks to the programming workflow. Researchers in the HCI community recently started to pay attention to developers’ experience using AI-powered code generation tools. For example, both Weisz et al. [62] and Xu et al. [67] examined the productivity impact of software engineers working with generative code models and found opposite results: the former found that some (but not all) developers produced code of higher quality, while the latter did not find a quantitative benefit to working with the AI model. Al Madi [2] assessed the readability of code produced by Copilot, compared to human-written code, and found that it was comparable. In the context of novice learning programming, a recent study found that learners with access to OpenAI Codex were able to produce code of higher accuracy and greater efficiency [38]. A lab study investigating the usability of GitHub Copilot indicates that although developers perceive Copilot as helpful, they face difficulty assessing the correctness of the suggested code [60]. A recent study on developers’ experience with GitHub Copilot revealed that developers spend more time reviewing code written by the AI than writing code themselves, suggesting a potential shift of the developer’s role as AI-powered tools become more integrated into software development tasks [7]. In another study, [63] showed that although AI-powered code generation tools might produce problematic code, software engineers were still interested in working with it. Developers typically face little challenge in getting on board and making Copilot produce usable code; however, due to the versatility of generative AI and the resulting uncertainty in its output, developers often struggle to establish an accurate mental model of what Copilot can or cannot do and have trouble using it to its full capacity [52]. As a result of inaccurate understandings of the capacity of Copilot, developers sometimes face the challenge of making Copilot generate the desired outputs [5]. They frequently felt that they had to invest substantial effort in developing strategies to design prompts and debug model input [35].

These mixed signals indicate that supporting users in building appropriate trust with such tools is challenging, yet necessary. While online communities showed promise in supporting developers, it is unclear how these communities could be leveraged and enhanced to support developers’ trust-building process with emerging AI-powered code generation tools. To explore the needs and opportunities in the space, we carried out Study 1 and Study 2 presented in the following sections.

3 Study 1: How Do Online Communities Shape Developers’ Trust in AI-powered Code Generation Tools?

To understand the role of online communities in helping developers build trust with AI code generation tools, we conducted a semi-structured interview study with 17 developers. To answer RQ1, our findings reveal that online communities shape developers’ trust in AI-powered code generation tools through community-curated experiences and community-generated evaluation signals. In Sections 3.2 and Section 3.3, we delve into what entails these features of online communities, how they facilitate developers to build trust with AI code generation tools, and what challenges developers face when engaging with these aspects. We also synthesized these findings into two pathways that extend the MATCH model, illustrating how user communities can support appropriate user trust in AI (Section 3.4).

3.1 Methods

3.1.1 Participants and Recruitment.

We recruited 17 participants for this study. We distributed our recruitment messages on Reddit, X, and LinkedIn to reach a wider population of diverse backgrounds and affiliations beyond our immediate social networks. We collected basic demographic information from our participants, including their gender, race, age range, educational background, and current job title. All participants received a US $50¹¹ digital gift card as compensation for their participation in this study. The ethics of this study were reviewed and approved by our institute’s Institutional Review Board (IRB).

Our aim was to explore how online communities shape developers’ trust in AI code generation tools. As such, we specifically sought participants who have experience engaging with content about AI code generation tools in online communities. Developers participate in online communities in various ways (as elaborated in Section 2.2), ranging from active participation in creating and disseminating examples and resources to passive observation of others’ interactions. Extensive research on communities of practice has highlighted the significance of legitimate peripheral participation (i.e., participation in a community through less central activities or mere observations) for the learning and development of its members [42]. With this in mind, our recruitment criteria for engagement in online communities encompassed both passive consumption (such as reading discussion threads or watching videos shared by community members) and active content contribution (such as sharing tutorials) related to AI code generation tools within online communities.

We intentionally included participants with a balanced mix of demographic information, programming experience, and use of AI code generation tools and online communities. Our participants consisted of 15 men and 2 women between 18 and 44 years of age and of a mix of multiple races. All but one participant was recruited on social media without a prior social relationship with the research team (P17 is a personal friend of a member of the research team) and all came from different organizations. All of our participants write computer programs regularly for their jobs as software engineers, students, researchers, and IT personnel, with programming experience ranging from 3 to 16 years (median = 10 years). Our participants have used a range of AI-powered code generation tools, including Copilot, Tabnine, Kite, CodeWhisperer, and aiXcoder. They have engaged in discussions about these tools on a range of online platforms, including GitHub, Stack Overflow, Stack Exchange communities, Reddit, Hugging Face Spaces, YouTube, X, and group chat channels such as Discord and Slack. Detailed profiles of our participants are presented in the Appendix (Table 1).

3.1.2 Procedures and Analysis.

All the interviews were conducted online from July to August 2022, video recorded and later transcribed. The interviews were semi-structured. Although there is existing work that quantitatively measures user trust in technology, since our study is the first known study that strives to understand how developers form trust in AI through participating in online communities, we chose the semi-structured interview format to keep our inquiries open-ended and to capture a breadth of themes that spoke to the experience of participants. A total of 980 minutes of interview data was collected. The length of the interviews ranged from 54 minutes to 63 minutes, with an average duration of 57.6 minutes. The interview started with questions about the general experience of the participants and their thoughts about the AI code generation tools they use, how they trust the tools, and what they mean by trust. Then, based on the participants’ answers, the interviews dove deeply into participants’ experience using online communities and how it impacts their trust, including how engagement in online communities impacts participants’ expectation of AI, how and when they use the tools, and what they would do in vulnerable situations, such as when they are not sure about the AI suggestions. The interview protocol is attached in the Appendix.

We analyzed the interview data to identify the core themes central to how developers build trust in AI through online communities. Specifically, to generate our own framework on this issue, we performed an inductive thematic analysis [9], a qualitative method designed to identify and interpret patterns within data without preconceived categories. The first author initiated the analysis by immersing themselves in the data, repeatedly reviewing the interview transcripts, and referencing the field notes taken during the interviews. Then, initial codes were generated, where notable lines of quotes were identified and encoded. Subsequently, these codes were grouped to form tentative themes. To enhance the robustness and validity of the analysis, these emerging themes were shared with the entire research team for review and discussion. Based on these collaborative discussions, the first author refined the themes for enhanced clarity and coherence, followed by revisiting and recoding the original data to align them with the evolving thematic structure. This process was repeated in two additional iterative cycles to ensure the thorough establishment of the final themes. The research team finalizes the thematic framework, collaboratively defining and naming each theme to create the codebook. Our final codebook is organized into two overarching super categories, each correlating with the findings detailed in subsequent sections. Within these super categories, there are three categories, each containing multiple themes. Each theme encompasses a series of codes applied to the data, resulting in a total of 45 distinct codes. For reference, the complete codebook is provided in the Appendix.

3.2 Finding 1: Building Trust with Community-curated Experiences

For developers, an important aspect of trust in AI code generation tools is whether and how comfortable they are with integrating the tool into their programming workflow. Developers leverage online platforms such as X, Reddit, and GitHub to first discover and learn about specific AI tools. Beyond initial discovery, developers also continue to share their own tool experiences and explore others’ experiences on these platforms. We refer to such collections of sharing about community members’ interaction with AI as community-curated experiences, which allow developers to form a mental model about the capacity of AI code generation tools and how they can use those tools, building trust in the process. In this section, we will discuss (1) what types of community-curated experiences that developers like to see, (2) how community-curated experiences help developers build trust with AI tools, and (3) what challenges developers face in sharing and consuming experiences in communities.

3.2.1 What Types of Community-curated Experiences Are Helpful.

Developers seek specific features in community-curated experiences that can effectively help them build and calibrate trust. We identified four important features of an ideal experience sharing post: vivid description and explanation, realistic programming tasks, inclusion of diverse use cases, and details on setup and dependency.

Vivid description and explanation. One of the most important features that developers would like to see in community is others’ specific experience interacting with an AI—in P13’s words, “concrete examples of how someone’s using a tool and how it helps them.” These posts typically contain details including the programming tasks, the prompt that elicits AI suggestions, and the outcomes of the AI suggestion, often supported by screenshots or video recordings. For example, P2 shared that they were convinced to try out Copilot by a YouTube video because they were able to see “somebody actively working with it and I can be like, that’s how it would work for me too.” (P2) Similarly, P13 stated that a video demo could vividly display how the AI assisted users and help them recognize the benefit it could bring: “when you watch it, parts of it can translate on whatever you do and you can make your workflow better.” In addition to a vivid, detailed description of the interaction, developers also would like to see the users’ explanations, opinions, and reflections on the interaction and AI suggestion. For instance, P3 explained that an effective demonstration of a vivid experience with AI should explain that “I wrote this suggestion—look what it was able to come up with, and why you should use it as well, because it could help you with something like this.” With the explanation, developers can achieve a more accurate understanding of how AI behaved in the user example.

Realistic programming tasks. It is also important that the shared experience is situated in a programming task that represents the typical and realistic workflow of a developer. Instead of using AI to solve predefined programming challenges, developers want to see the user showcase an “actual project” where they are “deeply nested into” and working with “lots of files, lots of context all around.” (P3) This type of demonstration can help developer anticipate the performance of AI in complex situations that they face in their day-to-day work. Specifically, developers would like to see both cases where AI makes helpful suggestions and fails to help. For example, P16 preferred to see posts without “a pre-biased judgment” and stated that “it’s better to see how people who are actually using it in real world face issues.” It was important to see incidents where the AI gave wrongful suggestions so that they could understand its limitations. Working with AI on a realistic project gives opportunities for such incidents to occur:

I think the intention shouldn’t be to use Copilot or not. They can just keep it running in the back and look at its suggestions and just work out they would normally work... Then you can see if it helps, it helps, but otherwise, you’re just dismissing the suggestions. Then we would know whether it is helping. (P10)

Inclusion of diverse use cases. Furthermore, developers would search for diverse use cases of an AI tool—different programming languages, tasks, scenarios, and so on—to learn what is possible with its capacity. For example, P13 shared their experience going down in a long “YouTube hole” when first learned about Copilot and got exposed to the breadth of its use cases: “there’s so much capability it has! I had a lot of fun just watching the way people use Copilot, front-end development, back-end development, machine learning development...” When introducing Copilot to their friend, P13 always demonstrated it in a variety of programming languages and tasks, including how it successfully suggested complex function in OCaml but stuck with the “<div>”s in HTML: “Because that’s what I experienced, I’m going to show them the full picture of what I’ve done.” (P13) Diverse use cases also help developers be aware of the boundary of capacity of the AI. Similarly, P2 shared that they once posted about how they used Copilot in a multicomponent, full-stack web development project to demonstrate the ability of Copilot. P2 was proud of this post, as it showed how Copilot behaved in a variety of programming languages and tasks, providing a comprehensive picture of its capacity:

I think just like one specific code example isn’t really enough... If one person says, here’s a screenshot of a cool thing co-pilot recommended, I don’t know if that be enough to convince me... That doesn’t prove that like co-pilot is good overall. (P2)

Details on project context and dependency. In order to make use of the experience of others with AI code generation tools, developers favor posts that include details on setup and dependencies of the project, for example, “the tool, what [the user] was planning to program, the language, the purpose, everything.” (P1) Developers value information on setup and dependency because they want to reproduce the interaction with AI in their use cases. For example, P1 preferred reviews that are “Something someone can follow and you’ll get the expected results.” In another example, P6 once saw a video about a similar problem that he had with Copilot and would like “see someone doing it, just follow it step by step and to see.” Replicability is crucial to trust building, as our participants explained that they needed to get hands-on experience with an AI tool before deciding how to trust it. Developers would like to replicate others’ successful interaction with AI to help with their particular use cases, while sometimes also examine wrongful suggestions that they saw online before making their own judgment: “If I see something negative I don’t like to say that, okay, this is all useless let’s stop using Copilot... I try it with caution, I try to reproduce it.” (P4)

3.2.2 How Community-curated Experiences Help Developers Build Trust with AI Tools.

Our finding suggests that developers can build appropriate trust with AI code generation tools as a result of engaging with community-curated experience. Specifically, developers can build a mental model of the capacity of the AI from others’ experience, which includes setting reasonable expectations, learning strategies on when to trust the AI, forming empirical understanding of how AI generates suggestions, and developing awareness of broader implications of AI-generated code.

Setting reasonable expectations on the capability of the AI. Engaging with community-generated content on experience with AI code generation tools helps developers form appropriate expectations with those tools. Before trying out the tool, our participants shared that they had little knowledge about its ability and how it will perform in their particular use cases, as P7 put, “I’ve not tried anything like that before so I didn’t really know what I was expecting.” Consuming other users’ “anecdotes” with the tools allows developers to form an initial mental model about the AI and decide if it is trustworthy enough to give it a try: “I read a bunch of what people think of the eventual outcome... [It] helps me make my own perception of whether it is something that is useful for me or not.” (p16)

Forming reasonable expectations also means learning about important constraints of AI. An unrealistically high expectation could undermine developers’ trust, as they may not be prepared for its constraints. For instance, P14 shared that they initially thought Tabnine could largely automate their programming tasks due to an online post claiming that Tabnine wrote 20–30 percentage code in a file, but then found that it was not the case in their actual usage: “I thought it was going to give me more detailed prediction on my code... But once I tried it out, I just saw that it‘s mostly just complete some lines.” (P14) Out of disappointment, P14 stopped using Tabnine. In contrast, a reasonable expectation allows developers to anticipate certain failures, so that their trust would not be broken by surprising negative experiences. For instance, P13 shared that a video demonstrating a variety of successful and failed use cases of Kite helped them realistically estimate how it can improve productivity: “I set my expectations lower. I wasn’t like, ‘wow, I thought Kite would write my entire code file for me.’ ” Such realistic expectations allow developers to understand the boundary of the AI’s capacity and plan out how they will use the AI accordingly. For example, P4 anticipated that Copilot would “fail in some cases” and did not expect Copilot “to work all the time”, but still decided to try it out with caution and validation.

Learning strategies of when to trust the AI. Community-generated content can also allow developers to develop strategies on how to trust and effectively use AI in their specific use cases. When developers feel unsure whether they could trust AI for particular tasks in their workflow, they go to online communities to seek relevant examples. For instance, P16 shared that they looked into community discussions to learn the tasks that they could trust Copilot to do:

The public discussion has definitely helped with the trust: this is a language translation that I can trust. This is a module transmission I can trust... It’s like all I need to do is find already if someone has tried to use the auto-complete. (P16)

Developers can also learn from others’ experience about areas where AI should not be trusted, and consequently form strategies to minimize any harm. For example, P1 shared an experience in which they tried to get Kite to make high-quality suggestions in a project that involved multiple programming languages. After browsing in online communities, they learned from other users that Kite generally did not perform well in some languages and decided that they should instead “stick to a particular programming language.” In particular, developers would like to see examples relevant to their specific use cases, as they offer previews of how the AI can function in their use cases, allowing developers to decide how much to trust it and whether to adopt it at all:

When I found out the reviews do not affect the areas of function that I want to use the app for, I will not really be bothered about checking out the app. But when I found out the negative reviews is [in] the area where I want to use the application for, I tend to not have enough trust for the application and mostly I end up not checking out at all. (P5)

Forming empirical understandings of how the AI generates particular suggestions. Online communities can also help developers understand how AI code generation tools work. The participants shared that they contemplated why the AI gave certain suggestions in particular scenarios, for example, “why does it make a certain choice over another? Why is the naming convention of this variable something else?” (P8) Knowing the inner decision processes and rationales of AI is crucial to trust, as developers can make decisions based on the reliability of the decision process. In online communities, members share their prompts and AI suggestions, and collectively build an empirical understanding of the otherwise black box: “it’s a super black box, so we’re collectively trying to figure out how to best use them.” (P10) In particular, participants shared appreciation for discussions around wrongful AI suggestions. In those discussions, users post bizarre code written by the AI and the surrounding context, speculate potential causes of those suggestions, and sometimes exchange ideas for experimenst to test out their theories. For example, P4 shared an experience where they engaged deeply in a discussion thread and tried to reverse engineer the prompt that resulted in a wrongful suggestion by Copilot: “I’d like to see more of this type of discussions. [We] see examples where it is going wrong and trying to figure out what went wrong.” (P4) Similarly, P16 enjoyed reading posts in which users shared experiments with GPT-3 prompts, as those discussions helped them empirically summarize the factors that affect AI suggestions:

If you see the problem that they wrote and you see the paragraph that they get then it’s very easy to figure out the breaking points that this word is contributing for this sentence to be generated, and basically bring a bit of interpretability to the machine learning model. (P16)

Developing awareness of broader implications of AI generated code. Developers also learn about the implications of using AI models beyond their immediate use cases by participating in online communities. These implications can include the legal and ethical impact of the code generated by AI, as well as potential concerns about security. For example, P3, a young programmer, only learned about the controversial impact of Copilot on FLOSS¹² after following a debate on GitHub repositories:

If I had not seen the discussion, I don’t think I ever would have thought about it... But once I did see a lot of people talking about the licensing issues... I feel like I should be a little more careful with what I do. (P3)

Although those legal and ethical implications would not directly affect P3’s current use cases of Copilot, P3 was glad to be aware of those issues to prepare for wider use cases in the future. In terms of security implications, while experienced developers may have knowledge of the vulnerabilities that AI can introduce, beginner programmers rely on community-generated resources to learn about these concerns. For example, participants reported that they had seen posts about Copilot suggesting others’ private keys. These implications are crucial for trust, as they can help developers stay alert and stay away from serious consequences:

People learned that they should be maybe careful against this in critical software. That’s a good thing that people were discussing it because there’s probably less probability of the worst-case happening. (P12)

Seeing security issues introduced by AI reminds developers of the importance of manual review on AI suggested code and their role as the supervisor of the AI. For example, P8 recalled that a discussion about a security breach caused by code written by Copilot made them invest more time in verification:

I would be more careful while using the application, just so I would not just write all my code and just publish it that way without taking time to actually verify. (P8)

3.2.3 What Challenges Developers Face in Sharing and Consuming Experience in Communities.

Despite its benefits, developers face a variety of challenges in engaging with community-curated experiences. Some participants found that it is hard to reproduce others’ experience since vivid description and dependency details were often missing, and others complained about the lack of diverse use cases or realistic programming tasks. We summarized three significant challenges that developers face in effectively sharing or consuming experience with AI tools: lack of channels dedicated to experience sharing, high cost of describing and sharing experience, and missing details for reproducibility.

Lack of channels dedicated to experience sharing. A challenge developers face is the lack of platforms dedicated to sharing experience with AI code generation tools. On general platforms like X and Reddit, sharing and discussions about specific experience with AI code generation tools can be buried in more general discussions about AI. For example, P16 shared that they had been apprehensive about engaging in high-level discussions about AI, since those discussions could often become controversial and distant from specific user experience:

Most of the discussion start with some random Atlanta or New York post headline saying something like, AI is going to take your job or anything like that... the headline is so sensational that people have really strong opinions over it and I do not really like engaging with. (P16)

The participants expressed a desire for a channel dedicated to end-user experience sharing and discussions about AI code generation tools. As P10 imagined, an example of the channel could be “a common forum just specific for Copilot or something, like crowd sourcing of ideas.” Similarly, P13 suggested a “centralized place” for Copilot related experience, where people share “code snippets and they’ll be, ‘here once you have access to Copilot, try this code snippet in your repository, see what happens.’ ” In addition, P7 proposed that a search function for “the feature you want and bring out posts about related AI co-generation tools” so that users can easily locate examples with particular use cases.

High cost of describing and sharing experience. For developers, writing a post about their experience with AI code generation tools and sharing it online is considered a time-consuming and complex task. Given the interactive nature of AI tools, it is difficult to effectively describe the interaction process. For example, P2 shared that it is hard to describe AI suggestions when making a text-based social media post: “You can’t really have a code block that differentiates between I wrote this code and then this is the code that was recommended.” (P2) For developers who choose to share a video about their interaction with the AI, making the video consumable by other developers requires a lot of planning on what to record and how to explain. For example, P2, who once prepared an elaborated video post about how Copilot helped them in a complicated project (see Section 3.2.1), complained about the tedious process of preparing the video:

I spent a lot of time learning the intricacies of, like, for that specific project, what would Copilot recommend and all that stuff. I was up until like 7 a.m. (P2)

In addition, posting on a public online platform can be intimidating. Developers may feel self-conscious when sharing their experiences and opinions with a large audience, with little knowledge about who they are and how they would react. As a result, developers may want to spend a lot of time polishing the post that they are trying to share, adding more investment to this task. P11 compared sharing their experience with Copilot to a group of friends making a public post:

(With friends,) I don’t really care about going deep into discussion in order to appear perfect. But when it come to posting online, I’m most mindful of how my writing will look. (P11)

Missing details for reproducibility. Another challenge is the lack of reproducibility in the experience shared by developers online. As we show in Section 3.2.1, developers build trust with AI by trying out prompts from online sources in their workflow. However, when sharing their experience, developers can overlook the setup of their project and the dependency of their environment: “people don’t exactly share their VSCode settings, [which] could be very much catered to the way they code or their programming preferences.” (P9) Developers are disappointed when they are unable to replicate AI interactions that they saw online in their own environment. For example, P13 complained of the lack of functionality to “copy and paste text from the video” (where the other user interacts with Copilot) and wished a “testing environment” where they could directly replicate the interaction. The lack of reproducibility can make developers not aware of diverse use cases and possible limitations of AI, resulting in biased views. Some of our participants even deemed particular use cases that they saw online but had not experienced themselves as “misinformation”, because they were not able to confirm themselves that such interaction is possible: “if I see that people are contrary on something that can’t be verified, I would be like, this is a misinformation.” (P7)

3.3 Finding 2: Building Trust with Community-generated Evaluation Signals

Another important aspect of trust in AI code generation tools for developers is whether and how they can leverage specific AI suggestions. Developers can find signals to evaluate the code suggested by AI from code solutions posted by community members and make decisions. We refer to these contributions of community members as community-generated evaluation signals. In the following sections, we explain how and why communities can offer effective support for the evaluation of AI output. Specifically, we unpack (1) what evaluation signals communities can offer, (2) how evaluation signals help developers build trust with AI tools, and (3) the tradeoff between evaluation signals and productivity.

3.3.1 What Evaluation Signals Communities Can Offer.

We identified three major evaluation signals that developers’ online communities can offer: direct indicators of code quality, context and generation process of a code solution, and identity signals.

Direct indicators of code quality. Unlike AI code generation tools, online communities offer direct metrics that can help developers evaluate the quality of code solutions. Many online communities, such as Stack Overflow, allow users to vote or provide ratings for solutions. Participants trusted a solution selected by such voting mechanism as it had been reviewed by many others: “if other programmers have used that solution and it worked in their code, they’ll upvote the solution and so that does give you a little bit more faith that this is a good solution.” (P3) The voting mechanism also implies that multiple real developers have used the code, indicating the safety and trustworthiness of the code. For instance, for P10, the solution selected by a question owner on Stack Overflow indicated that “the person who originally asked the question has tried it out and it worked for them,” providing “some guarantee that, that code compiles or it’s very close to what I want, which does not exist with Copilot.”

Another explicit indicator of code quality is how much engagement it receives. In online communities, developers can see how many other members viewed, rated, shared, commented on, or otherwise engaged with a post. High and positive engagement with a solution usually means that it “has been cross-verified by a lot of people,” (P4) indicating high quality. As P6 summarized, “when it’s coming from plenty of different people and all bring positive reply, I’ll be like, ‘yeah, it works for everyone and not just for a single person.’ ” With high engagement, developers can see verification triangulated from multiple sources, which informs their evaluation: “I trust that because it is not dealing with just from one person’s perspective—different developers with different ideas, coming together to make their review.” (P1) When a solution does not get high engagement, developers tend to be more skeptical about it since it is not verified by other users. For example, P14 explained why they did not have a high degree of trust with Tabnine after seeing a single post about it: “because of the fact that just a random post from a random user with no really much engagements.” (P14)

Context and generation process of a code solution. Developers also appreciate that online communities offer a broader context beyond the code solution itself, which helps in evaluation. Platforms such as Stack Overflow support discussions around solutions contributed by users. Participants expressed appreciation for the discussions, as compared to the solutions suggested by AI, they are more “detailed”, “interactive” (P12), and provide “a lot more transparency” to how it is generated (P3). Following a discussion allows developers to understand why a solution is chosen as the optimal:

They will go through the process. What happened? How did it happen? ... then people will go through different solutions. There is a lot more storytelling involved. (P9)

You can see this back-and-forth between people until one answer becomes the accepted answer for that question. Somebody is suggesting some code and then in the comments people point out, oh, you have a point here or like there’s an issue with this thing. (P12)

Understanding how a code solution is generated helps developers better judge its quality. Because users cannot see how an AI suggestion is decided, they have trouble deciding its validity, as P3 described: “in some cases you have no idea why something is working, although it works.” Sometimes, the solution generated by AI may not be the most optimal solution. For example, P3 shared an experience where Copilot suggested a bug-free but inefficient implementation of a function. While individual inexperienced programmers often have trouble identifying issues in examples like this, they can leverage online discussions where multiple, more experienced members can scrutinize the solution, spot any issues, and even propose more effective alternatives, as P3 reflected: “if [I] had gone with Stack Overflow for something like that, they would have said, ‘here’s the inefficient solution, this is why it’s inefficient, and here’s how you can make it more efficient.’ ”

Identity signal. Participants shared that knowing the identity of the author of the code allows them to evaluate the code based on the authors’ background and experience level. In communities like Stack Overflow and GitHub, users can view members’ contribution history indicating what and how much code they have contributed and how well the code has been received in the community. Common gamification mechanisms, such as levels and badges, also signal the expertise of a user. To developers, all these factors can help them decide to what extent they can trust the code written by a particular user:

I’m copying this code of this guy which has all the badges and everything—he probably knows what he’s doing. (P12)
On Stack Overflow they have a reputation. You see they answered 3000 questions and they’re always posting high quality responses, and then that’s a pretty big proponent that you can trust that person... then you can most likely trust their code. (P3)

Especially when the developer is unfamiliar about an area or indecisive about multiple options, an expert’s input can facilitate their decision process. For instance, P15 described a scenario where signals of expertise could help them recognize important considerations:

It may be the case that thousand people voted for it, but two or three people may have commented saying, ‘Hey, it has a threat.’ And those two people who commented are actually threat analysis specialists. It’s not always 1,000 people’s opinion are correct. (P15)

3.3.2 How Evaluation Signals Help Developers Build Trust with AI Tools.

Currently, AI code generation tools lack a evaluation mechanism for the quality and correctness of the AI-suggested code, as P3 observed with the example of Copilot, “GitHub Copilot just hands you the code, and it’s up to you to know whether or not that’s good code.” While developers can evaluate the suggestions themselves by reading or executing the code, external support for evaluation is often needed when cost of underlying error is too high in terms of computing resources and time. Especially for developers with less experience who may be unfamiliar with the syntax or logistics of the suggested code, effective support for evaluation is essential to ensure quality and safety in the code, as P7 summarized: “there are times that it’s giving me suggestion I’m not very familiar with and I would have to look it up and see.” Online communities can also provide multiple perspectives to triangulate evaluation: “I trust that more because it is not dealing with just from one person’s perspective. It deals with different programmers coming together with different ideas.” (P1) In addition, knowing that online communities attract users of a variety of expertise and experience levels can boost developers’ confidence for evaluation. As P7 pointed out, “online community has lots of people with more knowledge and who has worked on more projects probably than I have, so they’re familiar with many codes that I’m not,” developers can rely on online communities to learn more about the code and make appropriate judgments.

3.3.3 Tradeoff between Evaluation Signals and Productivity.

Participants pointed out an important trade-off between the effective evaluation that a community can provide and the amount of effort that a user needs to invest in a community. A great advantage of AI code generation tools is that they increase productivity—developers get AI suggestions seamlessly in their programming workflow, reducing the time spent typing and looking up online for syntax and documentation. As P16 framed, Copilot is “doing the filtering out process that I need in Stack Overflow to do myself for me... It helps me not open 50 Google Chrome tabs and still have the answer in an efficient amount of time possible.” Whereas in online communities like Stack Overflow, they need to “look through multiple questions and look for multiple answers and figure out whether this person wants to do the exact thing that I want to do.” (P16) In other words, it can take an enormous amount of time and energy to search, filter, and collect community generated resources and use credibility signals to validate those user sharing. This can break the workflow of a developer since their tasks are often very targeted and time very constrained. As P15 stated, “software engineers, they’re not go and search for online contents, they work on a need to do basis.” There is the need to effectively incorporate community evaluation in developer’s programming workflow.

3.4 Synthesis: The Two Pathways Through Which Communities Can Foster Appropriate User Trust in AI Code Generation Tools

Echoing previous literature arguing that trust is situated in sociotechnical systems [30, 36, 39, 72], our findings elaborated on the important role of online communities in helping developers build appropriate trust with AI tools. As a response to the call for social and organizational mechanisms beyond the tool itself to support trust building [44], we surfaced two pathways that online communities can offer to help developers build appropriate trust in AI: (1) the pathway of collective sensemaking and (2) the pathway of community heuristics. We explain these pathways in Figure 3 using the framework of the MATCH model [44] that we discussed in Section 2.1.

Fig. 3.

The first pathway, collective sensemaking, describes the process by which developers learn from others’ experience with the AI model and improve their understanding of the trust affordances of the system. As suggested in previous literature [52], due to the highly versatile, context-dependent nature of code generation AI, individual developers likely only get exposed to a subset of trust affordances and tend to understand the AI’s capacities from their limited perspectives. For example, developers may not be aware of cases where AI can fail, granting too much trust to AI and resulting in mistakes, similar to what we see in the literature on overreliance on and misuse of AI systems [21, 43, 51]; or developers may deem AI completely useless only based on their own few negative experiences. With community-curated experiences, developers are exposed to diverse examples of how AI can work or fail in different use cases, which complements their individual understanding of the trust affordances. As a result, developers set reasonable expectations, learn strategies on when to trust AI, develop empirical understanding of how AI generates suggestions, and develop awareness of broader implications of AI-generated code. All of this can help them make informed decisions about when and how much to trust AI in their programming workflow, helping them build responsible trust [34].

The other pathway, community heuristics, refers to the process in which users leverage a variety of community evaluation signals as heuristics to make trust judgments. Since individual developers rely on their own knowledge, expertise, and intuitions as heuristics, when unfamiliar with the programming project or language, they may not be able to effectively assess the suggested code [52, 60]. Echoing a range of literature on technical Q&A forums [26, 54, 58], our findings suggest that communities provide crowd-sourced evaluation signals such as user voting, coupled with supplementing discussion context and credibility signals on user identity. These signals can serve as heuristics in addition to the developers’ own expertise, scaffolding them to effectively evaluate an AI suggestion and make decisions on whether to trust it.

4 Study 2: How Can We Leverage Community Features to Design for Appropriate User Trust in AI-powered Code Generation Tools?

While Study 1 illustrates the two pathways through which communities can support appropriate user trust in AI code generation tools, it also highlights numerous challenges and needs that developers encounter while navigating these routes. To answer RQ2, we further explore the design space of incorporating a user community into the experience of AI code generation tools to help developers form appropriate trust. We conducted Study 2, a design probe study with 11 developers.

Specifically, we use the two pathways to appropriate trust in AI that we synthesized in extending the MATCH model (Section 3.4) as the foundation for developing two sets of design concepts—community evaluation signals and experience sharing spaces—in the form of visual stimuli. We presented these visual stimuli as design probes to a new set of participants in imaginary use cases to probe for their reaction and any additional ideas to improve the designs. Not measured by quantitative metrics, this type of qualitative design probe study is common in formative research on human-AI interaction (e.g., [13, 57]), as our goal is not to settle on specific design elements, but rather to explore what is possible in the design space of introducing user communities to AI code generation tools. Probed by our visual stimuli, participants were also able to generate new ideas and suggestions. The output of this section is a series of elaborated design directions that we created based on feedback from our participants.

In the following sections, we first summarize our key findings in Study 1 and list the user needs identified from our findings (Section 4.1). We then explain how we develop the design probe stimuli and the rationales for specific features (Section 4.2). We dive into the findings on how the two sets of stimuli can help developers build appropriate trust, respectively (Sections 4.4 and Section 4.5).

4.1 Identifying User Needs

Community-curated experiences. By leveraging community-curated experiences, developers can collectively make sense of the appropriate expectations, strategies, understandings, and implications of working with AI-powered code generation tools, all of which are crucial to fostering trust. Based on our findings presented in Section 3.2, we summarize the following user needs and challenges in consuming community-curated experiences and sharing experiences with the community:

(1)

Users would like to see others’ specific experiences interacting with AI tools, shared with vivid descriptions of interaction and situated in realistic programming tasks (elaborated in Section 3.2.1).

(2)

Users want to explore diverse use cases of the AI tool from community-curated experiences (elaborated in Section 3.2.1).

(3)

Users would like to replicate a certain interaction with AI that they learned online and need information on dependency and project context (elaborated in Section 3.2.1).

(4)

Users cannot find a dedicated channel for developers to share specific experiences with AI code generation tools (elaborated in Section 3.2.3).

(5)

It costs a lot of time and effort to document and share experiences (elaborated in Section 3.2.3).

Community evaluation signals. We found that developers tend to seek evaluation signals from communities as trust heuristics on AI code suggestions. Based on our findings presented in Section 3.3, we summarize the following user needs and challenges in leveraging evaluation signals to build trust:

(1)

Users seek direct indicators of code quality in communities (elaborated in Section 3.3.1).

(2)

Users appreciate community discussions that allow them to learn more about the context of a code solution (elaborated in Section 3.3.1).

(3)

Users would like to see the identity signals of community members to decide the credibility of their contributions (elaborated in Section 3.3.1).

(4)

Users feel distracted when having to leave the development environment in the middle of a coding session to search for code suggestions on external platforms (elaborated in Section 3.3.3).

4.2 Building Design Probe Stimuli

Given the user needs presented in Section 4.1, we designed two sets of stimuli—the community evaluation signals and the experience sharing spaces—as examples to address user needs regarding community-generated evaluation signals and community-curated experiences, respectively.

For all stimuli, we used Copilot as an example of AI code generation tools for which we are designing the user community. This design decision was made to ensure consistent presentation of our visual stimuli and due to the fact that GitHub Copilot was popular and was indeed the most frequently mentioned tool by participants in Study 1. Consequently, we borrowed the interface layout and design language of Visual Studio Code,¹³ the IDE that hosts Copilot, for our stimuli designs. It is important to note that our design based on GitHub Copilot served purely as an illustrative example to elicit participant feedback. We encouraged participants to consider the design ideas presented in the stimuli in contexts beyond the specific tool of Copilot.

Community evaluation signals. The stimuli for the community evaluation signals are shown in Figure 4 and the features are labeled as A–D. We designed an experience where users can view how other users interact with Copilot suggestions without the need to switch to an external platform. When the user gets a code suggestion from Copilot, they have the option to click on a button to view community evaluation signals in a pop-up window. We introduced feature A Community usage statistics and feature B community voting as indicators of code quality that allow users to see the prevalence of the suggestion used by other users, the percentage of acceptance, edits, rejection, and passing of downstream tests, as well as votings in terms of likes and shares. Feature C enables users to search for code in external communities such as Stack Overflow and GitHub, as well as experience sharing spaces discussed in the next section. With feature D User identity, users can hover on a community usage statistic to see the percentage breakdown of users who took this action with respect to their job titles.

Fig. 4.

Experience sharing spaces. The stimuli experience sharing spaces are shown in Figure 5. Since Study 1 indicates developers’ needs on community-curated experience both within and outside their programming workflow, we designed two versions of dedicated space—an experience sharing space within IDE (Figure 5(a)) and an external experience sharing space (Figure 5(d)). In both spaces, the user can view posts of short video snippets that record community members’ interaction with particular Copilot suggestions. The space within IDE takes the format of a sidebar in the IDE window, that users can toggle once they got a code suggestion and contains community-curated video snippets relevant to the suggestion. The external space is a standalone platform, where users have the opportunity to explore diverse use cases using the search bar and a variety of filters. The video recording of a post speaks to the user’s needs of vivid description of interaction. Figure 5(b) shows the details of a post, including a video snippet, user-generated title and labels, as well as social features such as likes and comments. Figure 5(c) illustrates how a post is generated. Considering the challenge that it takes a high cost to make a post, we designed an automatic video recording feature that captures a certain amount of time before and after a Copilot suggestion to minimize user effort. The in situ recording of a coding session also speaks to the user’s need to see realistic programming tasks. Besides the video snippet, users have the option to add text-based title, description, and labels to their post, as well as link the video to the code repository of the project on GitHub, so that others could see their project context, dependency, setup, and fork the project if they would like to fully replicate the interaction.

Fig. 5.

All stimuli were developed using Microsoft PowerPoint and each feature is presented on a single PowerPoint slide. These served as elicitations to help users imagine the interaction in the design probe sessions. Crucially, our goal was not to present exhaustive designs that cover all user needs identified in Study 1. We also do not claim that the designs showcased in the stimuli are optimal solutions to the issues. Instead, the stimuli are designed to encapsulate the core design concepts that address the most prominent user needs highlighted in Study 1. The gaps and omissions in our design also serve as deliberate open spaces to encourage user feedback, suggestions for improvements, and ideas for new features.

4.3 Participants and Procedures

We recruited 11 developers, consisting of 10 men and 1 woman, of a mix of demographic profiles in race, age, and level of education, as well as programming experience. Because our designs were based on GitHub Copilot, we deliberately recruited users familiar with GitHub Copilot so that participants could provide feedback in realistic contexts aligned with their current workflows. We recruited participants through X, LinkedIn, and within a major tech company to reach a population with diverse backgrounds, and intentionally selected a balance of participants who work in big tech companies and small organizations and who write production code and end-user code to capture different use cases of Copilot and user needs. Importantly, none of our participants had vested interests in Copilot or any other AI code generation tools. Only one participant (DP-P11) participated in both Study 1 (as P17) and Study 2, and the rest of the participants in Study 2 were different from those of Study 1. Same as in Study 1, all participants received a US $50¹⁴ digital gift card as compensation for their participation in this study. The detailed profile of the participants can be found in the Appendix Table 2.

All design probe sessions were conducted via recorded video calls in August 2022. In each session, we present the visual stimuli, explain each feature, and investigate the experience and needs of the participants. We encouraged participants to imagine using the features in their actual programming workflow with Copilot and speculate potential advantages and pitfalls. We also ask participants to brainstorm any changes or new features they want to add. The ethics of this study were reviewed and approved by our institute’s IRB.

A total of 646 minutes of video recorded data were collected. The duration of the sessions ranged from 54 to 65 minutes, with an average duration of 58.7 minutes. The analysis of data from the design probe sessions followed the procedure of deductive thematic analysis [9], where the themes were developed following the structure of the features in each of the visual stimuli. The first author of the article coded the feedback and suggestions of the participants based on the features of the stimuli. The first author then discussed the codes with the research team and iterated on the themes.

4.4 Finding 1: User Experience and Needs with Community Evaluation Signals

Participants agreed that signals of how other users interact with the suggestion are generally useful to help them evaluate Copilot’s suggestions. In the words of DP-P1, the signals can help “accelerate the checking process” (DP-P1). In particular, participants saw themselves referring to these signals when they were not familiar with the language or programming tasks: “[If] it’s the first time [writing the code] and I’m just not too sure about myself... I just want to see how people are using this.” (DP-P9) For the specific usage of the signals, participants imagined only checking out the signals after a period of programming instead of after every suggestion, so that the signal contributed to their overall mental model of the AI’s capability and made evaluation easier when they got a similar suggestion in the future: “it helps with the next piece of code when I use Copilot” (DP-P5) These signals also made it explicit that there were other users who had faced the same suggestions, giving users a sense of community. For example, DP-P5 stated that they would “definitely have more trust” in Copilot with the presence of the signals, as the suggestion has been verified by other users:

Without the feature, I feel like I’m the only one in the dark and Copilot is hovering over behind me. Then with this I see community. I see, hey, there has been 578 other people or at least saw the snippet. I’m like, okay, so these guys used it, then let me just take advantage of it as well. (DP-P5)

At the same time, participants were concerned that paying attention to the signals could break their programming flow. In DP-P11’s word, when looking at the signals, they would “stop the programming workflow and enter debugging mode.” Especially in these particular stimuli, users have to click and view the signals in a pop-up window, which was considered disruptive. The participants wished for a representation of the signals that could be digested at a glance. For example, DP-P2 suggested “a little comment of the code suggestion on the right side saying how other people are responding to this or similar code suggestion.” They also imagined a hierarchical presentation of information where the user can take a glance at first and dive in to see more if interested.

In the following section, we elaborate on the user experience and needs with specific concepts in this design.

4.4.1 Community Usage Statistics.

Participants found community usage statistics as a useful, quantified metric that helps them decide the quality of the suggestion: “it’s data-driven... that gives a lot of confidence to users on how good this recommendation is” (DP-P8). However, the participants were divided on the basis of how they would act on the statistics in actual usage. For example, DP-P7 believed that “a higher acceptance percentage” is “probably a good sign that I can use this safely.” DP-P1 viewed the number more useful when the numbers were extreme: “if I see like 90 percent rejected, it gives me some signal,” but not so much when they were on the middle ground. At the same time, participants also expressed skepticism on the actionability of the numbers, for example: “whether other people have accepted the code, it doesn’t have anything to do with my code...[if] it has a bug in my case, it’s like 100 percent fail for me.” (DP-P2)

These divided opinions indicate that users need more scaffolds to interpret any community signals presented in statistics. Besides the numbers, participants would like to see how other users’ tasks were relevant to theirs so that they could know how to translate community insights into their use cases: “the thing that I look for most is how close is their problem to mine.” (DP-P2) Furthermore, they wished for additional rationales and context behind user action with copilot, as acceptance or rejection along might not show the full picture of user intention. For instance, DP-P4 explained that they tended to experiment with Copilot in their coding process, where “it was always meant to be rejected because I was just experimenting with it.” (DP-P4) Therefore, participants would like to see, for example, “a list of the most common reasons about why the user decided to modify the code or just reject it, or even if the user accepted the code without any change.” (DP-P3) Information like this can help enhance the user’s understanding of the numbers and inform their decisions.

4.4.2 User Identity.

Participants generally appreciated seeing the information of other users to help them evaluate the suggestion. Especially for new programmers, senior developers’ action on the suggestion could influence their decision. At the same time, participants also felt that they should not follow others based solely on their organizational titles as presented on this specific design. Additional details on users’ experience and expertise directly relevant to the specific suggestions would help with evaluation. For example, feedback from users who “worked on a similar problem before” (DP-P2) would be appreciated, as well as their level of experience in the task or programming language. For example, P7 imagined a scenario where they wanted to evaluate the test units suggested by Copilot: “If I see senior test engineer and they are accepting that. I’m going to probably get good confidence. That could be based on what type of function I’m writing.” (DP-P7) Additional information on identities can also help users understand where other users’ were coming from and the rationales of their actions, further informing users to make effective judgment: “I try to put in their shoes and see what they were thinking when they liked or didn’t like that suggestion.” (DP-P7)

At the same time, participants wished for a mechanism to protect their identity when contributing data to evaluation signals. For example, DP-P4 demanded an explanation of what data about them was collected and how the data was used: “they already know my job title, what other information are they collecting about my work? Are they also collecting other information such as code quality and performance?” (DP-P4) Users need to ensure that their data is under their control, their privacy is protected, and sensitive data, such as their performance, is not restored.

4.4.3 Community Voting.

Participants appreciated that a user voting mechanism would provide them with an additional dimension that suggested the quality of the suggestion. Compared to user actions, the number of likes and shares indicated users’ explicit “registration of immediate satisfaction” (DP-P10) with the suggested code. Interestingly, a voting mechanism itself could also enhance users’ trust with the AI system, as it explicitly indicates that users’ “feedback gets registered” and their “voice are heard by the AI” (DP-P11). For this reason, participants expected to see outcomes of their voting, for example, more similar suggestions to customize their interaction with the AI.

4.5 Finding 2: User Experience and Needs with Experience Sharing Spaces

In general, participants appreciated the dedicated channels for sharing experience with Copilot and engaging in community-curated experience. Compared to the evaluation signals, participants could gain a more in-depth understanding of how others interact with the AI: “[evaluation signals] showed us statistics. This is one showing us the data samples.” (DP-P1) These dedicated channels help users determine their trust in AI by providing “more data points to help to build your own understanding of this Copilot.” (DP-P5) They would use the community as long as it was “grassroot” (DP-P2) and reflected the authentic experience of Copilot users.

At the same time, with this particular design that involves video recordings and sharings, the participant hoped that there was a mechanism to make it clear that any sharing and recording is safe and under user consent and control. For example, DP-P5 demanded the system to “ make it crystal clear that the videos are posted with the owner’s consent...and some information about we won’t record sensitive information.” (DP-P5) Furthermore, DP-P6 suggested that the system should provide a feature to warn participants to not share any confidentiality in their recordings to ensure security and privacy:

...should put some comment there like when sharing your code, you should verify that you are not going to share in what AWS token or personal information there. (DP-P6)

In the following section, we elaborate on how users would use the space within IDE and the external platform differently and their different needs.

4.5.1 Experience Sharing Space within IDE.

When programming inside the IDE, participants tended to leverage community-curated experience when they needed to find strategies for their immediate use case with AI. Some imagined that they would likely check out other users’ experience when they were surprised by the suggestions. For participants such as DP-P10, this could include both “really good or really bizarre suggestions.” They “want to see if others have the same experience” (DP-P10) and the prevalence of sharing could also function as additional evaluation signals to help them decide whether to trust the suggested code. Others view examples shared by others as guidance on how they should interact with the suggestion. For example, DP-P11 shared that they “will go the community while skeptical about larger pieces of suggestions” to see how others edit the suggestion and how it can translate to their use cases (DP-P11).

On the other hand, since they were in a programming flow, participants felt that viewing examples shared by others would be distracted by others’ experience. Especially with the current design, the video snippets require “maybe too much clicking and time wasting, redirecting my attention too many times” (DP-P1). Users also have little knowledge about what is in a video and whether it is worth their time:“I don’t know what I’m clicking into” (DP-P2). To address this issue, participants proposed alternatives that could allow users to quickly get the key points from the community-curated experience, including an informative title about the gist of the video snippets (DP-P5), screenshots coupled with a summary of the AI suggestion (DP-P1), code snippets before and after the suggestion (DP-P11), or a description of the interaction with a link to the code repository (DP-P11).

4.5.2 External Experience Sharing Space.

The external experience sharing space can help users build overall trust with AI in aspects such as expectations, strategies, understanding, and awareness of broader impact. In contrast to how they would use the within-IDE space, participants imagined themselves using the external space primarily for discovery of general strategies and tips when they were in their free time instead of in the midst of a coding session. For this reason, they believed that they would engage most with the space when they first learned about the tools:

if you see lots of videos that show you all these amazing things, then your expectation will be higher and you saw these tricks, now you want to go try them. Or if you see bad videos, you’ll be less likely to use it. It’s going to tint how you maybe use it just from this discovery. (DP-P1)

Given their needs in discovery and learning, participants would like to see the experience shared in a form that is more enriched than a recording of a coding session. For example, DP-P8 wished for mechanisms that support adding “certain concepts” to the video and make it “catchy.” Since they would have time and bandwidth to engage in a video, participants would like the videos to be longer and contain the users’ reflection on their experience, as well as a detailed walk-through of the goal, context, and setup of their project: “people post for a reason, and they should be scaffolded to communicate what problems they are thinking about.” For these reasons, participants appreciate the discussion feature underneath the videos as they can make sense of the experience with other community members.

Similarly to what we learned from the Study 1 interviews, participants expressed the need to reproduce some of the experience. However, the forking feature in these specific stimuli might not be effective, as it could be “too much investment to fork an entire project just to replicate the interaction” (DP-P11) and “the feeling of working on someone else’s code, it just feels weird” (DP-P2). DP-P1 explained the needs as a way to “extract the resources from the video and try it out... I just want to learn how to use it and take their prompts and paste it onto what I’m doing and seeing if that works.”

5 Discussion: Integrating User Community into the AI Code Generation Experience

In this article, we present a two-phase study that investigated the role of online communities in developers’ trust-building with AI-powered code generation tools. Through Study 1 (interviews), we contribute empirical insights into how online communities shape developers’ trust in AI, demonstrating the potential of user communities to offer pathways for collective sensemaking and community heuristics, which facilitate appropriate user trust. However, developers often encounter challenges in seeking and leveraging community-curated experiences and evaluation signals, especially in the absence of a dedicated platform purposefully designed to support these activities. To further translate these insights into concrete design recommendations, we contribute with Study 2 (design probe study) in which we gathered user experience and needs in a user community integrated into the Copilot experience. Both studies combined demonstrate the importance and design space of introducing community signals to AI code generation systems to facilitate trust.

Among the ongoing discussions and concerns surrounding the evolution of developer communities in the age of generative AI, as elaborated in Section 2.2, our research reassures the significance of these communities. Specifically, we highlight their valuable role in helping developers understand and shape their trust in AI technologies. Our findings emphasize that an intentionally designed and regulated user community is crucial for guiding developers in fostering an appropriate level of trust in AI code generation tools. Even beyond AI for programming support, as more and more commercial applications powered by generative AI model enter people’s lives, users often have the need to learn from others’ interactions with the AI and explore different use cases. In fact, dedicated user communities are emerging for a variety of generative AI tools. For example, users of DALL·E 2 model¹⁵ gather on a platform named DALL·E 2 Gallery¹⁶ to share prompts and the resulting image. Another AI-powered art generation service Midjourney¹⁷ is based on Discord¹⁸ platform where users can curate and discuss AI-generated art in real time. Some platforms, such as PromptBase,¹⁹ support users to exchange and sell prompts. Gamage et al. [27] investigated a subreddit dedicated to users to make sense of use cases and implications of “Deepfake” video synthesis applications²⁰ powered by generative AI.

In response to the emerging needs of the user community for generative AI applications, we summarize a series of important design considerations for leveraging the power of communities to build trustworthy generative AI applications. We present the design considerations that we constructed based on the findings of our Study 1 interviews that were iterated and enriched according to the feedback of the participants in Study 2. While the specific recommendations are situated in the context of AI code generation tools, we believe our advice can also be useful for introducing user communities to generative AI applications more broadly.

5.1 Scaffold Users to Share Specific, Authentic Experience with AI

The premise of a successful user community is the critical mass of user-generated high-quality content [40]. Our Study 1 suggests that members of a user community dedicated to AI code generation tools favor content about specific experiences with AI, which can include a detailed description of user interaction with the AI, ideally situated in realistic programming tasks, with sufficient information on project contexts and dependencies. Future platforms should scaffold users to effectively document and share this critical information with the community. Our Study 2 suggests that providing auto-recorded video can assist users in sharing their authentic experiences but has a few important limitations. Future systems can explore other mediums, such as features that help users efficiently create snapshots and informative text-based summarizations of their interaction with the AI. Future platforms can also support users to perform and share metacognitive thinking and “reflect-in-action” [53] when interacting with AI suggestions, allowing audiences to better understand their experience.

5.2 Integrate Community into Users’ Workflow.

Getting input from the community when generating content with AI can offer user many benefits, but as shown in both Study 1 and Study 2, checking community content in the middle of a task that demands a high mental load, such as programming, can be disruptive. Future systems can innovate to reduce such friction. One direction is to innovate on the way community content is presented. For example, an invaluable piece of feedback we received from participants in Study 2 suggests that information could be presented in a hierarchical manner—users may first take a glance at some sort of indicators, perhaps in the form of icons or numerical scales, that can quickly tell them what is going on in the community about a particular AI suggestion. Users can have the option to dive into the details or otherwise can move on if they want to stay in the task flow. Another possibility is to allow users to customize the timing and frequency of any community signals. Users may not want to see community signals every time they receive a suggestion. Instead, they may only want to see the signals when struggling to evaluate a particular AI suggestion. Future systems could also include mechanisms to detect and predict user intent with AI suggestions and present community signals when needed.

5.3 Assist Users to Effectively Utilize Community Content

Even given great community content and seamless integration into workflow, users may still need support to effectively translate community information into actionable insights. With our specific design probes in Study 2, users found it particularly difficult to interpret the community usage statistics. Future systems should enable users to understand how community insight can be applied to their specific use cases. Specifically, we learned from Study 2 that systems could provide users with concrete examples and context beyond aggregated statistics, such as the common edits community members have made the suggestions and their rationales. Another potential direction could be to signal or visualize the relevance between community content and the user’s specific use case, giving users more information on how they should use it. Furthermore, systems should support users to replicate others’ interaction with the AI. This could be features to help users easily extract a prompt from a video post, or perhaps a sandbox mechanism to allow users to play with the prompt in the context of their own project.

5.4 Assure Users About Privacy and Confidentiality

Finally, as shown in Study 2, one of the biggest concerns users have about the community is about privacy and protection of confidentiality. Especially in high-stake domains such as software engineering, recording and sharing interaction with AI and project context can result in concerns in project confidentiality, such as leaking source code. Although the “share outside your organization” switch as part of our design probes is an attempt to address these concerns, users would like additional assurance. Future systems should be aware of these needs and provide mechanisms to ensure that users have the ability to control any form of recording and sharing. Systems could also give users extra warnings when they share experiences with the public, allow them to preview who may see their content, and perhaps also provide automatic checking on any sensitive and confidential information. Users should also be able to provide informed consent [33] for any data collected about their interaction with AI, know how it is used, and stop sharing data at any point.

5.5 Limitation and Future Work

We admit that our work is not without limitations. First, this research was focused on AI-powered code generation tools such as Copilot, Tabnine, Kite, and Amazon Code Whisperer and we intentionally used a variety of AI code generation tools and online communities. However, we cannot claim that the findings will apply to all possible AI applications and to all domains in which AI is applied. In Study 2, we modeled our stimuli on the interface of GitHub Copilot and specifically targeted participants with prior Copilot experience to gather insights into their interactions. This decision was influenced by the popularity of Copilot and the participants’ familiarity with the tool. However, we acknowledge that this approach might have excluded the experiences and feedback of those unfamiliar with Copilot. We also acknowledge that basing the stimuli solely on the design of Copilot and Visual Studio Code could introduce biases in Study 2, potentially limiting participants in thinking about needs and use cases beyond the specific design of Copilot. We are still confident that community will play a significant role in trust building and also in the success of AI in general, since several communities have already emerged around applications of generative AI in other domains such as creating artwork. Exploring the role of community in building trust for other applications and domains is one direction for future work.

Second, this research is based on 17 interviews with software developers and 11 design probe sessions and could be considered a formative study. Although we have discovered a rich set of insights from our qualitative work, future work could focus on a more quantitative analysis of trust and communities with a larger study population (for example, through a large-scale survey) to investigate the prevalence of the themes identified in our studies. In addition, this research was conducted at a time when Copilot was just released and people started adopting Copilot. Therefore, future work should investigate the longitudinal impact of community on trust building.

Third, the demographics of our study sample were biased toward a certain gender group. Although we tried our best, we were unable to recruit a balanced group of participants in gender. Of the 17 participants in Study 1, only two were women, while the rest were all men and no gender-nonconforming participants were recruited. Therefore, it is possible that our insights were biased towards this population and that we might not be able to identify specific experience and needs of others. We want to point out that, in an effort to address the issue of lack of racial diversity in general HCI studies [45], we were able to recruit a balanced mix of participants in terms of race. Future research can extend our work and test our findings in a broader population of developers.

Fourth, this research was focused on current software developers, and we intentionally included participants with a range of years of programming experience. However, one of the successes of generative AI models has been to broaden who can perform certain tasks. For example, models such as Dall-E and Midjourney allow anyone to be an artist and create digital art. We expect that AI will have similar success in software development and enable more people to make software. Investigating the role of communities and trust for this next generation of software developers is another promising direction for future work.

6 Conclusion

From our studies, we have been able to confirm that “trust is a prerequisite” [16] when users are expected to engage with AI systems. Through our work, we have been able to highlight that online communities provide two additional pathways to building appropriate user trust that have not been captured in existing trust models. The pathway of collective sensemaking enhances and complements the user’s understanding of trust affordances by sharing others’ community-curated experiences with the AI system. The pathway of community heuristics provides users with community evaluation signals to help them make trust judgments. We discussed a series of important design considerations that we believe are relevant for a broad set of generative AI applications that want to leverage the power of community resources.

Footnotes

https://github.com/features/copilot. Last visited: 09/30/2023.

https://www.tabnine.com/. Last visited: 09/30/2023.

https://www.kite.com/ (Kite was available at the time of our study, but the service has been suspended since late 2022.)

⁴

https://aws.amazon.com/codewhisperer/. Last visited: 09/30/2023.

⁵

https://www.chicagotribune.com/business/ct-biz-reddit-chicago-office-20180418-story.html. Last visited: 09/30/2023.

⁶

https://chat.openai.com/. Last visited: 09/30/2023.

⁷

https://meta.stackoverflow.com/questions/421831/temporary-policy-generative-ai-e-g-chatgpt-is-banned/421832#421832. Last visited: 09/30/2023.

⁸

https://stackoverflow.blog/2023/04/17/community-is-the-future-of-ai/. Last visited: 09/30/2023.

⁹

https://openai.com/blog/openai-codex/. Last visited: 09/30/2023.

¹⁰

https://code.visualstudio.com/. Last visited: 09/30/2023.

¹¹

This is an appropriate compensation for participants as it is above the minimum wage in the United States.)

¹²

https://www.gnu.org/philosophy/floss-and-foss.en.html. Last visited: 09/30/2023.

¹³

https://code.visualstudio.com/. Last visited: 09/30/2023.

¹⁴

This is an appropriate compensation for participants as it is above the minimum wage in the United States.

¹⁵

https://openai.com/dall-e-2/. Last visited: 09/30/2023.

¹⁶

https://dalle2.gallery/. Last visited: 09/30/2023.

¹⁷

https://www.midjourney.com/. Last visited: 09/30/2023.

¹⁸

https://discord.com/. Last visited: 09/30/2023.

¹⁹

https://promptbase.com/. Last visited: 09/30/2023.

²⁰

https://www.theguardian.com/technology/2020/jan/13/what-are-deepfakes-and-how-can-you-spot-them. Last visited: 09/30/2023.

Appendices

A Participant Information

Table 1.

ID	Tool	Community	Gender	Race	Age	Education	Job Title	Exp
P1	GitHub Copilot, Tabnine, Kite, CodeWhisperer, aiXcoder	GitHub (Issues or Community); Discord or Slack; Stack Overflow	Man	Black or African American	35–44	Bachelor degree	IT personnel	16
P2	GitHub Copilot	Reddit; GitHub (Issues or Community); Discord or Slack	Man	White	18–24	Bachelor degree	Software engineer	14
P3	None	Reddit	Man	White	18–24	High school diploma	Student	5
P4	GitHub Copilot	X	Man	Asian	25–34	Bachelor degree	Masters student	8
P5	GitHub Copilot, Tabnine	GitHub (Issues or Community); Reddit	Man	White	25–34	Bachelor degree	Software developer/ Analyst	5
P6	GitHub Copilot, Tabnine, Kite, aiXcoder	GitHub (Issues or Community); Reddit; Stack Overflow	Man	Black or African American	35–44	Bachelor degree	Developer	10
P7	Tabnine, Kite, CodeWhisperer, aiXcoder	GitHub (Issues or Community); Reddit; Stack Exchange; Hugging Face Space; X; Stack Overflow	Man	Black or African American	25–34	Bachelor degree	Developer	10
P8	GitHub Copilot	Discord or Slack	Man	Asian	25–34	Master degree	Assistant professor	10
P9	GitHub Copilot	Discord or Slack; X	Man	Asian	35–44	Master degree	Machine learning engineer	10
P10	GitHub Copilot	Stack Exchange; Stack Overflow; GitHub (Issues or Community)	Woman	Asian	25–34	Master degree	Student	5
P11	GitHub Copilot, Tabnine, CodeWhisperer	GitHub (Issues or Community); X; Reddit; Stack Overflow	Man	Black or African American	25–34	Bachelor degree	IT personnel (software)	8
P12	None	X; Stack Overflow	Man	White	25–34	Master degree	PhD student	10
P13	GitHub Copilot, Kite	Discord or Slack; Hugging Face Space; Reddit; Youtube	Man	Mix	18–24	Bachelor degree	Software engineer	3
P14	GitHub Copilot, Kite, CodeWhisperer, aiXcoder	Stack Exchange; X; GitHub (Issues or Community); Reddit; Stack Overflow; Discord or Slack	Man	White	25–34	Bachelor degree	IT personel	12
P15	None	Stack Overflow; Discord or Slack; GitHub (Issues or Community); Youtube	Man	Asian	35–44	PhD degree	Software engineer	14
P16	GitHub Copilot	X; Reddit; Discord or Slack	Woman	Asian	25–34	PhD degree	ML researcher	14
P17	GitHub Copilot	Stack Overflow; GitHub (Issues or Community); X;	Man	Asian	18–24	Bachelor degree	PhD student	5

Table 1. The Participants of Study 1

The Column Exp indicates the years of programming experience. The Job Title was self reported by participants.

Table 2.

ID	Copilot	Gender	Race	Age	Education	Exp
DP-P1	I use the tool regularly	Man	White	25–34	Bachelor degree	7
DP-P2	I use the tool regularly	Man	Asian	18–24	Master degree	6
DP-P3	I use the tool regularly	Man	Latino	18–24	Bachelor degree	7
DP-P4	I’ve tried the tool but no longer using it	Woman	Asian	25–34	Master degree	8
DP-P5	I recently started using the tool	Man	Asian	25–34	Bachelor degree	7
DP-P6	I use the tool regularly	Man	Asian	25–34	Bachelor degree	5
DP-P7	I use the tool regularly	Man	White	35–44	Bachelor degree	20
DP-P8	I recently started using the tool	Man	Asian	18–24	Bachelor degree	7
DP-P9	I’ve tried the tool but no longer using it	Man	Asian	25–34	Master degree	8
DP-P10	I use the tool regularly	Man	Black or African American	25–34	Bachelor degree	6
DP-P11	I recently started using the tool	Man	Asian	18–24	Bachelor degree	5

Table 2. The Participants of Study 2

The Column Exp indicates the years of programming experience.

B Codebook for Study 1

Table 3 represents the final codebook as the result of the inductive thematic analysis in Study 1.

Table 3.

Super category	Category	Theme	Code
Building trust in community- curated experiences	Types of helpful or desired community- curated experiences	Vivid description and explanation	Detailed description of the interaction with AI
		Vivid description and explanation	Opinions and reflection on the interaction with AI
		Realistic programming tasks	Showcasing usage of AI in an actual programming project
			Examples where AI makes helpful suggestions
			Examples where the AI makes wrongful suggestions
		Inclusion of diverse use cases	Different programming projects (frontend, machine learning etc.)
			Different programming languages
			Different components and stages of a programming project
		Details on project context and dependency	Information about the setup and dependencies of a programming project
		Details on project context and dependency	Replicability of successful interaction with AI
	How community-curated experiences help developers build trust with AI tools	Setting reasonable expectations on the capability of AI	Limited prior knowledge about AI
			Forming an initial mental model about AI
			Learning about constraints of AI
			Unrealistically high expectation
		Learning strategies of when to trust AI	Programming tasks where AI should not be trusted
		Learning strategies of when to trust AI	Desire of examples relevant to specific use cases
		Forming empirical understandings of how AI generates suggestions	Guessing reasons for AI suggestions in certain scenarios
			Comparing collections of prompts and AI suggestions
			Looking for bizarre AI suggestions
			Shairng experiment to test AI suggestions
		Developing awareness of broader implications of AI generated code	Legal and ethical impact of AI-generated code
			Security concerns in AI-generated code
	Challenges developers face in sharing and consuming experience in communities	Lack of channels dedicated to experience sharing	Unhelpful general discussions about AI
			Desire for a dedicated place to discuss end-user experience with AI
			Desire of ability to search for relevant use cases
		High cost of describing and sharing experience	Difficult to effectively describe interaction with AI
			Time consuming to prepare video to document interaction with AI
			Social anxiety to post experience with AI on public online platforms
		Missing details for reproducibility	Lack of information on project setup and dependency
			Unable to replicate
			Deeming user sharings as ”Misinformation” due to lack of replicability
Building trust with community- generated evaluation signals	Types of evaluation signals from online communities	Direct indicators of quality in user-shared code	Communtiy votes and ratings on code
			Indication that code is actually used by community members
			Volume of community engagement with code
		Context and generation process of user-shared code	Discussions around code
			Rationale and explanation on code
			Looking for opinions from experienced programmers
		Identity signal	Contribution history of code author
		Identity signal	Gamification factors (e.g., badges and levels)
	How evaluation signals help developers build trust with AI tools.	How evaluation signals help developers build trust with AI tools.	Lack of evaluation on quality of AI-generated code
			Triangulating with signals from online communities
			Interacting with users of variety of experience levels
	Challenges developers face in leveraging evaluation signals	Trade-off between evaluation signals and productivity	AI code generation tools reduce the time to search online
			Time consuming to engage with community signals
			Breaking programming workflow

Table 3. Final Codebook from the Inductive Thematic Analysis in Study 1

C Interview Protocol

This is the interview protocol of the semi-structured interview in Study 1. Due to the semi-structured nature of our interviews, not every question was posed in each session. The bulleted questions are intended as prompts for the interviewer, guiding them to delve deeper into a participant’s unique experiences. As such, not all questions in bullet points were necessarily asked.

C.1 Background

Tell me a little bit about yourself.

Do you consider yourself a novice or experienced programmer?

Which online communities/platforms, including social media platforms, discussion forums, content platforms, have you used for software development related activities?

—

Why did you participate in these communities/platforms?

—

What do you usually do in these communities/platforms?

—

In general, what did you gain from these communities/platforms?

—

Do you consider yourself a novice or experienced user of these communities/platforms?

—

What experience do you have with AI-powered code generation tools, such as copilot?

Which tools do you use?

—

Do you consider yourself a novice or experienced user of these tools?

—

How did you first learn about this tool?

—

When did you start using this tool?

—

Why did you start to use this tool?

—

What tasks do you usually use this tool for?

—

Why did you decide to (not) continue using the tool after your first attempt?

—

What are your thoughts on this tool?

—

How do you trust this tool?

Can you say more about what you mean by trust? What do you think should be the appropriate level of trust?

C.2 BEFORE Tool Usage

Before using the tool that you mentioned, have you read/watched any online content (can be discussion threads, tutorials, videos) about the tools?

—

(If yes, ask questions in the Passive Engagement section).

—

(If not,) why not?

If you could have read/watched any online content before using the tool, what would be helpful?

—

About what aspects? Why?

—

From what sources? Why?

—

In what format? Why?

Before using the tool that you mentioned, have you posted online content or engaged in any online discussions about the tools?

—

(If yes, ask questions in the Active Engagement section).

—

(If not,) why not? What makes it difficult/you unwilling to participate?

If you could have participated, how would you participate? /under what circumstances would participation be helpful?

—

On what platform?

—

What would you say?

—

What reactions from other users do you expect to see?

—

How would such reactions be useful?

C.2.1 Passive Engagement.

Before using the tool, tell me about one or more useful online resources (discussion/user generated content (e.g., videos and tutorials)) that you have read or watched (but did not participate) about the tool. (Ask them to share screen or link)

Before reading or watching this discussion or content, what’s your thoughts on this tool?

—

What was it about?

—

Where did you see it?

—

Why did you read/watch it?

—

What do you think of it? What is helpful/useful, what is unhelpful/useful?

—

What did you learn from it?

How did this impact your thoughts about the tools? In terms of:

—

How would you interact with it?

—

Your expectation on how well it works.

—

Your understanding of how it works.

—

Your willingness to use it.

—

In what cases would you use it.

—

How much you should rely on it.

—

Things that you should keep in mind while using it.

—

How did this impact your trust in the tool? Why?

How does this experience influence your trust in other AI code generation tools?

(If time allows) Tell me about another example (for one of the other tools, if they use multiple tools) of online resources that you have read or watched about the AI tools that is different from what you have just said.

Tell me about one example of online resources about the tools that you found not useful/unhelpful to read/watch.

—

Why is it not useful/helpful?

—

How can it be improved to make it more useful?

C.2.2 Active Engagement.

Before using the tool, tell me about an online resources (discussion/content) about the tool that you have actively engaged in (e.g., posting and commenting). (Share screen or link)

—

What was it about?

—

Where did it happen?

—

What made you want to engage?

—

What was your role in this engagement?

—

What were other people’s take on this engagement?

—

Were there any challenges in your engagement with it?

—

What did you learn from this engagement?

—

How have your thoughts on the tool changed after this engagement?

How did this impact your thoughts about the tools? In terms of:

—

How would you interact with it?

—

Your expectation on how well it works.

—

Your understanding of how it works.

—

Your willingness to use it.

—

In what cases would you use it.

—

How much you should rely on it.

—

Things that you should keep in mind while using it.

—

How did this impact your trust in the tool? Why?

How does this experience influence your trust in other AI code generation tools?

(If time allows) Tell me about another example of online resources that you have actively engaged in about copilot and similar tools that is different from what you have just said.

Tell me about one example of online resources about the tool that you actively engaged in but found not so useful/unhelpful.

—

Why is it not useful/helpful?

—

How can it be improved to make it more useful?

C.3 AFTER Tool Usage

After using the tool, has your engagement with online discussions/user generated content about the tool changed?

—

(If yes,) how?

—

(If not,) why not?

After using the tool, have you shared your experience with the tools online (can be posts, comments, questions, videos, and so on)?

—

(If yes, ask questions in the Sharing Experience section).

—

(If not,) why not?

What makes it difficult/you unwilling to share?

If you share, what would you share? /under what circumstances would sharing be helpful? On what platform? Why do you choose to share this? What reactions from other users do you expect to see? How would such reactions be useful?

After using the tool that you mentioned, have you looked up/used any online content (can be discussion threads, tutorials, videos) about the tools?

—

(If yes, ask questions in the Consulting Online Content section).

—

(If not,) why not?

What makes it difficult/unhelpful to use those content?

What kind of content would be useful?

—

In what situation? Why?

—

From what sources? Why?

—

In what format? Why?

C.3.1 Sharing Experience.

Tell me about one or more examples where you share your experience after using the tools. (Share screen or link).

—

Where did you share?

—

Why did you choose this particular platform?

—

What made you want to share?

—

What did you share?

—

Why did you share this particular information?

—

Was there anything that you wish to share but didn’t? If yes, why?

—

What were other people’s reactions to it?

—

Do you find these reactions useful? Surprising?

—

Was there anything that you wish to see from the audience but didn’t? If yes, why?

—

Were there any challenges in sharing this experience?

(If time allows) Tell me about another example where you share your experience after using the tools that are different from what you have just said.

C.3.2 Consulting Online Content.

Tell me about one or more examples of useful online content that you engaged with after using the tools. (Share screen or link).

—

Where was the content?

—

How did you find the content?

—

Why did you choose this particular platform?

—

What was it about?

—

Why did you check out this content?

—

What was the situation/context?

—

What made you think it useful?

—

Was there anything that you wish to see but didn’t? If yes, why?

—

Were there any challenges in using this content?

—

What did you learn from this content?

After reading this content, did your usage or thoughts about the tool changed? For example, in terms of:

—

How you would interact with it.

—

Your confidence in how well it works.

—

Your understanding of how it works.

—

Your willingness to continue using it.

—

In what cases would you use it.

—

How much you should rely on it.

—

Things that you should keep in mind while using it.

—

Your trust in the tool.

(If time allows) Tell me about another example where you engage with online content that is different from what you have just said.

Tell me about one example of online content about the tool that you engaged in but found not so useful/unhelpful.

—

Why is it not useful/helpful?

—

How can it be improved to make it more useful?

C.4 General Reflection

In general, what kind of online discussion or content would you want to see more?

In general, what kind of online discussion or content would you want to see less?

If you can make one or more features in online platforms to make discussion or content about these AI tools more useful, what would you make?

In general, how does your online engagement affect how you understand AI code generation tools? Compare between tools? Compare between communities?

In general, how does your online engagement affect how you use AI code generation tools? Compare between tools? Compare between communities?

In general, how does your online engagement affect how you trust Copilot and similar tools?

References

[1]

2021. Retrieved 30 Spetember, 2023 from https://pair.withgoogle.com/guidebook/

Abstract

1 Introduction

2 Related Works

2.1 User Trust in AI and the MATCH Model

2.2 Developers’ Participation in Online Communities

2.3 AI-powered Code Generation Tools

3 Study 1: How Do Online Communities Shape Developers’ Trust in AI-powered Code Generation Tools?

3.1 Methods

3.1.1 Participants and Recruitment.

3.1.2 Procedures and Analysis.

3.2 Finding 1: Building Trust with Community-curated Experiences

3.2.1 What Types of Community-curated Experiences Are Helpful.

3.2.2 How Community-curated Experiences Help Developers Build Trust with AI Tools.

3.2.3 What Challenges Developers Face in Sharing and Consuming Experience in Communities.

3.3 Finding 2: Building Trust with Community-generated Evaluation Signals

3.3.1 What Evaluation Signals Communities Can Offer.

3.3.2 How Evaluation Signals Help Developers Build Trust with AI Tools.

3.3.3 Tradeoff between Evaluation Signals and Productivity.

3.4 Synthesis: The Two Pathways Through Which Communities Can Foster Appropriate User Trust in AI Code Generation Tools

4 Study 2: How Can We Leverage Community Features to Design for Appropriate User Trust in AI-powered Code Generation Tools?

4.1 Identifying User Needs

4.2 Building Design Probe Stimuli

4.3 Participants and Procedures

4.4 Finding 1: User Experience and Needs with Community Evaluation Signals

4.4.1 Community Usage Statistics.

4.4.2 User Identity.

4.4.3 Community Voting.

4.5 Finding 2: User Experience and Needs with Experience Sharing Spaces

4.5.1 Experience Sharing Space within IDE.

4.5.2 External Experience Sharing Space.

5 Discussion: Integrating User Community into the AI Code Generation Experience

5.1 Scaffold Users to Share Specific, Authentic Experience with AI

5.2 Integrate Community into Users’ Workflow.

5.3 Assist Users to Effectively Utilize Community Content

5.4 Assure Users About Privacy and Confidentiality

5.5 Limitation and Future Work

6 Conclusion

Footnotes

Appendices

A Participant Information

B Codebook for Study 1

C Interview Protocol

C.1 Background

C.2 BEFORE Tool Usage

C.2.1 Passive Engagement.

C.2.2 Active Engagement.

C.3 AFTER Tool Usage

C.3.1 Sharing Experience.

C.3.2 Consulting Online Content.

C.4 General Reflection

References

Cited By

Index Terms

Recommendations

Investigating and Designing for Trust in AI-powered Code Generation Tools

Cultivating Trust and Harvesting Value in Virtual Communities

User loyalty and online communities: why members of online communities are not faithful

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other