To understand the role of online communities in helping developers build trust with AI code generation tools, we conducted a semi-structured interview study with 17 developers. To answer RQ1, our findings reveal that online communities shape developers’ trust in AI-powered code generation tools through
community-curated experiences and
community-generated evaluation signals. In Sections
3.2 and Section
3.3, we delve into what entails these features of online communities, how they facilitate developers to build trust with AI code generation tools, and what challenges developers face when engaging with these aspects. We also synthesized these findings into two pathways that extend the
MATCH model, illustrating how user communities can support appropriate user trust in AI (Section
3.4).
3.2 Finding 1: Building Trust with Community-curated Experiences
For developers, an important aspect of trust in AI code generation tools is whether and how comfortable they are with integrating the tool into their programming workflow. Developers leverage online platforms such as X, Reddit, and GitHub to first discover and learn about specific AI tools. Beyond initial discovery, developers also continue to share their own tool experiences and explore others’ experiences on these platforms. We refer to such collections of sharing about community members’ interaction with AI as community-curated experiences, which allow developers to form a mental model about the capacity of AI code generation tools and how they can use those tools, building trust in the process. In this section, we will discuss (1) what types of community-curated experiences that developers like to see, (2) how community-curated experiences help developers build trust with AI tools, and (3) what challenges developers face in sharing and consuming experiences in communities.
3.2.1 What Types of Community-curated Experiences Are Helpful.
Developers seek specific features in community-curated experiences that can effectively help them build and calibrate trust. We identified four important features of an ideal experience sharing post: vivid description and explanation, realistic programming tasks, inclusion of diverse use cases, and details on setup and dependency.
Vivid description and explanation. One of the most important features that developers would like to see in community is others’ specific experience interacting with an AI—in P13’s words, “concrete examples of how someone’s using a tool and how it helps them.” These posts typically contain details including the programming tasks, the prompt that elicits AI suggestions, and the outcomes of the AI suggestion, often supported by screenshots or video recordings. For example, P2 shared that they were convinced to try out Copilot by a YouTube video because they were able to see “somebody actively working with it and I can be like, that’s how it would work for me too.” (P2) Similarly, P13 stated that a video demo could vividly display how the AI assisted users and help them recognize the benefit it could bring: “when you watch it, parts of it can translate on whatever you do and you can make your workflow better.” In addition to a vivid, detailed description of the interaction, developers also would like to see the users’ explanations, opinions, and reflections on the interaction and AI suggestion. For instance, P3 explained that an effective demonstration of a vivid experience with AI should explain that “I wrote this suggestion—look what it was able to come up with, and why you should use it as well, because it could help you with something like this.” With the explanation, developers can achieve a more accurate understanding of how AI behaved in the user example.
Realistic programming tasks. It is also important that the shared experience is situated in a programming task that represents the typical and realistic workflow of a developer. Instead of using AI to solve predefined programming challenges, developers want to see the user showcase an “actual project” where they are “deeply nested into” and working with “lots of files, lots of context all around.” (P3) This type of demonstration can help developer anticipate the performance of AI in complex situations that they face in their day-to-day work. Specifically, developers would like to see both cases where AI makes helpful suggestions and fails to help. For example, P16 preferred to see posts without “a pre-biased judgment” and stated that “it’s better to see how people who are actually using it in real world face issues.” It was important to see incidents where the AI gave wrongful suggestions so that they could understand its limitations. Working with AI on a realistic project gives opportunities for such incidents to occur:
I think the intention shouldn’t be to use Copilot or not. They can just keep it running in the back and look at its suggestions and just work out they would normally work... Then you can see if it helps, it helps, but otherwise, you’re just dismissing the suggestions. Then we would know whether it is helping. (P10)
Inclusion of diverse use cases. Furthermore, developers would search for diverse use cases of an AI tool—different programming languages, tasks, scenarios, and so on—to learn what is possible with its capacity. For example, P13 shared their experience going down in a long “YouTube hole” when first learned about Copilot and got exposed to the breadth of its use cases: “there’s so much capability it has! I had a lot of fun just watching the way people use Copilot, front-end development, back-end development, machine learning development...” When introducing Copilot to their friend, P13 always demonstrated it in a variety of programming languages and tasks, including how it successfully suggested complex function in OCaml but stuck with the “<div>”s in HTML: “Because that’s what I experienced, I’m going to show them the full picture of what I’ve done.” (P13) Diverse use cases also help developers be aware of the boundary of capacity of the AI. Similarly, P2 shared that they once posted about how they used Copilot in a multicomponent, full-stack web development project to demonstrate the ability of Copilot. P2 was proud of this post, as it showed how Copilot behaved in a variety of programming languages and tasks, providing a comprehensive picture of its capacity:
I think just like one specific code example isn’t really enough... If one person says, here’s a screenshot of a cool thing co-pilot recommended, I don’t know if that be enough to convince me... That doesn’t prove that like co-pilot is good overall. (P2)
Details on project context and dependency. In order to make use of the experience of others with AI code generation tools, developers favor posts that include details on setup and dependencies of the project, for example, “the tool, what [the user] was planning to program, the language, the purpose, everything.” (P1) Developers value information on setup and dependency because they want to reproduce the interaction with AI in their use cases. For example, P1 preferred reviews that are “Something someone can follow and you’ll get the expected results.” In another example, P6 once saw a video about a similar problem that he had with Copilot and would like “see someone doing it, just follow it step by step and to see.” Replicability is crucial to trust building, as our participants explained that they needed to get hands-on experience with an AI tool before deciding how to trust it. Developers would like to replicate others’ successful interaction with AI to help with their particular use cases, while sometimes also examine wrongful suggestions that they saw online before making their own judgment: “If I see something negative I don’t like to say that, okay, this is all useless let’s stop using Copilot... I try it with caution, I try to reproduce it.” (P4)
3.2.2 How Community-curated Experiences Help Developers Build Trust with AI Tools.
Our finding suggests that developers can build appropriate trust with AI code generation tools as a result of engaging with community-curated experience. Specifically, developers can build a mental model of the capacity of the AI from others’ experience, which includes setting reasonable expectations, learning strategies on when to trust the AI, forming empirical understanding of how AI generates suggestions, and developing awareness of broader implications of AI-generated code.
Setting reasonable expectations on the capability of the AI. Engaging with community-generated content on experience with AI code generation tools helps developers form appropriate expectations with those tools. Before trying out the tool, our participants shared that they had little knowledge about its ability and how it will perform in their particular use cases, as P7 put, “I’ve not tried anything like that before so I didn’t really know what I was expecting.” Consuming other users’ “anecdotes” with the tools allows developers to form an initial mental model about the AI and decide if it is trustworthy enough to give it a try: “I read a bunch of what people think of the eventual outcome... [It] helps me make my own perception of whether it is something that is useful for me or not.” (p16)
Forming reasonable expectations also means learning about important constraints of AI. An unrealistically high expectation could undermine developers’ trust, as they may not be prepared for its constraints. For instance, P14 shared that they initially thought Tabnine could largely automate their programming tasks due to an online post claiming that Tabnine wrote 20–30 percentage code in a file, but then found that it was not the case in their actual usage: “I thought it was going to give me more detailed prediction on my code... But once I tried it out, I just saw that it‘s mostly just complete some lines.” (P14) Out of disappointment, P14 stopped using Tabnine. In contrast, a reasonable expectation allows developers to anticipate certain failures, so that their trust would not be broken by surprising negative experiences. For instance, P13 shared that a video demonstrating a variety of successful and failed use cases of Kite helped them realistically estimate how it can improve productivity: “I set my expectations lower. I wasn’t like, ‘wow, I thought Kite would write my entire code file for me.’ ” Such realistic expectations allow developers to understand the boundary of the AI’s capacity and plan out how they will use the AI accordingly. For example, P4 anticipated that Copilot would “fail in some cases” and did not expect Copilot “to work all the time”, but still decided to try it out with caution and validation.
Learning strategies of when to trust the AI. Community-generated content can also allow developers to develop strategies on how to trust and effectively use AI in their specific use cases. When developers feel unsure whether they could trust AI for particular tasks in their workflow, they go to online communities to seek relevant examples. For instance, P16 shared that they looked into community discussions to learn the tasks that they could trust Copilot to do:
The public discussion has definitely helped with the trust: this is a language translation that I can trust. This is a module transmission I can trust... It’s like all I need to do is find already if someone has tried to use the auto-complete. (P16)
Developers can also learn from others’ experience about areas where AI should not be trusted, and consequently form strategies to minimize any harm. For example, P1 shared an experience in which they tried to get Kite to make high-quality suggestions in a project that involved multiple programming languages. After browsing in online communities, they learned from other users that Kite generally did not perform well in some languages and decided that they should instead “stick to a particular programming language.” In particular, developers would like to see examples relevant to their specific use cases, as they offer previews of how the AI can function in their use cases, allowing developers to decide how much to trust it and whether to adopt it at all:
When I found out the reviews do not affect the areas of function that I want to use the app for, I will not really be bothered about checking out the app. But when I found out the negative reviews is [in] the area where I want to use the application for, I tend to not have enough trust for the application and mostly I end up not checking out at all. (P5)
Forming empirical understandings of how the AI generates particular suggestions. Online communities can also help developers understand how AI code generation tools work. The participants shared that they contemplated why the AI gave certain suggestions in particular scenarios, for example, “why does it make a certain choice over another? Why is the naming convention of this variable something else?” (P8) Knowing the inner decision processes and rationales of AI is crucial to trust, as developers can make decisions based on the reliability of the decision process. In online communities, members share their prompts and AI suggestions, and collectively build an empirical understanding of the otherwise black box: “it’s a super black box, so we’re collectively trying to figure out how to best use them.” (P10) In particular, participants shared appreciation for discussions around wrongful AI suggestions. In those discussions, users post bizarre code written by the AI and the surrounding context, speculate potential causes of those suggestions, and sometimes exchange ideas for experimenst to test out their theories. For example, P4 shared an experience where they engaged deeply in a discussion thread and tried to reverse engineer the prompt that resulted in a wrongful suggestion by Copilot: “I’d like to see more of this type of discussions. [We] see examples where it is going wrong and trying to figure out what went wrong.” (P4) Similarly, P16 enjoyed reading posts in which users shared experiments with GPT-3 prompts, as those discussions helped them empirically summarize the factors that affect AI suggestions:
If you see the problem that they wrote and you see the paragraph that they get then it’s very easy to figure out the breaking points that this word is contributing for this sentence to be generated, and basically bring a bit of interpretability to the machine learning model. (P16)
Developing awareness of broader implications of AI generated code. Developers also learn about the implications of using AI models beyond their immediate use cases by participating in online communities. These implications can include the legal and ethical impact of the code generated by AI, as well as potential concerns about security. For example, P3, a young programmer, only learned about the controversial impact of Copilot on FLOSS
12 after following a debate on GitHub repositories:
If I had not seen the discussion, I don’t think I ever would have thought about it... But once I did see a lot of people talking about the licensing issues... I feel like I should be a little more careful with what I do. (P3)
Although those legal and ethical implications would not directly affect P3’s current use cases of Copilot, P3 was glad to be aware of those issues to prepare for wider use cases in the future. In terms of security implications, while experienced developers may have knowledge of the vulnerabilities that AI can introduce, beginner programmers rely on community-generated resources to learn about these concerns. For example, participants reported that they had seen posts about Copilot suggesting others’ private keys. These implications are crucial for trust, as they can help developers stay alert and stay away from serious consequences:
People learned that they should be maybe careful against this in critical software. That’s a good thing that people were discussing it because there’s probably less probability of the worst-case happening. (P12)
Seeing security issues introduced by AI reminds developers of the importance of manual review on AI suggested code and their role as the supervisor of the AI. For example, P8 recalled that a discussion about a security breach caused by code written by Copilot made them invest more time in verification:
I would be more careful while using the application, just so I would not just write all my code and just publish it that way without taking time to actually verify. (P8)
3.2.3 What Challenges Developers Face in Sharing and Consuming Experience in Communities.
Despite its benefits, developers face a variety of challenges in engaging with community-curated experiences. Some participants found that it is hard to reproduce others’ experience since vivid description and dependency details were often missing, and others complained about the lack of diverse use cases or realistic programming tasks. We summarized three significant challenges that developers face in effectively sharing or consuming experience with AI tools: lack of channels dedicated to experience sharing, high cost of describing and sharing experience, and missing details for reproducibility.
Lack of channels dedicated to experience sharing. A challenge developers face is the lack of platforms dedicated to sharing experience with AI code generation tools. On general platforms like X and Reddit, sharing and discussions about specific experience with AI code generation tools can be buried in more general discussions about AI. For example, P16 shared that they had been apprehensive about engaging in high-level discussions about AI, since those discussions could often become controversial and distant from specific user experience:
Most of the discussion start with some random Atlanta or New York post headline saying something like, AI is going to take your job or anything like that... the headline is so sensational that people have really strong opinions over it and I do not really like engaging with. (P16)
The participants expressed a desire for a channel dedicated to end-user experience sharing and discussions about AI code generation tools. As P10 imagined, an example of the channel could be “a common forum just specific for Copilot or something, like crowd sourcing of ideas.” Similarly, P13 suggested a “centralized place” for Copilot related experience, where people share “code snippets and they’ll be, ‘here once you have access to Copilot, try this code snippet in your repository, see what happens.’ ” In addition, P7 proposed that a search function for “the feature you want and bring out posts about related AI co-generation tools” so that users can easily locate examples with particular use cases.
High cost of describing and sharing experience. For developers, writing a post about their experience with AI code generation tools and sharing it online is considered a time-consuming and complex task. Given the interactive nature of AI tools, it is difficult to effectively describe the interaction process. For example, P2 shared that it is hard to describe AI suggestions when making a text-based social media post: “You can’t really have a code block that differentiates between I wrote this code and then this is the code that was recommended.” (P2) For developers who choose to share a video about their interaction with the AI, making the video consumable by other developers requires a lot of planning on what to record and how to explain. For example, P2, who once prepared an elaborated video post about how Copilot helped them in a complicated project (see Section
3.2.1), complained about the tedious process of preparing the video:
I spent a lot of time learning the intricacies of, like, for that specific project, what would Copilot recommend and all that stuff. I was up until like 7 a.m. (P2)
In addition, posting on a public online platform can be intimidating. Developers may feel self-conscious when sharing their experiences and opinions with a large audience, with little knowledge about who they are and how they would react. As a result, developers may want to spend a lot of time polishing the post that they are trying to share, adding more investment to this task. P11 compared sharing their experience with Copilot to a group of friends making a public post:
(With friends,) I don’t really care about going deep into discussion in order to appear perfect. But when it come to posting online, I’m most mindful of how my writing will look. (P11)
Missing details for reproducibility. Another challenge is the lack of reproducibility in the experience shared by developers online. As we show in Section
3.2.1, developers build trust with AI by trying out prompts from online sources in their workflow. However, when sharing their experience, developers can overlook the setup of their project and the dependency of their environment: “people don’t exactly share their VSCode settings, [which] could be very much catered to the way they code or their programming preferences.” (P9) Developers are disappointed when they are unable to replicate AI interactions that they saw online in their own environment. For example, P13 complained of the lack of functionality to “copy and paste text from the video” (where the other user interacts with Copilot) and wished a “testing environment” where they could directly replicate the interaction. The lack of reproducibility can make developers not aware of diverse use cases and possible limitations of AI, resulting in biased views. Some of our participants even deemed particular use cases that they saw online but had not experienced themselves as “misinformation”, because they were not able to confirm themselves that such interaction is possible: “if I see that people are contrary on something that can’t be verified, I would be like, this is a misinformation.” (P7)
3.3 Finding 2: Building Trust with Community-generated Evaluation Signals
Another important aspect of trust in AI code generation tools for developers is whether and how they can leverage specific AI suggestions. Developers can find signals to evaluate the code suggested by AI from code solutions posted by community members and make decisions. We refer to these contributions of community members as community-generated evaluation signals. In the following sections, we explain how and why communities can offer effective support for the evaluation of AI output. Specifically, we unpack (1) what evaluation signals communities can offer, (2) how evaluation signals help developers build trust with AI tools, and (3) the tradeoff between evaluation signals and productivity.
3.3.1 What Evaluation Signals Communities Can Offer.
We identified three major evaluation signals that developers’ online communities can offer: direct indicators of code quality, context and generation process of a code solution, and identity signals.
Direct indicators of code quality. Unlike AI code generation tools, online communities offer direct metrics that can help developers evaluate the quality of code solutions. Many online communities, such as Stack Overflow, allow users to vote or provide ratings for solutions. Participants trusted a solution selected by such voting mechanism as it had been reviewed by many others: “if other programmers have used that solution and it worked in their code, they’ll upvote the solution and so that does give you a little bit more faith that this is a good solution.” (P3) The voting mechanism also implies that multiple real developers have used the code, indicating the safety and trustworthiness of the code. For instance, for P10, the solution selected by a question owner on Stack Overflow indicated that “the person who originally asked the question has tried it out and it worked for them,” providing “some guarantee that, that code compiles or it’s very close to what I want, which does not exist with Copilot.”
Another explicit indicator of code quality is how much engagement it receives. In online communities, developers can see how many other members viewed, rated, shared, commented on, or otherwise engaged with a post. High and positive engagement with a solution usually means that it “has been cross-verified by a lot of people,” (P4) indicating high quality. As P6 summarized, “when it’s coming from plenty of different people and all bring positive reply, I’ll be like, ‘yeah, it works for everyone and not just for a single person.’ ” With high engagement, developers can see verification triangulated from multiple sources, which informs their evaluation: “I trust that because it is not dealing with just from one person’s perspective—different developers with different ideas, coming together to make their review.” (P1) When a solution does not get high engagement, developers tend to be more skeptical about it since it is not verified by other users. For example, P14 explained why they did not have a high degree of trust with Tabnine after seeing a single post about it: “because of the fact that just a random post from a random user with no really much engagements.” (P14)
Context and generation process of a code solution. Developers also appreciate that online communities offer a broader context beyond the code solution itself, which helps in evaluation. Platforms such as Stack Overflow support discussions around solutions contributed by users. Participants expressed appreciation for the discussions, as compared to the solutions suggested by AI, they are more “detailed”, “interactive” (P12), and provide “a lot more transparency” to how it is generated (P3). Following a discussion allows developers to understand why a solution is chosen as the optimal:
They will go through the process. What happened? How did it happen? ... then people will go through different solutions. There is a lot more storytelling involved. (P9)
You can see this back-and-forth between people until one answer becomes the accepted answer for that question. Somebody is suggesting some code and then in the comments people point out, oh, you have a point here or like there’s an issue with this thing. (P12)
Understanding how a code solution is generated helps developers better judge its quality. Because users cannot see how an AI suggestion is decided, they have trouble deciding its validity, as P3 described: “in some cases you have no idea why something is working, although it works.” Sometimes, the solution generated by AI may not be the most optimal solution. For example, P3 shared an experience where Copilot suggested a bug-free but inefficient implementation of a function. While individual inexperienced programmers often have trouble identifying issues in examples like this, they can leverage online discussions where multiple, more experienced members can scrutinize the solution, spot any issues, and even propose more effective alternatives, as P3 reflected: “if [I] had gone with Stack Overflow for something like that, they would have said, ‘here’s the inefficient solution, this is why it’s inefficient, and here’s how you can make it more efficient.’ ”
Identity signal. Participants shared that knowing the identity of the author of the code allows them to evaluate the code based on the authors’ background and experience level. In communities like Stack Overflow and GitHub, users can view members’ contribution history indicating what and how much code they have contributed and how well the code has been received in the community. Common gamification mechanisms, such as levels and badges, also signal the expertise of a user. To developers, all these factors can help them decide to what extent they can trust the code written by a particular user:
I’m copying this code of this guy which has all the badges and everything—he probably knows what he’s doing. (P12)
On Stack Overflow they have a reputation. You see they answered 3000 questions and they’re always posting high quality responses, and then that’s a pretty big proponent that you can trust that person... then you can most likely trust their code. (P3)
Especially when the developer is unfamiliar about an area or indecisive about multiple options, an expert’s input can facilitate their decision process. For instance, P15 described a scenario where signals of expertise could help them recognize important considerations:
It may be the case that thousand people voted for it, but two or three people may have commented saying, ‘Hey, it has a threat.’ And those two people who commented are actually threat analysis specialists. It’s not always 1,000 people’s opinion are correct. (P15)
3.3.2 How Evaluation Signals Help Developers Build Trust with AI Tools.
Currently, AI code generation tools lack a evaluation mechanism for the quality and correctness of the AI-suggested code, as P3 observed with the example of Copilot, “GitHub Copilot just hands you the code, and it’s up to you to know whether or not that’s good code.” While developers can evaluate the suggestions themselves by reading or executing the code, external support for evaluation is often needed when cost of underlying error is too high in terms of computing resources and time. Especially for developers with less experience who may be unfamiliar with the syntax or logistics of the suggested code, effective support for evaluation is essential to ensure quality and safety in the code, as P7 summarized: “there are times that it’s giving me suggestion I’m not very familiar with and I would have to look it up and see.” Online communities can also provide multiple perspectives to triangulate evaluation: “I trust that more because it is not dealing with just from one person’s perspective. It deals with different programmers coming together with different ideas.” (P1) In addition, knowing that online communities attract users of a variety of expertise and experience levels can boost developers’ confidence for evaluation. As P7 pointed out, “online community has lots of people with more knowledge and who has worked on more projects probably than I have, so they’re familiar with many codes that I’m not,” developers can rely on online communities to learn more about the code and make appropriate judgments.
3.3.3 Tradeoff between Evaluation Signals and Productivity.
Participants pointed out an important trade-off between the effective evaluation that a community can provide and the amount of effort that a user needs to invest in a community. A great advantage of AI code generation tools is that they increase productivity—developers get AI suggestions seamlessly in their programming workflow, reducing the time spent typing and looking up online for syntax and documentation. As P16 framed, Copilot is “doing the filtering out process that I need in Stack Overflow to do myself for me... It helps me not open 50 Google Chrome tabs and still have the answer in an efficient amount of time possible.” Whereas in online communities like Stack Overflow, they need to “look through multiple questions and look for multiple answers and figure out whether this person wants to do the exact thing that I want to do.” (P16) In other words, it can take an enormous amount of time and energy to search, filter, and collect community generated resources and use credibility signals to validate those user sharing. This can break the workflow of a developer since their tasks are often very targeted and time very constrained. As P15 stated, “software engineers, they’re not go and search for online contents, they work on a need to do basis.” There is the need to effectively incorporate community evaluation in developer’s programming workflow.
3.4 Synthesis: The Two Pathways Through Which Communities Can Foster Appropriate User Trust in AI Code Generation Tools
Echoing previous literature arguing that trust is situated in sociotechnical systems [
30,
36,
39,
72], our findings elaborated on the important role of online communities in helping developers build appropriate trust with AI tools. As a response to the call for social and organizational mechanisms beyond the tool itself to support trust building [
44], we surfaced two pathways that online communities can offer to help developers build appropriate trust in AI: (1) the pathway of
collective sensemaking and (2) the pathway of
community heuristics. We explain these pathways in Figure
3 using the framework of the
MATCH model [
44] that we discussed in Section
2.1.
The first pathway,
collective sensemaking, describes the process by which developers learn from others’ experience with the AI model and improve their understanding of the trust affordances of the system. As suggested in previous literature [
52], due to the highly versatile, context-dependent nature of code generation AI, individual developers likely only get exposed to a subset of trust affordances and tend to understand the AI’s capacities from their limited perspectives. For example, developers may not be aware of cases where AI can fail, granting too much trust to AI and resulting in mistakes, similar to what we see in the literature on overreliance on and misuse of AI systems [
21,
43,
51]; or developers may deem AI completely useless only based on their own few negative experiences. With community-curated experiences, developers are exposed to diverse examples of how AI can work or fail in different use cases, which complements their individual understanding of the trust affordances. As a result, developers set reasonable expectations, learn strategies on when to trust AI, develop empirical understanding of how AI generates suggestions, and develop awareness of broader implications of AI-generated code. All of this can help them make informed decisions about when and how much to trust AI in their programming workflow, helping them build responsible trust [
34].
The other pathway,
community heuristics, refers to the process in which users leverage a variety of community evaluation signals as heuristics to make trust judgments. Since individual developers rely on their own knowledge, expertise, and intuitions as heuristics, when unfamiliar with the programming project or language, they may not be able to effectively assess the suggested code [
52,
60]. Echoing a range of literature on technical Q&A forums [
26,
54,
58], our findings suggest that communities provide crowd-sourced evaluation signals such as user voting, coupled with supplementing discussion context and credibility signals on user identity. These signals can serve as heuristics in addition to the developers’ own expertise, scaffolding them to effectively evaluate an AI suggestion and make decisions on whether to trust it.