We finally have quantitative data about the usability of generative AI systems like ChatGPT for real business tasks: three new studies tested very different types of users in different domains but arrived at the same conclusions. Productivity increased significantly, with the biggest gains for the least-skilled users. Some of the studies also found improvements in the quality of the work products.

There have been endless discussions of AI in recent months, but almost all are speculative, reflecting the authors’ personal opinions. Bah, humbug. If the dot-com bubble taught us anything, it’s that such speculations are worthless for assessing business use. We need to know which deployments will be profitable and which will flop. Opinion-based guesses are often wrong and lead to massive waste when companies launch products that don’t work for real users. That’s why empirical data from hands-on use while users perform actual tasks (as opposed to watching demos) is so valuable.

Here, I discuss the findings from three studies, which are described in detail in separate articles:

  • Study 1Customer service agents resolving customer inquiries in an enterprise software company.
  • Study 2: Experienced business professionals (e.g., marketers, HR professionals) writing routine business documents (such as press releases) that take about half an hour to write
  • Study 3: Programmers coding a small software project that took about three hours to complete without AI assistance

In all three cases, users were measured while they completed the tasks: always for task time and sometimes for quality. About half the users performed the tasks the old-fashioned way, without AI assistance, whereas the other half had the help of an AI tool.

Productivity Findings

The most dramatic result from the research is that AI works for real business use. Users were much more efficient at performing their job with AI assistance than without AI tools.

Productivity measures how many tasks a user can perform within a given time — for example, a day or a week. If an employee can get twice as much work done, their productivity has increased by 100%. 

Here are the results:

  • Study 1: Support agents who used AI could handle 13.8% more customer inquiries per hour.
  • Study 2: Business professionals who used AI could write 59% more business documents per hour.
  • Study 3: Programmers who used AI could code 126% more projects per week.

The following chart summarizes the findings from the three research studies:

The measured increase in users’ task performance when using generative AI tools (relative to control groups of users who did not use AI), according to the three research studies discussed here

In all three cases, the difference with the control group (who performed the work the traditional way, without AI tools) was statistically significant at the level of < 0.01 or better.

It’s clear from the chart that the change in task productivity is very different across the three domains studied. It looks like more cognitively demanding tasks (i.e., writing code vs. answering a customer query) benefited the most from AI’s assistance.

Is the AI-Caused Productivity Lift a Big Deal?

On average, across the three studies, generative AI tools increased business users’ throughput by 66% when performing realistic tasks. How should we judge this number?

A number in itself is meaningless. We can only draw conclusions when we compare it to other numbers.

For comparison, average labor productivity growth in the United States was 1.4% per year during the 12 years before the COVID-19 pandemic (2007–2019), according to the Bureau of Labor Statistics. In the European Union, average labor productivity growth was 0.8% per year during the same period, according to Eurostat

Both numbers measure the average value created by a worker per hour worked. If employees put in more hours or additional warm bodies join the workforce, that will increase the overall volume of economic output, but it doesn’t mean workers have become more productive in the sense discussed here. This article discusses how much value employees create for each work hour. If this value is increased, standards of living will improve

Now we have something to compare against! The 66% productivity gains from AI equate to 47 years of natural productivity gains in the United States. And AI corresponds to 88 years of growth in the European Union, which is a third more time than the 66 years that have passed since the formation of the European Community (the precursor to the EU) in 1957.

AI is a big deal, indeed!

Caveats

There are three caveats to these findings.

First, the 66% productivity gains come from the previous versions of generative AI (used when the data was collected), as represented by ChatGPT 3.5. We’re already on the next release, which is substantially better. I expect future AI systems to improve even more, especially if they are developed from user-experience insights instead of purely driven by engineers. (Current AI tools have big usability weaknesses.)

Second, only study 1 (customer support) followed workers across multiple months. Studies 2 and 3 (writing business documents and programming) measured participants during a single use of the AI tool — often the first time that person used AI. There’s always a learning curve in using any design, where users get better after repeated exposure to the user interface. Thus, I expect that the (already very high) gains measured in studies 2 and 3 will become much larger in real-world settings where employees keep using tools that make them substantially more effective at their job.

Third, the productivity gains accrue only while workers are performing those tasks that receive AI support. In some professions, like UX design, many tasks may be unsuitable for AI support, and, thus, employees in those fields will realize only smaller gains when viewed across their entire workday.

These factors point in opposite directions. Which will prove stronger remains to be seen. For now, since I have no data, I will assume that they will be about equally strong. Thus, my early estimate is that deploying generative AI across all business users can potentially increase productivity by around 66%.

Productivity of UX Professionals

We currently have only minimal data on the potential benefits of AI used by UX professionals. One study suggests that ChatGPT can help with faster thematic analysis of questionnaire responses. There will likely be many other examples where we can productively employ these new tools for various UX processes.

How big can the improvements be expected to be for UX work? We can get an early, rough estimate from the data presented in this article. More complicated tasks lead to bigger AI gains. UX design is not quite as cognitively demanding as programming (which saw a 126% improvement), but it’s up there in complexity. So, my guess is 100% productivity gains from AI-supported UX work.

While we await actual data, the current expectation is that UX professionals can double their throughput through AI tools. To be more precise, productivity would double for those UX tasks that lend themselves to AI support. Which ones might be a topic for a future post, but not all UX work will gain equally much from AI tools.

For example, in discovery studies or usability testing, observational user research still needs to be conducted by humans. The AI can’t guess what users will do without watching real people doing actual tasks and truly understanding their context. Some of this “watching” might eventually be automated, but I suspect that a human UX expert will still need to sit with users for many of the most high-stake studies.

Unfortunately, conducting a one-hour customer visit consumes an hour, so there’s no productivity speed-up. (But summarizing and comparing notes from each visit might be expedited by AI.)

Again, I'm guessing at this early stage, but half of UX work will benefit from AI tools. If this half of our work can get 100% increased throughput, but the rest remains at its current pedestrian pace, UX work will have only a 33% productivity lift overall. Still good, but not as revolutionary — due to the human-focused nature of our work.

Quality Improvements from the Use of AI

Efficiency is nice, but it does us no good in the bigger picture if AI use produces vastly more bad outputs. Quality is just important as quantity for the value of any innovation.

Luckily, the quality of the output produced with AI assistance is better than that produced without it, at least in studies 1 and 2. (Study 3, of programmers, didn’t judge the code quality produced under the two experimental conditions.)

In study 1 (customer support), the AI-using agents’ work quality improved by 1.3% compared to that of the agents not using AI, as measured by the share of customer problems that were successfully resolved. On the other hand, the customers gave identical subjective satisfaction scores for the problem resolutions under the two conditions. The small 1.3% lift was only marginally significant at = 0.1.

So, in study 1, we can conclude that AI assistance didn’t hurt work quality and likely improved it slightly.

In contrast, in study 2 (writing business documents), work quality shot through the roof when the business professionals composed their documents with help from ChatGPT. On a 1–7 scale, the documents' average quality rating was 4.5 with AI versus 3.8 without AI. This difference was statistically significant at < 0.001.

Based on self-reported data from the participants, the quality improvement is likely to stem from the fact that the AI-using business professionals spent much less time producing the first draft of the document text (which was generated by ChatGPT) and much more time editing this text, producing more-polished deliverables.

Human-Computer Symbiosis

In 1960, computer pioneer J.C.R. Licklider wrote an influential paper titled “Man-Computer Symbiosis.” Licklider envisioned a future when people and computers would supplement each other in “an expected development in cooperative interaction between men and electronic computers” (my bolding added to Licklider’s quote). 

According to this early research on AI usability, that day seems to have arrived. Excellent results in task throughput and work-product quality come from such a symbiosis.

AI will not replace humans. The best results come when AI and humans work together — for example, by expediting the production of draft text, leaving human professionals to focus on editing and polishing.

Narrowing Skill Gaps

The exciting findings from the research don’t stop with increased productivity and work quality.

Generative AI has a third effect: narrowing the gap between the most talented and the least talented employees.

Of course, individual differences will always exist, and some people will perform better than others. But the magnitude of these differences can be reduced with AI. 

In study 1 (customer support), the lowest-performing 20% of the agents (the bottom quintile) improved their task throughput by 35% — two and a half times as much as the average agent. In contrast, the best-performing 20% of the agents (top quintile) only improved their task throughput by a few percent.

In study 2 (writing business documents), the professionals who scored the lowest when writing a document without AI help improved their scores much more than high-scoring participants when they received support from ChatGPT. The difference between good and bad writers was around 2–3 points (on the 1–7 quality rating scale) without using AI; this difference narrowed to roughly a single point when using ChatGPT. (This difference assessment is my own, from eyeballing the figures in the original paper. In the original paper, the narrowing of the skills gap is explained in terms of a statistically significant reduction in the correlation between the quality of the work with-AI and without-AI in the treatment group compared with the control one, who used a tool that was not AI-based. However, that can be hard to interpret for readers without good statistical skills.)

In study 3 (programming), programmers with fewer years of experience benefited more from the AI tool, though the effect was only marginally significant, at = 0.06. Also, programmers who spent fewer hours per day coding benefited more from the AI tool than participants who coded for more hours per day. This second effect was significant, at = 0.02. Taken together, these two findings suggest that less-skilled programmers benefit more from AI than more-skilled programmers do.

While the detailed findings and statistics differ between the 3 studies, all 3 conclusions are the same: using AI narrowed the gap between the worst and the best performers.

Narrowing Skills Gap, but Biggest Productivity Gains in Cognitively Demanding Tasks

At first, I thought it was a paradox that the worst performers within a domain are helped the most by AI, but that, between domains, the biggest gains are for the more cognitively demanding tasks. In other words, AI helps on the low end in one analysis and on the high end in the other analysis. How can that be?

As always, I would like to see more research, in more domains and with broader target audiences. But given the three case studies at hand, I have a tentative explanation for these two seemingly opposite results.

My hypothesis is that generative AI takes over some heavy lifting in manipulating large amounts of data. That is, it reduces working-memory load. This helps:

  • When the task is more complex and puts higher demands on people’s working memories
  • When humans have smaller working-memory capacity and cannot hold as many chunks of information in their brain at any given time

(Note that working-memory capacity is a well-known individual difference, and it does vary with things like age or education. Also, compared to experts in a domain, low-skill workers or novices will tend to use more of their working memory to do a task — because the task is not yet completely familiar, and they have to remember how to do it.)

By lifting some of the heavy working-memory burden, AI tools free up users to sprinkle on the unique human stardust of creativity, as exemplified by the editing component of the business-document task.

Creativity matters more in complex tasks than routine tasks — another reason AI helps more in advanced domains. Without AI, less-skilled users’ creativity is repressed by their need to devote most of their working memory to data handling.  But, with AI, their brains are freed up to be more creative, and thus, the skill gap between less and more skilled workers is narrowed.

Steve Jobs famously called computers “bicycles for the mind” because they allow users to do things with less effort. Similarly,  AI tools are forklifts for the mind because they do the heavy lifting. In an actual warehouse, the forklift driver still needs to decide how to stack the pallets most efficiently, applying their human insights. But, because the forklift does the lifting, humans no longer need to be musclebound beasts to move heavy pallets around. The same is true when using AI.

Faster Learning

A final finding comes from study 1 (customer support), the only longitudinal research study in the current set. In this study, the support agents were followed over several months; the results showed that, when supported by AI, agents achieve expertise faster than agents without AI support.

On average, new agents can complete 2.0 customer inquiries per hour. An experienced agent can complete 2.5 inquiries per hour; this level of productivity is normally reached in 8 months of work (without using the AI tool). In contrast, the agents who started using the AI tool right off the bat reached this level of performance in only two months. In other words, AI used expedited learning (to this level of performance) by a factor of 4.

Research Weaknesses

It may seem petty of me to complain of weaknesses in these pioneering research studies that are giving us sorely needed empirical data at a time when most analysts are blathering based on their personal opinions. I am very grateful for this research, and I enthusiastically praise the authors of these three papers. Well done, folks! If anybody deserves more research funding, it’s you, and I look forward to reading your future, more in-depth findings.

That said, there are always areas for improvement in any research, especially in pioneering studies conducted early on tight timelines. My main complaint is that the research was carried out without applying well-known user-research methods, such as think-aloud protocols. While economists love quantitative research like that reported here, UX researchers know that qual eats quant for breakfast when it comes to insights into user behaviors and reasons why some designs work better than others.

The Three Research Studies

You can stop reading here if you don’t care about the details but only want the conclusions, which I have already presented. But I have written up an analysis of each of the studies if you want more information: 

  • Study 1:  Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond.  Generative AI at Work. 
  • Study 2:  Shakked Noy and Whitney Zhang (2023): Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence.
  • Study 3: Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer (2023): The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. 

Understanding the specifics of the research is valuable for interpreting the results. The detailed discussion is essential for realizing how much we still don’t know. There are unsolved questions enough for many master’s and Ph.D. theses and even some undergraduate papers. If you want to do this work, let me know, and I might be able to mentor you.

The three studies were very different and yet arrived at the same results. This vastly increases my faith in the conclusions. Any single study can be wrong or misleading for several reasons. But credibility shoots through the roof when different people find the same thing in different domains with different research methods. The lead authors of these three case studies were from Stanford, MIT, and Microsoft Research, respectively. They studied customer-support agents resolving customer inquiries, business professionals writing routine documents, and programmers coding an HTTP server. As you’ll see if you read on, each research team used different study protocols and different metrics.

Despite all these differences, the three studies still all arrived at roughly the same conclusions. Impressive!

 

References

[Study 1] Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond (2023): Generative AI at Work. National Bureau of Economic Research working paper 31161. https://www.nber.org/papers/w31161 (for a detailed analysis of this study, see [[AI Tools Raise the Productivity of Customer-Support Agents)

[Study 2] Shakked Noy and Whitney Zhang (2023): Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence. Available at SSRN: https://ssrn.com/abstract=4375283 or http://dx.doi.org/10.2139/ssrn.4375283(for a detailed analysis of this study, see https://www.nngroup.com/articles/chatgpt-productivity/)

[Study 3] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer (2023): The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. Available at Arxiv: https://arxiv.org/abs/2302.06590 orhttps://doi.org/10.48550/arXiv.2302.06590 (for a detailed analysis of this study, see [[AI Tools Make Programmers More Productive)