From Provenance to Aberrations: Image Creator and Screen Reader User Perspectives on Alt Text for AI-Generated Images
Abstract
1 Introduction
2 Related Work
2.1 Background: AI-Generated Images
2.2 Alt Text Research in HCI
3 Methods
3.1 Participants
3.2 Stimuli: Four Alt Text Versions
3.3 Creators’ Study Procedure
3.3.1 Asynchronous Pre-Work.
3.3.2 Alt Text Evaluation Session.
3.4 Screen Reader Users’ Study Procedure
3.5 Evaluation Tasks and Questions
3.6 Data Analysis
4 Findings
4.1 RQ1: What are the characteristics of alt texts for AI images prepared from different sources (creator-written, expert-written, the T2I prompt, and a V2L model)?
4.1.1 Alt Text Characteristics.
Alt text version | Character count | % Nouns | % Verbs | % Adjectives | % Adverbs | % Other |
Prompt | 98.7 (98.2) | 48.2 (15.5) | 7.2 (7.5) | 11.7 (10.7) | 0.4 (1.4) | 32.5 (15.1) |
V2L | 46.7 (6.7) | 36.1 (7.4) | 7.2 (6.8) | 7.5 (8.2) | 0.4 (2.5) | 48.7 (7.7) |
Expert | 402.8 (156.4) | 32.1 (3.7) | 12.5 (3.1) | 18.1 (4.1) | 1.2 (1.6) | 36.1 (3.3) |
Creator-Original | 215.8 (157.8) | 34.5 (8.4) | 11.1 (5.6) | 11.7 (5.4) | 1.5 (2.4) | 41.2 (6.7) |
Creator-Ideal | 356.0 (199.3) | 33.0 (5.0) | 11.8 (4.2) | 16.5 (5.4) | 1.0 (1.4) | 37.8 (4.8) |
SRU-Ideal | 270.8 (143.8) | 36.3 (6.2) | 11.5 (4.4) | 13.8 (6.3) | 1.1 (1.6) | 37.4 (5.6) |
4.1.2 Creation of ‘Ideal’ Alt Text.
Ideal version written by | # Closest to Expert | # Closest to Creator-Original | # Closest to V2L | # Closest to Prompt |
Creators | 17 | 9 | 4 | 2 |
SRUs | 27 | 20 | 8 | 9 |
4.2 RQ2: How do creators and SRUs evaluate alt texts from different sources?
4.2.1 Preferred Alt Text Versions.
Alt text version | Rank (creator) | Rank (SRU) | Rank (all) |
Prompt | 3.18 (0.98) | 3.03 (1.02) | 3.08 (0.98) |
V2L | 3.22 (0.97) | 2.97 (0.92) | 3.05 (0.94) |
Expert | 1.59 (0.67) | 1.91 (1.03) | 1.80 (0.94) |
Creator-Original | 2.0 (0.92) | 2.09 (1.03) | 2.06 (0.99) |
4.2.2 Compositions of Preferred Alt Text.
Rater | Char count | % Nouns | % Verbs | % Adjectives | % Adverbs | % Other |
Creators | -0.59** | 0.32** | -0.25* | -0.33** | -0.35** | 0.23* |
SRUs | -0.37** | 0.19** | -0.29** | -0.08 | -0.15* | 0.06 |
4.3 RQ3: Can text prompts be a good source of alt text for AI images?
4.3.1 Irrelevant Content.
4.3.2 Generic Overview.
4.3.3 Undepicted Phrases.
4.3.4 Jargon.
4.4 RQ4: What should be described in alt texts for AI images? How, if at all, does this differ from alt text considerations for traditional images?
4.4.1 Provenance.
S22: “I think they [AI images] have a lot of potential for danger if they’re not labeled correctly. I saw an image of Obama and Angela Merkel dancing on the Beach together and it was made by Midjourney and it looked very very realistic. It can lead to fake information out there that can have a really bad impact on our society.”
4.4.2 Aberrations and Uncanny Content.
“These are things that you might not see normally. And because you can come up with anything with T2I, it may require more description about relative positions, placements, colors, to give you a feel of what this actually looks like as opposed to an image of a well-known movie poster — it would be very easy to describe because the viewer may have a sense already of what this is like through their background.”
4.4.3 Visual Uncertainty and Creator Intent.
4.4.4 Image Medium, Style, and Ambience.
“It got sort of the details but is kind of missing the point of the image… When I look at images, it’s a kind of a depth-first process where you take in the dominant features, first the vibe, the warmth, the colors, maybe the style, and then you start to drill down into the arrangement of things, and then maybe you finally end up with things like words or figures, or what’s happening in scenes… And the impressive majesty of this image, it’s kind of like Yosemite Valley. You don’t start with the color of the leaves or something on a particular tree, you really start at that whole picture.”
5 Discussion
5.1 Accessible Provenance
5.2 Alt Text for AI-Generated Images
5.3 Limitations
6 Conclusion
A Appendix
Category | Creators (count) | SRUs (count) | |
Gender | Male | 11 | 8 |
Female | 5 | 8 | |
Age (years) | 18–24 | 0 | 2 |
25–34 | 6 | 4 | |
35–44 | 8 | 5 | |
45–54 | 2 | 5 |
Factor | Prompt | V2L | Expert | Creator | SRU-ideal |
Provenance | Not described but jargon (e.g., cybernetic, –upbeta) and unusual sentence structure implied AI generation. | Not described | Described watermark if present i.e., in 27/32 images. Example: “Model_name is written near the bottom right corner.” | Described watermark in 5 ‘original’ and 16 ‘ideal’ versions out of 32 images (27 with watermark). Example: For I9, C9 added, “At the bottom right, there are 5 colored squares, the signature for Model_name.” | Described AI generation in plain language in 23/32 images; all 23 had visible watermark described by other alt texts. Example: S31 described I7, I17, and I25 as “Image made by model_name.” |
Aberrant and uncanny content | Not described | Not described | Described in 12/32 images among which 5 explicitly called those out as possible “aberrations of the AI model.” Example: I13’s expert alt text said, “a birthday-style multi-colored banner that says “KIRRY ARIHOA” with some illegible letters, a possible aberration of AI models.” | Described in 10/32 images (both ‘original’ and ‘ideal’ versions); only 4 ‘ideal’ versions called these out as aberrations. Example: In I13’s ‘original’ alt text, C13 said, “Some of the edges of the drawing are smudged or slightly stretched out of shape… A letter banner above the mouse reads "DIRRY ARIHIOA".” In the ‘ideal’ version, they added, “a possible aberration of AI model used to generate this image.” Creators sometimes considered aberration information as unnecessary and irrelevant to the images’ key point. | Described in 3 alt texts, although none called them out as aberrations. Example: S29’s alt text for I13 mirrors the description from C13’s ‘original’ version, not the ideal version which called out aberrations explicitly. SRUs felt that aberrations and uncanny content are difficult to visualize, require a longer description, and are unnecessary information except in certain contexts (e.g., if the image is created or shared by the SRU for their own work, or used as a funny example of T2I hallucinations.) |
Visual uncertainty | Not described | Not described | Used qualifying language to describe visually uncertain content. Example: Experts described I25’s animal to have an “otter-like body” and “parrot-like short curved gray beak.” | Sometimes used qualifying language for visually uncertain content. Example: C11 described I25’s animal to be a “hamster-like creature with sharp beak.” | Preferred accurate description (even if generic) over specific but potentially inaccurate interpretations. Example: In I29, S23 chose ‘weapon’ instead of ‘flamethrower’ or ‘bat’ to describe an object, because they preferred “to say that it’s a weapon if it’s not explicitly sure that it’s a flamethrower.” SRUs also sometimes used qualifying language. Example: S25 described I27 as showing “sign language-like gestures” to imply that “this kind of looks like sign language but I don’t know if they’re actually saying something in sign language.” |
Creator intent | Indicated creator intent. Example: I9 prompt says, “Draw a professional black and white graphic logo of a llama standing with a unicorn horn.” | Not described | Not described | Indicated creator intent, which sometimes was a determining factor in choosing between multiple interpretations of visually uncertain content. Example: In I9, C9 considered the experts’ interpretation of alpaca to be incorrect, explaining that “I asked for a llama. So I’m thinking that this is supposed to be a llama, not an alpaca.” | Not described. Some SRUs valued knowing creator intent if available. Example: S23 said, “I would lean more toward the intent of what the author wanted to do.” |
Factor | Prompt | V2L | Expert | Creator | SRU-ideal |
Image medium | Mentioned in 16/32 images. Example: photo, infographic, poster, pixel art, oil painting, pencil sketch, charcoal drawing. | Mentioned in 18/32 images. Example: painting drawing, photos, pictures. | Mentioned in 21/32 images. Example: painting, pixel art, clip art, charcoal/graphite pencil drawing, photograph, studio shot, illustration. | Mentioned in 17 ‘original’ versions and 24 ‘ideal’ versions out of 32 images. Example: drawing, painting, pencil sketch, illustration, photo, pixel art | Mentioned in 19/32 images. Example phrases: drawing, painting, pencil sketch, photograph, illustration, studio shot, pixel art, charcoal/graphite pencil drawing, oil painting |
Style | Mentioned in 20/32 images. Example: arcane style, psychedelic style, Van Gogh style, cinematic, photorealistic, anthropomorphic, renaissance, graphic logo, black-and-white, high resolution, high quality scan, highly detailed, isometric render, octane render, insular art, Christian iconography | Mentioned in 5/32 images. Example: cartoon image, black-and-white. | Mentioned in 9/32 images. Example: black-and-white, book-style, abstract, cartoon style, closeup. | Mentioned in 9 ‘original’ and 11 ‘ideal’ versions out of 32 images. Example: style of anime, movie poster-style, renaissance style, style of the 13th century, style similar to Rembrandt, black-and-white, abstract, photorealistic, graphic logo, closeup, impressionist. | Mentioned in 11/32 images. Example phrases: detailed, closeup, Van Gogh style, abstract, black-and-white, style similar to Rembrandt, renaissance portrait, movie poster-style, anime style, book-style, photorealistic. Some SRUs were initially against image medium and style descriptors but appreciated this information upon learning more about T2I variations. |
Ambience | Conveyed ambience and subjective emotions in several images. Example: In I19 “sad person”, in I4 “happy monkey”, in I18 “happy and friendly sloth”, and in I22 “beautiful landing page.” | Did not describe ambience or emotions, but on a few occasions identified the subjects’ facial expressions that conveyed emotions. Example: “a smiling monkey” in I4. | Described overall ambiance on a few occasions. Example: In I46 “relaxed vibe” and in I33 “dystopian" cityscape. In several cases, described the subjects’ facial expressions or body language that indicated emotions. Example: In I4 “laughing monkey” or in I18 “arms raised skywards as if in celebration.” | Described overall ambience and emotion in several images, in addition to describing explicit facial or bodily expressions. Example: In I3, C3 described, “the entire bird seems to glow with fiery energy. The bird appears very regal and majestic.” In I17, C3 added, “The image gives off the feeling that it is a scary movie.” In I19, C5’s ideal version described, the person’s “posture suggests sadness or grieving.” In I4, C4 said, “a happy monkey”. | Generally preferred more objective descriptions instead of interpretive ambience. Example: In I17, both S17 and S31 added that the letters were “shaky” instead of explicitly mentioning that the image gives a “scary” vibe. SRUs also included explicit facial or bodily expressions. Example: In I14, S30 added, “smiling” turtles. In I4, S20 added, “hands up in a celebratory position.” On a few rare occasions, SRUs included subjective expressions. Example: In I3, S18 mirrored the creator’s emotive expression, “the entire bird seems to glow with fiery energy. The bird appears very regal and majestic.” In I4, S28 said, “happy monkey”. |
Footnotes
Supplemental Material
- Download
- 35.46 MB
- Transcript
- Download
- 3.12 MB
References
Index Terms
- From Provenance to Aberrations: Image Creator and Screen Reader User Perspectives on Alt Text for AI-Generated Images
Recommendations
Rich Representations of Visual Content for Screen Reader Users
CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing SystemsAlt text (short for "alternative text") is descriptive text associated with an image in HTML and other document formats. Screen reader technologies speak the alt text aloud to people who are visually impaired. Introduced with HTML 2.0 in 1995, the alt ...
Dimensional alt text: Enhancing Spatial Understanding through Dimensional Layering of Image Descriptions for Screen Reader Users
CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing SystemsOver the past decade, there has been a significant improvement in the quality of images we see on the web, and image processing technologies such as monocular depth estimation are opening up new possibilities for various applications. However, despite ...
Understanding Blind People's Experiences with Computer-Generated Captions of Social Media Images
CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing SystemsResearch advancements allow computational systems to automatically caption social media images. Often, these captions are evaluated with sighted humans using the image as a reference. Here, we explore how blind and visually impaired people experience ...
Comments
Please enable JavaScript to view thecomments powered by Disqus.Information & Contributors
Information
Published In
Sponsors
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Check for updates
Badges
Author Tags
Qualifiers
- Research-article
- Research
- Refereed limited
Data Availability
Conference
Acceptance Rates
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 951Total Downloads
- Downloads (Last 12 months)951
- Downloads (Last 6 weeks)298
Other Metrics
Citations
View Options
View options
View or Download as a PDF file.
PDFeReader
View online with eReader.
eReaderHTML Format
View this article in HTML Format.
HTML FormatGet Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in