Accessible AI: Why LLM Outputs Fail Users with Disabilities (Prompt Engineering the System, Not the User)

Hero illustration showing transformation from hollow, faded speech bubble to solid gold speech bubble with accessibility symbol, connected by a beam of light - representing how context transforms LLM outputs from technically valid to meaningfully accessible

Executive Summary

Over 1.3 billion people rely on AI systems that default to failing them - a responsible AI problem hiding in plain sight. When I tested Claude's alt text generation across four blog images, context-free prompts produced descriptions that named visual elements but conveyed no meaning. Adding purpose and audience context transformed outputs from technically valid to functionally useful - same AI, different prompts, dramatically different accessibility outcomes.

Three prompt engineering patterns consistently improved results: including purpose context ("this is a hero image for a post about accessibility metrics"), specifying audience ("describe this for someone who cannot see it"), and referencing WCAG's "equivalent purpose" standard. Research validates this approach - University of Washington studies found LLMs handle simple accessibility tasks better than complex ones, while Microsoft Research confirms that understanding what blind users actually need produces more useful descriptions.

The intervention is context, but the burden belongs on systems design, not individual users. Disabled users shouldn't need to engineer better prompts to get adequate access. WCAG compliance provides the foundation; building trustworthy AI requires encoding accessibility through inclusive design and centering disabled voices in design decisions.

Introduction

Over 1.3 billion people worldwide live with disabilities. As AI proliferates across every digital touchpoint, the accessibility of LLM outputs becomes a responsible AI imperative, not a feature request. Yet most AI systems ship with defaults that fail these users - offering illusions of access rather than the real thing.

This post examines where LLM-generated accessibility content falls short, why this matters ethically, what standards can tell us, and practical prompt engineering patterns that improve outcomes. I write as an accessibility practitioner who can evaluate outputs against WCAG criteria - but I cite research for claims about user experience outside my direct domain knowledge.

The core insight: context is the intervention. Prompts that include purpose and audience produce meaningfully better accessibility content than "describe this image." The burden should fall on system designers to provide that context, not on disabled users to compensate for inadequate defaults.

The Accountability Gap

The American Foundation for the Blind articulated the problem in their 2024 analysis: AI-generated image descriptions give users "the illusion of access but without the reliability or fidelity that meaningful access demands." This phrase captures something that should trouble anyone building AI systems. We're shipping appearance without substance.

The scale is significant. Fewer than 6% of English-language Wikipedia images have adequate alt text. When AI attempts to fill this gap, it often produces descriptions that are technically valid but functionally useless.

In my own testing, I generated alt text for four blog images using Claude, varying the prompting approach. The results were instructive.

For a hero image combining an accessibility symbol with performance metrics (graphs, gauges), the prompt "describe this image" yielded: "Abstract illustration with accessibility symbol, line graph trending upward, bar chart, and speedometer gauge on blue geometric background."

Hero illustration for Contrast Chaser Zero post: accessibility symbol integrated with performance metrics including an upward-trending line graph, bar chart, and gauge, representing measurable accessibility improvement

This is accurate. It's also useless. A screen reader user learns what visual elements are present but not why they matter. The description provides mechanics without semantics.

When I added context - "this is a hero image for a post about accessibility metrics tools" - the output improved dramatically: "Hero illustration combining accessibility symbol with performance metrics - trending line graph, bar chart, and gauge - representing the intersection of accessibility and measurable improvement."

Same image. Same AI. The difference was entirely in what I, as the system designer, provided. The LLM had no way to know the purpose without being told.

This gap between technically compliant and functionally useful defines the accountability problem. An alt attribute exists, so checkboxes get ticked. But the user isn't served.

Why This Is a Systems Design Problem

The alt text example reveals a broader pattern: compliance without substance. WCAG permits empty alt text (alt="") for images that are purely decorative. But this permission is routinely misapplied. The "decorative" label becomes a cop-out - an excuse to avoid the harder work of describing why content exists.

Disabled users correctly suspect they're being denied context their peers receive. A sighted user sees a hero image that establishes mood, signals topic, creates visual hierarchy. A screen reader user gets silence. The standard permits this; the outcome fails them.

This connects to a broader pattern I explored in Compliance without Accommodation: meeting the letter of WCAG criteria while failing its spirit. Alt text is another instance. The criterion exists to ensure equivalent access to information. Empty alt text on images that convey meaning - even subtle meaning - produces unequal outcomes.

Inadequate defaults manifest differently across domains. University of Washington researchers found that ChatGPT and GPT-4 consistently ranked resumes with disability-related credentials lower than identical resumes without them. Disability honors were penalized. An autism leadership award was deemed to show "less emphasis on leadership roles." This is disability bias embedded in defaults.

Custom GPTs trained on bias-mitigation principles showed improvement - but the improvement was uneven. Autism-related bias proved more resistant than others. The problem isn't a simple bug to fix. It's structural.

This requires systems thinking. Individual prompts can improve individual outputs. But we need AI systems that default to better behavior, not systems that require users to compensate for inadequate design. This is what AI governance should address: encoding ethical defaults at the system level.

What Standards Tell Us

But what makes a default "better"? Without measurable criteria, accessibility claims become subjective assertions. This is where standards matter.

WCAG provides the measurable framework for evaluating AI accessibility outputs. For alt text, the relevant criterion is 1.1.1 (Non-text Content): all non-text content must have a text alternative that serves the equivalent purpose.

"Equivalent purpose" is the key phrase. A description isn't automatically equivalent just because it names what's visible. Equivalence requires conveying the same information or function the visual provides to sighted users.

Research on LLM-generated code accessibility found that LLMs handle basic accessibility issues (color contrast, simple alt text) better than complex ones (ARIA attributes, keyboard navigation). Default code frequently fails WCAG standards; structured feedback and screenshot-based prompts improve fixes.

This maps to my alt text testing. For UI screenshots where information is explicit - buttons, labels, text - LLM defaults are reasonable. For abstract or conceptual images where meaning isn't explicit, defaults fail.

Standards provide measurement criteria but can't guarantee outcomes. WCAG 3.0, currently in draft, will explicitly address emerging technologies. Accessibility Standards Canada's AI Technical Guide emphasizes inclusive design with people with disabilities. But without the ethical commitment to serve users—the foundation of trustworthy AI—standards become another checkbox.

Prompting for Better Defaults

Standards define the criteria. The question is how to meet them in practice.

Context is the intervention. My evaluation identified three prompt engineering patterns that consistently improved alt text quality:

Pattern	Example Prompt	Effect
Add purpose context	"This is a hero image for a post about accessibility metrics. Describe it."	Shifts from describing mechanics to conveying meaning
Specify audience	"Describe this image for someone who cannot see it"	Shifts from visual inventory to equivalent information
Reference WCAG	"Write alt text that provides equivalent information per WCAG 1.1.1"	Anchors output on the "equivalent purpose" standard

The phrase "for someone who cannot see it" proved particularly effective. It reframes the task from describe visually to provide equivalent information. The prompt makes the user present, not abstract.

Some approaches failed. "Write accessible alt text" produced verbose, overly detailed output. LLMs conflate "accessible" with "comprehensive." For abstract images, providing article context caused over-interpretation - the LLM projected symbolism that might not match creator intent. An interconnected network of shapes became "representing the interconnection of ethical principles" when given context about an ethics post. This may or may not be accurate.

Abstract geometric network of interconnected nodes in blue and gold tones - an image where meaning is interpretive rather than explicit

Research validates this user-centered approach. The Microsoft Research alt text initiative, in collaboration with UT Austin, focuses on increasing the utility of automatically generated descriptions by understanding what blind and low vision users actually need - not just what's technically describable.

The ethical framing matters. These prompting patterns shouldn't be user burden. A blind user shouldn't need to engineer better prompts to get adequate alt text. The burden belongs on content creators, platform designers, and AI developers.

This means encoding accessibility intent at the system level—a core principle of inclusive design and responsible AI development—in platform defaults, tool configurations, organizational guidelines. System interventions scale where individual prompts don't.

Beyond Compliance: What Standards Don't Capture

These patterns improve measurable outcomes: WCAG compliance, information equivalence, technical validity. But meeting standards is the floor, not the ceiling. User research reveals needs that technical evaluation alone can't address.

Stanford HAI research found that blind and low vision users often prefer descriptions that are subjective and context-appropriate rather than objective and neutral. "If the dog is cute or the sunny day is beautiful, depending on context, the description might need to say so." This preference for contextual judgment goes beyond what WCAG requires - and informs better prompting strategies.

Neurodivergent users have documented format preferences: bullet points, numbered lists, segmented steps, clear structure supporting executive functioning. These preferences suggest that verbose prose - even technically accurate prose - may not serve everyone. Designing with these needs in mind produces outputs that work for broader audiences.

Deaf and hard of hearing users prefer verbatim captions over summarized ones. Accuracy in high-stakes contexts (legal, medical, employment) requires standards current AI often fails to meet. These are real constraints that shape when and how LLMs should be deployed for accessibility content.

Significant research gaps remain. Studies focus heavily on visual impairments; non-visual disabilities are underexamined. Compounding factors - disability combined with other marginalized identities - receives limited attention. Global South perspectives are largely absent from Western-centric research.

This is why participatory design—a cornerstone of universal design methodology—matters. "Nothing about us without us" isn't a slogan - it's a design imperative. Standards-based evaluation provides a foundation. Building on that foundation requires the voices and experiences of the users these systems are meant to serve.

Conclusion

LLM outputs give the illusion of access without the substance. Technically valid alt text that describes mechanics without semantics. Default behaviors shaped by biased training data. Standards met while users are failed.

The intervention is context. Prompts that include purpose, audience, and accessibility framing produce better outputs. System designers should encode this context by default, not expect users to compensate.

Technical evaluation provides the foundation: measuring outputs against WCAG criteria, identifying which prompting patterns improve compliance, documenting where defaults fail specific standards. But meeting standards is necessary, not sufficient. Building systems that truly serve disabled users requires their voices - not just in user research, but in the rooms where these systems are designed.

Build AI that defaults to serving everyone. Don't ship illusions of access.