Safe on Paper: What Accessibility Standards Teach AI Safety Governance

Executive Summary

I built a color contrast tool that spread to federal agencies and became the de facto standard for accessible visualization in my part of the federal government. That reach made a different problem visible: the underlying accessibility standard the tool was built on produces documented false passes for people with low vision. Better guidance existed. Getting it into practice required years of navigating institutional inertia that had nothing to do with whether the science was sound.

That pattern has three parts: standards that produce wrong answers once the problem space advances, revision that moves slower than the science does, and affected populations excluded from the working groups that actually move decisions. It's not a WCAG failure specifically. It's what happens to any standard once it embeds in procurement requirements, legal frameworks, and organizational workflows.

AI safety governance is early enough in that process to build differently. This post traces the failure pattern through a concrete case and proposes specific structural interventions: built-in revision mechanisms, outcome-based evidence requirements, and affected populations in working groups rather than comment periods. The window to act on those lessons is narrower than it looks.

Introduction

By 2022, Contrast Chaser had spread far enough that accessible visualization was widely considered a solved problem in my part of the federal government. That's the kind of statement that sounds like success. From inside the tool's reach, it looked like something else.

Practitioners were learning which colors passed — not why they worked or failed. Specialists who understood the underlying logic produced better work. They were the minority. The tool had optimized for speed, and speed trained users toward compliance thinking rather than judgment. A tool that trains practitioners to trust the compliance score makes it harder, not easier, to see when the score is wrong.

That's the view from inside a standards problem before you've named it as one. What follows is the anatomy of that problem.

What the Standard Gets Wrong

WCAG's contrast algorithm uses a luminance ratio. It doesn't account for spatial frequency — text size and weight — which is a known variable in how the human visual system perceives legibility. It doesn't reflect current display technology. And it treats all hues as equally malleable. They aren't.

Pick yellow: its identity is in its lightness. You can't darken it significantly without it becoming something else. Pick a saturated blue: it's strongest as a dark tone. The hue dictates the options. The standard doesn't make that visible — so practitioners kept handing designers answers without explanations.

When Better Guidance Can't Get Through

That same algorithm covered both text contrast and the non-text contrast WCAG 2.1 added for elements like charts, maps, and visualizations. APCA, grounded in more recent psychophysical research, does a better job of modeling what the visual system actually does for both. I wanted to build it into Contrast Chaser.

Here's the catch: building APCA in meant any team doing right by their users would find themselves technically out of compliance with existing accessibility regulation. Not because the better guidance was worse — it was better. Because the regulation hadn't caught up with the science. That's not a team failure. It's a standards failure.

Safe on paper. Less safe in practice.

Three groups needed to sign off: the people worried about whether it would slow the work, the people worried about liability, and the people whose job was to keep the regulatory floor consistent across the organization.

The approach: strip other unofficial guidance out of the UI first, to make clear the one official text standard was untouched. Frame new guidance as a tool for continuing the conversation, not rewriting a rule. Build relationships with the regulatory tier over time so they had enough context to tolerate the conflict rather than reject it reflexively. Wait for the right window at the leadership level.

It worked. It shouldn't have had to be that hard. The science was better. The users would benefit. The path was that complicated because the standard had worked its way into contracting language, legal citations, and day-to-day operational practice — and anything that touched those layers required managing everyone invested in keeping them stable.

This is what standards calcification looks like from the inside.

Why This Matters for AI Safety

The accessibility case shows the full failure pattern, start to finish. AI safety governance is early enough that it hasn't run to completion yet. Three things happen, reliably, when standards calcify:

The standard starts producing wrong answers. WCAG's false passes are the predictable outcome of a static algorithm in a dynamic problem space. The criteria were designed for what we could measure when they were written. The problem space doesn't hold still.

A model that clears red-teaming evaluations and alignment benchmarks is "safe" the same way a WCAG-compliant interface is "accessible." Teams that optimize for benchmarks learn benchmark thinking. They develop judgment about passing scores, not about the failure modes the scores were meant to detect.

Standards revision moves slower than the science does. APCA — evaluated by the W3C's Visual Contrast of Text Subgroup for inclusion in WCAG 3 before that subgroup went inactive — still isn't the standard. WCAG 3 itself has been in working draft since 2021 and isn't final. The slow pace isn't necessarily bad faith; updating legal frameworks, automated testing tools, and procurement language simultaneously is genuinely complex. But the practical effect is the same bind: practitioners who want to do right by their users end up caught between better evidence and entrenched compliance requirements, with no clean path through.

The AI safety version of this is early enough to interrupt. Evaluation frameworks and benchmark suites are accumulating institutional weight — referenced in agency guidance, cited in policy documents, becoming shorthand for "we checked." That's the mechanism. Once procurement requirements attach to specific evaluation methods, updating them means moving every layer that built on top of them simultaneously. The science of what current benchmarks actually measure isn't settled — but the governance machinery is building anyway.

The people most affected aren't in the room. Users with low vision who encounter false-passing color combinations daily had no structured role in the WCAG revision process. The people best positioned to identify where the standard was failing — because they lived with the failures — had no reliable mechanism to flag when the standard was wrong.

The same structural gap exists in AI safety governance. The populations most likely to encounter AI agent failures in high-stakes contexts — people with disabilities who depend on agents as accommodation tools, people who interact with government services by necessity, people for whom a miscalibrated system means a denied benefit or a missed medical alert — are not the ones defining what alignment looks like. Their failure modes tend to surface as edge cases rather than primary design criteria, if they surface at all.

What to Do Differently

Standards are necessary. They're also machinery, and machinery has inertia. The question isn't whether to have AI safety standards — it's how to build them so they can stay true to their purpose as the problem space evolves.

The first structural gap the accessibility case reveals is the absence of built-in revision mechanisms. WCAG has no native process for incorporating better evidence once the standard embeds — APCA still isn't the standard. The Contrast Chaser experience shows how it operates at the organizational level: even adopting APCA within a single organization required navigating those same layers simultaneously — none of which were designed with revision in mind. At the standards level, those layers likely multiply. AI safety evaluation frameworks are early enough that revision mechanisms can be designed in rather than retrofitted later. Sunset clauses. Named processes for when a benchmark suite is superseded by better evidence. They're currently absent.

The second gap is the difference between process compliance and outcome evidence. WCAG compliance is process-based: run the algorithm, report the ratio, clear the threshold. A WCAG-compliant interface can still fail real users — the algorithm reports whether the math cleared the bar, not whether anyone with low vision can actually read the result. The AI safety equivalent is a safety evaluation that reports whether a model cleared its benchmarks — not whether those benchmarks would catch a harmful model. The standard to hold instead: evaluation methods must demonstrate they catch the failure modes they're designed to detect, and that they work for the specific populations at risk, not just the median case.

The third gap is structural inclusion. Disability advocates have formally objected to WCAG over failures the standard missed — one objection about cognitive and learning disability gaps drew signatures from over 40 organizations. But comment periods accept input; they don't determine who sits on the working groups that act on it. Those high-stakes populations need structural roles in working groups, not submissions in comment periods. Those are not equivalent forms of participation, and treating them as equivalent is how affected populations end up with a channel into the process but no seat in it. This is what inclusive design looks like applied to the governance layer itself: not accommodating the margin after the standard is set, but ensuring the margin shapes the standard from the start.

The accessibility community spent 25 years learning these lessons reactively. AI safety gets them in advance. The playbook exists. Use it.