The Problems of Artificial Intelligence for digital Accessibility

In this article we will think about artificial intelligence – not from the perspective of its potential, but rather with a view to the risks that are already emerging or have already manifested themselves, particularly in the context of digital accessibility. In previous posts, I've discussed the opportunities of AI in detail. It seems all the more important, therefore, to now also examine the problematic aspects. First, I will outline the key challenges and then discuss possible solutions.

Summary

This post examines the risks of AI in the area of ​​digital accessibility. While automatic code generation offers democratic potential (e.g., through tools like GitHub Copilot), it is currently unable to reliably create fully accessible applications. AI can provide valuable support to developers, but it does not replace expert review.

Key problems include:

  • Error-prone or non-compliant code (e.g., regarding WCAG),
  • Training data that predominantly contains inaccessible content and reproduces existing deficiencies,
  • Hallucinations in summaries and simplifications,
  • Bias towards people with disabilities nsufficiently accessible AI tools themselves,
  • Poor-quality automatic image descriptions, subtitles, or sign language avatars.

It is particularly critical that affected target groups often cannot check content themselves and are therefore heavily reliant on accurate AI output.

Possible solutions include:

-
  • AI as an assistant with multi-stage quality assurance instead of full automation,
  • Curated and quality-assured training data
  • Clearly defined accessible code patterns
  • Regulatory frameworks such as the EU AI Act
  • Above all, the systematic, early, and compensated involvement of people with disabilities in development and quality assurance

Key message: AI can promote inclusion, but without targeted quality assurance and genuine participation, it creates new risks of exclusion. Accessibility is not an add-on feature, but a fundamental requirement.

Automated Code Generation by AI

One topic currently being discussed particularly is the automated code generation by AI systems. This field is both fascinating and controversial. While some are already predicting the end of traditional software development, others warn of serious quality and security problems when program code is generated automatically on a massive scale. As is so often the case, the truth lies in a nuanced perspective.

Significantly more code is being generated automatically today – often by people without formal programming training who use AI to create simple solutions. This also applies to blind and visually impaired people. Here, a real opportunity for the democratization of software development opens up: Even minor adjustments or simple macros currently cost several hundred to a thousand euros, as professional developers must be compensated accordingly for their time. Through so-called "vibe coding"—that is, the dialog-based generation of code using AI tools—even non-experts can now develop their own solutions without relying on external service providers. In this sense, the technology has emancipatory potential.

At the same time, the risks should not be underestimated. Automatically generated code can contain security vulnerabilities, be inefficient or error-prone, and is often inadequately documented. These are serious shortcomings for professional software development.

In the context of digital accessibility, another aspect comes into play: There are already too few developers with in-depth expertise in accessible implementation. This bottleneck means that many digital products do not meet accessibility requirements, or only do so inadequately. The hope now is that AI-supported code generation could at least partially remedy this situation. However, an important distinction must be made. A distinction must be made between two scenarios:

Complete generation of entire applications with the aim of accessibility

Currently, in practice, it is not possible to instruct a tool to create a complete, accessible app. The requirements for semantically correct HTML, ARIA implementations, keyboard navigation, screen reader compatibility, and contrast ratios are complex and context-dependent. While AI systems generate formally plausible code, they do not guarantee standards-compliant implementation, for example, according to WCAG or EN 301 549.2. Targeted Support for Experienced Developers.

Using AI as an assistance system seems significantly more sensible: AI can generate boilerplate code, take over repetitive tasks, or provide information about potential barriers. However, the prerequisite remains that the responsible person can professionally assess the quality and accessibility.

Coding assistants such as GitHub Copilot, which can be integrated into development environments like Visual Studio Code and offer context-sensitive code suggestions during programming, present a promising opportunity. I see considerably more potential here than in the complete, automated generation of entire applications. An experienced developer can professionally evaluate the suggestions of such an assistant without being deeply immersed in accessible development.

In an ideal case, a multi-stage quality assurance system is created:

  1. An AI tool generates or supplements code.
  2. Another tool—ideally from a different provider—automatically checks this code for semantic and logical correctness.
  3. In addition, traditional testing tools are used, such as static analyses or accessibility checkers.
  4. Finally, a manual validation is performed by a qualified person.

Such a setup can significantly accelerate development work. It's even conceivable that a highly specialized expert in accessible development isn't required; solid programming skills are sufficient if AI-supported assistance systems provide structured support for accessibility aspects. Today, the problem isn't so much training a developer in accessibility. However, the knowledge isn't lost if the person only occasionally programs with accessibility in mind. They can't build up experience continuously, and it's not worthwhile for them to constantly stay up-to-date.

A key problem lies in the generation process itself. Many AI-supported code tools don't work incrementally in the classic sense, but rather regenerate large parts of the application when changes are made. Every new prompt can lead to existing structures being overwritten or unintentionally altered. Improvements in one area are then often accompanied by deteriorations in another. The probabilistic work of Large Langauge Models is not the best way to generate accessible code for a whole Application.

Accessibility, however, is not an add-on, but a systemic property of an application. It requires consistent architectural decisions, consistent semantic structuring, and stable interaction logic. A destructive generation approach undermines these prerequisites.

Another fundamental risk lies in the training data of the models. For web applications, the underlying HTML, CSS, and JavaScript code is publicly accessible and therefore easily usable as training material. However, most existing websites are not implemented with accessibility in mind. If a model is trained predominantly with faulty or non-standard code, it will statistically reproduce these deficiencies. Even explicit instructions such as "Generate accessible code according to WCAG" often fail to result in consistent implementation in practice. Semantic depth and normative requirements are either incompletely considered or misinterpreted. Experience from the field—for example, when using AI functions in design or development tools—shows that corresponding prompts do not guarantee truly accessible results. The intention is understood linguistically, but not implemented technically in a standards-compliant manner.

Against this backdrop, it seems unrealistic in the foreseeable future that AI systems will independently generate complete, standards-compliant, and robust accessible applications.

A major risk lies in the uncritical use of code generation tools by those inexperienced in accessibility. This is where the AI ​​automation bias comes into play. In other words, those using the tools mistakenly believe that simply instructing them to generate accessible code is enough to guarantee an accessible result. If a formal testing routine is integrated into the tool, the application can pass automated tests even if it is not actually accessible in practice. Unfortunately, many decision-makers treat accessibility as a mere compliance requirement and a minor feature, thus increasing the risk of poorly AI-generated applications.

A structural risk arises particularly when AI systems begin using their own output as a training basis. We are already observing that more and more websites are being generated using AI—often without any consideration of accessibility. If this very code then serves as training material for new models, a self-referential cycle is created.

  1. Inaccessible code is generated.
  2. This code becomes publicly available.
  3. It is incorporated into new training datasets.
  4. The model reproduces and amplifies existing deficiencies.

This phenomenon is not limited to code. It is also being discussed that AI-generated content is increasingly being used as training data for text-based models. The danger lies in the fact that errors, simplifications, or ambiguities in content can accumulate—while maintaining high linguistic quality. This is precisely the strength, but also the risk, of these systems: They produce seemingly plausible results, even if the underlying content is lacking.

In the context of code, this means: formally correct structures that are, however, semantically incorrect, insecure, or not implemented in an accessible way. Models lack an intrinsic awareness of quality; they do not recognize that an ARIA attribute is logically inconsistent or that an interaction structure is inaccessible to screen readers—unless this has been explicitly and validly defined in the training material.

To break this cycle is not trivial. Possible approaches include:

- Curated, quality-assured training datasets - Explicit weighting of standards-compliant implementations - Model-internal validation mechanisms - Combining generative systems with formal rule checks

Whether and to what extent work is being done on this is difficult to assess from the outside. However, one thing is clear: the problem is real and structural.

What I do consider as a possible solution though, is the automated accessibility testing of applications. So far, axe-core and similar systems are entirely rule-based and produce many false positives. However, some providers are working on integrating AI into the process, such as Accessibility Cloud. The ability to work more with the visual interface reduces common false positives such as those caused by contrast or keyboard focus.

Accessibility of AI Tools

Another often underestimated aspect concerns the accessibility of the AI ​​tools themselves. Many of these tools appear relatively easy to use in their basic interface. However, as soon as more complex functions are used—such as advanced configuration options, debugging interfaces, or integrations into development environments—barriers emerge:

For people with disabilities, this can effectively lead to exclusion—especially when their professional lives increasingly require the use of such tools. This creates a paradox: technologies that could theoretically promote inclusion generate new risks of exclusion.

This problem is often underestimated because the tools appear "simple" at first glance. However, the complexity becomes apparent in the depth of their functionality.

Problems with Summarizing and Comprehensible Translations

Another point that remains relevant in practice concerns so-called hallucinations—that is, factually incorrect but plausibly presented output. They often arise during AI-generated text optimization, summaries, or conversion into understandable formats using AI.

For my newsletter, I had several links summarized. Some of them were texts I wrote myself. The summaries contained a lot of nonsense. They included content that sounded relevant but wasn't in the original text.

Something similar happened to me with Gemini. I copied the entire article into the input field and then asked for a summary of that exact text. Despite this, the summary contained statements that weren't in the original. This content was clearly incorrect.

This is difficult. Many people use such tools to simplify texts. For example, they read an article from a government agency and ask for a more understandable summary. If the tool then adds incorrect information, a serious problem arises.

These tools are supposed to help people who have difficulty reading or understanding complex texts. This target group, in particular, often can't verify the information themselves. If they could read and understand the original text, they wouldn't need a summary. Therefore, it's especially critical if the summary contains errors.

I've observed this behavior with several tools, both Gemini and ChatGPT. It was less pronounced with ChatGPT, but I also didn't test it extensively.

These experiences have shaken my trust in such systems. For many years, there has been talk of so-called hallucinations. If this problem remains unresolved, the question arises whether it can be solved at all. I cannot make a definitive technical assessment. I don't know the architecture of such systems well enough. However, my impression is that it still doesn't work reliably.

Bias in Data

A major problem is the data basis of AI systems. The topic of disability is only inadequately represented there.

Disability is a complex issue. Even with physical disabilities, there are many different forms. Nevertheless, these differences are hardly reflected in image databases and text collections. The lived reality of disabled people is often not sufficiently considered.

This has concrete consequences. In automated application processes, disabled people can be disadvantaged. A system might, for example, evaluate photos or resumes. If someone doesn't conform to the expected norm in appearance or in their resume, the person can be automatically rejected. This happens even if they are professionally qualified.

There can also be risks in road traffic. Autonomous or semi-autonomous vehicles must correctly interpret human behavior. If a system misinterprets the behavior of a blind or visually impaired person, dangerous situations can arise.

For example, the vehicle might assume that a person stops because they see or hear the car. In reality, the person might not perceive the vehicle and continue walking. This could lead to an accident.

Such scenarios are particularly relevant in countries where autonomous vehicles are already in use, such as the USA or China. In Germany, fully autonomous systems are currently only permitted in very limited circumstances on public roads.

Another problem concerns information processing. Blind people often require different information than sighted people when interpreting images or graphics.

The problem can be exacerbated by the use of AI-based monitoring systems. If such systems fail to adequately consider certain groups, misinterpretations and discrimination can increase.

Overall, it's clear that incomplete data directly impacts the systems' results. This particularly affects groups that are already structurally disadvantaged.

Inaccuracies in Descriptions and Conversions

An interesting issue is the inaccuracy in the production of accessible content. By this, I mean imprecise or inaccurate descriptions.

Alternative text for images is one example. I see great potential here. Many images and PDFs on social media lack descriptions, making them inaccessible. Automatic image descriptions could solve this problem if they function reliably.

Some platforms have already integrated such features, for example, Facebook, Instagram, and Microsoft in its Office products. However, the quality of these automatic descriptions is often inadequate. Sometimes the texts are garbled or factually incorrect. A regular bus is described as a cable car. A person is vaguely described as "possibly blond with glasses." Such descriptions are hardly helpful. These systems have been in use for many years but have only undergone limited development. They are resource-efficient but lack detail.

Modern, large-scale language and image models are significantly more powerful. They can describe images in greater detail. However, they require considerably more computing power. Therefore, it is currently unrealistic to automatically analyze every single image on large platforms with a complex model. This would result in high costs and high energy consumption.

In addition, there is a structural problem: Blind people represent a comparatively small user group. From a purely economic perspective, their needs are often irrelevant to large platforms.

The problem of inaccurate content also affects subtitles. Automatically generated subtitles usually focus on spoken language. Sounds, moods, or relevant background information are often not adequately described. However, such information is important for deaf or hard-of-hearing people.

Another sensitive issue is sign language. So-called sign language avatars are being intensively discussed in the deaf community. Many affected individuals report that these avatars appear incomprehensible or unnatural. Furthermore, it is criticized that the community is not sufficiently involved in the development process.

If such systems are widely deployed without carefully testing their comprehensibility and practicality, a serious problem arises. Accessibility must not only be implemented formally; it must function in actual use.

When AI systems generate alternative text on a large scale, a quality problem arises.

For example, let's assume an image description is 70% accurate. This means that almost a third of the information is inaccurate or incorrect. It's not completely unusable, but it's also unreliable.

In academic contexts, this can have serious consequences. Illustrations play a central role in natural sciences, computer science, and business and economics. Diagrams, models, and code examples are often relevant for exams.

If the description of a graphic is incomplete or inaccurate, a blind person may misunderstand the content. In university, this can lead to failing an exam. In the workplace, it can result in incorrect decisions.

The same applies to subtitles. If important information is missing or misrepresented, an information deficit arises.

The situation becomes even more critical when human sign language interpreters are replaced by avatars. If these avatars are incomprehensible or inaccurate, the situation worsens. Deaf people have to rely on content they cannot independently verify. The risk of misinformation increases.

Possible Solutions

One approach lies in the area of ​​coding support. AI-powered coding assistants can guide developers. They can suggest ways to make code accessible and propose automated corrections. However, it's crucial that these suggestions are reviewed—both by other tools and by the responsible person.

It seems counterproductive to feed large amounts of existing websites or code unfiltered into models and hope that accessible code will result. Past experience contradicts this approach. Quantity alone does not generate quality.

A more promising approach is working with clearly defined patterns.

For example:

  • An input field requires specific attributes, such as correct labels and ARIA assignments.
  • A toggle button requires clearly defined states and semantic identifiers.

Such accessible patterns can be systematically stored. AI systems could then specifically check whether these requirements are met or suggest appropriate structures.

Progress is also possible in the area of ​​analysis tools. Visual checks, such as color contrast analysis, can become more precise. False positives during contrast checks could be reduced.

What currently seems unrealistic is the fully autonomous creation of accessible applications by AI. We are still a long way from that technically, if indeed this goal is even achievable.

The problem of hallucinations during summaries or translations into understandable language remains unresolved. Intuitively, one would expect a system to work exclusively with the input text. In practice, however, models draw on their trained knowledge and supplement content. How this tension between language modeling and actual text fidelity can be reliably resolved is currently not fully understood.

When it comes to inaccessible interfaces, the solution is clear: Interfaces must be designed to be accessible from the outset. Then the problem won't even arise.

Nevertheless, this discussion is constantly being revisited. Even companies that position themselves as particularly responsible don't consistently consider accessibility. Anthropic is one example. The company emphasizes ethical principles, but accessibility isn't implemented throughout.

Even large mainstream tools often function well technically. However, "functioning" is not synonymous with "accessible." A system can run stably and still be inaccessible to certain user groups.

The key measure regarding bias is also obvious: people with disabilities must be systematically included. Many models suffer from a lack of data. Texts, images, and videos that realistically depict different forms of disability are underrepresented. This also applies to specialized assistive apps. Without training data, systems cannot reliably recognize or represent certain needs.

However, data can be collected in a targeted manner. In cities with a high proportion of blind people or in institutions such as schools for people with physical disabilities, real-world usage situations can be analyzed. Such data could be structured and integrated into training corpora. Of course, an ethically sound and data protection-compliant approach is essential.

In addition to data expansion, algorithmic adjustments are conceivable. Large language models are known as foundation models. Specialized systems are built upon them. It would be possible to develop additional models or fine-tuning processes that better account for the typical behavioral patterns of people with disabilities, for example, in road traffic or when using digital interfaces.

Whether and how this can be implemented optimally from a technical perspective depends heavily on the architecture, training strategy, and target system. However, one thing is clear: without targeted optimization, these aspects will remain underrepresented.

The central problem is the lack of inclusion of experts with disabilities in development teams.

Many AI systems are primarily developed for young, healthy, Western, and affluent user groups. These groups determine requirements, test environments, and priorities. This is understandable from a market economy perspective. However, it leads to systematic exclusion.

If AI systems are to be widely used in society, minorities must also be included. Democratic societies are based on participation. Technological infrastructure must therefore not be geared solely towards the needs of the majority.

Accessibility is not an add-on feature. It is a fundamental prerequisite for equal access.

The European EU AI Act contains approaches to regulating AI systems. These include provisions for risk analysis and fundamental rights. Nevertheless, the inclusion of people with disabilities and other minorities is not yet sufficiently defined.

This is not just about obvious disabilities. Older people must also be taken into account. An older person often crosses a street more slowly. They might trip over a tram track or stop unexpectedly. Such behaviors are part of everyday life.

When AI systems make safety-relevant decisions, for example in traffic or surveillance situations, these realities must be considered. This should be a given. In practice, however, it is frequently overlooked.

When it comes to inaccuracies and quality issues, the involvement of target groups is crucial. This applies particularly to sign language, but also to image descriptions and subtitles. When a sign language generation system is developed, it should be thoroughly tested by people with a high level of sign language proficiency. Ideally, they should be involved in the development process from the outset.

This is not only ethically sound, but also economically advantageous. If a provider can demonstrate to a municipality that numerous qualified individuals from the community have tested and validated the system, it increases credibility. Quality assurance thus becomes a clear selling point.The systematic inclusion of affected groups also opens up employment opportunities. Many people with disabilities are unemployed. Often, this is not due to a lack of qualifications, but rather to structural barriers in the labor market.

The AI ​​industry has considerable financial resources. Billions of euros flow through private investments and government funding programs. Against this backdrop, it would be appropriate to allocate funds specifically for inclusive development processes. This is already happening in some projects. However, it is not yet widespread.

A key problem is the form of participation. Mere symbolic participation is insufficient. If affected individuals are only allowed to provide occasional feedback without genuine decision-making authority or structural influence, the impact remains minimal.

Contemporary participation means:

  • Early involvement
  • Appropriate compensation
  • Real opportunities for co-creation
  • Transparent consideration of feedback

Only in this way can systems be created that truly reflect societal diversity.

This brings us full circle: Technical quality, ethical responsibility, and economic sustainability are inextricably linked. Without serious participation from the target groups, AI systems remain incomplete – regardless of how technically powerful they are.

More on Artificial Intelligence and digital Accessibility