How does the Speech Output from blind Persons work?

A little clarification for newcomers to the topic of screen readers: screen readers and speech output are often used synonymously, but they are different programs. The screen reader is the interface to the PC and reads out information. The voice output is purely an output medium, another medium would be Braille. In principle, the speech output is a module that is independent of the screen reader.

Article Content

The Speech Synthesizer

All screen readers come with a speech synthesizer, eSpeak for NVDA and Eloquence for Jaws. In many ears, these voice outputs sound more artificial than the voice outputs that are used in smartphones or navigation systems, for example. Narrators use either synthetic or recorded phonemes, i.e. parts of words. Synthetic voices were developed at a time when memory was still measured in kilobytes and megabytes. Synthetic voice outputs are usually lean and performant, while natural phonemes take up a lot of storage space and are sometimes not very performant. Most blind hardcore users narrators use synthetic narrators.

Who controls the Speech Output?

All screen readers have dictionaries that regulate the pronunciation of many words. In order to understand this, you need to know roughly how the pronunciation works in general.

Basically, everything could be so beautifully simple. We set a few rules for how certain phonemes are assembled and pronounced, and we stick to them. In reality, there are thousands of exceptions to these rules that have nothing to do with logic. So the "way" is pronounced differently than the "away" in "I'll be gone then".

There are many such exceptions and some of them are anchored in the dictionaries of the speech editions. So there are rules how the phonemes are put together, which apply in many cases, as well as dictionaries to catch exceptions to the pronunciation rules. It gets complicated with the compounds that are so popular in Germany. For example, many screen readers are unable to pronounce the word "equal opportunity" correctly. It mostly sounds like a tie. This is because two words are being combined here and the screen reader cannot figure out where word 1 ends and word 2 begins. The problem also exists when the exception rules that were developed for this do not apply. This happens, for example, if the word in question is misspelled or unfavorably hyphenated manually.

Character pronunciation is also a complex issue. Common screen readers have predefined modes for character pronunciation such as "all", "few" or "none", with each screen reader itself defining which characters are subsumed under these modes. At NVDA, the euro sign € under "some" was switched off for a while, a rather nonsensical decision.

And how could it be otherwise: Here, too, the user can define which characters should be pronounced in which contexts.

Rules can be specified both by the screen reader and by the speech synthesizer. Last but not least, the user can define pronunciation rules himself. He can even use different dictionaries depending on the context, for example if he wants something different to be pronounced in the word processor than in the Internet browser. Professionals can perform even finer control with regular expressions.

Why is this important

For the reasons mentioned above, language markup for individual terms does not make sense. Most persons don't know what foreign words are. Restaurant, smartphone, foyer, lobby, team... What needs to be labeled and what doesn't?

Screen readers are at odds and many blind persons today use more than one. I've got NVDA, Jaws and VoiceOver up and running and I can't even remember who says what and how. Jaws pronounces the word "team" correctly, NVDA does not. I couldn't remember that and it's definitely not reasonable for sighted persons. You can invest the time gained more sensibly.

The second problem is that a language markup doesn't guarantee that I understand the word. Unlike screen reader dictionaries, it doesn't specify how the word is pronounced. Rather, it instructs the screen reader to, say, switch to French for that word and pronounce it as if the listener were French. But I don't even know what restaurant sounds like in French because I don't speak that language. At the speed I set my screen reader to, I couldn't get a single word in French mode and at best a rough sense in English. Unfortunately, doesn't interest the sighted persons behind the BITV test.

More on screen readers