How Does ASR Work? Let's Get Technical

How Does ASR Work? Let's Get Technical

You may have watched our video What is Speech-to-Text and How Does it All Work and wondered, "Yes, I get it. But how does THAT work?" Even if you haven't watched the basic video, this article will take a more technical look at how the systems involved in Automatic Speech Recognition (ASR) work. We'll also try to help you understand why distorted audio can produce a less-than-optimal transcript when using AutoScript Web (ASW). To reach that understanding we'll start with the basics of how sound works.

Sound Theory

If you think hard you might remember from Science classes that sound travels in waves. Although those waves aren't quite the same as waves on the water, they have a lot of similarities that might help us understand. Let's look at a common sound you've probably heard before and examine the parts of a sound waveform. If you've ever watched a show where something was censored, you know the sound of the 1,000 Hertz tone (

We can see the waveform using programs like Audacity. Let's look at the 1,000 Hertz tone waveform.

The distance between the waves (highlighted in red), or frequency, is what we recognize as the sound. Frequency is measured in hertz, commonly abbreviated as Hz, and it's the number of waves per second. The height of the waves (highlighted in yellow), or amplitude, is what we recognize as the volume of the sound. The amplitude of sound waves is measured in decibels, abbreviated as dB. So this example shows a 1kHz waveform with a volume of 0.8 dB.

Speak and it Shall Appear

Now let's look at a more complex sound waveform. You might recognize this word.

These waves look much more pointy, don't they? Let's zoom in quite a bit to see some individual waves in the pattern; we'll zoom in on the highlighted middle part of "voice". 

Zoomed in...

Now we can see that the waves have nice rounded tops like in the 1kHz example. This is what human speech looks like in a visual form. There's a whole range of frequencies and volumes combined. These are the patterns that an AI learns to recognize when it identifies phonemes as part of the ASR process.
Pho¬∑neme - any of the perceptually distinct units of sound in a specified language that distinguishes one word from another, for example, p, b, d, and t in the English words pad, pat, bad, and bat.
These patterns are consistent no matter who's speaking them. Even though every speaker will sound different, and even the same speaker will never say words perfectly the same twice, the patterns are always close enough to be recognized. We can see that by comparing two speakers' versions of "VoiceScript".

The two speakers' renditions of "VoiceScript" might look different to your eye, but to a trained eye, and especially to a trained AI, these two waveforms are the same word.

That's Neat. So What?

So you might say, "This is pretty neat, but why should I care?" Since you're here on the AutoScript Web support site, I assume you have an interest in ASR. Specifically, I assume you care about the quality of your ASR-generated transcripts. Just like most things computer related, if you put garbage in you'll receive garbage output. Likewise, if you feed the VoiceScript AI an audio file that is distorted, you will get distorted results in your transcript.

Now that you have a basic understanding of what speech looks like in a visual form, let's take a look at some really bad audio. See if you can spot a problem with this recording of the word "VoiceScript".

Don't see the problem? Let's zoom in on the middle of "voice" again and compare it to our previous example which was not distorted.

Observe how jagged the edges of the waves are. Observe how the tops of the waves weren't even properly recorded and so they look square. Those square tops are what is referred to as clipping. Clipping occurs when the audio equipment and settings used to produce a recording were not set properly and the tops of the waves get clipped off.
Let's look again at the earlier recording with the nice rounded waves.

Quite a difference, isn't it? These samples were recorded in a studio. Imagine a busy courtroom where there is also background noise and maybe even poor equipment. You can probably still recognize the patterns a bit. And so can the AI. We regularly process many files that are very distorted and deliver satisfactory results to our clients. But if you want to ensure the best possible results, it is imperative to feed the AI the best audio you can.

What is Good Audio?

That question might sound like it has an easy answer. The truth is that there are tons of easy answers to that question. Let's focus on what's important to the ASR process. For human speech, our main goal is to keep the audio volume level from clipping or distorting as we saw in the example above. It's even better if we can keep the volume close to or under 0.5 dB when the file is viewed in a program such as Audacity. There are two ways to accomplish this.

The first and best way to improve the recording is to reduce the gain setting when the recording is taken. It's always a good idea to do some audio tests as you prepare for a proceeding. You want to start the gain at 0 (if you're using a microphone controlled by Windows, start at 50 on the slider volume control).

Have your speakers do a few quick tests, or even test it yourself. If the meter on your equipment or software goes into the yellow and red range at all, you're clipping. Turn the gain down a bit and try again. Some meters even have a handy CLIP indicator.

If you don't have meters, you can approximate a good level by turning your monitor earphones to about 50% volume. The goal is to just hear everything, but not super loud. This may take some trial and error.

The second way to improve the recording is to move the microphone back from the speaker a bit. Some folks tend to talk right next to the microphone. This might be helpful sometimes to those in the gallery listening live, but it will wreak havoc on your recording levels. Have the speaker move back or move the mic back to at least 6 inches from the speaker. This is not the best method, but it may be the only option you have.

If you don't have any control over the recording process, there still might be an option or two to help produce better recordings. But at that level, it tends to be case-by-case. Please contact us using the Request Help link above and we'll be glad to provide advice based on your unique circumstances.

Time to Wave Goodbye (Summary)

We're at everyone's favorite part of any long-winded article, the end. This is the part where many of you skip ahead to read the high points first. Even if you read the whole article, it may be helpful to revisit those high points as well. So here we go.
  1. Sound travels in waves.
  2. Waves are measured in frequency (Hertz; Hz) and volume (Decibels; dB).
  3. The unique combinations of frequency and volume form phonemes that can be recognized by an AI to perform ASR.
  4. The ASR process can still be successful, but it will be negatively affected by distorted audio. 
  5. Distorted audio can be prevented by adjusting gain and placing microphones properly.
  6. The general goal is a file with levels at 0.5 dB or slightly lower.
Not concise enough? How about this one-line bottom-line?
  1. Due to how sound works, ASR is less than optimal with distorted input and recording equipment should be adjusted with the goal of 0.5dB for the best results.

If this article was helpful, please click the [Yes] button below. If it wasn't helpful, please click the [No] button below and provide some feedback we can use to improve it.