Voice-to-text transcription works by collecting voice data with a microphone and transmitting it to a server or processing unit. Once the voice data has arrived at the server or processing unit, an application analyzes the data and compares each small sound to a database of words and phrases. These sounds are evaluated for context and syntax to determine the user's desired statement. After the best match has been found, the server or processing unit returns the result to the user.
The small sounds that are evaluated by voice-to-text applications are known as phonemes. There are 44 different phonemes in the English language. Software that processes phonemes must be able to distinguish slight differences in sounds in order to reach the correct voice-to-text result. Additionally, this software must also be able to tell the difference between words that sound the same known as homonyms.
As of 2016, voice-to-text transcription services are up to 99 percent reliable for the English language. This reliability has been achieved through spectogram analysis, in which voice-to-text processors not only evaluate phonemes, but also compare voice data to known patterns that help them predict new and evolving speech patterns. This same approach has also been used in image recognition software that powers image search tools found on the Internet.