Speech segmentation is an important subproblem of speech recognition, and cannot be adequately solved in isolation. As in most natural language processing problems, one must take into account context, grammar, and semantics, and even so the result is often a probabilistic division rather than a categorical.
The notion that speech is produced like writing, as a sequence of distinct vowels and consonants, is a relic of our alphabetic heritage. In fact, the way we produce vowels depends on the surrounding consonants and the way we produce consonants depends on the surrounding vowels. For example, when we say 'kit', the [k] is farther forward than when we say 'caught'. But also the vowel in 'kick' is phonetically different from the vowel in 'kit', though we normally do not hear this. In addition, there are language-specific changes which occur on casual speech which makes it quite different from spelling. For example, in English, the phrase 'hit you' could often be more appropriately spelled 'hitcha'. Therefore, even with the best algorithms, the result of phonetic segmentation will usually be very distant from the standard written language. For this reason, the lexical and syntactic parsing of spoken text normally requires specialized algorithms, distinct from those used for parsing written text.
Statistical models can be used to segment and align recorded speech to words or phones. Applications include automatic lip-synch timing for cartoon animation, follow-the-bouncing-ball video sub-titling, and linguistic research. Automatic segmentation and alignment software is commercially available.
For most spoken languages, the boundaries between lexical units are surprisingly difficult to identify. One might expect that the inter-word spaces used by many written languages, like English or Spanish, would correspond to pauses in their spoken version; but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word.
Moreover, an utterance can have different meanings depending on how it is split into words. A popular example, often quoted in the field, is the phrase How to wreck a nice beach, which sounds very similar to How to recognize speech. As this example shows, proper lexical segmentation depends on context and semantics which draws on the whole of human knowledge and experience, and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer.
This problem overlaps to some extent with the problem of text segmentation that occurs in some languages which are traditionally written without inter-word spaces, like Chinese and Japanese. However, even for those languages, text segmentation is often much easier than speech segmentation, because the written language usually has little interference between adjacent words, and often contains additional clues not present in speech (such as the use of Chinese characters for word stems in Japanese).
The Impact of Attention Load on the Use of Statistical Information and Coarticulation as Speech Segmentation Cues
Aug 01, 2010; In two artificial language learning experiments, we investigated the impact of attention load on segmenting speech through two...
US Patent Issued to Honda Research Institute Europe on Nov. 29 for "Using Child Directed Speech to Bootstrap a Model Based Speech Segmentation and Recognition System" (German Inventors)
Dec 05, 2011; ALEXANDRIA, Va., Dec. 5 -- United States Patent no. 8,069,042, issued on Nov. 29, was assigned to Honda Research Institute Europe...