Alexandre Trilla, PhD - Research Engineer |

Blog

-- Thoughts on data analysis, software development and innovation management. Comments are welcome

Information Retrieval techniques in ASR

28-Nov-2010

I also wanted to blog about Dr. Alex Acero's speech in the FALA 2010 conference. His talk was entitled "New Machine Learning approaches to Speech Recognition", and in brief (quoting his own description), he described some new approaches to Automatic Speech Recognition (ASR) that leverage large amounts of data using techniques from Information Retrieval (IR) and Machine Learning.

The "large amounts of data" detail of the description was in fact the gist of his work. He recalled that Hidden Markov Model based ASR in the late 60's and early 70's needed to compress a lot the acoustic features because otherwise they could not succeed, due to computation capabilities. But today, some of those assumptions may be challenged. In this regard, he presented a novel ASR approach where the linguistic models were replaced by an IR engine based on a Vector Space Model (VSM):

Dr. Acero conjectured that this approach works well if sufficient repetitions per word are available. Hence, by dealing with a huge amount of information, the system is supposed to deliver to a good performance, with an argued robustness to deal with disfluencies. Nevertheless, this novel approach (1) does not still solve the problems with homophony, and (2) gets confused with phoneme/syllable permutations. On the one hand, the classic indetermination with homophones persists at the phrase-level (e.g. the mondegreens) but also at the word-level (have some fun testing an ASR system with a list of words with similar Soundex indexes). On the other hand, the approach is weak toward disambiguating words with rearranged phonemes, e.g., "stop" and "spot" (regarding phonotactic rules in the rearrangement). And I know I'm being fussy here, because in a limited-domain scenario, this approach has actually yielded excellent results. So, although the linguistic knowledge of a speech-enabled application like ASR may not be directly replaceable in general, I find it is a most interesting work to approach different scientific disciplines, avoiding the idealisation of some particular method.