AUTOMATIC SPEECH RECOGNITION (ASR). The concept of a machine than can recognize the human voice has long been an accepted feature in Science Fiction. From ‘Star Trek’ to George Orwell’s ‘1984’ - “Actually he was not used to writing by hand. Apart from very short notes, it was usual to dictate everything into the speakwriter.” - it has been commonly assumed that one day it will be possible to converse naturally with an advanced computer-based system. Indeed in his book ‘The Road Ahead’, Bill Gates (co-founder of Microsoft Corp.) hails ASR as one of the most important innovations for future computer operating systems.
From a technological perspective it is possible to ...view middle of the document...
ASR products have existed in the marketplace since the 1970s. However, early systems were expensive hardware devices that could only recognize a few isolated words (i.e. words with pauses between them), and needed to be trained by users repeating each of the vocabulary words several times. The 1980s and 90s witnessed a substantial improvement in ASR algorithms and products, and the technology developed to the point where, in the late 1990s, software for desktop dictation became available ‘off-the-shelf’ for only a few tens of dollars. As a consequence, the markets for ASR systems have now grown to include:
• large vocabulary dictation - for RSI sufferers and quadriplegics, and for formal document preparation in legal or medical services
• interactive voice response - for callers who do not have tone pads, for the automation of call centers, and for access to information services such as stock market quotes
• telecom assistants - for repertory dialing and personal management systems
• process and factory management - for stocktaking, measurement and quality control
The progress in ASR has been fuelled by a number of key developments, not least the relentless increase in the power of desktop computing. Also R&D has been greatly stimulated by the introduction of competitive public system evaluations, particularly those sponsored by the US Defense Advanced Research Projects Agency (DARPA). However, scientifically, the key step has been the introduction of statistical techniques for modeling speech patterns coupled with the availability of vast quantities of recorded speech data for training the models.
The main breakthrough in ASR has been the discovery that recognition can be viewed as an integrated search process, and this first appeared in the 1970s with the introduction of a powerful mathematical search technique known as ‘dynamic programming’ (DP) or ‘Viterbi search’. Initially DP was used to implement non-linear time alignment in a whole-word template-based approach, and this became known as ‘dynamic time warping’ (DTW).
DTW-based systems were quite successful, and could even be configured to recognize connected words. However another significant step came in the late 1980s when pattern matching was replaced by ‘hidden Markov modeling’. This not only allowed systems to be configured for large numbers of users – providing so-called ‘speaker independent’ systems – but ‘sub-word HMMs’ enabled the recognition of words that had not been encountered in the training material.
A hidden Markov model (HMM) is a stochastic generative process that is particularly well suited to modeling time-varying patterns...