Controlling a computer with spoken commands was not too long ago in the realm of science fiction. In the last few years the ability to interact with your desktop PC using your voice has become a reality. Voice control is now here for your PDA or smartphone. In creating this capability developers have overcome the limitations created by the comparatively meagre processing and memory capacity of these small devices as well as the challenges created by the noisy environments in which PDAs and smartphones are used.
In this article we will take a look at the solutions available or under development that will be enabling voice control of a PDA, smartphone, microwave or car near you. In particular we take a look at the UK start-up NeuVoice which has just released a voice activated Dialler for the Nokia 9210.
Speech Recognition Technology
The main market for speech recognition technology is currently in server applications servicing inquiries made by phone. Companies such as Nuance, Speechworks and Temic (for the German market only) have deployed applications where the caller can navigate with spoken commands to obtain information. A typical example of this type of application is Temic’s railway timetable information system for “Deutsche Bahn” (which, if you can speak German can be reached by calling +49 1805 996622). This system recognises the names of 5000 railway stations and has now processed over a million calls since 1999.
Speech recognition technology generally uses one of two techniques: Hidden Markov Models (HMM) and Dynamic Time Warping (DTW). HMM is a statistical method that essentially tries to predict whether a sequence, represented numerically, is the ancestor of or matches another sequence. It is used in several disciplines, with applications as diverse as determining the likelihood of an international crisis resulting in conflict, to work on the recognition of facial expressions.
HMM works by sampling a voice signal in blocks of 10 milliseconds, each sample is then characterised by the frequencies it contains. The HMM technique is then used to compare the pattern of frequencies with a database of phoneme, the basic sounds which make up speech, and determine which one the sound is most likely to be. Once the likely phonemes are known they are compared to a database of phoneme patterns and likely words identified. This information is then processed by the application to determine what action to take.
The strength of the HMM technique is that it allows predefined words to be matched and therefore it is used predominantly in speaker independent solutions. It however relies on the availability of extensive statistics on speech sound. Advanced Recognition Technologies (ART), which specialises in voice recognition for mobile phones, has a team of 4 people who each year will record speech from up to 4000 people in a range of environments.
Several providers also use proprietary techniques. Speechworks, for example, uses a technique called OSR that identifies phonemes more discretely. This is done by identifying acoustic boundaries, places where a relatively large change in the frequency occurs as the word is spoken. This may result in a word like ‘instructions’ being divided into 11 segments, where as the HMM technique might detect upwards of 80 frames. The reduced number of elements to match results in a more efficient process requiring less processing to reach an answer.
Dynamic Time Warping (DTW) is a method of matching sounds where the sound is similar but the time taken to produce it is variable, for example whether a word is said quickly or slowly. This method is generally used in speaker dependent systems where the user trains the recognition system to identify specific words, such as names in a telephone directory.
In implementing voice technology for small devices the developers have had to work very hard on optimising their code. They have also had to develop new techniques to overcome limitation, which don’t exist in the PC world. For example, ART has developed a technique that allows recognition to be performed on a compressed voice signal. Most mobile phones are basically run by two chips, a DSP (Digital Signal Processor), which digitises and compresses the users voice and decodes the incoming voice data, and a microprocessor which deals with the phones user interface, features and protocols to enable calls to be made. These two units are joined by a relatively narrow link capable of limited data transfer, which mean that the voice signal is compressed. However, as the recognition is performed on the microprocessor only the compressed voice signal is available.
The newest method to emerge uses neural networks. A neural network is a model of the way in which the human brain works. They are ideally suited to all forms of pattern recognition and have the extraordinary ability to learn. They have another strength in that the same technique can be applied to both speaker dependent and independent systems.
NeuVoice is a recent entrant into the speech market. Formed in early 2000, NeuVoice exploits the results of research work done at the University of Plymouth’s Centre for Neural and Adaptive Systems (CNAS) in the UK. Its technique, for which the patent is currently pending, is particularly efficient and in the words of CEO Mark Denham: “Allows device manufactures to create phones with voice recognition capability rather than voice recognition devices with a phone attached.”
The work undertaken at CNAS first looked at the way in which the brain separates and differentiate sounds, often known as the ‘cocktail party effect’; our ability to listen to one speaker despite there being many other voices and sounds competing for our attention. Following on from this work they investigated how the brain recognised and processed sounds, particularly speech. All these processes were modelled in neural networks that formed the basis of the commercial applications developed by NeuVoice.
While the neural networks developed were complex, NeuVoice believed they are more efficient than the statistical approaches used by most other recognition techniques. This means that they can be implemented with minimal computer code and run using very small amounts of memory, two assets at a premium in mobile and wireless devices, due to space, and in other applications, such as, household appliances, because of cost.
The other significant feature of NeuVoice’s technique is that it processes the whole sound looking for speech, which makes it highly resilient to noise. Conventional subtraction techniques, which sample the background noise and look to subtract it from the signal to be processed, can be effective but are particularly unreliable where the background noise is variable, exactly the type of noise which effect mobile devices and car applications. NeuVoice has successfully demonstrated its tool in a diesel car travelling at 140 kph (90 mph) as well as environments with high levels of background music and have achieved better than 95% recognition accuracy.
NeuVoice’s command and control applications also employ a hierarchical approach to recognition. This technique is similar to a PC menu structure and it means that once the user has asked for ‘File’ there are a limited number of words they could then use for the next action, ‘Open’, ‘Save’ or ‘Print’ for example. This method also assists in minimising the resources required to run the application.
NeuVoice has also developed a continuous dialling application, which provides the ability to correct the number as it is spoken. For example, if you are reading a number from the Yellow Pages and you make a mistake just say ‘back’ to remove the incorrect digit and continue to complete the number.
In the future NeuVoice plans to add continuous speech recognition of short sentences to allow, for example, the dictation of SMS messages. Other developments in progress include voice identification, which could be used in mobile devices, in conjunction with existing security features such as PIN, to help reduce fraudulent use of stolen mobile phones. For service providers they are also looking at techniques to allow new vocabularies to be downloaded and used without any training. They are also working on Voice Synthesis to be able to close the loop on voice communication between the device and user.
NeuVoice’s applications are available in both retail and OEM forms. Applications like the continuous digit dialling require a DSP (Digital Signal Processor) chip and would, at least for the time being, be delivered as part of a phone’s native applications. Command and control applications, such as the one for the Nokia 9210 which we review next, are available as a retail products. NeuVoice will also be providing SDKs for integration of voice recognition into other third party applications. These will initially be available for the Symbian based Nokia Series 60 and UIQ, but NeuVoice has also implemented the voice engine for Windows mobile platforms.
The NeuVoice Dialler provides the user with two functions. The first is the ability to search for a name within their contact database or phone SIM card. The second is to determine how that person will be contacted. The Dialler has two modes, fast action and normal. Fast action initiates contact immediately after the name has been matched, while in normal mode after the contact has been found the action command can be spoken. The actions available are to:
- Dial the home, mobile or work number.
- Compose an e-mail, fax or text (SMS) message.
- Browse the contacts internet page.
Dialler provides the ability to define the phone numbers or other contact addresses from those already entered in the Nokia’s contact database. Before any of this will work, however, Dialler needs to be trained. In a quiet environment you first train the commands, home, text etc. and then train each contact’s voice tag and, if necessary, adjust the contacts fast action and default addresses. Once this is done NeuVoice is ready for use.
So how accurate is it? I have been using NeuVoice in a number of environments over the last week. Obviously the less background noise the higher the rate of correct recognition. As previously noted NeuVoice has tested the technology in a fast moving diesel car. My Land Rover Defender, not a vehicle renowned for its whisper quiet cabin, probably provides a far harsher environment and at 100 Kph (60 Mph) recognition is achievable but slightly unreliable. In more realistic environments, shopping centres, city streets, airports and cafes (without overly intrusive background music) recognition is very reliable. Having tried a number of environments I have concluded that if the background noise is such that you would consider comfortably having a phone conversation the NeuVoice Dialler will be able to match a voice tag to initiate the call.
It is also worth noting that the more information provided to the Dialler, that is the longer the name given as a tag, the more reliable the recognition is. So using the voice tag ‘Tom’ for Thomas Smith will not be as reliable as using his full name.
Other speech and voice development
Voice technology extends beyond recognition of speech. The three important technologies are biometric recognition systems, speech synthesis and distributed speech processing.
Biometric recognition systems are aimed at being able to determine who a particular speaker is. In addition to the usual security applications for activities, such as access control, these systems are being developed to help counter the problem of mobile phone theft. These systems would monitor the call being made on the phone and if the user’s voice did not match the stored voice the application could terminate the call or require that the phones PIN be re-entered before continuing. Obviously the technology would allow any user to make a call to the emergency services!
Speech synthesis closes the circle in terms of speech based interaction with a device. This technology would allow you to listen to your e-mail or the content of a web page.
Distributed speech processing, is a technology which addresses two problems, the limited processing capacity of a mobile or smartphone and the quality of sound transmitted over a standard telephone line. Both these issues mean that for complex services the phone is unable to do all the speech recognition while the server does not receive a quality input to guarantee good matching rates. These problems are overcome in a distributed solution by analysing the sound on a device, coding the sound and sending the codified data to the server which will then perform the matching and interpretation.
Much of the voice technology we have looked at will also start appearing in domestic electronics. As these devices become more sophisticated the ability to control them with small displays and limited keys becomes more challenging. Voice technology would assist in overcoming these limitations. The other main area where voice technology is likely to have a significant market is in cars, where the ability to adjust control by voice has obvious safety advantages.
Speech recognition on small mobile devices is rapidly becoming a reality. Around the world there is increasing regulation on the use of cell phones in cars and this certainly will be a significant driver for adoption of this technology in virtually every new phone over the next few years. Technologies like neural networks and optimisation of other techniques makes practical voice recognition on small devices a reality.
At the end of all this however I’m left wondering, will people be happy talking to their mobile phone? Domestic appliances probably won’t be an issue, after all we probably already talk to them, particularly when our microwave has managed to blow up that ‘heat and serve’ meal we were so looking forward to! In public or surrounded by work colleagues it may be a different matter. But it was not too long ago that if you passed someone in the street talking to themselves you would have been wary, now you simply assume they are on their mobile phone.
The NeuVoice Dialler for the 9210 is available at the Nokia Software Market for EUR 29.99 and Handango for USD 25.99.
Written by Richard Bloor for PMN Mobile Industry Intelligence.