Wednesday, April 27, 2011

Speech Recognition from Google - New possibilities for Open Home Automation


Although this does not seem to be related to the panStamp project, I've always wanted to integrate voice commands into my home automation software without having to pass through some of the existing commercial solutions; most of them being platform dependent.

Last week a friend pointed me to Google Translate, one of the new Google apps for mobile platforms that translates voice statemens to a good number of languages. This application seemed to recognize our very different voices on the fly without having to pass any previous training. Even complex statements were being recognized with a high degree of precision, and all that from a simple small application installed on an iPhone!

Speech recognition software has proven to be one of the most specialized areas in the software programming world. Only a few companies develop voice recognition engines and most of these software rely on complex voice models and ultra-secret recognition algorithms. Some of these applications, like Windows' Voice Recognition Engine are mostly specialized in recognizing short commands. Other, like Dragon Naturally Speaking, seem to offer a better performance for dictating applications.

Over the past years I've been studying the possibility to add speech recognition capabilities to my custom software applications, mainly to those solutions having to do with Home Automation. Some years ago I decided to play with HomeSeer's integrated speech recognition scripting engine. Well, HomeSeer was using Windows' engine in those times but HomeSeer did offer a good scripting interface for controlling the Home Automation stuff from voice commands; something really useful and well documented. Unfortunately, Windows was providing speech recognition only in English. As you may guess, I needed something closer to my mother language so I ended by giving it up.

Now Google arrives with a very interesting solution, and I mean interesting due to the following reasons:

  • Google's Speech Recognition API would seem to be free and open source. This feature would let us integrate these Speech Recognition capabilities into any existing software, including my panTastic software.
  • It's client server-based. This means that the client just has to generate an audio file and then send it to Google's remote voice recognition servers. A few seconds later, Google responds with a formatted string containing the text recognized from the provided audio file. The whole speech recognition engine, including voice models and dynamic algorithms, are placed in Google's servers. Thus, we can do speech recognition almost from any lightweight device capable to record audio.
  • The API is http-based so it's truly open-platform. We Linux users would be finally allowed to run speech to text routines in a decent way. Indeed this http API opens new possibilities to open control software like opn-max and panTastic (soon...) since we'll be able to keep the small factor and low power and still provide Voice Recognition capabilities.
  • We'd take advantage of the growing voice models being added to Google's servers, probably being improved dynamically along the use from the wide community (??). 
On the other hand, there are some drawbacks that will have to be considered too:
  • Any device wanting to do Voice Recognition through Google's engine will have to be on-line. This should not be an issue for most Home Automation applications though.
  • The whole recording-transmission-processing-reception procedure takes a few seconds. Some users may expect a faster response to their voice commands.
  • Google Speech Recognition seem to be still under test. As result, Google may want to move their servers or change the API URL address without previous notice.
OK but, how can Google Speech Recognition engine be used from a custom software? I've deeply browsed the net during the last days and found very little information about Google's API. It would seem that most of Google's efforts are being focused on providing speech recognition capabilities for Android and some of Google's most popular applications, including Google Chrome. Finally, I found this great source of information:

Sairon, the maintainer of the above website provides some interesting scripts that would let us integrate the speech-to-text engine in any application with IP connectivity. Since I was mainly interested in testing the Speech Recognition Engine in Spanish, I modified the suggested scripts for my convenience.

First of all, we need to create a flac file with the desired voice command to be processed. The sampling rate must be 16 KHz:

sox -r 16000 -t alsa default recording.flac silence 1 0.1 1% 1 1.5 1%

Once the flac file created, it's time to transmit it to Google's Speech Recognition server through a wget command:

wget -q -U "rate=16000" -O - "http://www.google.com/speech-api/v1/recognize?lang=es&client=Mozilla/5.0" --post-file recording.flac --header="Content-Type: audio/x-flac; rate=16000"

Sairon suggests a more complex script, identifying the http request as coming from both Mozilla and Chrome, maybe as a way to offer some redundancy. For my tests, I just used the Mozilla portion of the command.

Less than three seconds after running the above wget command I got the following response from Google:

{"status":0,"id":"b62308e814a4287240f68d23258914d2-1","hypotheses":[{"utterance":"estamos probando el motor de reconocimiento vocal de google","confidence":0.8816453}]}

Indeed, “we were testing Google's Voice Recognition engine”, a command spoken in a very natural way without having to include additional pauses between words. Quite awesome given my precedent experiences with other VR software. At the end of Google's response, 0.8816453 seems to show the degree of confidence of the resulted procedure, more than 88% of confidence would not seem so bad.

Finally, Sairon also explains how to use Google's text-to-speech engine:

wget -q -U Mozilla -O output.mp3 "http://translate.google.com/translate_tts?tl=es&q=esto+parece+funcionar+a+la+perfeccion"

6 comments:

  1. Where do I download a free Google speech recognition engine?

    ReplyDelete
  2. You play back the recording into headphones (maybe at half speed to start with), and read it into a microphone, and the computer creates a transcript for you. speech recognition program

    ReplyDelete
  3. I wondered upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
    Home Automation in Chennai
    smart home in Chennai
    Home security in Chennai
    Burglar alarm in Chennai
    Door sensors Chennai

    ReplyDelete
  4. Reading Buddy Software is advanced, speech recognition reading software that listens, responds, and teaches as your child reads. It’s like having a tutor in your computer

    ReplyDelete
  5. Reading Buddy Software is advanced, speech recognition reading software that listens, responds, and teaches as your child reads. It’s like having a tutor in your computer

    ReplyDelete
  6. Reading Buddy Software is advanced, speech recognition reading software that listens, responds, and teaches as your child reads. It’s like having a tutor in your computer

    ReplyDelete