In the movie 2001: A Space Odyssey, HAL 9000 — the neurotic computer — had a birthday in 1992 (for some reason, in the book it is 1997). In the late 1960s, that date sounded impossibly far away, but now it seems like a distant memory. The only thing is, we are only now starting to get computers with voice I/O that are practical and even they are a far cry from HAL.

[GeraldF6] built an Arduino-based clock. That’s nothing new but thanks to a MOVI board (ok, shield), this clock has voice input and output as you can see in the video below. Unlike most modern speech-enabled devices, the MOVI board (and, thus, the clock, does not use an external server in the cloud or any remote processing at all. On the other hand, the speech quality isn’t what you might expect from any of the modern smartphone assistants that talk. We estimate it might be about 1/9 the power of the HAL 9000.

You might wonder what you have to say to a clock. You’ll see in the video you can do things like set and query timers. Unlike HAL, the device works like a Star Trek computer. You address it as Arduino. Then it beeps and you can speak a command. There’s also a real-time clock module.

Setting up the MOVI is simple:

 recognizer.init(); // Initialize MOVI (waits for it to boot)
 recognizer.callSign("Arduino"); // Train callsign Arduino (may take 20 seconds)
 recognizer.addSentence(F("What time is it ?")); // Add sentence 1
 recognizer.addSentence(F("What is the time ?")); // Add sentence 2
 recognizer.addSentence(F("What is the date ?")); // Add sentence 3

Then a call to recognizer.poll will return a numeric code for anything it hears. Here is a snippet:

// Get result from MOVI, 0 denotes nothing happened, negative values denote events (see docs)

 signed int res = recognizer.poll(); 

// Tell current time
 if (res==1 | res==2) { // Sentence 1 & 2
 if ( now.hour() > 12) 
 recognizer.say("It's " + String(now.hour()-12) + " " + ( now.minute() < 10 ? "O" : "" ) +
     String(now.minute()) + "P M" ); // Speak the time

Fairly easy.

HAL being a NASA project (USSC, not NASA, and HAL was a product of a lab at University of Illinois Urbana-Champaign – ed.) probably cost millions, but the MOVI board is $70-$90. It also isn’t likely to go crazy and try to kill you, so that’s another bonus. Maybe we’ll build one in a different casing. We recently talked about neural networks improving speech recognition and synthesis. This is a long way from that.

Retrotechtacular: The Incredible Machine

They just don’t write promotional film scripts like they used to: “These men are design engineers. They are about to engage a new breed of computer, called Graphic 1, in a dialogue that will test the ingenuity of both men and machine.”

This video (embedded below) from Bell Labs in 1968 demonstrates the state of the art in “computer graphics” as the narrator calls it, with obvious quotation marks in his inflection. The movie ranges from circuit layout, to animations, to voice synthesis, hitting the high points of the technology at the time. The soundtrack, produced on their computers, naturally, is pure Jetsons.

Highlights are the singing “Daisy Bell” at 9:05, which inspired Stanley Kubrick to play a glitchy version of the track as Dave is pulling Hal 9000’s brains out, symbolically regressing backwards through a history of computer voice synthesis which at that point in time was the present. (Whoah!)

mpv-shot0007Anyway, we think it’s great to look back at these things and realize how simultaneously similar and different the early days of our modern technology were. One thread they got wrong was thinking that physically modelling the inner ear would help with speech synthesis — all you have to do is make the right sounds. But one thing they got right was the all-in-one drag-and-drop circuit simulation application shown in the beginning. They had some really functional GUIs back then, considering the tech.

And that reminds us that we wanted to work on integrating SPICE modelling into our circuit design flow. You know, to catch up with the late 1960’s.

Thanks [Simon] for the trip down (someone else’s) memory lane.

Talking Neural Nets

Speech synthesis is nothing new, but it has gotten better lately. It is about to get even better thanks to DeepMind’s WaveNet project. The Alphabet (or is it Google?) project uses neural networks to analyze audio data and it learns to speak by example. Unlike other text-to-speech systems, WaveNet creates sound one sample at a time and affords surprisingly human-sounding results.

Before you rush to comment “Not a hack!” you should know we are seeing projects pop up on GitHub that use the technology. For example, there is a concrete implementation by [ibab]. [Tomlepaine] has an optimized version. In addition to learning English, they successfully trained it for Mandarin and even to generate music. If you don’t want to build a system out yourself, the original paper has audio files (about midway down) comparing traditional parametric and concatenative voices with the WaveNet voices.

Another interesting project is the reverse path — teaching WaveNet to convert speech to text. Before you get too excited, though, you might want to note this quote from the read me file:

“We’ve trained this model on a single Titan X GPU during 30 hours until 20 epochs and the model stopped at 13.4 ctc loss. If you don’t have a Titan X GPU, reduce batch_size in the file from 16 to 4.”

Last time we checked, you could get a Titan X for a little less than $2,000.

There is a multi-part lecture series on reinforced learning (the foundation for DeepMind). If you wanted to tackle a project yourself, that might be a good starting point (the first part appears below).

We’ve seen DeepMind playing Go before. We have to admit, though, we get the practical side of speech analysis over playing with stones. We are waiting to cover the first hacker project that uses this technology.

