With only 80 MBs, Google has brought AI powered speech recognition offline. It is described as an end-to-end, all neural, on device, speech recognizer. This allows a user to dictate notes, emails, text messages and voice searches faster and more reliably. The new recognizer works at the character level. As you dictate, the speech recognizer outputs words in real time, character by character, similar to someone typing out what you are saying.
This image compares the production, server-side speech recognizer (left panel) to the new on-device recognizer (right panel). Image credit: Akshay Kannan and Elnaz Sarbar
Françoise Beaufays, a research scientist and team lead at Google’s speech recognition and mobile input group stated “Imagine if you had a keyboard where you couldn’t click on the keys whenever the connectivity was lousy. You just wouldn’t use that keyboard.”
“By taking the system offline, dictation will become a more natural choice.”
Previously, diction was achieved by recording the users’ voice input, sending it to the cloud, comparing the entire input against a huge online database or decoder, then finally, combining them to provide a text output. In my experience, dictating to my phone consisted of speaking slowly and only a few words at a time, then staring at a loading bar, then finally reading the written output and checking for errors. If my internet connection was ever spotty, waiting for voice diction felt like:
At that point, I just physically typed out what I dictated and questioned why I even tried to use voice diction.
So what does this have to do with data science? And how are cats involved?
Statistical models and machine learning! HMMs, RNN-Ts, CTC, DNNs, LSTMs, CNNs, a brief history and fun with letters!
Traditionally, voice diction has used Hidden Markov Models as a basis to predict output. A hidden Markov model is a statistical model that has two states, observations and hidden states. Hidden states can only be inferred through the observable states. In speech recognition, the observations are the signals you hear, which is the sound, and the hidden state is the text output which are predicted by the input.
Bob is happy when it is sunny out. Luis Serrano, Udacity.
Another example of this follows: Alice and Bob are long distance friends. When it is sunny out, Bob is happy and when it is raining, Bob is grumpy. When Bob and Alice talk on the phone, if Bob tells Alice that he is happy, she can infer that it is sunny out. Whereas if Bob is grumpy, Alice can infer that it is raining. The hidden state is the weather since Alice cannot see the weather, but can only make inferences from Bob’s mood.
A hidden Markov model by Luis Serrano of Udacity.
It gets a little more complicated than that. Let’s say Bob is happy when it is sunny 80% of the time. That means he is unhappy, even though it is sunny, 20% of the time. When it is raining, Bob is unhappy 60% of the time, and happy 40% of the time. If it is a sunny day, there is an 80% chance the following day will also be sunny. Therefore, there is a 20% chance that the following day will be raining. When it is raining, there is a 60% chance the following day will also be raining, and a 40% chance that the following day will be sunny. All of these probabilities are from previous data that has been studied. With these probabilities, Alice can infer the weather if Bob tells her his mood was happy, grumpy, happy, grumpy, grumpy throughout the week.
Now, let me take you back in time. It’s 2012, “Gangnam Style” and “Call Me Maybe” are playing for the millionth time on the radio, The Avengers are assembling in a theater near you, and The Curiousity Mars rover checked into Foursquare. In Google’s “X Lab” researchers created one of the biggest neural networks for machine learning by connecting 16,000 computer processors. They presented the neural network with ten million digital images found on Youtube videos and found that the neural network taught itself to discover patterns in large datasets. Since the internet is full of them,
the neural network taught itself to recognize cats.
Did you know that the internet is full of cat pictures and gifs?
The key take away, other than cats are cute and are all over the internet, is when you combine large data, large amounts of computing power and machine learning algorithms, you can apply it to other practical problems. 2012 was considered a breakthrough year for machine learning. Since then, Google cited new architectures that further increased the quality of machine learning. A few examples of this are: Deep neural networks (DNNs), or deep learning, which is defined as deep structured learning that is part of a broader family of machine learning methods based on learning data representations. Recurrent neural network (RNNs) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Convolutional networks (CNNs) are a class of deep neural networks most commonly applied to analyzing visual imagery. Long short-term memory (LSTM) networks is an artificial recurrent neural network architecture used in deep learning. LSTM has feedback connections that allow it to compute what Turing machines can and process single data points such as images and entire sequences of data, such as audio.
So how did they make an offline, end-to-end, all-neural, on device speech recognizer?
Traditionally, speech recognition consisted of several components, an acoustic model that maps segments of audio to phonemes, a pronunciation model that connects phonemes together to form words, then a language model that expresses likelihood of given phrases.
2014 brought the development of “attention-based” models that review the entire input at a time but does not allow additional input or produce output until it is finished. Recurrent Neural Network Transducers (RNN-Ts) continuously process inputs and streams outputs at the same time. The RNN-T recognizer outputs characters one by one as you speak. Training RNN-T was considered difficult and computationally intensive. Google used TensorFlow, which is an open source library for numerical computation and large-scale machine learning, to develop parallel implementation and increase efficiency. Their first decoder/search graph, weighed in at a massive 2 GB thus required online connectivity to work properly.
RNN-T
Google then eliminated the need for an online connection by decoding using a beam search through a single neural network which also decreased the size to 450 MB. Finally, they used “parameter quantization and hybrid kernel techniques” through the model optimization toolkit in the TensorFlow Lite library and with compression, reduced the final size to 80MBs.
So what do you do with character by character offline AI diction?
Tongue twisters!
Google’s offline diction system is available to Google Pixel phones through the Gboard (Google Keyboard) app found in the Play Store.