Voicemail: The Next Frontier of Machine Learning


Companies are working to improve technology that allows mobile users to "read" voicemail via text or email.


Google and other firms are trying to perfect the science of decoding speech.

"First learn the meaning of what you say, and then speak." What was true for the Stoic philosopher Epictetus is equally valid for today's machine learning algorithms. Software engineers and artificial intelligence (AI) experts are working frantically to build software capable of harnessing smart machines that understand and utter words. As The Economist explains, this gets to the heart of where AI is today:

By and large, tasks that are hard for humans are easy for computers, and vice versa. The simplest computer can run rings around the brightest person when it comes to wading through complicated mathematical equations. At the same time, the most powerful computers have, in the past, struggled with things that people find trivial, such as recognizing faces, decoding speech and identifying objects in images. [emphasis added]

Take voicemail as an example. They are verbal signals of missed conversations. The challenge, if the caller chooses to accept, is to rapidly speak a certain amount of basic facts to a recipient. Without the benefit of visual cues, callers often choose to employ verbal shortcuts couched in a cultural context. For instance, saying "I'll hit you up later" does not mean the caller means to impart physical harm. He or she simply means he or she will call back at some point in the future. I get that. You get that. But computers—which are basically turbocharged calculators—struggle to grasp this meaning and to intelligently voice this understanding back to us.

Anyone who's taken a foreign language course knows the struggle to learn what is, in essence, human code. Software must start with the same basic building blocks of language, learning the rules that govern how we order our words. It then must chew through massive sums of written and spoken word to find the often unspoken rules governing casual speech, not to mention the regional dialects that proliferate in every language. Some tech companies are taking this challenge very seriously. 

Soon after Google began offering an automated, voice-activated 411 service to callers in search of a telephone number, the search giant realized it had on its hands a vast spoken language database. It was a moment of serendipitous learning, and one soon led to the development of an advanced speech recognition engine that became Google Voice's signature offering. 

Launched in 2009 as a PC-to-PC phone service, Google Voice quickly became known for its ability to transcribe voicemails. Audio messages became text messages in seconds. Users could then search audio files like they did links on Google. And once Voice integrated with Gmail, users once and for all unified their life's inbox. Spoken words became more useful as simple data points.

Except the execution wasn't so easy. As one Google software engineer admitted in a recent blog post, "Open your voicemail transcriptions in Google Voice to find that at times they aren’t completely intelligible. Or, they are humorously intelligible. Either way, they might not have been the message the caller meant to leave you."

But the prospect of cheap, automated transcription remained useful to many, and what was not possible in 2010 is much easier today. Just a few days ago, after using a "long short-term memory deep recurrent neural network" (obviously) to study patterns in voicemail messages, Google announced it had cut its rate of transcription errors in half. 

Machine learning is boring like this. We were promised Terminators and instead got really smart answering machines, which upon reflection sounds like a good deal. The upshot is that regular people are finding their lives improved in tangible ways from artificial intelligence. We also avoid Skynet. 

Google is not alone in realizing recent gains in machine learning for language processing. Apple is testing a way for Siri, its virtual personal assistant, to transcribe voicemails left on iPhones. And every major consumer-facing tech company is looking for ways to incorporate our physical and digital lives into a mobile ecosystem that's seamless, predictive, and secure.

What machine learning portends is universal communication. No matter the medium or language, you are understood. There is only meaning and speech, just as Epictetus once saw.