Today, however, that is no longer the case. Thanks to a team of Microsoft researchers and engineers, conversational speech recognition by machines just entered the human level realm – that is, human parity in speech recognition has finally been achieved!
Photo Credit: Cornell University Library
“In a paper published Monday, a team of researchers and engineers in Microsoft Artificial Intelligence and Research reported a speech recognition system that makes the same or fewer errors than professional transcriptionists.”
– Allison Linn, Microsoft
Microsoft researchers from the Speech & Dialog research group. (Photo Credit: Microsoft/Dan DeLong)
This breakthrough was achieved when the team began using Microsoft’s Computational Network Toolkit (CNTK) – an open source deep learning system that is currently available via GitHub. And although it wasn’t able to perfectly transcribe every single word, one has to keep in mind that neither do us humans. Unlike us humans, however, CNTK has plenty of more room to continue perfecting its conversational skills over time.
Next up: conversational speech understanding!
“The next frontier is to move from recognition to understanding,” as according to Geoffrey Zweig, who manages Microsoft’s Speech & Dialog research group. While the machines have certainly dethroned us in our ability to recognize speech, we still rule the land when it comes to actually understanding what someone is saying. Remember, human speech isn’t perfect and is usually coupled with subtle nuances depending on our facial expressions, body language, and surrounding environment.
If someone were to run up to you, their face in fright, pointing towards an approaching man as they screamed, “Help, he’s after me!”, you’d quickly understand what is going on and what that person is asking of you. If a machine today were to receive a similar scenario, although it would easily transcribe the words spoken to it, it won’t be able to understand what exactly is being asked of it. That all will eventually change, however, as researchers continue advancing deep learning protocols. Once that time comes, will we even know if we’re having a conversation with a human or machine?