Research and Markets reported that the speech recognition market will be worth $18 billion by 2023. Thanks to machine-learning technology, great accuracy results make AI speech recognition comparable with the recognition rate in humans. Seems like we have finally cracked the code of natural language and AI is soon to acquire its own voice. Or is it?
First of all, a little piece of information: a whopping 95% accuracy has been achieved for English. However, other less wide-spread languages aren’t likely to receive much attention very soon. Therefore, the rarer is the language, the more it will need humans to work with it.
Also, according to transcription specialists, a typical file received for transcription is a kryptonite for any, even very advanced voice recognition system. The high accuracy rates that AI recognition systems show in tests are often due to the perfect conditions of the experiment or a context of usage. In real-life situations, that 95% will quickly melt down to something quite unimpressive. Noisy environments with distant microphones, accented speech, different speakers making ad hoc remarks, different speaking styles and languages for which only limited training data is available – it all presents quite a challenge. So it’s not an advanced technology, but the humongous accumulated data set, that powers this success.
However, we are easily tricked by the illusion of advancement. For example, Siri is often lauded for her conversational interface and for almost having a mind of its own. All that – for knowing what are the gaps she needs to fill (she always asks for more details if there is a room for mistake). For a couple of interactions with the assistant, you can save yourself some 5 minutes that the task would take otherwise.
Yet Siri does all this brilliantly because it gets to know you – your voice, things you most often ask for, your friends, places you frequent, your age, etc. – all sets of data accumulated over time. The assistant is also a masterful interpreter of context – location and time are cues that help her give you suggestions based on a couple of keywords she managed to elicit from the garbled and muffled speech.
The speech recognition itself is very limited and imperfect but combined with the service delegation approach (Siri will combine information from a multitude of sources like local business directories, geospatial databases, online guides, review sources, online reservation services, and combine this with the user’s own data) it produces impressive results. However, if you lend your phone to someone with different habits the illusion falls down like the house of cards.
Also, to see if Siri’s speech recognition is that great, try using it when you’ve caught a cold – or just let someone with a different accent give Siri a couple of commands.
Transcript specialists aren’t going to become unemployed very soon, as you can see. Well, what about the writers then? Some four years ago, the notion that all journalism will be automated was gaining traction. Although properly trained AI can fill in the blanks in the carefully human-tailored template and create generic and boring, albeit accurate report on earthquakes, stock fluctuations or sporting events, it still sucks (and probably will suck for years to come) in storytelling. Of course, it can take some grunt work off the shoulders of journalists and deliver news on local elections galore. Still, you need a human to create the template and curate the results.
You may argue that the AI is still in infancy and its capabilities are growing. Yet last year’s experiment with the Harry Potter chapter that AI cooked up proves that we aren’t in for a big surprise any time soon. Although the machine was trained on the complete set of original works by J. K. Rowling, it produced something that makes sense loosely and only grammatically-wise. By the way, the only reason it does at all is that multiple human writers were involved in the sentence construction based on the AI suggested word predictions. Otherwise, it would look like the usual jumbled word salad of text prediction and wouldn’t even be grammatically plausible.
If you think that faking an entire chapter of fiction is a long shot, you are quite right. The thing is, one cannot outsource even the easiest creative writing task to a machine. If an AI could master at least pieces such as How I spent my summer, essay mills would be out of business already.
There were projects to automate college essay grading that failed miserably. They can check spelling and grammar but they can next to nothing when it comes to meaning and fact-checking. Les Perelman, a now retired former director of writing for MIT, has long been against the idea of using machines to grade essays. To prove his skepticism he created Babel Generator (short for Basic Automatic BS Essay Language Generator). It produces a completely nonsensical yet grammatically correct essay in less than a second on an input of three keywords only. This gibberish managed not only to fool automated tests like MY Access! but to get high scores: 5.4 out of 6.
Again, it does not prove the efficiency of automated text generators, it proves that automated assessment systems are bad at reading. Although they both can be used to help people, they can never substitute humans entirely.
To sum up: for the time being, speech recognition and text generation are not completely on par with the human ability to discern words, understand the context and guess intent. It cannot fully substitute human ability to discern and interpret human speech or produce it. Neither it will be able to for some years. As for the creative writing – I would not hold my breath for a robotic writer masterpiece in my lifetime.
Resources cited in this blog: