Unsexy Science: February 2022

The promise of automatic speech recognition (ASR) providing a hands-free experience is fulfilled in countless specialist PC software products, for example dictation software. After starting up, dictation software does not need a wake-word for doing its job, even after minutes of silence. Neither is it the case with military speech recognition---imagine a fighter pilot needing to speak a wake-word in a high-risk situation. So why does Alexa (or, for that matter Google Assistant and Siri) have wake-words?

But it's not the wake-word only. If you wait too long giving a reply to a question from Alexa, or giving a second command associated with a previous command, Alexa will stop listening, making a blip sound, and will require you to say the wake-word again, before listening again. The time Alexa listens to your silence is fixed to eight seconds (same with Google). Isn't this annoying? The Follow-Up mode implemented in Alexa in 2021 does not remove this restriction, it just removes the requirement for a wake-word within those eight seconds.

The difference between PC or military ASR and virtual assistants is manyfold:

the former run on a PC or embedded system, and are used in a workplace setting. Most assistants are installed at home. However, dictation software is also used by disabled people to control the PC in a home setting, but this is minor usage if you count the numbers.
speech-to-text transformation of specialist ASR happens completely in the PC/device. In contrast virtual assistants transfer speech to the Cloud for processing. This is a legal nightmare for privacy reasons. But apparently the central processing and the reduced need for software updates in devices make this system design attractive despite the legal minefields.
Alexa in particular allows skill distribution by external developers in the Amazon Skill Store, much like apps for mobile devices. However, any mobile app using the microphone has to be given explicit permission by the user, while Alexa uses the microphone per default unless explicitly turned off. So Amazon just assumes the worst case and avoids legal problems by making Alexa strictly listen for eight seconds in all skills.

You can probably see now why you never will play a game of chess using Alexa, sitting on your sofa in front of a big TV showing a chess board, and leisurely moving pieces by saying "move pawn to e4". After you see the move of your opponent you ponder for minutes and say "take the pawn". And you can say "pause" or "quit game" after minutes and Alexa will know what you mean.

No. It will always be like this: The opponent moves. You ponder. After eight seconds you here a faint BLIP. Now you need to say "Alexa, open chess game and take pawn". For every of your moves that takes longer than eight seconds to ponder. Because the device needs to hear the wake-word to start listening again, and because the skill context has been lost.

And that is very unsexy for any developer trying to provide services to the user. I wanted to write a chess skill for Alexa to help me or other physically disabled persons. But now I'm starting to think about PC-centered solutions again.

Unsexy Science

Welcome!

2022-02-12

Why you won't play a relaxed game of chess with Alexa in the foreseeable future