Hello Computer: My path to voice UX

We’ve already established I grew up in a Trekkie household – so even though my fervor was largely attained in adulthood, I did still have strong opinions about parts of that universe.

Specifically, my favorite Trek movie was always Star Trek IV: The Voyage Home. Of course, the juxtaposition of those future characters in present-day was delightful and a large part of what I liked about the film (I’ve always been drawn more to comedy.)

But the scene I’ve always most remembered involves Scotty attempting to conduct some research via an 80’s era PC. He walks up to the giant CRT, lifts the mouse to his mouth, and says “Hello, computer.”

I thought this was HILARIOUS. The look on his face when handed the mouse! In retrospect, maybe it was foreshadowing my future fascination with speech technology.

Pikachu’s Microphone

Those who have met me in person are deeply familiar with my Pokemon fanaticism. I’ve been collecting Pikachus since the late 90’s, early on in the craze. But I’m also a gamer, so when my brother told me about a new Nintendo 64 game where you could SPEAK to Pikachu, I knew I had to have it.

The game came with a microphone on a wire split in half by a plastic box, which was probably their speech processing CPU. (The N64 really had no cycles to spare – remember how you could buy extra memory that was required for some games?)

In the story, you play a child who is befriended by a wild Pikachu. Each gameplay segment is a “day” in the life of those characters – essentially, you choose a minigame outing and go adventuring with Pikachu.

The gameplay was incredibly simple, yet maddening, since you had little direct control over the universe. Instead, you had to get Pikachu to perform tasks by directing him with your voice. He recognized his name, and specific commands based on the scenario (like “reel it in”, “pull harder”, or “throw it back” in a fishing game).

But rather than turn me off on the game, this conceit seemed to work perfectly – of course Pikachu didn’t recognize you all the time. He’s a mischevious wild creature. Every time he misrecognized me, my mind simply justified it as a personality trait come to life. It was amazing to me how much more connected I felt to this character after speaking to it, which is saying quite a bit given my already-noticeable fervor.

But I think that game went deeper – it hit me at the beginning of my interaction design coursework, and I think that mode of interaction has always spoken to me (figuratively and literally) as an enjoyable and potentially frictionless way to connect with the systems we use. It was, in fact, further foreshadowing.

Aloha Stitch

A few short months later, I was working as an intern at Disney World when a little movie called Lilo and Stitch came out and blew my mind. Also mind-blowing: the Aloha Stitch doll I found in the parks. It was an incredibly responsive voice-controlled toy.

The doll had a simple state-based model, where you could provoke him into “naughty” mode with specific verbal taunts, after which he’d reply differently to your commands. This is a pretty good demonstration of what he could do – remember, this was released in 2002!

Joining the Ohana

Fast forward several years into my game development career. I had a Sophie’s choice to make – Disney liked working with me, so I had my choice of joining the team pitching a Muppets game, or joining the team pitching a new product using Disney characters inspired by Nintendogs.

Back and forth I went, but I eventually chose the game team that became the Disney Friends team. As it came out early in the original Nintendo DS’s lifecycle, we were one of the first games to utilize the Nintendo DS speech recognition engine. I suspect it was built directly on the technology from Hey You, Pikachu! – it was a grammar-based system, where you specify a specific, small, fixed dictionary of recognizable terms.

I drew inspiration from the Aloha Stitch doll – especially since he was our primary character! We combined that state based concept with some simulation game concepts to get to our emotional AI model, where each character had 3 or 4 core emotional pendulums, and we could peg moods as a combination of those pendulum values. Of course, Stitch had a pretty wide mood window for “naughty”. It was pretty great.

I learned a great deal about the art and science of speech design on that project. Acoustic confusability of commands hadn’t ever really occurred to me, but the issue rears its head quickly if, for example, you have two words that start with “s” in a 20-word grammar. Thankfully, Disney had a surprisingly thorough approach to the process, and their QA department helped us find the underperforming commands. But one change in these small grammars caused ALL performance results to change. The analog world of sound is a complex one.

Even more complicated is the localization of these commands to other languages. We very carefully chose our terms in English to maximize performance – but then, when translated to German or French, you might end up with three or four similar sounding terms. In some cases it required 3 or 4 rounds of iteration with translators in each language to get to a single word that worked correctly. The fact that we shipped first in Europe was quite an accomplishment given those challenges.

One day, a (male) member of my team was verifying some of the functionality on our first character, Stitch. As I walked past his office, I saw him lean in to the DS and whisper something, then lean back frustrated. I paused outside to see him repeat himself a bit louder, mumbling “…oveuu stitch”. Still nothing. Finally, frustrated and a bit embarrassed, he burst out, “I LOVE YOU STITCH!” The extra volume resulted in the desired response, and I realized then that the 50% of our players that were little boys probably wouldn’t ever want to tell their DS “I love you”, even if Stitch IS their favorite. (We added “friend” as a synonym for those folks who couldn’t bear to have their image besmirched by professions of love to fictional characters trapped in a portable game console.)

Posing with my awesome Disney Friends development leads team – and my voice-interactive Ohana Stitch, who lived in my office but had fallen victim to a Pikachu crepebombing as part of my team’s welcome back from surgery for me. L to R: Amy Kalson (Lead Game Designer), Tamara Knoss (Lead Artist), Cheryl Platz (Lead Producer), and Bill Harding (Lead Programmer).

Driving Forward

Long story short, I decided to leave the game industry for some new challenges after successfully shipping Disney Friends. I spent several years working on the opposite kind of challenge – large-scale server software.

Still, I was following speech design indirectly – I was extremely fascinated with the Kinect, and as an Xbox internal beta tester, I got to test the Kinect the summer before it launched. It was, at the time, more about the gesture control than the speech control. But I’ll tell you this – the magical ability to blurt out: “Xbox, pause” has given me unreasonable expectations for the rest of my life.

But a series of circumstances saw me joining the Windows Automotive team in 2012 – and it quickly arose that we needed a designer to work on the voice components of the software. With my experience on Disney Friends, it seemed like a natural next step, so I volunteered to take on that role.

Over the course of the next two years, I had the opportunity to learn more about natural language speech systems and how they differed from grammar-based systems. We got to envision the future of voice software in the car, and did some really methodical, quantitative work to prove out our thinking. Unfortunately, we didn’t get to ship that work due to organizational changes. The silver lining in that sadness was a brief chance to work with the Cortana team before leaving Microsoft. Throughout the car/Cortana journey, I learned a great deal. I will always be grateful to folks like Stefanie and Lisa, who shared their expertise with me along the way. (If you’re reading, thank you.)

Adventures with Alexa

Around the same time, a series of events occurred that led to a job offer from Amazon (in late September 2014). It was a tough offer to accept, because it was blind – they could not tell me what I was working on until after I’d signed on as a full time employee.

I still can’t share any details about the project I was originally brought in to work on. But three months in, I was floored to find out about a product called the Echo. A speech-only home appliance? What? I signed up for the beta immediately, but employees didn’t get any special treatment… I was lucky enough to pop off the waitlist in February, and happily dove into use of our new home companion. From the beginning, there was something extremely compelling about Alexa. At first, it didn’t seem like we used her that much – but soon enough I was at others’ homes, blurting out “Alexa, is it going to rain this weekend?”

As luck would have it, my work over the past year gave me opportunities to collaborate with the Alexa team – and eventually, circumstances led to an interview and a job offer to join the voice design team on Alexa. It’s been an honor and the biggest challenge of my career thus far. We are solving problems no one has really solved yet, and it’s maddening and exciting. I’m grateful to Sumedha, a coworker of mine, for being generous with her time and expertise when I was still a full-stack designer from a team just curious about the budding Echo technology and what it might mean for us.

And Here We Are

What in one way seems like an almost random turn of events has been building for me ever since I first told Pikachu to reel in the fishing line in the early part of the millennium. Or perhaps since I first saw how silly Scotty thought we were for using mice to control our computers.

I can’t yet talk about… well, any of my full-time work at Amazon. But a separate post is coming on my VERY FIRST public Alexa skill. Guess what I did? I taught Alexa about Pokemon type strengths and weaknesses.

FULL CIRCLE.

If there’s anything to be learned here, it’s that careers in technology almost always require learning on the job. Speech interaction was so new when I was in university that it essentially wasn’t studied except in hard-to-find corners of the deeply academic world. I encountered this technology on the job and in my free time, and owe a debt of gratitude to those who helped me understand what was going on under the hood. At times in your career, you may find yourself on either side of that exchange – extend the hand if needed, accept it if offered.

Of course, once you have voice experience people want to paint you into the “voice designer” box, believing you’re not interested or not qualified to do anything but. To me, this is a temporary issue, like “will you always specialize in mobile design?” was 10 years ago for some. Eventually voice will join mice, keyboards, and touch interfaces as just one of many rich forms of regular interaction. But today, while it’s still new, it helps to have folks going deep. My VUI (voice UI) team has to understand the state of the technology today and come up with creative ways to work around limitations while still providing an awesome experience.

What does it mean to DO voice design? Well, it’s largely an information architecture problem, plus a little heavier emphasis on the spoken word than the rendered one. And from the beginning I’ve specialized in multimodal speech design – the design of speech-enabled systems that feature a nontrival graphical component. Cars, video games, FireTV – multimodal experiences. The Star Trek computer that we all want is multimodal. Speech-only when needed, and visual when you want it. In the end, all design will reflect that sensibility.

Hello, computer.