Building a Custom Mobile Voice Assistant — Technical Feasibility Study

Is there a better way to experience the convenience of a smartphone than through a mobile voice assistant? Not really. Mobile voice assistants are extremely useful. But they can also be tedious — I mean, how many times a day do you have to say “Ok Google” or “Hey Siri” to prompt the assistant?

From a business perspective, these prompting commands are also generic. Before using your product, customers have to say the name of someone else’s brand. An obvious alternative is a custom mobile voice assistant with specific prompts that lets companies create a branded experience. The key question is whether such a solution can be built and what does it take to develop a custom mobile voice assistant?

woman using a mobile phone voice assistant
Building a Custom Mobile Voice Assistant

About the Research — Assumptions and Requirements

At Labs, our internal R&D department, we’ve been eager to explore the idea of creating a custom mobile voice assistant for a while. The goal was to do initial research and check whether it is possible to implement an efficient always-on wake word-detection system. We'd like to share our findings and give tips for further steps.

To determine the technology needed to develop the solution, we first had to get our assumptions and requirements right. The main requirement was for the mobile voice assistant to react to a specific wake word.

We narrowed down other requirements for the system as the following:

  • Always-on — continuous listening for the wake word and notifying when a specific phrase is detected
  • Low energy consumption — the assistant shouldn’t drain the battery
  • Low latency — the analysis needs to be done on the phone, offline
  • High accuracy
  • Android as an operating system, but the solution should work on iOS similarly
  • Customizable — the wake word should be customizable to another phrase
  • Working in a resource-constrained environment — mobile devices don’t have the hardware capacity of laptops or desktop computers

Taken together, these requirements translate into a system that consists of three loosely coupled subsystems:

  1. Energy detector

This system operates continuously and estimates the energy of an incoming sound. Such detection should consume very little power. The next stage activates only when the sound volume is above a certain threshold.

  1. Voice activity detection (VAD)

With a signal detected, the system should recognize if it’s speech or just a noise. VAD consumes slightly more power than an energy detector, but it’s still a relatively simple system. We would expect a very high accuracy when it comes to classifying something as speech, above 98%. If the signal is speech, we proceed to the third phase.

  1. Wake word recognizer

Now that the incoming signal is recognized as voice, we launch the last and most compute-expensive system: wake word recognizer. To achieve very high accuracy and be able to adjust the system in the future, a recommended way would be to use a neural network and a data set that have been proven in wake word detection. It will cut the development time.

Explore what's possible with technology
Visit our Labs R&D

Researching the Available Technology for a Mobile Voice Assistant

During the research, we implemented the first two subsystems — energy detector and voice activity detection. We also checked several options to implement the third step.

Evaluating the energy detector

Implementing the energy detector — a simple algorithm that is checking the level of energy (sound volume) — was relatively simple. The detection threshold can be easily adjusted; we set it to detect any sound occurring around the device that might be a voice. The energy detector passes the recorded sound only after a positive detection.

To improve the accuracy and cut out too low and too high frequencies, the system can be augmented with a biquad filter.

Voice activity detector

We used Google’s open-source Voice Activity Detection library. The library is written in C, but there is an Android wrapper available that eases the use of the library.

The library itself is reportedly “one of the best available: it's fast, modern, and free. Google’s algorithm has found wide adoption and has recently become one of the gold standards for delay-sensitive scenarios like web-based interaction” (Source gkonovalov/android-vad).

The algorithm implemented in the library is based on the Gaussian mixture model (GMM), which is one of the commonly used probabilistic models. However, even the best GMM algorithms can't compete with the algorithms based on deep neural networks in terms of speed and error rate. The authors of the paper were able to lower the delay 87 times and achieve a 6.7% lower error rate.

RNNoise has a very good and highly performant VAD system. One can also opt for a more dedicated solution. RNNoise can be compiled into a Web Assembly. It's also present as a component on the WebRTC.

Wake word recognizer

An offline wake word detector can be approached in two ways. We can ask users to record a keyphrase several (~3) times, upload the data to the server, use an algorithm to create a model, and use it in the app.

Another approach would be to create a universal model that can detect a keyphrase without any user interaction.

During this research, we checked a few options that implement one of the two approaches:

Howl

Howl is an open-source wake word detection system used in Mozilla Firefox. After saying "Hey Firefox," users start interacting with the browser. Howl is written in Python, and it's using the PyTorch machine learning framework.

After some modification applied to the source code, we were able to run the app and test it. The system works very well. It detects the phrase "Hey Firefox" quickly and with a very low error rate. The Howl repository describes nicely how to prepare a data set and train a model. 

Using Howl to implement a custom phrase detection might work very well, but the building model procedure is not so obvious and requires a lot of data

For example, for Firefox, the company used a Mozilla Common Voice dataset (~70 GB of short audio clips from users from around the world) as well as 632 recordings of “Hey, Firefox” from volunteers.

Howl also requires a CUDA-enabled graphics card with at least 4GB of VRAM (they used Nvidia Titan RTX) for the training procedure. Because we wanted to develop a solution for smartphones, we had to explore several approaches to see which was viable:

  • Using a pretrained model in the PyTorch Android library — PyTorch comes with a version for mobile devices. In theory, it should be possible to run the same model on desktop and mobile. In practice, however, the library is not very reliable and has several issues. Since it's just a wrapper around the Python library, it exposes a simple interface that is not easy to work with.
  • Converting PyTorch's model to TensorFlow — TensorFlow can be easily used on mobile devices, and it would be great to test the app with the pretrained model from Howl application on mobile devices. However, converting models between these two technologies requires specific knowledge of both of them.
  • Creating a TensorFlow model using Howl's approach — This seems like the best option. But, again, it requires specific knowledge of the system. On the GitHub page, there is a description of how to prepare a dataset for training and testing purposes. Unfortunately, the description doesn't specify how to use it in the TensorFlow framework.

mycroft-precise

Precise is a wake word listener. The software monitors an audio stream (usually a microphone). When it recognizes a specific phrase, Precise triggers an event. It's written in Python, and it’s designed to run on Linux, especially on resource-constrained devices like Raspberry Pi.

The software is built on top of the TensorFlow framework, and the model is distributed as a .pb file. Unfortunately, Precise doesn’t provide an easy way to convert (or build) it for mobile devices.

Ideally, we would need to get .tflite. There even was some work done to convert a .pb file to a .tflite, but the branch that contains these changes still isn’t merged, and using it causes some installation issues. 

Still, getting the .tflite file wouldn't be enough, because there are some calculations required before feeding the model with audio data. Precise provides instructions on how to train your model. You need around 12 recordings of the keyphrase to make the system work properly.

Snowboy

Snowboy is a hotword wake word detection framework based on deep neural networks. The tool provides an option to create two types of models:

  • Personal — The user needs to record an audio file by saying a keyphrase 3 times, and then upload it to the server. The server will create a model file (.pmdl) that is ready to use in the application.
  • Universal — The algorithm needs to be fed with 500 audio files that contain chosen phrases. The server will then create a model file (.umdl) that is ready to use by every user. The big downside of this framework is that the part responsible for creating a model is closed-source — at some point, the company can decide to stop supporting it, which will make generating new models impossible. In fact, Snowboy is already a deprecated framework. The website collecting user recordings was shut down (the website let anyone propose a phrase and others could record themselves saying this phrase).

PocketSphinx

PocketSphinx is a framework that analyzes audio transcriptions. The user can provide a transcription of the keyphrase that they would like to be detected. During our research, however, PocketSphinx turned out not very reliable. It detects a lot of false-positive signals that result in a poor user experience.

Picovoice AI

Picovoice AI offers an SDK for easily training wake words detection. Commercial applications require purchasing a commercial license. The vendor doesn’t provide any pricing guidance on their website.

Problems Detected During the Research

During the research, we ran into a few problems.

The model creation process

The model creation process is a task that requires very specific knowledge of machine learning and audio processing. There are several technologies that researchers are using in their work to explore the capabilities of these technologies.

The two most popular are PyTorch and TensorFlow. TensorFlow seems to fit the most because of its presence on mobile platforms. But a lot of researchers prefer PyTorch, which brings us to the next point: it’s difficult to convert ready-to-use models between these frameworks.

Another difficulty is audio processing. For example, to understand in what form of audio wave it will be easiest to process human voice, you need to have some knowledge of audio features (like power spectrum, MFCC, or Mel).

Large datasets

Creating a good, well-tested, and capable wake word detection system requires a lot of audio data that contains a phrase that the system will look for. It also requires a bunch of long audio files that don't contain a wake word. It’s therefore necessary to reduce the number of false-positives triggers during everyday usage.

The hardware limitations of smartphones

Because of the volume of data needed, it’s necessary to have a highly capable machine to run the training and evaluation parts of the model creation process.

Possible Use Cases for a Mobile Voice Assistant

User experience isn’t about pursuing innovation. It’s about building solutions that deliver convenience and satisfaction. In that context, the main goal of a customer mobile voice assistant is to give businesses greater control of the user experience. The assistant should have a customizable activation phrase that can be uniquely adjusted to the needs of a company.

For example, Howl built by Firefox lets users dictate a specific website address after saying “Hey Firefox.”

Here are some of the possible use cases for a mobile voice assistant:

Increasing accessibility — Mobile voice assistants make it easy for people with disabilities to use smartphones and access more advanced features of their devices.

Using a voice command to wake the application without touching the phone — This is highly useful whenever physical access to the smartphone is limited (e.g., while driving a car). You can ask an app to perform a task, for example, a cooking app to suggest next steps when your hands are busy.

Controlling IoT devices — IoT devices seem the most promising use case for a custom mobile voice assistant, for example, robots, home appliances, or custom-build devices (devices built using RaspberryPi or similar computers).

Recommendations for Commercial Projects

While it’s possible to create a wake word detection system that will meet all the requirements, the task isn’t easy. First, you should have a huge dataset to build the correct machine learning model. Second, considerable knowledge of the audio domain is also required. Third, you need good experience with machine learning frameworks.

As for the best technology for machine learning, TensorFlow Lite seems the most suitable technology.

How to Deploy Machine Learning Models on Mobile and Embedded Devices

Why TensorFlow Lite?

TensorFlow Lite is a version of TensorFlow optimized for mobile devices. Models built in TensorFlow Lite can be used on Android and iOS without issues. On top of that, TensorFlow Lite has low hardware requirements:

  • Android — the minimum requirements state Android SDK version 19, which means Android 4.4 (released in 2013).
  • iOS — the minimum iOS version is 8, which was released in 2014. 

Of course, the more powerful the device, the faster the analysis, but you can run models on a great majority of smartphones currently in use.

Featured articles
No items found.
Stay up to date with news on business & technology
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Test an idea to see if it's technically feasible
Visit our Labs R&D

You may also like