Labs R&D

Looking for something more specific?
Enter keywords into the search bar
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
woman using a mobile phone voice assistant

Building a Custom Mobile Voice Assistant — Technical Feasibility Study

Kamil Halko
Kamil Halko
November 25, 2021

Building a mobile voice assistant that reacts to a custom keyphrase is a complex process. Read about our findings from a technical feasibility study.

Read more

Is there a better way to experience the convenience of a smartphone than through a mobile voice assistant? Not really. Mobile voice assistants are extremely useful. But they can also be tedious — I mean, how many times a day do you have to say “Ok Google” or “Hey Siri” to prompt the assistant?

From a business perspective, these prompting commands are also generic. Before using your product, customers have to say the name of someone else’s brand. An obvious alternative is a custom mobile voice assistant with specific prompts that lets companies create a branded experience. The key question is whether such a solution can be built and what does it take to develop a custom mobile voice assistant?

About the Research — Assumptions and Requirements

At Labs, our internal R&D department, we’ve been eager to explore the idea of creating a custom mobile voice assistant for a while. The goal was to do initial research and check whether it is possible to implement an efficient always-on wake word-detection system. We'd like to share our findings and give tips for further steps.

To determine the technology needed to develop the solution, we first had to get our assumptions and requirements right. The main requirement was for the mobile voice assistant to react to a specific wake word.

We narrowed down other requirements for the system as the following:

  • Always-on — continuous listening for the wake word and notifying when a specific phrase is detected
  • Low energy consumption — the assistant shouldn’t drain the battery
  • Low latency — the analysis needs to be done on the phone, offline
  • High accuracy
  • Android as an operating system, but the solution should work on iOS similarly
  • Customizable — the wake word should be customizable to another phrase
  • Working in a resource-constrained environment — mobile devices don’t have the hardware capacity of laptops or desktop computers

Taken together, these requirements translate into a system that consists of three loosely coupled subsystems:

  1. Energy detector

This system operates continuously and estimates the energy of an incoming sound. Such detection should consume very little power. The next stage activates only when the sound volume is above a certain threshold.

  1. Voice activity detection (VAD)

With a signal detected, the system should recognize if it’s speech or just a noise. VAD consumes slightly more power than an energy detector, but it’s still a relatively simple system. We would expect a very high accuracy when it comes to classifying something as speech, above 98%. If the signal is speech, we proceed to the third phase.

  1. Wake word recognizer

Now that the incoming signal is recognized as voice, we launch the last and most compute-expensive system: wake word recognizer. To achieve very high accuracy and be able to adjust the system in the future, a recommended way would be to use a neural network and a data set that have been proven in wake word detection. It will cut the development time.

Researching the Available Technology for a Mobile Voice Assistant

During the research, we implemented the first two subsystems — energy detector and voice activity detection. We also checked several options to implement the third step.

Evaluating the energy detector

Implementing the energy detector — a simple algorithm that is checking the level of energy (sound volume) — was relatively simple. The detection threshold can be easily adjusted; we set it to detect any sound occurring around the device that might be a voice. The energy detector passes the recorded sound only after a positive detection.

To improve the accuracy and cut out too low and too high frequencies, the system can be augmented with a biquad filter.

Voice activity detector

We used Google’s open-source Voice Activity Detection library. The library is written in C, but there is an Android wrapper available that eases the use of the library.

The library itself is reportedly “one of the best available: it's fast, modern, and free. Google’s algorithm has found wide adoption and has recently become one of the gold standards for delay-sensitive scenarios like web-based interaction” (Source gkonovalov/android-vad).

The algorithm implemented in the library is based on the Gaussian mixture model (GMM), which is one of the commonly used probabilistic models. However, even the best GMM algorithms can't compete with the algorithms based on deep neural networks in terms of speed and error rate. The authors of the paper were able to lower the delay 87 times and achieve a 6.7% lower error rate.

RNNoise has a very good and highly performant VAD system. One can also opt for a more dedicated solution. RNNoise can be compiled into a Web Assembly. It's also present as a component on the WebRTC.

Wake word recognizer

An offline wake word detector can be approached in two ways. We can ask users to record a keyphrase several (~3) times, upload the data to the server, use an algorithm to create a model, and use it in the app.

Another approach would be to create a universal model that can detect a keyphrase without any user interaction.

During this research, we checked a few options that implement one of the two approaches:


howl github screenshot

Howl is an open-source wake word detection system used in Mozilla Firefox. After saying "Hey Firefox," users start interacting with the browser. Howl is written in Python, and it's using the PyTorch machine learning framework.

After some modification applied to the source code, we were able to run the app and test it. The system works very well. It detects the phrase "Hey Firefox" quickly and with a very low error rate. The Howl repository describes nicely how to prepare a data set and train a model. 

Using Howl to implement a custom phrase detection might work very well, but the building model procedure is not so obvious and requires a lot of data

For example, for Firefox, the company used a Mozilla Common Voice dataset (~70 GB of short audio clips from users from around the world) as well as 632 recordings of “Hey, Firefox” from volunteers.

Howl also requires a CUDA-enabled graphics card with at least 4GB of VRAM (they used Nvidia Titan RTX) for the training procedure. Because we wanted to develop a solution for smartphones, we had to explore several approaches to see which was viable:

  • Using a pretrained model in the PyTorch Android library — PyTorch comes with a version for mobile devices. In theory, it should be possible to run the same model on desktop and mobile. In practice, however, the library is not very reliable and has several issues. Since it's just a wrapper around the Python library, it exposes a simple interface that is not easy to work with.
  • Converting PyTorch's model to TensorFlow — TensorFlow can be easily used on mobile devices, and it would be great to test the app with the pretrained model from Howl application on mobile devices. However, converting models between these two technologies requires specific knowledge of both of them.
  • Creating a TensorFlow model using Howl's approach — This seems like the best option. But, again, it requires specific knowledge of the system. On the GitHub page, there is a description of how to prepare a dataset for training and testing purposes. Unfortunately, the description doesn't specify how to use it in the TensorFlow framework.


Precise is a wake word listener. The software monitors an audio stream (usually a microphone). When it recognizes a specific phrase, Precise triggers an event. It's written in Python, and it’s designed to run on Linux, especially on resource-constrained devices like Raspberry Pi.

The software is built on top of the TensorFlow framework, and the model is distributed as a .pb file. Unfortunately, Precise doesn’t provide an easy way to convert (or build) it for mobile devices.

Ideally, we would need to get .tflite. There even was some work done to convert a .pb file to a .tflite, but the branch that contains these changes still isn’t merged, and using it causes some installation issues. 

Still, getting the .tflite file wouldn't be enough, because there are some calculations required before feeding the model with audio data. Precise provides instructions on how to train your model. You need around 12 recordings of the keyphrase to make the system work properly.


Snowboy is a hotword wake word detection framework based on deep neural networks. The tool provides an option to create two types of models:

  • Personal — The user needs to record an audio file by saying a keyphrase 3 times, and then upload it to the server. The server will create a model file (.pmdl) that is ready to use in the application.
  • Universal — The algorithm needs to be fed with 500 audio files that contain chosen phrases. The server will then create a model file (.umdl) that is ready to use by every user. The big downside of this framework is that the part responsible for creating a model is closed-source — at some point, the company can decide to stop supporting it, which will make generating new models impossible. In fact, Snowboy is already a deprecated framework. The website collecting user recordings was shut down (the website let anyone propose a phrase and others could record themselves saying this phrase).


PocketSphinx is a framework that analyzes audio transcriptions. The user can provide a transcription of the keyphrase that they would like to be detected. During our research, however, PocketSphinx turned out not very reliable. It detects a lot of false-positive signals that result in a poor user experience.

Picovoice AI

Picovoice AI offers an SDK for easily training wake words detection. Commercial applications require purchasing a commercial license. The vendor doesn’t provide any pricing guidance on their website.