Home Assistant Voice Preview Edition: The Open-Source Game Changer We've Been Waiting For?

HA Voice Preview Edition Box and PCB — Nabu Casa's take on a voice assistant, how does it stand up to the rest of the crowd?

It's been just over a week since the Home Assistant Voice Preview Edition arrived, and I've been testing it out to see how well it performs and how it holds up to the Amazon Alexa ecosystem. As expected, the platform has got a long way to go before it's fully mature and on-par with Alexa, requiring not only an instance of Home Assistant running, but several APIs setup (either locally or in the cloud) to get performance which somewhat rivals the current competition.

I think it's important to note that this is not meant as a replacement for Alexa or Google Home quite yet, and this initial experience will likely be buggy and have quite a few teething issues, compared to the current commercial solutions. This post is intended to give my initial thoughts, and a long-term review is quite a way away yet.

Out of the Box Experience

Box contents — Box containing the sticker, start card, warranty information, and the device itself. Here, I've also removed the top cover so you can see the PCB.

As with a lot of enthusiast-grade smart home tech, the packaging is very simple, and is just a cardboard box with the product, some warranty information, and a small sticker. All setup instructions are linked via a QR code, which takes you to a well documented website with good, clear instructions.

The finish of the product is very reasonable for their first commercial sample, exposing 4 Philips screws under the rubber feet, which aren't glued in and can be pulled out. The Voice PE doesn't come with any cables or bricks out of the box, which is a trend seen across the industry over the past few years and definitely helps to keep product costs lower.

Setup

Initial plug in was about on-par with other similar smart home devices I've setup, although in my Home Assistant, the device appeared a couple of times and had some minor issues connecting to my IoT network (it's on a hidden SSID, and can cause issues with ESPHome devices configured to perform a scan before connecting).

Ultimately, due to this, when I performed the first reboot of the device, it was unable to connect to the network and I had to perform a factory reset. The factory reset process itself is (or at least was) also very buggy and took several attempts to take hold properly. Once reconnected, the device is functionally useless within HA initially, with Nabu Casa driving people to either sign up to their cloud service, or to run their own speech-to-text and text-to-speech models. HA has some basic intent processing built in, which I'll cover slightly later.

Onboard Speaker and 3.5mm Jack

The onboard speaker is lacking in comparison to its proprietary competitors, and is a small module firing from one side of the device. It sounds slightly tinny when playing back audio but just about suffices for voice control, but is definitely lacking for music playback.

I had to take the device apart to get to the 3.5mm jack as it wasn't making good electrical contact and could sometimes lose a speaker on the output. Whether it was the HA Voice PE or my 3.5mm jack I'm not too sure, but bending the pins out to make slightly better contact seemed to do the trick well. One slight niggle here is the screw holes on the PCB aren't coated, so the internal screws go straight onto the FR4, which isn't ideal.

Microphone Array

Mic Array, LED Ring, Button, and Rotary Encoder — Lots of tech, all nicely packaged on a beautiful black PCB!

The HA Voice PE has two microphones onboard, which provides some echo cancellation and is a small improvement over a single-mic system. Coming from the 2nd generation Echo, with a 6 microphone array, detection is not quite as good, but this could also be down to the OpenWakeWord project not being as mature and not training itself on user data.

LED Ring, Button, Rotary Encoder, and Mute Switch

The LED ring is very similar to the Alexa counterpart, although not as well diffused, so it's easier to see the individual LEDs. It's just about as bright and has some cool animations, and helpfully can be fully controlled within Home Assistant.

The rotary encoder, button, and mute switch all feel well built, with the encoder providing iPod-like volume control and the mute switch physically disconnecting the microphone array. The button can be programmed within Home Assistant, and has onboard entity state updates for short press, long press, double click, triple click, and an Easter egg click. I've currently got mine set to switch off the lights and electric blanket when I long press it, but haven't yet had time to think up what to program for the others.

Processing Pipeline

One major advantage for the commercial players such as Amazon, Google, and (to an extent) Apple, is that they all provide a fully hosted cloud service running the whole pipeline remotely and can do so for free due to their economies of scale, and their ability to sell or use your voice data for advertising.

For a voice assistant to work properly, all the following services need to be run, and ideally work really quite quickly to provide minimal delay when responding:

Wake word processing to switch the device from an idle to a listening state
Speech-to-text processing which takes the (possibly messy) audio and converts it into text, which is much easier for the next stage of the pipeline
Intents processing which takes the intent and tries to figure out what needs to be done (e.g., "turn the light on" → light.state = on, or "when was X born" → "X was born on the Y")
Text-to-speech processing takes the textual response and converts it to a hopefully not-too-robotic sounding response, which can then be played back to you through the assistant.

For people to be happy, wake word processing needs to have a very low false negative rate, and ideally quite a low false positive rate, as people really don't like to shout the wake word constantly, and people are normally quite annoyed when their voice assistant gets triggered when it's not been called. Additionally, the whole pipeline needs to take 2-3 seconds at most to complete and start responding. This is easy in a data centre but quite tedious to run locally with very modest hardware.

OpenWakeWord, which is the wake word implementation built onto the device, relies on thousands of samples from various people. These are collected with the Wake Word Collective project, and unlike Amazon or Google with their much higher market share and use of voices during use to train models, the project relies on people manually providing samples for use.

The STT and TTS parts of the pipeline are also very fast on a proprietary provider or using Home Assistant Cloud, but are very difficult to setup and get running well on my host, a Raspberry Pi 4B. Faster devices running Home Assistant OS shouldn't have too much of an issue, but setup on a device running container or core is quite a faff, as you can't use the Whisper or Piper add-ons to get local TTS and STT working and have to run them in their own docker containers natively.

Nabu Casa provide a free 31-day trial of Home Assistant Cloud before you have to pay $7.50/£6.50/mo for it, which is enough to test the device out and see if a local pipeline is for you. Unfortunately, when you're not the product, you can't have the service for free.

Intent Processing

Built-in

Currently, the only bit of the pipeline that runs on the device is the naïve intent processing flow, which takes an intent and tries to match it against a sentence in its library. This works well if you're a robot, and have been programmed to talk as such, but quickly gets annoying if you're a normal person, with an intent that doesn't quite match one of the templates.

The onboard processing is improving every day, and is under very active development in a multitude of languages. For example, when I initially asked for the time between 12:00 pm and 1:00 pm, it would happily respond that it was 12:XX am. I went to submit a pull request but this had already been done before I had even received my device.

The built in processing is very fast, as it compares your intent to a list of understood intents, and can be the fastest and most lightweight part of the pipeline if you're asking it something that it understands.

Of course, this processing is not going to match the sorts of processing that Amazon and Google have in place, as they have massive teams dedicated to updating and improving their devices with every failed intent and end user has. To bridge the gap between local simple processing and the kinds of intents Amazon and Google have in place, we can make use of some large language models.

Local

Running a high quality LLM on a Raspberry Pi 4 is simply not going to work well, or within any reasonable timeframe. I'm sure if we loaded a reasonably small model and left it chugging along for several minutes (potentially hours) and with enough swap space allocated, we might get something close to reasonable.

However, for "local" processing to be anywhere near fast enough, I had to use my workstation with a very modest Radeon 5600XT GPU, which was able to respond in between 2 seconds and a minute, depending on the request. Unfortunately, I don't leave my workstation running 24/7 so this doesn't work as a long term solution.

Remote

Whilst Home Assistant Cloud does provide TTS and STT, it doesn't handle LLM processing, and it's easy enough to see why. LLMs can be expensive to run, and can be quite risky to offer as part of a fixed costs model, as heavy users can run up more costs than others and can put you over budget.

Therefore, you can integrate OpenAI into your processing pipeline, and choose between a cheap or an expensive model as part of this. I can see an OpenAI integration being something that they potentially bundle in future or offer as an upgrade, but once setup with a card, spending limits and an auto top-up functionality, the OpenAI API is very simple to use and the costs are entirely use based.

So far, I've been experimenting with GPT-4o mini model, which has cost me a whopping $0.04 in the week I've been using it. Admittedly, this hasn't necessarily been a heavy usage week for asking my voice assistant questions, but for likely less than a dollar a month, it's really cool to have an LLM that I can talk to.

The "system prompt" needs a bit of tuning, as sometimes it likes to ramble on a bit, and it currently asks me to follow up without re-prompting. I think this is likely the way I'll stick with until (if) I get a GPU cluster up and running, and it can try to use the built in local intent processing first.

Exposing Entities

You have to expose the entities you want a voice assistant to be able to control manually, which I think is a good preventative measure stopping the assistant from messing up everything completely. It can get a bit confused, as sometimes my scenes don't seem to be included in what it can control, and you have to add a supported weather integration for it to be able to tell you the current weather.

Music Playback

Finally, music! This is one of the main things I use my Echo for, and as yet I've not been able to get Music Assistant to work at the same level of simplicity as the Echo, and it does seem slightly overcomplicated for what I need. Music Assistant aggregates a lot of different providers, then allows you to playback on lots of different systems, which is great, but I only really need it to let me cast Spotify.

The Music Assistant UI has much worse UX than the Spotify app, so I think I'll mainly be using it to interface for Spotify Connect. The integration with the intents system is also not working 100% correctly, so it struggles to let me play on Spotify when asked.

I'm not using a supported setup, as most of my IoT devices are segregated onto a different VLAN so it's my own fault really, but this is something I'm hoping to eventually fix.

Conclusion

So far, it's not been a smooth experience migrating to an open source alternative voice assistant, but I expected just as much. The technology is probably about on-par with the initial Amazon Echo assistants, but without the massive user base and development budget that Amazon has behind them. Whilst there have already been some "satellites" for Home Assistant in the past, the HA Voice PE is the biggest single step in the path to a fully local, DIY voice assistant.

I'm never going to expect a 1:1 feature or ease of use match with a commercial competitor, at least for a few years. I think Nabu Casa have done a good job marketing this version as the preview edition, as it is just that, a slice of what is to come and something that tinkerers can use but probably isn't ready for the general public.

Based on how quickly the hardware seems to have sold out from pretty much everywhere, you can see that there's definitely a demand for it, but only for systems that are built and known to sort-of work. I'm excited to see how the next iteration evolves from this one, personally I'd quite like to see more microphones in the array and a better-diffused LED ring but other than that I'm very happy with it so far. I love the fact the hardware¹ and software is all open source, and the whole pipeline can be customised and audited to know that your personal data isn't being sold.

As of publication, the PCB schematics and KiCad files have still not been released. ↩