I spend most of my day talking to AI agents in the terminal. Claude Code, ChatGPT, aider — you name it. And every time I have to type out a long, detailed prompt explaining what I want refactored, I think: why am I typing this when I could just say it?
WisprFlow exists. It's $15/month, it's cloud-only, and it sends your voice + screenshots to servers run by OpenAI and Meta. For someone working on proprietary code, that's a non-starter. So I built my own. It runs entirely on my local network, uses a Raspberry Pi Zero W and a Raspberry Pi Pico as a hardware dictation device, offloads inference to my Windows gaming PC's RTX 5070, and types the result into whatever computer is active on my KVM switch — no software installed on the host machines. Zero. It just looks like a keyboard.
Total hardware cost: about $40. Latency: under 700ms end-to-end. And it nails my technical jargon because I taught Whisper my vocabulary. Here's exactly how to build one.
What You'll Need
Hardware
| Item | Approx. Price | Notes |
|---|---|---|
| Raspberry Pi Zero W (or Zero 2 WH) | ~$15-19 | The brain. Get the WH variant (pre-soldered headers) to avoid soldering |
| Raspberry Pi Pico H | ~$7-9 | The USB keyboard bridge. H = pre-soldered headers |
| INMP441 I2S MEMS Microphone | ~$8-10 | Comes in 3-packs usually. You only need one |
| Micro SD Card (16-32GB) | ~$7 | For the Pi Zero's OS |
| Micro USB power supply (5V 2A+) | ~$10 | Powers the Pi Zero. Any decent phone charger works |
| Dupont jumper wires | ~$7 | Female-to-female for connecting everything |
| Push button or joystick | ~$3-8 | Push-to-talk trigger. I used a thumb joystick I had lying around |
| Micro USB to USB-A cable | ~$3 | For the Pico to KVM connection |
You'll also need access to:
- A PC with an NVIDIA GPU (RTX 20xx or newer) for running the inference server. I use a Windows gaming PC with an RTX 5070. This stays on your local network — it does all the heavy AI lifting.
- A KVM switch (any model with USB ports for keyboard/mouse). I use a KCeve KVM303.
Software / Dependencies
- Raspberry Pi OS Lite (32-bit) for the Pi Zero
- CircuitPython 9.x or 10.x for the Pico
- Python 3.x on the Pi Zero
- Any FastAPI server running faster-whisper + Ollama) on your GPU PC
- Ollama with a small model (qwen2.5:1.5b or qwen3:1.7b) on the GPU PC
Architecture Overview
Here's the full signal path:
┌──────────────────────────────────────────────────────────────┐
│ YOUR DESK │
│ │
│ ┌──────────┐ ┌────────────────────────┐ │
│ │ INMP441 │───▶│ Raspberry Pi Zero W │ │
│ │ Mic │ │ │ │
│ └──────────┘ │ • Captures audio │ │
│ │ • Sends to server │ WiFi │
│ ┌──────────┐ │ • Receives clean text │◄──────────────┐ │
│ │ Joystick │───▶│ • Sends to Pico │ │ │
│ │ (button) │ └───────────┬────────────┘ │ │
│ └──────────┘ │ UART │ │
│ ┌───────────▼────────────┐ │ │
│ │ Raspberry Pi Pico │ │ │
│ │ │ │ │
│ │ • Receives text │ │ │
│ │ • Types as USB kbd │ │ │
│ └───────────┬────────────┘ │ │
│ │ USB │ │
│ ┌───────────▼────────────┐ │ │
│ │ KVM Switch │ │ │
│ │ │ │ │
│ ├──▶ Mac Mini │ │ │
│ ├──▶ MacBook Pro │ │ │
│ └──▶ Windows PC ─────────┼───────────────┘ │
│ (also runs the │
│ inference server) │
└──────────────────────────────────────────────────────────────┘
The flow is simple:
- You press the joystick button
- Pi Zero W starts recording from the INMP441 mic
- You release the button
- Pi Zero sends the audio over WiFi to your GPU PC
- GPU PC runs Whisper (transcription) + Ollama (text cleanup)
- Clean text comes back over WiFi
- Pi Zero sends it over UART to the Pico
- Pico types it as USB keyboard input through the KVM
- Text appears in whatever app is focused on your active machine
The host machines see nothing but a keyboard. No drivers, no software, no permissions. It just works.
Step-by-Step Setup
Part 1: Set Up the Inference Server (Windows PC)
This is your GPU-powered backend. I won't cover the full server setup here since it depends on your setup, but the essentials are:
- Install Ollama and pull a cleanup model:
ollama pull qwen2.5:1.5b
- Set up a FastAPI server running faster-whisper with CUDA. My server (called Murmur) exposes these endpoints:
POST /transcribe_raw— accepts WAV/raw audio, returns transcribed textPOST /cleanup— accepts text, returns LLM-cleaned textGET /health— returns server status
- Open Windows Firewall port 8787 (or whatever port your server runs on) for your local network.
- Note your PC's IP address (
ipconfigin Command Prompt). You'll need it.
Part 2: Flash the Raspberry Pi Pico (USB Keyboard Bridge)
The Pico's only job is to receive text over UART and type it as a USB keyboard. This is what makes the whole thing work through a KVM switch.
Flash CircuitPython:
- Hold the BOOTSEL button on the Pico while plugging it into your Mac/PC via USB
- It mounts as a drive called RPI-RP2
- Download CircuitPython from circuitpython.org/board/raspberry_pi_pico (get the latest stable UF2)
- Copy the
.uf2file onto the RPI-RP2 drive - Pico reboots and remounts as CIRCUITPY
Install the HID library:
- Download the Adafruit CircuitPython Bundle matching your CircuitPython version (9.x or 10.x)
- Unzip it
- Copy the
lib/adafruit_hid/folder toCIRCUITPY/lib/
Deploy the keyboard bridge firmware:
Create CIRCUITPY/code.py with the following:
"""
VoiceKey - Pi Pico HID Keyboard Bridge
=======================================
Receives text over UART from a Raspberry Pi Zero W,
and types it out as USB keyboard input to whatever computer
is connected via KVM switch.
Wiring:
Pico GP0 (pin 1) = UART TX -> Pi Zero RXD (GPIO 15, pin 10)
Pico GP1 (pin 2) = UART RX <- Pi Zero TXD (GPIO 14, pin 8)
Pico GND (pin 3) = GND -> Pi Zero GND (pin 14)
Pico USB -> KVM USB keyboard port
"""
import time
import board
import busio
import digitalio
import usb_hid
from adafruit_hid.keyboard import Keyboard
from adafruit_hid.keyboard_layout_us import KeyboardLayoutUS
from adafruit_hid.keycode import Keycode
# --- Configuration ---
UART_BAUD = 115200
KEYSTROKE_DELAY = 0.008 # 8ms between keystrokes, safe for most KVMs
MAX_BUFFER = 200
ACK = b'\x06'
NAK = b'\x15'
# --- Setup UART ---
uart = busio.UART(
tx=board.GP0,
rx=board.GP1,
baudrate=UART_BAUD,
receiver_buffer_size=256,
timeout=0.1
)
# --- Setup USB HID Keyboard ---
keyboard = Keyboard(usb_hid.devices)
layout = KeyboardLayoutUS(keyboard)
# --- Setup onboard LED ---
led = digitalio.DigitalInOut(board.LED)
led.direction = digitalio.Direction.OUTPUT
led.value = True
def blink(times=1, on_time=0.1, off_time=0.1):
for _ in range(times):
led.value = True
time.sleep(on_time)
led.value = False
time.sleep(off_time)
led.value = True
def type_text(text):
led.value = False
i = 0
while i < len(text):
char = text[i]
if i % 10 == 0:
led.value = not led.value
if char == '\n':
keyboard.press(Keycode.ENTER)
keyboard.release_all()
elif char == '\t':
keyboard.press(Keycode.TAB)
keyboard.release_all()
elif char == '\r':
i += 1
continue
else:
try:
layout.write(char)
except ValueError:
pass
time.sleep(KEYSTROKE_DELAY)
i += 1
keyboard.release_all()
led.value = True
def handle_command(cmd):
cmd = cmd.strip()
if cmd.startswith("SPEED:"):
try:
global KEYSTROKE_DELAY
ms = int(cmd.split(":")[1])
KEYSTROKE_DELAY = ms / 1000.0
uart.write(ACK)
except (ValueError, IndexError):
uart.write(NAK)
elif cmd == "PING":
uart.write(ACK)
elif cmd == "ENTER":
keyboard.press(Keycode.ENTER)
keyboard.release_all()
uart.write(ACK)
elif cmd == "TAB":
keyboard.press(Keycode.TAB)
keyboard.release_all()
uart.write(ACK)
elif cmd == "ESCAPE":
keyboard.press(Keycode.ESCAPE)
keyboard.release_all()
uart.write(ACK)
elif cmd.startswith("CTRL+"):
key_char = cmd.split("+")[1].upper()
keycode_map = {
'A': Keycode.A, 'B': Keycode.B, 'C': Keycode.C,
'D': Keycode.D, 'E': Keycode.E, 'F': Keycode.F,
'G': Keycode.G, 'H': Keycode.H, 'I': Keycode.I,
'J': Keycode.J, 'K': Keycode.K, 'L': Keycode.L,
'M': Keycode.M, 'N': Keycode.N, 'O': Keycode.O,
'P': Keycode.P, 'Q': Keycode.Q, 'R': Keycode.R,
'S': Keycode.S, 'T': Keycode.T, 'U': Keycode.U,
'V': Keycode.V, 'W': Keycode.W, 'X': Keycode.X,
'Y': Keycode.Y, 'Z': Keycode.Z,
}
if key_char in keycode_map:
keyboard.press(Keycode.CONTROL, keycode_map[key_char])
keyboard.release_all()
uart.write(ACK)
else:
uart.write(NAK)
elif cmd.startswith("CMD+"):
key_char = cmd.split("+")[1].upper()
keycode_map = {
'A': Keycode.A, 'C': Keycode.C, 'V': Keycode.V,
'X': Keycode.X, 'Z': Keycode.Z, 'S': Keycode.S,
}
if key_char in keycode_map:
keyboard.press(Keycode.GUI, keycode_map[key_char])
keyboard.release_all()
uart.write(ACK)
else:
uart.write(NAK)
else:
uart.write(NAK)
# --- Main Loop ---
print("VoiceKey Pico HID Bridge started")
print(f"UART: {UART_BAUD} baud, Keystroke delay: {KEYSTROKE_DELAY*1000:.0f}ms")
blink(3, 0.15, 0.15)
buffer = bytearray()
while True:
data = uart.read(256)
if data is not None:
buffer.extend(data)
while b'\n' in buffer:
line_end = buffer.index(b'\n')
line_bytes = bytes(buffer[:line_end])
buffer = buffer[line_end + 1:]
try:
line = line_bytes.decode('utf-8').rstrip('\r')
except UnicodeDecodeError:
uart.write(NAK)
continue
if not line:
uart.write(ACK)
continue
if line.startswith("CMD:"):
handle_command(line[4:])
else:
type_text(line)
uart.write(ACK)
if len(buffer) > MAX_BUFFER * 2:
buffer = bytearray()
uart.write(NAK)
time.sleep(0.01)
After saving code.py, the Pico restarts automatically. The onboard LED should blink 3 times then stay solid — that means it's running and waiting for UART input.
Quick test: Eject CIRCUITPY properly (diskutil eject /Volumes/CIRCUITPY on Mac), unplug from your computer, and plug the Pico into your KVM's USB port. It should be recognized as a keyboard.
Part 3: Wire Everything Up
Here's the complete wiring diagram. The Pi Zero W's GPIO header is your central hub — everything connects to it.
Pi Zero W GPIO Pinout Reference (relevant pins):
SD card side
┌───────────────────┐
│ (1) (2) │
│ ● ● 3V3 5V │ ← Pin 1: INMP441 VDD
│ (3) (4) │
│ ● ● │
│ (5) (6) │
│ ● ● GND │ ← Pin 6: INMP441 GND
│ (7) (8) │
│ ● ● TXD │ ← Pin 8: Pico GP1 (UART)
│ (9) (10) │
│ ● ● GND RXD │ ← Pin 9: INMP441 L/R
│(11) (12) │ Pin 10: Pico GP0 (UART)
│ ● ● G18 │ ← Pin 12: INMP441 SCK
│(13) (14) │
│ ● ● GND │ ← Pin 14: Pico GND
│(15) (16) │
│ ● ● G23 │ ← Pin 16: Joystick D (button)
│(17) (18) │
│ ● ● G24 │ ← Pin 18: Status LED (optional)
│(19) (20) │
│ ● ● GND │ ← Pin 20: Joystick GND
│ ... │
│(35) (36) │
│ ● ● G19 │ ← Pin 35: INMP441 WS
│(37) (38) │
│ ● ● G20 │ ← Pin 38: INMP441 SD (data)
│(39) (40) │
│ ● ● │
└───────────────────┘
USB ports side
Connection summary:
INMP441 Microphone (5 wires + 1):
INMP441 VDD → Pi Zero Pin 1 (3.3V) ⚠️ NOT 5V — that will fry it
INMP441 GND → Pi Zero Pin 6 (GND)
INMP441 SCK → Pi Zero Pin 12 (GPIO 18)
INMP441 WS → Pi Zero Pin 35 (GPIO 19)
INMP441 SD → Pi Zero Pin 38 (GPIO 20)
INMP441 L/R → Pi Zero Pin 9 (GND — selects left channel)
Raspberry Pi Pico UART (3 wires):
Pi Zero Pin 8 (TXD) → Pico GP1 (pin 2)
Pi Zero Pin 10 (RXD) ← Pico GP0 (pin 1)
Pi Zero Pin 14 (GND) → Pico GND (pin 3)
Joystick / Push Button (2 wires):
Button terminal → Pi Zero Pin 16 (GPIO 23)
Button GND → Pi Zero Pin 20 (GND)
Pico USB output:
Pico micro USB → KVM switch USB port (keyboard input)
Part 4: Set Up the Raspberry Pi Zero W
Flash a fresh SD card using Raspberry Pi Imager:
- Device: Raspberry Pi Zero W
- OS: Raspberry Pi OS Lite (32-bit)
- Configure: hostname
voicekey, enable SSH, set WiFi credentials, set username/password
Insert the card, power up the Pi Zero, wait 60-90 seconds, then SSH in:
ssh pi@voicekey.local
Heads up: If you've previously SSH'd to a device calledvoicekey.local, you may get a host key warning. Fix with:ssh-keygen -R voicekey.local
Now run through the setup. I've broken this into individual steps so you can verify each one works before moving on.
Step 1: Install system dependencies
sudo apt update && sudo apt install -y \
python3-pip python3-venv python3-dev \
libportaudio2 portaudio19-dev libopenblas-dev sox
This takes a few minutes on the Pi Zero W. The Pi Zero is slow — embrace it.
Step 2: Configure I2S audio for the INMP441 mic
sudo nano /boot/firmware/config.txt
Add at the bottom:
dtoverlay=googlevoicehat-soundcard
dtparam=i2s=on
Save (Ctrl+O, Enter, Ctrl+X).
Note: On older Raspberry Pi OS versions, the path is/boot/config.txtinstead of/boot/firmware/config.txt.
Add the sound module:
echo "snd-bcm2835" | sudo tee -a /etc/modules
Step 3: Configure UART
sudo raspi-config
Navigate to Interface Options → Serial Port:
- Login shell over serial: No
- Serial hardware enabled: Yes
Disable the serial console service (this is critical — if you skip this, the UART output will be garbled with login prompts):
sudo systemctl disable serial-getty@ttyS0.service
sudo sed -i 's/console=serial0,[0-9]* //g' /boot/firmware/cmdline.txt
Step 4: Reboot and verify hardware
sudo reboot
SSH back in after ~60 seconds:
ssh pi@voicekey.local
Verify the mic is detected:
arecord -l
You should see snd_rpi_googlevoicehat_soundcar listed. If it shows nothing, double-check your /boot/firmware/config.txt edits.
Verify UART:
ls -la /dev/serial0
Should show a symlink to /dev/ttyS0.
Test the mic (speak or clap during the 3-second recording):
arecord -D plughw:1,0 -f S32_LE -r 16000 -c 1 -d 3 /tmp/test.wav
ls -la /tmp/test.wav
File should be ~192KB. You can copy it to your computer to listen:
# Run on your Mac/PC, not the Pi:
scp pi@voicekey.local:/tmp/test.wav ~/Downloads/
Tip: The INMP441 is a small MEMS mic — audio will sound faint. That's normal. We apply a 20dB gain boost in the daemon, and Whisper handles quiet audio remarkably well.
Test UART → Pico → KVM (open a text editor on your active KVM machine first):
echo "hello from voicekey" > /dev/serial0
If the text appears in your editor, the full hardware chain is working.
Test server connectivity (replace with your GPU PC's IP):
curl http://10.0.0.115:8787/health
Should return {"status":"ok",...}.
All four checks must pass before continuing.
Step 5: Set up Python environment and install dependencies
python3 -m venv ~/voicekey
source ~/voicekey/bin/activate
pip install requests pyserial pyyaml numpy sounddevice gpiozero lgpio
This is the slow step — 10-15 minutes on the Pi Zero W. Go get coffee.
Verify everything installed:
python3 -c "import requests; import serial; import numpy; import sounddevice; from gpiozero import Button; print('All good')"
Step 6: Create the configuration file
cat > ~/voicekey/config.yaml << 'EOF'
server_url: "http://10.0.0.115:8787"
server_timeout_connect: 5
server_timeout_response: 30
uart_port: "/dev/serial0"
uart_baud: 115200
button_gpio: 23
led_gpio: 24
min_record_seconds: 0.3
max_record_seconds: 180
chunk_size: 200
audio_sample_rate: 16000
audio_gain_db: 20
log_file: "/home/pi/voicekey/voicekey.log"
EOF
Important: Replace the server_url IP with your actual GPU PC's IP address. Consider setting a static IP / DHCP reservation in your router so it doesn't change.Step 7: Deploy the daemon
This is the main orchestrator — it ties audio capture, server communication, and Pico output together.
cat > ~/voicekey/daemon.py << 'DAEMONEOF'
"""
VoiceKey Daemon for Raspberry Pi Zero W
========================================
Push the joystick button to record, release to process.
Audio is sent to the Murmur server for transcription + cleanup,
then the result is typed out via the Pico USB keyboard bridge.
"""
import os
import io
import time
import wave
import struct
import logging
import subprocess
import threading
import yaml
import numpy as np
import requests
import serial
from gpiozero import Button, LED
from signal import pause
# --- Load Config ---
CONFIG_PATH = os.path.join(os.path.dirname(__file__), "config.yaml")
DEFAULT_CONFIG = {
"server_url": "http://10.0.0.115:8787",
"server_timeout_connect": 5,
"server_timeout_response": 30,
"uart_port": "/dev/serial0",
"uart_baud": 115200,
"button_gpio": 23,
"led_gpio": 24,
"min_record_seconds": 0.3,
"max_record_seconds": 180,
"chunk_size": 200,
"audio_sample_rate": 16000,
"audio_gain_db": 20,
"log_file": "/home/pi/voicekey/voicekey.log",
}
def load_config():
config = DEFAULT_CONFIG.copy()
if os.path.exists(CONFIG_PATH):
with open(CONFIG_PATH, "r") as f:
user_config = yaml.safe_load(f)
if user_config:
config.update(user_config)
return config
config = load_config()
# --- Logging ---
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[
logging.FileHandler(config["log_file"]),
logging.StreamHandler(),
],
)
log = logging.getLogger("voicekey")
# --- State ---
class State:
IDLE = "IDLE"
LISTENING = "LISTENING"
PROCESSING = "PROCESSING"
TYPING = "TYPING"
ERROR = "ERROR"
current_state = State.IDLE
recording_process = None
recording_file = "/tmp/voicekey_recording.wav"
raw_audio_file = "/tmp/voicekey_raw.raw"
# --- Hardware Setup ---
button = Button(config["button_gpio"], pull_up=True, bounce_time=0.05)
led = None
try:
led = LED(config["led_gpio"])
except Exception as e:
log.warning(f"Could not initialize LED on GPIO {config['led_gpio']}: {e}")
# --- LED Control ---
led_blink_thread = None
led_blink_stop = threading.Event()
def set_led(state):
global led_blink_thread
if led is None:
return
led_blink_stop.set()
if led_blink_thread and led_blink_thread.is_alive():
led_blink_thread.join(timeout=1)
led_blink_stop.clear()
if state == State.IDLE:
led.off()
elif state == State.LISTENING:
led.on()
elif state == State.PROCESSING:
def blink_fast():
while not led_blink_stop.is_set():
led.toggle()
led_blink_stop.wait(0.1)
led.off()
led_blink_thread = threading.Thread(target=blink_fast, daemon=True)
led_blink_thread.start()
elif state == State.TYPING:
def blink_slow():
while not led_blink_stop.is_set():
led.toggle()
led_blink_stop.wait(0.5)
led.off()
led_blink_thread = threading.Thread(target=blink_slow, daemon=True)
led_blink_thread.start()
elif state == State.ERROR:
def blink_error():
for _ in range(3):
if led_blink_stop.is_set():
break
led.on()
led_blink_stop.wait(0.15)
led.off()
led_blink_stop.wait(0.15)
led.off()
led_blink_thread = threading.Thread(target=blink_error, daemon=True)
led_blink_thread.start()
def set_state(new_state):
global current_state
current_state = new_state
log.info(f"State: {new_state}")
set_led(new_state)
# --- UART Setup ---
def open_uart():
try:
ser = serial.Serial(config["uart_port"], config["uart_baud"], timeout=30)
log.info(f"UART opened: {config['uart_port']} @ {config['uart_baud']}")
return ser
except Exception as e:
log.error(f"Failed to open UART: {e}")
return None
uart = open_uart()
def send_to_pico(text):
global uart
if uart is None:
uart = open_uart()
if uart is None:
log.error("UART not available, cannot send to Pico")
return False
chunk_size = config["chunk_size"]
chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
for i, chunk in enumerate(chunks):
try:
uart.write((chunk + "\n").encode("utf-8"))
uart.flush()
log.info(f"Sent chunk {i+1}/{len(chunks)}: {len(chunk)} chars")
ack = uart.read(1)
if ack == b'\x06':
log.info(f"ACK received for chunk {i+1}")
elif ack == b'\x15':
log.warning(f"NAK received for chunk {i+1}")
return False
else:
log.warning(f"No ACK received for chunk {i+1}, got: {ack}")
except Exception as e:
log.error(f"UART send error: {e}")
uart = None
return False
return True
# --- Audio Recording ---
def start_recording():
global recording_process, current_state
if current_state != State.IDLE:
log.warning(f"Cannot start recording in state {current_state}")
return
set_state(State.LISTENING)
try:
for f in [recording_file, raw_audio_file]:
if os.path.exists(f):
os.remove(f)
recording_process = subprocess.Popen(
[
"arecord", "-D", "plughw:1,0", "-f", "S32_LE",
"-r", str(config["audio_sample_rate"]),
"-c", "1", "-d", str(config["max_record_seconds"]),
recording_file,
],
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
)
log.info("Recording started")
except Exception as e:
log.error(f"Failed to start recording: {e}")
set_state(State.ERROR)
time.sleep(1)
set_state(State.IDLE)
def stop_recording_and_process():
global recording_process
if current_state != State.LISTENING:
log.warning(f"Cannot stop recording in state {current_state}")
return
if recording_process:
recording_process.terminate()
recording_process.wait(timeout=5)
recording_process = None
log.info("Recording stopped")
if not os.path.exists(recording_file):
log.warning("No recording file found")
set_state(State.IDLE)
return
file_size = os.path.getsize(recording_file)
min_bytes = int(config["min_record_seconds"] * 16000 * 4) + 44
if file_size < min_bytes:
log.warning(f"Recording too short ({file_size} bytes), ignoring")
set_state(State.IDLE)
return
set_state(State.PROCESSING)
try:
sox_result = subprocess.run(
[
"sox", recording_file, "-t", "raw",
"-r", str(config["audio_sample_rate"]),
"-c", "1", "-e", "floating-point", "-b", "32",
raw_audio_file, "gain", str(config["audio_gain_db"]),
],
capture_output=True, text=True, timeout=30,
)
if sox_result.returncode != 0:
log.error(f"Sox conversion failed: {sox_result.stderr}")
set_state(State.ERROR)
time.sleep(1)
set_state(State.IDLE)
return
log.info("Audio converted to raw float32 with gain boost")
transcription = transcribe_audio(raw_audio_file)
if not transcription:
log.warning("Empty transcription, nothing to type")
set_state(State.IDLE)
return
log.info(f"Transcription: {transcription}")
cleaned = cleanup_text(transcription)
log.info(f"Cleaned: {cleaned}")
set_state(State.TYPING)
success = send_to_pico(cleaned)
if success:
log.info("Text sent to Pico successfully")
else:
log.warning("Failed to send text to Pico")
except Exception as e:
log.error(f"Processing error: {e}")
set_state(State.ERROR)
time.sleep(1)
finally:
set_state(State.IDLE)
for f in [recording_file, raw_audio_file]:
try:
if os.path.exists(f):
os.remove(f)
except:
pass
# --- Server Communication ---
def transcribe_audio(raw_file_path):
url = f"{config['server_url']}/transcribe_raw"
timeout = (config["server_timeout_connect"], config["server_timeout_response"])
try:
with open(raw_file_path, "rb") as f:
files = {"audio": ("audio.raw", f, "application/octet-stream")}
data = {"sample_rate": str(config["audio_sample_rate"])}
response = requests.post(url, files=files, data=data, timeout=timeout)
if response.status_code == 200:
result = response.json()
text = result.get("text", "").strip()
timing = result.get("timing", {})
log.info(
f"Transcription timing: {timing.get('transcription_ms', '?')}ms, "
f"RTF: {timing.get('realtime_factor', '?')}"
)
return text
else:
log.error(f"Transcription failed: HTTP {response.status_code}")
return None
except requests.exceptions.ConnectionError:
log.error(f"Cannot reach server at {config['server_url']}")
return None
except requests.exceptions.Timeout:
log.error("Server request timed out")
return None
except Exception as e:
log.error(f"Transcription error: {e}")
return None
def cleanup_text(text):
url = f"{config['server_url']}/cleanup"
timeout = (config["server_timeout_connect"], config["server_timeout_response"])
if len(text.split()) < 4:
cleaned = text.strip()
if cleaned and cleaned[0].islower():
cleaned = cleaned[0].upper() + cleaned[1:]
if cleaned and cleaned[-1] not in ".!?":
cleaned += "."
return cleaned
try:
response = requests.post(url, json={"text": text}, timeout=timeout)
if response.status_code == 200:
result = response.json()
cleaned = result.get("cleaned_text", text).strip()
timing = result.get("timing", {})
log.info(f"Cleanup timing: {timing.get('total_ms', '?')}ms")
return cleaned if cleaned else text
else:
log.warning(f"Cleanup failed: HTTP {response.status_code}, using raw text")
return text
except Exception as e:
log.warning(f"Cleanup error: {e}, using raw text")
return text
# --- Button Handlers ---
def on_button_press():
log.info("Button pressed")
start_recording()
def on_button_release():
log.info("Button released")
threading.Thread(target=stop_recording_and_process, daemon=True).start()
# --- Main ---
def main():
log.info("=" * 50)
log.info("VoiceKey Daemon Starting")
log.info(f"Server: {config['server_url']}")
log.info(f"UART: {config['uart_port']} @ {config['uart_baud']}")
log.info(f"Button: GPIO {config['button_gpio']}")
log.info(f"Audio gain: {config['audio_gain_db']}dB")
log.info("=" * 50)
try:
r = requests.get(f"{config['server_url']}/health", timeout=5)
health = r.json()
log.info(f"Server OK: whisper={health.get('whisper_model')}, "
f"device={health.get('whisper_device')}")
except Exception as e:
log.warning(f"Server not reachable: {e} -- will retry on first dictation")
if uart:
uart.write(b"CMD:PING\n")
ack = uart.read(1)
if ack == b'\x06':
log.info("Pico connected and responding")
else:
log.warning("Pico not responding to PING -- check wiring")
button.when_pressed = on_button_press
button.when_released = on_button_release
set_state(State.IDLE)
log.info("Ready! Press the joystick to dictate.")
try:
pause()
except KeyboardInterrupt:
log.info("Shutting down...")
set_led(State.IDLE)
if uart:
uart.close()
if __name__ == "__main__":
main()
DAEMONEOF
Step 8: Test the daemon
source ~/voicekey/bin/activate
cd ~/voicekey
python3 daemon.py
You should see:
VoiceKey Daemon Starting
Server: http://10.0.0.115:8787
UART: /dev/serial0 @ 115200
Button: GPIO 23
Audio gain: 20dB
Server OK: whisper=small.en, device=cuda
Pico connected and responding
State: IDLE
Ready! Press the joystick to dictate.
Open a text editor on your active KVM machine. Press the joystick, say something, release. Watch the magic happen.
Press Ctrl+C to stop the daemon after verifying it works.
Step 9: Set up auto-start on boot
sudo bash -c 'cat > /etc/systemd/system/voicekey.service << EOF
[Unit]
Description=VoiceKey Daemon
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=/home/pi/voicekey/bin/python3 /home/pi/voicekey/daemon.py
WorkingDirectory=/home/pi/voicekey
User=root
Restart=on-failure
RestartSec=3
[Install]
WantedBy=multi-user.target
EOF'
sudo systemctl daemon-reload
sudo systemctl enable voicekey
sudo systemctl start voicekey
Verify:
sudo systemctl status voicekey
VoiceKey now starts automatically on every boot. Plug in power, wait 60 seconds, dictate.
It's Working — Now What?
When everything is humming, your logs (visible with journalctl -u voicekey -f) should look something like this:
State: IDLE
Ready! Press the joystick to dictate.
Button pressed
State: LISTENING
Recording started
Button released
Recording stopped
State: PROCESSING
Audio converted to raw float32 with gain boost
Transcription timing: 182.12ms, RTF: 0.04
Transcription: I'm refactoring the ViewModel to use StateFlow instead of LiveData
Cleanup timing: 342ms
Cleaned: I'm refactoring the ViewModel to use StateFlow instead of LiveData.
State: TYPING
Sent chunk 1/1: 67 chars
ACK received for chunk 1
Text sent to Pico successfully
State: IDLE
That RTF: 0.04 means transcription ran 25x faster than real-time on the RTX 5070. Total end-to-end latency from button release to text appearing: under 700ms.
Teach Whisper Your Vocabulary
This is the single biggest accuracy improvement you can make. Without a vocabulary prompt, Whisper will mangle your technical terms: "Hilt" becomes "hilled," "StateFlow" becomes "state flow," "gRPC" becomes "grip C."
Add a vocabulary.txt file to your inference server containing your most-used technical terms:
ViewModel, StateFlow, MutableStateFlow, LiveData, Hilt, Dagger,
Kotlin, Jetpack Compose, coroutine, LaunchedEffect, Retrofit,
OkHttp, gRPC, protobuf, Gradle, ProGuard, R8, Room, NavGraph,
RecyclerView, suspend fun, rememberSaveable, LazyColumn, composable,
NavHostController, serialization, deserialization, idempotent, Kafka
Pass this as the initial_prompt parameter to faster-whisper's transcribe() call on the server side. The difference is dramatic — after adding my vocabulary, Whisper nailed every single Android/Kotlin term I threw at it.
Gotchas & Tips
The serial console will eat your UART. This was my biggest debugging headache. If you don't disable serial-getty@ttyS0.service and remove console=serial0 from cmdline.txt, the Pi Zero sends login prompts over UART. The Pico dutifully types them into your text editor. Ask me how I know.
The INMP441 mic is quiet. It's a tiny MEMS microphone, not a studio condenser. The daemon applies a 20dB gain boost via sox before sending to the server. You can adjust this in config.yaml (audio_gain_db). Whisper is remarkably good at handling quiet audio, so even without the boost it often works — but the boost helps.
First transcription is slow. The first time you dictate after starting the server, Whisper needs to load the model into GPU memory. This can take several seconds. Every subsequent transcription is fast (~180ms for a 5-second clip on an RTX 5070). Just do a throwaway test dictation after starting the server.
The Pi Zero W is slow to set up, fast to run. Installing packages takes ages on the single-core ARM11 CPU. But the actual daemon runtime is minimal — it's just recording audio, making HTTP calls, and writing to UART. None of that is CPU intensive.
Set a static IP for your GPU PC. If your PC gets a new IP from DHCP, VoiceKey stops working until you update config.yaml. Set a DHCP reservation in your router for the PC.
KVM keystroke speed. Some KVM switches drop keystrokes if they come too fast. The default 8ms delay between keystrokes works for my KCeve KVM303. If you see dropped characters, increase it in config.yaml or send CMD:SPEED:15 over UART for 15ms delay.
Don't plug the INMP441 VDD into 5V. It's a 3.3V device. Pin 1 (3.3V), not Pin 2 (5V). I didn't fry one, but only because I triple-checked before powering on.
What's Next
This is v1. It works, it's fast, and it cost less than three months of a WisprFlow subscription. But there's a lot of room to grow:
- Context-aware modes — detect which app is focused (Slack vs Cursor vs Mail) and adjust the LLM cleanup prompt accordingly. Dictating code should produce code, not English sentences.
- Voice commands — "select all and copy" → Cmd+A, Cmd+V. The Pico already supports
CMD:prefixed commands for key combos. - A proper enclosure — right now it's a tangle of wires and bare boards on my desk. A 3D-printed case would make it look like a real product.
- The 1602A LCD display I found in a drawer — wire it up for status, diagnostics, and scrolling transcription preview.
I'll cover some of these in future posts. If you want to see the build process, wiring, and live demo, check out the companion video on my YouTube channel @droidchef.
If you build one of these, I want to see it. Drop a comment here or on the video!
Happy dictating.
— Ishan