I Built My Own WisprFlow — Fully Local, Under $50, and It Types Into Any Computer

I spend most of my day talking to AI agents in the terminal. Claude Code, ChatGPT, aider — you name it. And every time I have to type out a long, detailed prompt explaining what I want refactored, I think: why am I typing this when I could just say it?

WisprFlow exists. It's $15/month, it's cloud-only, and it sends your voice + screenshots to servers run by OpenAI and Meta. For someone working on proprietary code, that's a non-starter. So I built my own. It runs entirely on my local network, uses a Raspberry Pi Zero W and a Raspberry Pi Pico as a hardware dictation device, offloads inference to my Windows gaming PC's RTX 5070, and types the result into whatever computer is active on my KVM switch — no software installed on the host machines. Zero. It just looks like a keyboard.

Total hardware cost: about $40. Latency: under 700ms end-to-end. And it nails my technical jargon because I taught Whisper my vocabulary. Here's exactly how to build one.

What You'll Need

Hardware

Item	Approx. Price	Notes
Raspberry Pi Zero W (or Zero 2 WH)	~$15-19	The brain. Get the WH variant (pre-soldered headers) to avoid soldering
Raspberry Pi Pico H	~$7-9	The USB keyboard bridge. H = pre-soldered headers
INMP441 I2S MEMS Microphone	~$8-10	Comes in 3-packs usually. You only need one
Micro SD Card (16-32GB)	~$7	For the Pi Zero's OS
Micro USB power supply (5V 2A+)	~$10	Powers the Pi Zero. Any decent phone charger works
Dupont jumper wires	~$7	Female-to-female for connecting everything
Push button or joystick	~$3-8	Push-to-talk trigger. I used a thumb joystick I had lying around
Micro USB to USB-A cable	~$3	For the Pico to KVM connection

You'll also need access to:

A PC with an NVIDIA GPU (RTX 20xx or newer) for running the inference server. I use a Windows gaming PC with an RTX 5070. This stays on your local network — it does all the heavy AI lifting.
A KVM switch (any model with USB ports for keyboard/mouse). I use a KCeve KVM303.

Software / Dependencies

Raspberry Pi OS Lite (32-bit) for the Pi Zero
CircuitPython 9.x or 10.x for the Pico
Python 3.x on the Pi Zero
Any FastAPI server running faster-whisper + Ollama) on your GPU PC
Ollama with a small model (qwen2.5:1.5b or qwen3:1.7b) on the GPU PC

Architecture Overview

Here's the full signal path:

┌──────────────────────────────────────────────────────────────┐
│                     YOUR DESK                                │
│                                                              │
│  ┌──────────┐    ┌────────────────────────┐                  │
│  │ INMP441  │───▶│   Raspberry Pi Zero W  │                  │
│  │   Mic    │    │                        │                  │
│  └──────────┘    │  • Captures audio      │                  │
│                  │  • Sends to server     │     WiFi         │
│  ┌──────────┐    │  • Receives clean text │◄──────────────┐  │
│  │ Joystick │───▶│  • Sends to Pico       │               │  │
│  │ (button) │    └───────────┬────────────┘               │  │
│  └──────────┘                │ UART                       │  │
│                  ┌───────────▼────────────┐               │  │
│                  │   Raspberry Pi Pico    │               │  │
│                  │                        │               │  │
│                  │  • Receives text       │               │  │
│                  │  • Types as USB kbd    │               │  │
│                  └───────────┬────────────┘               │  │
│                              │ USB                        │  │
│                  ┌───────────▼────────────┐               │  │
│                  │     KVM Switch         │               │  │
│                  │                        │               │  │
│                  ├──▶ Mac Mini            │               │  │
│                  ├──▶ MacBook Pro         │               │  │
│                  └──▶ Windows PC ─────────┼───────────────┘  │
│                       (also runs the                         │
│                        inference server)                     │
└──────────────────────────────────────────────────────────────┘

The flow is simple:

You press the joystick button
Pi Zero W starts recording from the INMP441 mic
You release the button
Pi Zero sends the audio over WiFi to your GPU PC
GPU PC runs Whisper (transcription) + Ollama (text cleanup)
Clean text comes back over WiFi
Pi Zero sends it over UART to the Pico
Pico types it as USB keyboard input through the KVM
Text appears in whatever app is focused on your active machine

The host machines see nothing but a keyboard. No drivers, no software, no permissions. It just works.

Step-by-Step Setup

Part 1: Set Up the Inference Server (Windows PC)

This is your GPU-powered backend. I won't cover the full server setup here since it depends on your setup, but the essentials are:

Install Ollama and pull a cleanup model:

ollama pull qwen2.5:1.5b

Set up a FastAPI server running faster-whisper with CUDA. My server (called Murmur) exposes these endpoints:

POST /transcribe_raw — accepts WAV/raw audio, returns transcribed text
POST /cleanup — accepts text, returns LLM-cleaned text
GET /health — returns server status

Open Windows Firewall port 8787 (or whatever port your server runs on) for your local network.
Note your PC's IP address (ipconfig in Command Prompt). You'll need it.

Part 2: Flash the Raspberry Pi Pico (USB Keyboard Bridge)

The Pico's only job is to receive text over UART and type it as a USB keyboard. This is what makes the whole thing work through a KVM switch.

Flash CircuitPython:

Hold the BOOTSEL button on the Pico while plugging it into your Mac/PC via USB
It mounts as a drive called RPI-RP2
Download CircuitPython from circuitpython.org/board/raspberry_pi_pico (get the latest stable UF2)
Copy the .uf2 file onto the RPI-RP2 drive
Pico reboots and remounts as CIRCUITPY

Install the HID library:

Download the Adafruit CircuitPython Bundle matching your CircuitPython version (9.x or 10.x)
Unzip it
Copy the lib/adafruit_hid/ folder to CIRCUITPY/lib/

Deploy the keyboard bridge firmware:

Create CIRCUITPY/code.py with the following:

"""
VoiceKey - Pi Pico HID Keyboard Bridge
=======================================
Receives text over UART from a Raspberry Pi Zero W,
and types it out as USB keyboard input to whatever computer
is connected via KVM switch.

Wiring:
  Pico GP0 (pin 1) = UART TX -> Pi Zero RXD (GPIO 15, pin 10)
  Pico GP1 (pin 2) = UART RX <- Pi Zero TXD (GPIO 14, pin 8)
  Pico GND (pin 3) = GND     -> Pi Zero GND (pin 14)
  Pico USB         -> KVM USB keyboard port
"""

import time
import board
import busio
import digitalio
import usb_hid
from adafruit_hid.keyboard import Keyboard
from adafruit_hid.keyboard_layout_us import KeyboardLayoutUS
from adafruit_hid.keycode import Keycode

# --- Configuration ---
UART_BAUD = 115200
KEYSTROKE_DELAY = 0.008  # 8ms between keystrokes, safe for most KVMs
MAX_BUFFER = 200
ACK = b'\x06'
NAK = b'\x15'

# --- Setup UART ---
uart = busio.UART(
    tx=board.GP0,
    rx=board.GP1,
    baudrate=UART_BAUD,
    receiver_buffer_size=256,
    timeout=0.1
)

# --- Setup USB HID Keyboard ---
keyboard = Keyboard(usb_hid.devices)
layout = KeyboardLayoutUS(keyboard)

# --- Setup onboard LED ---
led = digitalio.DigitalInOut(board.LED)
led.direction = digitalio.Direction.OUTPUT
led.value = True


def blink(times=1, on_time=0.1, off_time=0.1):
    for _ in range(times):
        led.value = True
        time.sleep(on_time)
        led.value = False
        time.sleep(off_time)
    led.value = True


def type_text(text):
    led.value = False
    i = 0
    while i < len(text):
        char = text[i]
        if i % 10 == 0:
            led.value = not led.value
        if char == '\n':
            keyboard.press(Keycode.ENTER)
            keyboard.release_all()
        elif char == '\t':
            keyboard.press(Keycode.TAB)
            keyboard.release_all()
        elif char == '\r':
            i += 1
            continue
        else:
            try:
                layout.write(char)
            except ValueError:
                pass
        time.sleep(KEYSTROKE_DELAY)
        i += 1
    keyboard.release_all()
    led.value = True


def handle_command(cmd):
    cmd = cmd.strip()
    if cmd.startswith("SPEED:"):
        try:
            global KEYSTROKE_DELAY
            ms = int(cmd.split(":")[1])
            KEYSTROKE_DELAY = ms / 1000.0
            uart.write(ACK)
        except (ValueError, IndexError):
            uart.write(NAK)
    elif cmd == "PING":
        uart.write(ACK)
    elif cmd == "ENTER":
        keyboard.press(Keycode.ENTER)
        keyboard.release_all()
        uart.write(ACK)
    elif cmd == "TAB":
        keyboard.press(Keycode.TAB)
        keyboard.release_all()
        uart.write(ACK)
    elif cmd == "ESCAPE":
        keyboard.press(Keycode.ESCAPE)
        keyboard.release_all()
        uart.write(ACK)
    elif cmd.startswith("CTRL+"):
        key_char = cmd.split("+")[1].upper()
        keycode_map = {
            'A': Keycode.A, 'B': Keycode.B, 'C': Keycode.C,
            'D': Keycode.D, 'E': Keycode.E, 'F': Keycode.F,
            'G': Keycode.G, 'H': Keycode.H, 'I': Keycode.I,
            'J': Keycode.J, 'K': Keycode.K, 'L': Keycode.L,
            'M': Keycode.M, 'N': Keycode.N, 'O': Keycode.O,
            'P': Keycode.P, 'Q': Keycode.Q, 'R': Keycode.R,
            'S': Keycode.S, 'T': Keycode.T, 'U': Keycode.U,
            'V': Keycode.V, 'W': Keycode.W, 'X': Keycode.X,
            'Y': Keycode.Y, 'Z': Keycode.Z,
        }
        if key_char in keycode_map:
            keyboard.press(Keycode.CONTROL, keycode_map[key_char])
            keyboard.release_all()
            uart.write(ACK)
        else:
            uart.write(NAK)
    elif cmd.startswith("CMD+"):
        key_char = cmd.split("+")[1].upper()
        keycode_map = {
            'A': Keycode.A, 'C': Keycode.C, 'V': Keycode.V,
            'X': Keycode.X, 'Z': Keycode.Z, 'S': Keycode.S,
        }
        if key_char in keycode_map:
            keyboard.press(Keycode.GUI, keycode_map[key_char])
            keyboard.release_all()
            uart.write(ACK)
        else:
            uart.write(NAK)
    else:
        uart.write(NAK)


# --- Main Loop ---
print("VoiceKey Pico HID Bridge started")
print(f"UART: {UART_BAUD} baud, Keystroke delay: {KEYSTROKE_DELAY*1000:.0f}ms")
blink(3, 0.15, 0.15)

buffer = bytearray()

while True:
    data = uart.read(256)
    if data is not None:
        buffer.extend(data)
        while b'\n' in buffer:
            line_end = buffer.index(b'\n')
            line_bytes = bytes(buffer[:line_end])
            buffer = buffer[line_end + 1:]
            try:
                line = line_bytes.decode('utf-8').rstrip('\r')
            except UnicodeDecodeError:
                uart.write(NAK)
                continue
            if not line:
                uart.write(ACK)
                continue
            if line.startswith("CMD:"):
                handle_command(line[4:])
            else:
                type_text(line)
                uart.write(ACK)
        if len(buffer) > MAX_BUFFER * 2:
            buffer = bytearray()
            uart.write(NAK)
    time.sleep(0.01)

After saving code.py, the Pico restarts automatically. The onboard LED should blink 3 times then stay solid — that means it's running and waiting for UART input.

Quick test: Eject CIRCUITPY properly (diskutil eject /Volumes/CIRCUITPY on Mac), unplug from your computer, and plug the Pico into your KVM's USB port. It should be recognized as a keyboard.

Part 3: Wire Everything Up

Here's the complete wiring diagram. The Pi Zero W's GPIO header is your central hub — everything connects to it.

Pi Zero W GPIO Pinout Reference (relevant pins):

        SD card side
    ┌───────────────────┐
    │ (1)  (2)          │
    │  ●    ●  3V3  5V  │  ← Pin 1: INMP441 VDD
    │ (3)  (4)          │
    │  ●    ●           │
    │ (5)  (6)          │
    │  ●    ●       GND │  ← Pin 6: INMP441 GND
    │ (7)  (8)          │
    │  ●    ●       TXD │  ← Pin 8: Pico GP1 (UART)
    │ (9)  (10)         │
    │  ●    ●  GND  RXD │  ← Pin 9: INMP441 L/R
    │(11) (12)          │     Pin 10: Pico GP0 (UART)
    │  ●    ●      G18  │  ← Pin 12: INMP441 SCK
    │(13) (14)          │
    │  ●    ●       GND │  ← Pin 14: Pico GND
    │(15) (16)          │
    │  ●    ●       G23 │  ← Pin 16: Joystick D (button)
    │(17) (18)          │
    │  ●    ●       G24 │  ← Pin 18: Status LED (optional)
    │(19) (20)          │
    │  ●    ●       GND │  ← Pin 20: Joystick GND
    │  ...               │
    │(35) (36)          │
    │  ●    ●  G19      │  ← Pin 35: INMP441 WS
    │(37) (38)          │
    │  ●    ●       G20 │  ← Pin 38: INMP441 SD (data)
    │(39) (40)          │
    │  ●    ●           │
    └───────────────────┘
         USB ports side

Connection summary:

INMP441 Microphone (5 wires + 1):

INMP441 VDD  →  Pi Zero Pin 1  (3.3V) ⚠️ NOT 5V — that will fry it
INMP441 GND  →  Pi Zero Pin 6  (GND)
INMP441 SCK  →  Pi Zero Pin 12 (GPIO 18)
INMP441 WS   →  Pi Zero Pin 35 (GPIO 19)
INMP441 SD   →  Pi Zero Pin 38 (GPIO 20)
INMP441 L/R  →  Pi Zero Pin 9  (GND — selects left channel)

Raspberry Pi Pico UART (3 wires):

Pi Zero Pin 8  (TXD)  →  Pico GP1 (pin 2)
Pi Zero Pin 10 (RXD)  ←  Pico GP0 (pin 1)
Pi Zero Pin 14 (GND)  →  Pico GND (pin 3)

Joystick / Push Button (2 wires):

Button terminal  →  Pi Zero Pin 16 (GPIO 23)
Button GND       →  Pi Zero Pin 20 (GND)

Pico USB output:

Pico micro USB  →  KVM switch USB port (keyboard input)

Part 4: Set Up the Raspberry Pi Zero W

Flash a fresh SD card using Raspberry Pi Imager:

Device: Raspberry Pi Zero W
OS: Raspberry Pi OS Lite (32-bit)
Configure: hostname voicekey, enable SSH, set WiFi credentials, set username/password

Insert the card, power up the Pi Zero, wait 60-90 seconds, then SSH in:

ssh pi@voicekey.local

Heads up: If you've previously SSH'd to a device called voicekey.local, you may get a host key warning. Fix with: ssh-keygen -R voicekey.local

Now run through the setup. I've broken this into individual steps so you can verify each one works before moving on.

Step 1: Install system dependencies

sudo apt update && sudo apt install -y \
    python3-pip python3-venv python3-dev \
    libportaudio2 portaudio19-dev libopenblas-dev sox

This takes a few minutes on the Pi Zero W. The Pi Zero is slow — embrace it.

Step 2: Configure I2S audio for the INMP441 mic

sudo nano /boot/firmware/config.txt

Add at the bottom:

dtoverlay=googlevoicehat-soundcard
dtparam=i2s=on

Save (Ctrl+O, Enter, Ctrl+X).

Note: On older Raspberry Pi OS versions, the path is /boot/config.txt instead of /boot/firmware/config.txt.

Add the sound module:

echo "snd-bcm2835" | sudo tee -a /etc/modules

Step 3: Configure UART

sudo raspi-config

Navigate to Interface Options → Serial Port:

Login shell over serial: No
Serial hardware enabled: Yes

Disable the serial console service (this is critical — if you skip this, the UART output will be garbled with login prompts):

sudo systemctl disable serial-getty@ttyS0.service
sudo sed -i 's/console=serial0,[0-9]* //g' /boot/firmware/cmdline.txt

Step 4: Reboot and verify hardware

sudo reboot

SSH back in after ~60 seconds:

ssh pi@voicekey.local

Verify the mic is detected:

arecord -l

You should see snd_rpi_googlevoicehat_soundcar listed. If it shows nothing, double-check your /boot/firmware/config.txt edits.

Verify UART:

ls -la /dev/serial0

Should show a symlink to /dev/ttyS0.

Test the mic (speak or clap during the 3-second recording):

arecord -D plughw:1,0 -f S32_LE -r 16000 -c 1 -d 3 /tmp/test.wav
ls -la /tmp/test.wav

File should be ~192KB. You can copy it to your computer to listen:

# Run on your Mac/PC, not the Pi:
scp pi@voicekey.local:/tmp/test.wav ~/Downloads/

Tip: The INMP441 is a small MEMS mic — audio will sound faint. That's normal. We apply a 20dB gain boost in the daemon, and Whisper handles quiet audio remarkably well.

Test UART → Pico → KVM (open a text editor on your active KVM machine first):

echo "hello from voicekey" > /dev/serial0

If the text appears in your editor, the full hardware chain is working.

Test server connectivity (replace with your GPU PC's IP):

curl http://10.0.0.115:8787/health

Should return {"status":"ok",...}.

All four checks must pass before continuing.

Step 5: Set up Python environment and install dependencies

python3 -m venv ~/voicekey
source ~/voicekey/bin/activate
pip install requests pyserial pyyaml numpy sounddevice gpiozero lgpio

This is the slow step — 10-15 minutes on the Pi Zero W. Go get coffee.

Verify everything installed:

python3 -c "import requests; import serial; import numpy; import sounddevice; from gpiozero import Button; print('All good')"

Step 6: Create the configuration file

cat > ~/voicekey/config.yaml << 'EOF'
server_url: "http://10.0.0.115:8787"
server_timeout_connect: 5
server_timeout_response: 30
uart_port: "/dev/serial0"
uart_baud: 115200
button_gpio: 23
led_gpio: 24
min_record_seconds: 0.3
max_record_seconds: 180
chunk_size: 200
audio_sample_rate: 16000
audio_gain_db: 20
log_file: "/home/pi/voicekey/voicekey.log"
EOF

Important: Replace the server_url IP with your actual GPU PC's IP address. Consider setting a static IP / DHCP reservation in your router so it doesn't change.

Step 7: Deploy the daemon

This is the main orchestrator — it ties audio capture, server communication, and Pico output together.

cat > ~/voicekey/daemon.py << 'DAEMONEOF'
"""
VoiceKey Daemon for Raspberry Pi Zero W
========================================
Push the joystick button to record, release to process.
Audio is sent to the Murmur server for transcription + cleanup,
then the result is typed out via the Pico USB keyboard bridge.
"""

import os
import io
import time
import wave
import struct
import logging
import subprocess
import threading
import yaml
import numpy as np
import requests
import serial
from gpiozero import Button, LED
from signal import pause

# --- Load Config ---
CONFIG_PATH = os.path.join(os.path.dirname(__file__), "config.yaml")

DEFAULT_CONFIG = {
    "server_url": "http://10.0.0.115:8787",
    "server_timeout_connect": 5,
    "server_timeout_response": 30,
    "uart_port": "/dev/serial0",
    "uart_baud": 115200,
    "button_gpio": 23,
    "led_gpio": 24,
    "min_record_seconds": 0.3,
    "max_record_seconds": 180,
    "chunk_size": 200,
    "audio_sample_rate": 16000,
    "audio_gain_db": 20,
    "log_file": "/home/pi/voicekey/voicekey.log",
}


def load_config():
    config = DEFAULT_CONFIG.copy()
    if os.path.exists(CONFIG_PATH):
        with open(CONFIG_PATH, "r") as f:
            user_config = yaml.safe_load(f)
            if user_config:
                config.update(user_config)
    return config


config = load_config()

# --- Logging ---
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.FileHandler(config["log_file"]),
        logging.StreamHandler(),
    ],
)
log = logging.getLogger("voicekey")


# --- State ---
class State:
    IDLE = "IDLE"
    LISTENING = "LISTENING"
    PROCESSING = "PROCESSING"
    TYPING = "TYPING"
    ERROR = "ERROR"


current_state = State.IDLE
recording_process = None
recording_file = "/tmp/voicekey_recording.wav"
raw_audio_file = "/tmp/voicekey_raw.raw"

# --- Hardware Setup ---
button = Button(config["button_gpio"], pull_up=True, bounce_time=0.05)

led = None
try:
    led = LED(config["led_gpio"])
except Exception as e:
    log.warning(f"Could not initialize LED on GPIO {config['led_gpio']}: {e}")

# --- LED Control ---
led_blink_thread = None
led_blink_stop = threading.Event()


def set_led(state):
    global led_blink_thread
    if led is None:
        return
    led_blink_stop.set()
    if led_blink_thread and led_blink_thread.is_alive():
        led_blink_thread.join(timeout=1)
    led_blink_stop.clear()

    if state == State.IDLE:
        led.off()
    elif state == State.LISTENING:
        led.on()
    elif state == State.PROCESSING:
        def blink_fast():
            while not led_blink_stop.is_set():
                led.toggle()
                led_blink_stop.wait(0.1)
            led.off()
        led_blink_thread = threading.Thread(target=blink_fast, daemon=True)
        led_blink_thread.start()
    elif state == State.TYPING:
        def blink_slow():
            while not led_blink_stop.is_set():
                led.toggle()
                led_blink_stop.wait(0.5)
            led.off()
        led_blink_thread = threading.Thread(target=blink_slow, daemon=True)
        led_blink_thread.start()
    elif state == State.ERROR:
        def blink_error():
            for _ in range(3):
                if led_blink_stop.is_set():
                    break
                led.on()
                led_blink_stop.wait(0.15)
                led.off()
                led_blink_stop.wait(0.15)
            led.off()
        led_blink_thread = threading.Thread(target=blink_error, daemon=True)
        led_blink_thread.start()


def set_state(new_state):
    global current_state
    current_state = new_state
    log.info(f"State: {new_state}")
    set_led(new_state)


# --- UART Setup ---
def open_uart():
    try:
        ser = serial.Serial(config["uart_port"], config["uart_baud"], timeout=30)
        log.info(f"UART opened: {config['uart_port']} @ {config['uart_baud']}")
        return ser
    except Exception as e:
        log.error(f"Failed to open UART: {e}")
        return None


uart = open_uart()


def send_to_pico(text):
    global uart
    if uart is None:
        uart = open_uart()
    if uart is None:
        log.error("UART not available, cannot send to Pico")
        return False

    chunk_size = config["chunk_size"]
    chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

    for i, chunk in enumerate(chunks):
        try:
            uart.write((chunk + "\n").encode("utf-8"))
            uart.flush()
            log.info(f"Sent chunk {i+1}/{len(chunks)}: {len(chunk)} chars")
            ack = uart.read(1)
            if ack == b'\x06':
                log.info(f"ACK received for chunk {i+1}")
            elif ack == b'\x15':
                log.warning(f"NAK received for chunk {i+1}")
                return False
            else:
                log.warning(f"No ACK received for chunk {i+1}, got: {ack}")
        except Exception as e:
            log.error(f"UART send error: {e}")
            uart = None
            return False
    return True


# --- Audio Recording ---
def start_recording():
    global recording_process, current_state
    if current_state != State.IDLE:
        log.warning(f"Cannot start recording in state {current_state}")
        return

    set_state(State.LISTENING)
    try:
        for f in [recording_file, raw_audio_file]:
            if os.path.exists(f):
                os.remove(f)
        recording_process = subprocess.Popen(
            [
                "arecord", "-D", "plughw:1,0", "-f", "S32_LE",
                "-r", str(config["audio_sample_rate"]),
                "-c", "1", "-d", str(config["max_record_seconds"]),
                recording_file,
            ],
            stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
        )
        log.info("Recording started")
    except Exception as e:
        log.error(f"Failed to start recording: {e}")
        set_state(State.ERROR)
        time.sleep(1)
        set_state(State.IDLE)


def stop_recording_and_process():
    global recording_process
    if current_state != State.LISTENING:
        log.warning(f"Cannot stop recording in state {current_state}")
        return

    if recording_process:
        recording_process.terminate()
        recording_process.wait(timeout=5)
        recording_process = None
        log.info("Recording stopped")

    if not os.path.exists(recording_file):
        log.warning("No recording file found")
        set_state(State.IDLE)
        return

    file_size = os.path.getsize(recording_file)
    min_bytes = int(config["min_record_seconds"] * 16000 * 4) + 44
    if file_size < min_bytes:
        log.warning(f"Recording too short ({file_size} bytes), ignoring")
        set_state(State.IDLE)
        return

    set_state(State.PROCESSING)
    try:
        sox_result = subprocess.run(
            [
                "sox", recording_file, "-t", "raw",
                "-r", str(config["audio_sample_rate"]),
                "-c", "1", "-e", "floating-point", "-b", "32",
                raw_audio_file, "gain", str(config["audio_gain_db"]),
            ],
            capture_output=True, text=True, timeout=30,
        )
        if sox_result.returncode != 0:
            log.error(f"Sox conversion failed: {sox_result.stderr}")
            set_state(State.ERROR)
            time.sleep(1)
            set_state(State.IDLE)
            return

        log.info("Audio converted to raw float32 with gain boost")

        transcription = transcribe_audio(raw_audio_file)
        if not transcription:
            log.warning("Empty transcription, nothing to type")
            set_state(State.IDLE)
            return

        log.info(f"Transcription: {transcription}")

        cleaned = cleanup_text(transcription)
        log.info(f"Cleaned: {cleaned}")

        set_state(State.TYPING)
        success = send_to_pico(cleaned)
        if success:
            log.info("Text sent to Pico successfully")
        else:
            log.warning("Failed to send text to Pico")

    except Exception as e:
        log.error(f"Processing error: {e}")
        set_state(State.ERROR)
        time.sleep(1)
    finally:
        set_state(State.IDLE)

    for f in [recording_file, raw_audio_file]:
        try:
            if os.path.exists(f):
                os.remove(f)
        except:
            pass


# --- Server Communication ---
def transcribe_audio(raw_file_path):
    url = f"{config['server_url']}/transcribe_raw"
    timeout = (config["server_timeout_connect"], config["server_timeout_response"])
    try:
        with open(raw_file_path, "rb") as f:
            files = {"audio": ("audio.raw", f, "application/octet-stream")}
            data = {"sample_rate": str(config["audio_sample_rate"])}
            response = requests.post(url, files=files, data=data, timeout=timeout)
        if response.status_code == 200:
            result = response.json()
            text = result.get("text", "").strip()
            timing = result.get("timing", {})
            log.info(
                f"Transcription timing: {timing.get('transcription_ms', '?')}ms, "
                f"RTF: {timing.get('realtime_factor', '?')}"
            )
            return text
        else:
            log.error(f"Transcription failed: HTTP {response.status_code}")
            return None
    except requests.exceptions.ConnectionError:
        log.error(f"Cannot reach server at {config['server_url']}")
        return None
    except requests.exceptions.Timeout:
        log.error("Server request timed out")
        return None
    except Exception as e:
        log.error(f"Transcription error: {e}")
        return None


def cleanup_text(text):
    url = f"{config['server_url']}/cleanup"
    timeout = (config["server_timeout_connect"], config["server_timeout_response"])

    if len(text.split()) < 4:
        cleaned = text.strip()
        if cleaned and cleaned[0].islower():
            cleaned = cleaned[0].upper() + cleaned[1:]
        if cleaned and cleaned[-1] not in ".!?":
            cleaned += "."
        return cleaned

    try:
        response = requests.post(url, json={"text": text}, timeout=timeout)
        if response.status_code == 200:
            result = response.json()
            cleaned = result.get("cleaned_text", text).strip()
            timing = result.get("timing", {})
            log.info(f"Cleanup timing: {timing.get('total_ms', '?')}ms")
            return cleaned if cleaned else text
        else:
            log.warning(f"Cleanup failed: HTTP {response.status_code}, using raw text")
            return text
    except Exception as e:
        log.warning(f"Cleanup error: {e}, using raw text")
        return text


# --- Button Handlers ---
def on_button_press():
    log.info("Button pressed")
    start_recording()


def on_button_release():
    log.info("Button released")
    threading.Thread(target=stop_recording_and_process, daemon=True).start()


# --- Main ---
def main():
    log.info("=" * 50)
    log.info("VoiceKey Daemon Starting")
    log.info(f"Server: {config['server_url']}")
    log.info(f"UART: {config['uart_port']} @ {config['uart_baud']}")
    log.info(f"Button: GPIO {config['button_gpio']}")
    log.info(f"Audio gain: {config['audio_gain_db']}dB")
    log.info("=" * 50)

    try:
        r = requests.get(f"{config['server_url']}/health", timeout=5)
        health = r.json()
        log.info(f"Server OK: whisper={health.get('whisper_model')}, "
                 f"device={health.get('whisper_device')}")
    except Exception as e:
        log.warning(f"Server not reachable: {e} -- will retry on first dictation")

    if uart:
        uart.write(b"CMD:PING\n")
        ack = uart.read(1)
        if ack == b'\x06':
            log.info("Pico connected and responding")
        else:
            log.warning("Pico not responding to PING -- check wiring")

    button.when_pressed = on_button_press
    button.when_released = on_button_release

    set_state(State.IDLE)
    log.info("Ready! Press the joystick to dictate.")

    try:
        pause()
    except KeyboardInterrupt:
        log.info("Shutting down...")
        set_led(State.IDLE)
        if uart:
            uart.close()


if __name__ == "__main__":
    main()
DAEMONEOF

Step 8: Test the daemon

source ~/voicekey/bin/activate
cd ~/voicekey
python3 daemon.py

You should see:

VoiceKey Daemon Starting
Server: http://10.0.0.115:8787
UART: /dev/serial0 @ 115200
Button: GPIO 23
Audio gain: 20dB
Server OK: whisper=small.en, device=cuda
Pico connected and responding
State: IDLE
Ready! Press the joystick to dictate.

Open a text editor on your active KVM machine. Press the joystick, say something, release. Watch the magic happen.

Press Ctrl+C to stop the daemon after verifying it works.

Step 9: Set up auto-start on boot

sudo bash -c 'cat > /etc/systemd/system/voicekey.service << EOF
[Unit]
Description=VoiceKey Daemon
After=network-online.target
Wants=network-online.target

[Service]
ExecStart=/home/pi/voicekey/bin/python3 /home/pi/voicekey/daemon.py
WorkingDirectory=/home/pi/voicekey
User=root
Restart=on-failure
RestartSec=3

[Install]
WantedBy=multi-user.target
EOF'

sudo systemctl daemon-reload
sudo systemctl enable voicekey
sudo systemctl start voicekey

Verify:

sudo systemctl status voicekey

VoiceKey now starts automatically on every boot. Plug in power, wait 60 seconds, dictate.

It's Working — Now What?

When everything is humming, your logs (visible with journalctl -u voicekey -f) should look something like this:

State: IDLE
Ready! Press the joystick to dictate.
Button pressed
State: LISTENING
Recording started
Button released
Recording stopped
State: PROCESSING
Audio converted to raw float32 with gain boost
Transcription timing: 182.12ms, RTF: 0.04
Transcription: I'm refactoring the ViewModel to use StateFlow instead of LiveData
Cleanup timing: 342ms
Cleaned: I'm refactoring the ViewModel to use StateFlow instead of LiveData.
State: TYPING
Sent chunk 1/1: 67 chars
ACK received for chunk 1
Text sent to Pico successfully
State: IDLE

That RTF: 0.04 means transcription ran 25x faster than real-time on the RTX 5070. Total end-to-end latency from button release to text appearing: under 700ms.

Teach Whisper Your Vocabulary

This is the single biggest accuracy improvement you can make. Without a vocabulary prompt, Whisper will mangle your technical terms: "Hilt" becomes "hilled," "StateFlow" becomes "state flow," "gRPC" becomes "grip C."

Add a vocabulary.txt file to your inference server containing your most-used technical terms:

ViewModel, StateFlow, MutableStateFlow, LiveData, Hilt, Dagger,
Kotlin, Jetpack Compose, coroutine, LaunchedEffect, Retrofit,
OkHttp, gRPC, protobuf, Gradle, ProGuard, R8, Room, NavGraph,
RecyclerView, suspend fun, rememberSaveable, LazyColumn, composable,
NavHostController, serialization, deserialization, idempotent, Kafka

Pass this as the initial_prompt parameter to faster-whisper's transcribe() call on the server side. The difference is dramatic — after adding my vocabulary, Whisper nailed every single Android/Kotlin term I threw at it.

Gotchas & Tips

The serial console will eat your UART. This was my biggest debugging headache. If you don't disable serial-getty@ttyS0.service and remove console=serial0 from cmdline.txt, the Pi Zero sends login prompts over UART. The Pico dutifully types them into your text editor. Ask me how I know.

The INMP441 mic is quiet. It's a tiny MEMS microphone, not a studio condenser. The daemon applies a 20dB gain boost via sox before sending to the server. You can adjust this in config.yaml (audio_gain_db). Whisper is remarkably good at handling quiet audio, so even without the boost it often works — but the boost helps.

First transcription is slow. The first time you dictate after starting the server, Whisper needs to load the model into GPU memory. This can take several seconds. Every subsequent transcription is fast (~180ms for a 5-second clip on an RTX 5070). Just do a throwaway test dictation after starting the server.

The Pi Zero W is slow to set up, fast to run. Installing packages takes ages on the single-core ARM11 CPU. But the actual daemon runtime is minimal — it's just recording audio, making HTTP calls, and writing to UART. None of that is CPU intensive.

Set a static IP for your GPU PC. If your PC gets a new IP from DHCP, VoiceKey stops working until you update config.yaml. Set a DHCP reservation in your router for the PC.

KVM keystroke speed. Some KVM switches drop keystrokes if they come too fast. The default 8ms delay between keystrokes works for my KCeve KVM303. If you see dropped characters, increase it in config.yaml or send CMD:SPEED:15 over UART for 15ms delay.

Don't plug the INMP441 VDD into 5V. It's a 3.3V device. Pin 1 (3.3V), not Pin 2 (5V). I didn't fry one, but only because I triple-checked before powering on.

What's Next

This is v1. It works, it's fast, and it cost less than three months of a WisprFlow subscription. But there's a lot of room to grow:

Context-aware modes — detect which app is focused (Slack vs Cursor vs Mail) and adjust the LLM cleanup prompt accordingly. Dictating code should produce code, not English sentences.
Voice commands — "select all and copy" → Cmd+A, Cmd+V. The Pico already supports CMD: prefixed commands for key combos.
A proper enclosure — right now it's a tangle of wires and bare boards on my desk. A 3D-printed case would make it look like a real product.
The 1602A LCD display I found in a drawer — wire it up for status, diagnostics, and scrolling transcription preview.

I'll cover some of these in future posts. If you want to see the build process, wiring, and live demo, check out the companion video on my YouTube channel @droidchef.

If you build one of these, I want to see it. Drop a comment here or on the video!

Happy dictating.

— Ishan

I Built My Own WisprFlow — Fully Local, Under $50, and It Types Into Any Computer

What You'll Need

Hardware

Software / Dependencies

Architecture Overview

Step-by-Step Setup

Part 1: Set Up the Inference Server (Windows PC)

Part 2: Flash the Raspberry Pi Pico (USB Keyboard Bridge)

Part 3: Wire Everything Up

Part 4: Set Up the Raspberry Pi Zero W

It's Working — Now What?

Teach Whisper Your Vocabulary

Gotchas & Tips

What's Next

Ishan Khanna

How to Input Unicode Characters in Maestro Android Tests: A Complete Workaround Guide

Subscribe to newsletter

What You'll Need

Hardware

Software / Dependencies

Architecture Overview

Step-by-Step Setup

Part 1: Set Up the Inference Server (Windows PC)

Part 2: Flash the Raspberry Pi Pico (USB Keyboard Bridge)

Part 3: Wire Everything Up

Part 4: Set Up the Raspberry Pi Zero W

It's Working — Now What?

Teach Whisper Your Vocabulary

Gotchas & Tips

What's Next

Ishan Khanna

How to Input Unicode Characters in Maestro Android Tests: A Complete Workaround Guide

Subscribe to newsletter

See posts by Popular tags