A contributor's map of kesha-voice-kit: where code lives, what the boundaries
are, and where to make changes. For the why behind specific designs, see the
spec docs under docs/superpowers/specs/.
Kesha is two programs, not one:
keshaCLI — a thin Bun/TypeScript wrapper (src/). Parses commands, formats stdout/stderr, downloads pinned assets only when explicitly asked, and owns the local cache, support bundles, and Stats.kesha-engine— a standalone Rust binary (rust/). Does all inference (ASR, TTS, language detection, VAD, diarization). No cloud calls, no Python, no ffmpeg.
The CLI spawns the engine as a subprocess — it is never linked in-process.
TypeScript runs directly under Bun (no build step); the engine is a precompiled
binary downloaded from GitHub Releases during kesha install. The two are
versioned independently (package.json#version vs
package.json#keshaEngine.version).
Users / agents
shell | scripts | OpenClaw | Hermes | Raycast | @drakulavich/kesha-voice-kit/core
|
v
+------------------------------- Kesha CLI -------------------------------+
| Bun + TypeScript wrapper |
| - parses commands and formats stdout/stderr |
| - installs pinned engine/model assets only when explicitly requested |
| - keeps cache, support bundles, and local Stats in the CLI |
+-----------------------------------+-------------------------------------+
|
| spawns one local process
v
+----------------------------- kesha-engine ------------------------------+
| Rust binary, no cloud calls, no Python, no ffmpeg |
| |
| Audio input Text input Diagnostics |
| WAV/MP3/OGG/FLAC/AAC/M4A plain text / SSML status/support |
| | | | |
| v v v |
| Symphonia decode TTS preprocessing runtime probes |
| | | |
| +--> optional VAD +--> voice routing |
| | + diarization | Kokoro / Vosk / macOS voices |
| | | |
| +--> audio lang ID +--> speech synthesis |
| | SpeechBrain ONNX |
| | |
| +--> ASR backend |
| CoreML on Apple Silicon |
| ONNX Runtime on Linux/Windows/fallback |
+-----------------------------------+-------------------------------------+
|
v
transcript | JSON/TOON | WAV | local diagnostics
Cache boundary: kesha install and opt-in feature installs populate the local
cache; ordinary transcription and speech commands fail fast if required assets
are missing.
| Model | Task | Size | Source |
|---|---|---|---|
| NVIDIA Parakeet TDT 0.6B v3 | Speech-to-text | ~2.5GB | HuggingFace |
| SpeechBrain ECAPA-TDNN | Audio language detection | ~86MB | HuggingFace |
| Apple NLLanguageRecognizer | Text language detection | built-in | macOS system framework |
| Silero VAD v5 (opt-in) | Voice activity detection | ~2.3MB | snakers4/silero-vad |
| Kokoro-82M / Vosk-TTS (opt-in) | Text-to-speech | ~990MB | FluidAudio Kokoro on darwin-arm64 (FluidAudio cache, not Kesha-verified); ONNX Kokoro elsewhere · Vosk-TTS |
All models run through kesha-engine — a Rust binary using
FluidAudio (CoreML) on Apple
Silicon and ort (ONNX Runtime) on other
platforms. Audio decoding via
symphonia — WAV, MP3, OGG/Opus, FLAC,
AAC, M4A. No ffmpeg.
src/ Bun/TS CLI + library
cli.ts argument parsing, --format/--json/--toon, top-level flags
cli/ subcommands: install, init, logs, doctor, completions, dispatch
engine.ts engine subprocess wrapper + getEngineCapabilities
engine-install.ts engine binary download (uses keshaEngine.version)
transcribe.ts thin forwarder to `kesha-engine transcribe`
synth.ts thin forwarder to `kesha-engine say`
voice-routing.ts omitted-`--voice` language→voice picker
lib.ts public API exported at @drakulavich/kesha-voice-kit/core
*.ts doctor, support-bundle, stats, diagnostic-log, paths, ...
rust/src/ kesha-engine (Rust)
main.rs clap CLI: transcribe / say / detect-lang / install / record / ...
capabilities.rs --capabilities-json (single source of truth for feature flags)
models.rs HF download + cache + SHA-256 pins for every model
audio.rs symphonia decode + rubato resample to 16kHz mono f32
lang_id.rs SpeechBrain ONNX audio language detection (always built)
text_lang.rs macOS NLLanguageRecognizer (macOS only)
vad.rs Silero VAD v5
backend/ ASR backends — onnx.rs, fluidaudio.rs (coreml), mod.rs (trait)
transcribe/ transcribe pipeline + diarize.rs
tts/ kokoro.rs (en), vosk.rs (ru), avspeech.rs (macos), g2p, ssml/, en/, ru/
tests/ bun tests — unit/, integration/, fixtures/, helpers/
rust/tests/ nextest integration binaries (tts_e2e, diarize_e2e, ssml_integration, ...)
.github/workflows/ ci, rust-test, build-engine, security, npm-publish, homebrew-tap, linux-packages, docker
raycast/ Raycast extension (its own package.json)
packaging/ deb/rpm nfpm config
flake.nix Nix build path (aarch64-darwin, x86_64-linux)
SKILL.md OpenClaw skill manifest (shipped in the npm package)
- A
kesha <cmd>call is parsed insrc/cli.ts/src/cli/dispatch.ts. - Commands that need inference (
transcribe,say,detect-lang) forward to the engine viasrc/engine.ts, which locates the binary (KESHA_ENGINE_BINoverride → installed cache path) and spawns it withBun.spawn. - The CLI reads the engine's capability surface via
kesha-engine --capabilities-json(src/engine.ts::getEngineCapabilities) and validates flags against it instead of blindly forwarding — see the "DO NOT BLINDLY FORWARD CLI FLAGS" rule in CLAUDE.md. - stdout is the result (transcript / JSON / WAV bytes); stderr is progress + errors. This keeps stdout pipe-friendly.
- Assets are install-only.
kesha install(and opt-in--tts/--vad/--diarize) populate the cache; ordinary commands fail fast with an actionable hint if an asset is missing — the engine is never auto-downloaded on first transcription.
Compile-time feature gating (rust/Cargo.toml): the engine ships in
per-platform variants selected by cargo features, mirrored in every
build-engine.yml matrix row.
- ASR: exactly one backend per binary, no runtime fallback —
coreml(FluidAudio / Apple Neural Engine, darwin-arm64) oronnx(ONNX Runtime, Linux/Windows/fallback). They're mutually exclusive at the module level (backend/mod.rstrait,onnx.rs,fluidaudio.rs). lang_id.rsalways uses ONNX regardless of ASR backend.- TTS (
ttsfeature): routed by voice-id prefix intts/voices.rs::resolve_voice—en-*→ Kokoro (kokoro.rs),ru-*→ Vosk-TTS (vosk.rs),macos-*→ AVSpeech (avspeech.rs).
Sidecars are resolved at runtime sibling-first (next to the engine binary,
then build-time $OUT_DIR): the say-avspeech Swift helper (system_tts,
darwin) and the native fluidaudio-rs CoreML path (coreml / system_diarize).
- Cache lives under
~/.cache/kesha/models/(overrideKESHA_CACHE_DIR). - Every model file in
rust/src/models.rscarries a pinned SHA-256;download_verifiedrefuses to cache a file whose hash doesn't match. This makesKESHA_MODEL_MIRRORsafe and turns an upstream re-publish into a deliberate pin bump (see theverify-pin-bumpskill). - Diarization compiles its
.mlpackageto a stable.mlmodelcsidecar warmed atinstall --diarize; Apple's e5rt cache is keyed by compiled-bundle identity, so a recompile is a cold ~98 s cost (see #444).
- TS tests:
tests/unit/+tests/integration/, run withbun test/make test. - Rust tests:
cargo nextest run --features tts/make rust-test; nextest integration binaries live inrust/tests/. Never plaincargo test(CI uses nextest) exceptcargo test --doc. - CI:
ci.yml(TS units + integration + type check),rust-test.yml(fmt/clippy/nextest + coreml feature check; PR + lean push-to-main gate),security.yml(cargo-deny + bun audit),build-engine.yml(tag → 3 platform binaries + draft release),npm-publish.yml,homebrew-tap.yml,linux-packages.yml,docker.yml. - Releases: CLI and engine version independently; the full procedure
(lockstep bump → tag → draft validation → un-draft → npm publish) is in
CLAUDE.md and the
release-engineskill. - Nix:
flake.nixbuilds the engine + CLI onaarch64-darwin/x86_64-linux; not a CI gate.
- OpenClaw:
SKILL.md(shipped in the npm package) documents thetools.media.audio.modelsCLI route and TTS provider config;openclaw.plugin.json+openclaw-plugin.cjsregister the plugin. - Raycast: the
raycast/extension (its own package, own lockfile). - Programmatic API:
@drakulavich/kesha-voice-kit/core(src/lib.ts) exportstranscribe,transcribeWithTimestamps,say, and thedownloadModel/downloadTtsinstallers.
| If you're changing… | Touch | Verify with |
|---|---|---|
| A CLI flag / output format | src/cli.ts, src/cli/*, src/format.ts |
bun test && bunx tsc --noEmit |
| ASR pipeline | rust/src/backend/, rust/src/transcribe/ |
make rust-test + cargo check --features coreml --no-default-features |
| A TTS voice/engine | rust/src/tts/, src/voice-routing.ts |
make rust-test; cargo nextest run --features tts tts_ |
| A model version/pin | rust/src/models.rs |
verify-pin-bump skill; cargo test models::manifest_tests |
| Shell completions / manpage | regenerate, don't hand-edit | bun run generate:shell-artifacts |
| A GitHub workflow | .github/workflows/* |
bun run check:workflows + actionlint |
| The OpenClaw skill | SKILL.md |
cross-check against live kesha <cmd> --help |
When in doubt, the agent-facing rules in CLAUDE.md capture the hard constraints (bun-only, no auto-download, male default voices, pinned hashes, isolated worktrees) that this map only summarizes.