Moderator Edit: This post was written/formatted by AI
TL;DR: Expose a local, opt-in microphone audio stream API so users can run voice assistants in any spoken language -- not just the handful Sonos and its partners support today.
Submitted by: A Sonos owner
Affected products: Arc Ultra, Era 100/300, Beam (Gen 2), and all current mic-equipped players
Category: Voice / Platform
Summary
Sonos players ship with excellent far-field microphone arrays, but voice control only works in a small set of supported languages, through first-party and partner assistants. Hundreds of millions of people who own (or would buy) Sonos cannot talk to it in their own language.
We are asking Sonos to add a public, opt-in, locally-authenticated API that streams post-wake-word microphone audio to software on the user's own network. With access to that audio, the community and third parties can pair the mic with modern, open speech-to-text engines that already understand a hundred-plus languages and dialects -- and play the response back through the same Sonos.
In short: let the microphone hardware customers already paid for understand the language they actually speak.
The Problem
Voice on Sonos today is language-locked:
- First-party and partner assistants support only a limited set of languages. Sonos Voice Control and the integrated partner assistants cover major markets, but leave out most of the world's languages, regional dialects, and accents. If your household speaks Tagalog, Ukrainian, Vietnamese, Swahili, Catalan, or any of hundreds of others -- or simply has a strong accent the model wasn't tuned for -- voice control effectively does not exist for you.
- The microphone audio never leaves the device in any usable form, so users cannot route it to a speech engine that does understand their language. The audio is encrypted to first-party/partner cloud endpoints with certificate pinning; there is no supported way to capture it.
- Meanwhile, the technology to solve this already exists -- modern open speech-to-text models (e.g. Whisper-class systems) transcribe 90–100+ languages accurately and run locally. The only missing piece is access to the audio from the Sonos mic.
The result: a global audience that loves the hardware but is shut out of its most natural interface, purely because of a language and access gap that Sonos could close in software.
Proposed Solution
Add a Microphone Stream API with these properties:
- Local-first. The stream is delivered over the LAN, never required to transit Sonos's cloud. Latency stays low and audio stays on the user's own network.
- Post-wake-word by default. To preserve the existing privacy model, the default mode streams audio only after a wake event. A continuous-stream mode can exist as a separate, more heavily gated option.
- Explicitly opt-in, per-device, revocable. The owner enables it in the Sonos app per player, sees a clear privacy disclosure, and can revoke it anytime. The physical mic mute switch remains a hard kill.
- Standard audio format. Deliver PCM/Opus at a documented sample rate so it can be fed directly into any speech-to-text engine.
- Authenticated and household-scoped, bound to the owner's existing credentials and local pairing.
This deliberately mirrors the privacy posture Sonos already ships (opt-in mic, physical mute, on-device wake detection) -- it adds a destination the owner chooses, not a new data-collection surface.
With this in place, a user whose assistant isn't offered in their language can route the Sonos mic audio to a local speech engine that understands it, process the request, and play the reply back through the speaker -- all in their native language.
And because the captured audio can feed any downstream logic, it isn't limited to answering questions. The same stream can drive AI agents that take action -- turning a spoken request in any language into real commands across the user's smart home.
Extensibility: Sonos as the Voice Front Door to the Agentic Smart Home
The microphone stream is not just an input for transcription -- it is the trigger surface for AI agents. Once audio leaves the Sonos in an open format, a user's agent can:
- Understand intent in any language (via local or cloud speech-to-text + LLM), then
- Act on the user's other devices -- lights, thermostats, locks, blinds, media, scenes -- through the platforms they already run (Home Assistant, Matter/Thread controllers, vendor APIs), and
- Respond and confirm by voice through the same Sonos, closing the loop.
This makes Sonos the natural-language front door to the whole home, regardless of which assistant or agent framework wins. Crucially, the speaker stays constant while the intelligence behind it can be upgraded indefinitely -- today a simple command router, tomorrow a multi-step reasoning agent that chains actions ("dim the living room, queue dinner jazz, and tell me if the garage is still open"). None of that requires new Sonos hardware or a new Sonos assistant; it only requires that the microphone audio be reachable.
This is the extensibility that closed, single-assistant ecosystems cannot offer: the user -- not Sonos, not Amazon, not Google -- chooses the agent, and the capability grows as agent technology grows.
Why This Is Good for Sonos (the ROI case)
This is a low-cost, high-leverage growth lever. The microphone hardware is already in the field -- this is a software/firmware unlock of an existing asset, not a new hardware cost. And it targets the single largest untapped pool of Sonos demand: people who don't speak a currently-supported language.
1. Unlock a massive underserved global market
The supported-language list excludes the majority of the world's ~7,000 living languages and a large share of its speakers. Every household that can't use voice today because of language is a household getting less value from its Sonos -- and a prospective buyer who sees voice as "not for me."
Illustrative model. First-party voice covers a relatively small set of languages. Even capturing a sliver of the excluded population converts to large numbers: if opening the mic lets the community serve, say, 50 additional languages, and this brings in just 0.1% of the speakers of those languages as new or upgrading Sonos customers, that is still hundreds of thousands of incremental buyers -- from a feature that is essentially a firmware unlock of hardware already shipped.
2. Word-of-mouth and the network effect -- in every language community
This is the part that compounds. "I can finally talk to my Sonos in my own language" is an intensely shareable moment, and it spreads through tight-knit language and regional communities that mainstream tech marketing never reaches:
- Each "it works in our language now" post, video, or community thread is a free, credible, organic advertisement aimed precisely at people who share that language -- i.e., at qualified prospective buyers.
- Language communities are dense and high-trust networks. Recommendations travel fast within a diaspora, a region, or a linguistic group -- far more efficiently than paid ads.
- The household member who sets this up is typically the one who outfits the home; one enthusiast frequently drives multiple system purchases across family and friends.
- The flywheel: open mic → assistants in new languages → users sharing in their communities → new buyers → demand for still more languages. Sonos supplies the hardware base; the world's language communities supply the marketing.
3. Accessibility and inclusion narrative
Beyond languages, the same capability serves accent robustness, speech differences, and accessibility use cases the big platforms underinvest in. A Sonos that says "voice control for everyone, in any language" owns an inclusion story that is both genuinely good and highly shareable.
4. Capture the agentic smart-home shift -- as the voice layer
Voice is rapidly becoming the interface to AI agents that control the home, not just to media playback. By opening the mic, Sonos positions its speakers as the front door to the agentic smart home: the device you talk to, in your language, to make things happen across lights, climate, locks, and scenes. This pulls Sonos into the center of the smart-home conversation -- a far larger and faster-growing market than audio alone -- without Sonos having to build the agents or the integrations itself. Every new capability the agent ecosystem ships (better reasoning, more device integrations, multi-step automation) makes the Sonos in the room more valuable at zero additional cost to Sonos. It also deepens the moat: once a household's agent talks and listens through Sonos, the speaker becomes the irreplaceable I/O endpoint of their entire home.
5. Future-proofing against the AI/voice shift
Sonos can't build an assistant for every language and dialect -- but the global community can, if given the audio. Opening the mic lets Sonos hardware stay the preferred voice endpoint regardless of which assistant, model, or language a household uses. Sonos becomes the ears and voice for whatever AI a user prefers.
6. Differentiation no major competitor offers
Amazon, Google, and Apple all keep microphone audio locked to their own clouds and their own supported-language lists. A Sonos that says "it's your mic, your audio, your network -- and it understands your language, and drives your whole home" owns a positioning none of the incumbents match -- and that positioning is itself highly shareable.
7. Low cost, high optionality
- Cost: primarily firmware + API work on hardware already deployed. No new manufacturing.
- Upside: access to entire language markets currently written off, a marketing engine that runs on community enthusiasm, and future monetization options (premium tier, certified-assistant program).
Privacy & Trust (addressing the obvious objection)
Sonos's caution around microphone data is correct and should be preserved. This proposal strengthens the privacy story:
- Opt-in only, per-device, with a plain-language disclosure at enable time.
- Local by default -- audio goes to the owner's chosen device on their own network, not to Sonos or any third-party cloud unless the owner's own software sends it there.
- Post-wake-word default, preserving the "not always listening" guarantee.
- Physical mute switch remains a hard, hardware-level kill.
- Per-app authorization & revocation, auditable in the Sonos app.
This is more privacy-respecting than the status quo, in which users who need an unsupported language must bolt on a separate, uncontrolled third-party microphone because the Sonos mic is unavailable.
Suggested Rollout
- Beta program behind the existing developer/feedback portal, mic-equipped players only.
- Spec-first: publish a documented streaming protocol so the audio can be fed into any speech engine.
- Post-wake-word mode first; gate continuous-stream mode behind an additional explicit permission.
- Reference example: ship a sample "local voice assistant" (wake word → speech-to-text → action → text-to-speech → playback) demonstrating a non–first-party language, to seed the community and model the intended privacy posture.
- Gather feedback, then graduate to GA.
What This Would Enable (concrete demand signal)
- Voice control in languages and dialects Sonos doesn't and won't natively support -- the long tail of the world's languages.
- Better handling of regional accents and multilingual households that switch languages mid-conversation.
- Accessibility experiences tailored to individual speech patterns.
- Fully local, private voice assistants for users who want voice control without cloud dependence -- in their own language.
- AI agents that control the smart home by voice -- "turn off the lights and lock the door," "set the bedroom to 20 degrees," multi-step routines -- spoken in any language and confirmed back through the Sonos, using whatever agent platform the user already runs (Home Assistant, Matter, vendor APIs).
Every one of these gives a previously-excluded community a reason to talk publicly about Sonos -- which is exactly the growth mechanism described above.
