Where your voice actually goes when you dictate

When you speak to your computer and watch text appear, a lot happens in between. Your voice is captured, turned into a digital recording, processed by a speech-recognition model, and converted into words. The question most people never ask is the most important one: where does that recording physically go, and what happens to it afterward?

This post answers that for the different kinds of dictation tools, then describes exactly what Lispr does. The aim is not to scare you. It is to make an invisible process visible, so you can make an informed choice.

The journey of a spoken sentence

Strip away the branding and every dictation tool does the same four things:

Capture. Your microphone records the audio. This part is always local; it has to be.
Transcribe. A speech model converts the audio into text. This is the step that runs either on your machine or on a server, and that choice is everything.
Deliver. The text is placed where you want it — into a document, a message, a field.
Dispose, or not. The audio is either discarded immediately or kept. This is the step nobody advertises, and the one most worth understanding.

The interesting variation is in steps 2 and 4.

On-device tools

With on-device transcription, step 2 happens on your own Mac. The recording is captured, processed, and turned into text without leaving the machine. Apple Dictation on modern Macs works this way, as do local apps built on the open Whisper model, like MacWhisper.

Here, the answer to "where does my voice go" is short: nowhere. It stays on your hardware. There is no transmission to secure and no server retention policy to read, because there is no server. The disposal question in step 4 is also simple — the audio lives and dies on your machine. This is the strongest privacy posture available, and it is structural rather than a promise.

Cloud tools

With cloud transcription, step 2 happens on a server. Your audio is sent over the network to a speech service, transcribed there, and the text is sent back. Many modern dictation apps work this way.

Now "where does my voice go" has a real answer with several parts, and they vary enormously between tools:

Is the connection encrypted? It should be. Audio in transit ought to be unreadable to anyone in between.
Is the audio stored after transcription? Some services discard it immediately. Others keep it — for "quality," for debugging, or unstated. This is the single biggest difference between two cloud tools that otherwise look identical.
Is the audio used to train a model? Some services improve their models using customer audio. That means recordings of your voice become part of a training set. Others contractually do not.
Is there an account? An account links every transcription to an identity, which turns dictation into a profile. Account-free tools cannot do that. We dig into this in dictation without an account.

Two cloud apps can sit at completely opposite ends of this spectrum. One encrypts, discards immediately, never trains, and needs no account. Another stores indefinitely, trains on your speech, and ties it all to a login. They both "transcribe in the cloud," but they are not remotely the same as far as your privacy is concerned. The category label tells you almost nothing — the specifics tell you everything.

The questions worth asking

For any dictation tool you are considering, you can get a clear picture with five questions:

Does transcription run on-device or in the cloud? This frames everything else.
If cloud, is the connection encrypted?
Is my audio stored after it is transcribed, and if so for how long?
Is my audio used to train a model?
Do I need an account, and what is linked to it?

The answers should be findable in the privacy policy. If they are not — if a tool is vague about storage or training — treat the vagueness itself as an answer. A tool confident in its handling tends to state it plainly. We expand this into a full checklist in is voice dictation private.

What Lispr does with your voice

Here is Lispr's path, step by step, with nothing left out.

When you hold the right Option key and speak, your microphone captures the audio locally. On release, that audio travels over an encrypted connection to be transcribed. The destination is a Whisper speech model, reached through a Cloudflare edge proxy to Groq. The model converts the audio to text, and the text comes back and appears at your cursor in whatever Mac app you are using.

Then the important part: the audio is discarded. Nothing is stored on a server. Nothing is kept for "quality" or analysis. And nothing is used to train a model — your voice does not become part of any training set. There is no account and no sign-up, so there is no identity for transcriptions to be tied to in the first place. The whole round trip takes around 200 milliseconds.

We will be candid about what this does not give you. Because Lispr transcribes in the cloud, your audio does leave your Mac during step 2 — it has to, for the cloud model to do its work. We discard it and store nothing, but we cannot offer the structural guarantee of an on-device tool, where the audio never moves at all. If that structural guarantee is your requirement, an on-device option is the more honest fit, and we say so plainly in cloud vs on-device transcription. You can also read our full privacy policy.

Closing

Dictation feels like magic because the middle of the process is invisible. But there is no magic — just audio going somewhere and being handled some way. The tools worth trusting are the ones willing to describe that path plainly: where the audio goes, whether it is kept, whether it trains anything, and what is tied to your name. Ask those questions of any tool you use. The answers, not the marketing, are what matter.