Push-to-talk vs always-listening dictation

Dictation tools come in two broad shapes, and the difference between them is not a small detail. It changes who is in control of the microphone, how the tool feels to use, and how much you have to trust it. This post lays out both models plainly so you can decide which one you actually want.

The two shapes are always-listening and push-to-talk. Lispr is push-to-talk, and this post explains the reasoning — but the goal is to be fair to both, because each one is a reasonable answer to a real question.

The two models

Always-listening dictation keeps the microphone open and waits. Usually there is a wake word — you say a trigger phrase and the tool starts paying attention to what comes next. The promise is convenience: you never have to do anything with your hands. You just talk, and the tool is already there.

Push-to-talk dictation does nothing until you tell it to. You hold a key, speak, and release. The microphone is active only during that hold, and only because you are physically holding it. The promise is control: the tool listens exactly when you decide and not one moment longer.

Both can transcribe your speech well. The difference is not accuracy. It is the relationship between you and the microphone.

Control

With push-to-talk, the boundary of "is this being recorded" is something you can feel. Your finger is on the key or it is not. There is no ambiguity, no in-between state, no wondering. The tool's attention has a clear physical edge, and you are holding it.

With an always-listening tool, that boundary is fuzzier. The microphone is open continuously, and the tool is making its own judgment about what counts as a command and what is just the room. Most of the time it judges correctly. But the line between "listening" and "acting" lives inside the software, not in your hand, and you have to take its word for where that line is.

This shows up in everyday small ways:

Accidental triggers. Always-listening tools sometimes wake up to something on a podcast, a phrase in a meeting, a word that sounded like the trigger. Push-to-talk cannot do this, because nothing happens unless you hold the key.
Half-formed thoughts. With push-to-talk you can think, pause, abandon a sentence, and start over, and none of it is captured — you simply have not pressed the key yet. An open mic does not give you that quiet space to think out loud first.
Knowing when you are "on." Holding a key is an unmistakable signal to yourself. There is no equivalent clear signal with a mic that is always on.

Privacy

This is where the two models differ most, and it is worth being precise rather than alarmist.

An always-listening tool, by design, has the microphone open whenever it is running. To detect a wake word, something has to be processing audio continuously. A well-built tool does this carefully and locally. But the architecture requires a continuously-open microphone, and that is simply a larger surface to trust.

Push-to-talk inverts this. The microphone is closed by default. It opens only while you hold the key, and it closes the instant you release. The amount of audio the tool ever has access to is exactly the amount you deliberately handed it. There is no continuous capture to reason about, because there is no continuous capture.

For a lot of people this is the deciding factor. Dictation often happens in offices, shared homes, and rooms where other conversations are going on. A tool that only ever hears the specific seconds you chose to give it is a much easier thing to feel comfortable with than one whose normal state is listening.

Lispr is push-to-talk for this reason. You hold the right Option key, you speak, you release. Between releases the microphone is doing nothing. The audio from each hold is sent over an encrypted connection, transcribed by the Whisper speech model, and then discarded — nothing is stored, and nothing trains a model. Push-to-talk keeps the amount of audio minimal; the rest of the design keeps what little there is from lingering.

Effort

The honest case for always-listening is that it asks nothing of your hands. For some situations that genuinely matters — across the room, hands full, accessibility needs that make holding a key difficult. If your hands cannot reliably hold a key, hands-free is not a luxury, it is the point. That is a real and good use of the model.

Push-to-talk does ask for one thing: a key held down while you speak. The question is how much that costs. With a well-placed key the answer is close to nothing. Holding a modifier key with one finger while you talk is a motion you stop noticing within a day. And in exchange for that small, steady effort you get a tool that never misfires and never listens unasked.

There is also a quieter benefit. The act of pressing the key is a deliberate decision to dictate. That tiny moment of intention tends to produce more focused speech than an open mic, where it is easy to start narrating before you have decided what to say. We dug into how to use that to your advantage in better dictation accuracy.

Which one should you choose

It comes down to one honest question: do you want the microphone open by default, or closed by default?

Choose always-listening if hands-free operation is essential for you — for accessibility reasons, or because your hands are genuinely occupied while you need to dictate. The continuously-open mic is the cost of that, and for these needs it is a fair trade.
Choose push-to-talk if you want the microphone closed unless you are actively, physically choosing to use it. You accept one small motion — holding a key — and in return you get precise control and a much smaller thing to trust.

For most people at a Mac, doing everyday dictation into emails, documents, and messages, push-to-talk is the better fit. The key is right there, the effort is trivial, and "the mic is off unless I am holding it" is a simple, reassuring thing to be able to say.

Where Lispr stands

We built Lispr as a push-to-talk tool on purpose. It is a small macOS app with no window and no account: hold the right Option key, speak, release, and the recognized text appears at your cursor in whatever app you are using. The microphone is closed every second you are not holding that key.

That design is a deliberate choice about control. You decide when the tool listens, the boundary is something you can feel under your finger, and the tool never gets more of your voice than you chose to give it. If that is the relationship you want with a dictation tool, Lispr is free in early access and quick to try.