What is speech to text? A plain guide

Speech to text does exactly what the name says: it listens to you talk and writes down the words. You speak a sentence, and a moment later that sentence is sitting there as text — in an email, a document, a chat box, a line of code.

It is not new. What is new is that it finally works well enough to use all day without fighting it. For most of its history, dictation was a niche tool: slow, error-prone, something you set up once and quietly abandoned. In the last few years that changed, and it changed fast.

This guide explains what speech to text actually is, how it works, what it is genuinely good at, and where it still falls short — so you can decide whether it belongs in your day.

How it works

Under the hood, speech to text is a chain of three steps.

First, your microphone captures sound — your voice, plus whatever else is in the room. Second, a speech recognition model analyzes that audio and predicts the most likely sequence of words. Third, that text is handed back to you, ideally with punctuation and capitalization already in place.

The middle step is where the real progress happened. Older systems matched sounds to words with hand-built rules and statistics. Modern systems use large neural networks trained on enormous amounts of recorded speech. The best-known is OpenAI's Whisper model, which can transcribe around a hundred languages and handles accents, background noise, and natural pauses far better than anything that came before.

That model runs in one of two places: on your device, or in the cloud. On-device transcription keeps the audio on your machine and works offline. Cloud transcription sends the audio to a server, which is usually faster and more accurate but means your voice leaves the device. Neither is automatically "better" — it is a trade-off, and a tool should be honest about which one it uses.

What it is good at

Speech to text earns its place because of one simple fact: you talk faster than you type.

Most people type somewhere around 40 words per minute. Most people speak at 130 to 150. That gap is real, and it compounds across a workday full of small writing — messages, replies, notes, comments, the first rough draft of anything.

It is especially good at:

First drafts. Talking out a rough version is faster than composing it letter by letter. You can tidy it up afterward.
Short, frequent writing. Chat replies, quick emails, a comment on a pull request. The overhead of "open a thing, type a thing" disappears.
Writing when typing hurts. For anyone managing RSI, carpal tunnel, or any condition that makes a keyboard painful, voice is not a productivity hack — it is the difference between working comfortably and not.
Capturing a thought before it escapes. Ideas arrive faster than fingers move. Saying it out loud catches it.

Where it still struggles

An honest guide has to cover the other side.

Speech recognition is not perfect and never claims a perfect transcript. Expect the occasional wrong word, especially with unusual names, technical jargon, and acronyms. Always glance at the result before you rely on it.

It also asks something of you that typing does not: you have to speak. That is fine at a desk and awkward on a quiet train or in an open office where you would rather not narrate your email to the room. Some of this is habit — dictation feels strange for about a week and then it doesn't — but some of it is genuinely situational.

And it does not replace editing. Voice is excellent for getting words out. Shaping them — cutting, reordering, polishing — is still mostly a job for your hands and eyes.

Free vs paid

Every major operating system ships free dictation — Apple Dictation on macOS and iOS, Voice Typing on Android, the built-in option on Windows. These are genuinely useful and cost nothing. For occasional use they are often enough.

Dedicated tools tend to win on three things: speed (less waiting for the text to appear), accuracy (better models, better punctuation), and friction (one gesture, works in every app, nothing to configure). Whether that is worth paying for depends entirely on how much you write. If dictation is a once-a-week thing, the built-in option is fine. If you write all day, the difference adds up.

How to choose a tool

A few questions cut through the marketing:

Where does my audio go? On-device or cloud? If cloud, is it stored, or transcribed and discarded? A tool should answer this plainly.
Does it need an account? Some do, some don't. Fewer accounts is fewer things to manage.
Does it work everywhere? The best dictation works in any app, not just one editor.
How fast is it? A second of lag per phrase does not sound like much until you feel it a hundred times a day.
How much does it ask of me? The ideal tool is one gesture and then it gets out of the way.

Where Lispr fits

Lispr is our take on that last point. It is a small macOS app: you hold a key, speak, let go, and the text lands wherever your cursor was — in any app. No window, no account, no subscription. Audio is sent over an encrypted connection purely to be transcribed, then discarded; nothing is stored, nothing trains a model.

It is not the only good option, and we say so plainly. But if "talk, and it is just there" is what you want from speech to text, that is exactly what we built it to be.