Lispr ← All posts
Basics

"How speech recognition actually works"

April 16, 2026 · 6 min read

You hold a key, say a sentence, and a few hundred milliseconds later the words are on your screen. It feels like magic, but it is not. It is a chain of fairly understandable steps. This post walks through that chain in plain language — no math, no jargon dump — so you have a real mental model of what is happening when you dictate.

Understanding it is useful even if you never think about it again, because it explains why speech recognition makes the mistakes it makes, and why it got so much better recently.

The chain, start to finish

At the highest level, speech recognition is three steps:

  1. A microphone turns sound into numbers.
  2. A speech model turns those numbers into text.
  3. The text is placed wherever you need it.

That is the whole shape of it. Everything else is detail inside step two.

Step one: from sound to numbers

When you speak, you push air. That moving air vibrates a small membrane in your microphone, and those vibrations are measured thousands of times a second. Each measurement is just a number describing the air pressure at that instant. String them together and you have a digital recording — a long list of numbers that traces the shape of the sound wave.

This is raw and not very meaningful yet. A list of pressure readings does not "know" anything about words. Before the model sees it, the audio is usually reshaped into something that highlights the parts that matter for speech — roughly, which pitches are present and how loud, moment by moment. You can picture it as a heat map of sound over time. This representation throws away things the model does not need and keeps the patterns that distinguish one sound from another.

So by the end of step one, your sentence is a picture of sound, ready to be read.

Step two: from numbers to text

This is the hard part, and it is where the technology has changed the most. To appreciate today's systems, it helps to see what came before.

The old way: rules and dictionaries

Early speech recognition was built by hand. Engineers wrote out the sounds that make up each word, defined the small units of speech, and used statistical models to guess which sequence of sounds was most likely. A separate language component held a list of which words tend to follow which.

It worked, sort of. But it was brittle. It needed you to speak slowly and clearly, often with pauses between words. It struggled with accents it had not been tuned for, with background noise, and with anything outside its dictionary. It was a stack of carefully assembled rules, and like any such stack, it broke at the edges.

The modern way: neural speech models

Today's systems work differently. Instead of being told the rules of speech, a neural network is shown an enormous amount of example audio paired with the correct text — many thousands of hours of people speaking, in many languages, many accents, many noise conditions. From all those examples, the model learns the patterns itself: which sound shapes correspond to which words, how words fit together, how to handle a cough or a "um."

The widely used open speech model Whisper, from OpenAI, is the best-known example of this approach. Because it learned from such varied real-world audio, it is far more robust than the old rule-based systems. It tolerates accents, casual fast speech, imperfect microphones, and noisy rooms much better, and it handles many languages without being separately programmed for each.

Crucially, a modern model does not transcribe in a vacuum. It uses context — the surrounding words — to decide what it heard. If two words sound identical, the model leans on which one makes sense in the sentence. That is also why it can add punctuation: it has learned where sentences tend to end and where commas tend to fall. We go deeper on that in how automatic punctuation works.

If you want the longer version of this story, see what speech-to-text is.

On-device or in the cloud

The model has to run somewhere, and there are two choices.

On-device means the model runs on your own computer. Nothing leaves the machine. The trade-off is that a small enough model to run smoothly on a laptop is usually a less capable one, and it competes with everything else your computer is doing.

In the cloud means your audio is sent over an encrypted connection to a server, transcribed there by a large, fast model, and the text comes back. The benefits are accuracy and speed; the consideration is that your audio briefly leaves your machine. A well-built cloud tool uses the audio only to transcribe and then discards it — it is not stored and not used to train anything.

Neither is automatically better. It is a genuine trade-off, and we lay it out fully in cloud versus on-device transcription. What matters is that a tool is clear about which it does and why.

Step three: getting the text to you

Once the model has produced text, something has to put it where you want it. A dictation app inserts the words at your cursor, so they land in whatever you are writing — an email, a document, a chat box. This sounds trivial next to the modeling, but it is the part that makes speech recognition feel like typing with your voice rather than a separate tool you copy out of.

Why this explains the mistakes

With the chain in mind, the errors make sense:

None of these are bugs. They are the visible edge of a system that is, underneath, making informed guesses. Knowing that helps you use it well: speak clearly, cut obvious noise, and glance at the result before you rely on it.

The short version

Speech recognition is a microphone turning sound into numbers, a neural model trained on huge amounts of real speech turning those numbers into text, and an app placing that text where you need it. The leap from hand-written rules to learned neural models is why dictation went from a frustrating novelty to something genuinely fast and reliable.

A modern push-to-talk app like Lispr is this whole chain compressed into one key: hold the right Option key, speak, release, and the text appears — usually in about the time it takes to lift your finger.

Try Lispr

Voice to text in any Mac app — hold a key, talk, let go. Free, no account, ~4 MB.

Download for macOS