General 14 min read

Dictation Went Local: Speak Your Notes, Keep the File

MMNMNOTE
local-firstdictationspeech-to-textprivacymarkdown

Dictation no longer means the cloud. Open speech-to-text models are now small and fast enough to run on your own device, so your voice never has to reach a vendor's servers. That solves the privacy worry most people focus on. It leaves a quieter question unanswered: once your words become text, where does the text live?

For a decade, speaking to a machine meant sending a recording somewhere. The shift began in September 2022, when OpenAI open-sourced Whisper — a "general-purpose speech recognition model" — under an MIT license, with its public repository created on 2022-09-16 1. Four years later, the picture is unrecognizable. In February 2026, a community demonstration ran a 4B speech-to-text model with "Pure C, CPU-only inference," no GPU required 2. The capability that once lived in a data center now fits on a laptop.

What most people believe about dictation

Most people still assume dictation is a cloud service. You hold a button, your voice travels to a remote server, and text comes back. That assumption was reasonable for years, because accurate transcription genuinely required hardware most personal devices did not have. The trade was convenience for a recording you no longer controlled.

The belief was never paranoid. Voice is among the most revealing data a person produces, and the law treats it that way. Under the EU's General Data Protection Regulation, "'biometric data' means personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person" 3. A recording of your voice is a rich, sensitive source — not just words, but the body that made them.

Why sending your voice away was always a risk

The risk in cloud dictation was never hypothetical. In July 2019, The Guardian reported that Apple contractors "regularly hear confidential details" inside Siri recordings sent for quality review 4. One contractor described the contents bluntly: "There have been countless instances of recordings featuring private discussions between doctors and patients, business deals, seemingly criminal dealings, sexual encounters and so on" 4.

What makes the episode unanswerable is not the whistleblower's account alone — it is Apple's own response. Apple's statement at the time confirmed the grading data "is used to help Siri and dictation … understand you better and recognise what you say" 4. The word dictation is right there. Weeks later, on 2019-08-28, Apple said it had "immediately suspended human grading," that "by default, we will no longer retain audio recordings of Siri interactions," and that "users will be able to opt in" 5. The company added: "As a result of our review, we realize we haven't been fully living up to our high ideals, and for that we apologize" 5.

GDPR underlines why this matters. Article 9 lists "biometric data for the purpose of uniquely identifying a natural person" among the special categories whose processing is, by default, prohibited absent a specific lawful basis 6. The point is not that every voice note is a regulated special category — that depends on purpose. The point is the stakes. When a recording leaves your device, you are trusting someone else with the most identifying thing you own.

What changed: the models came home

What changed is size. Transcription that once needed a data center now runs on hardware you already have, and the models keep shrinking. Whisper's 2022 open-weights release was the inflection point 1. By early 2026, the wave was unmistakable: a steady run of small, on-device speech projects, each smaller and more capable than the last. It is the same shift covered in local AI is becoming the default, now reaching the moment you speak rather than the moment you query.

The data points stack up across H1 2026. A Show HN post in March 2026 announced "Three new Kitten TTS models – smallest less than 25MB" — a sub-25-megabyte voice model — drawing 561 points 7. In February 2026, the "Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model" demonstration showed a 4-billion-parameter transcriber running without a GPU 2. The makers of Moonshine, an on-device toolkit, report models offering "higher accuracy than Whisper Large V3 at the top end … down to tiny 26MB models," and that "Everything runs on-device, so it's fast, private, and you don't need an account, credit card, or API keys" 8. That accuracy claim is the vendor's own; treat it as such.

This is the honest caveat. Local has historically traded accuracy for privacy. The 2019 Preech study put it plainly: "Although offline and open-source ASR eliminates the privacy risks, its transcription performance is inferior to that of cloud-based ASR systems, especially for real-world use cases" 9. That was true. It is becoming less true each quarter — which is exactly why the local option is worth taking seriously now and was easy to dismiss five years ago. On-device transcription still asks something of your machine, and the smallest models trade accuracy for size. The gap is narrowing, not gone.

The question nobody asks: where does the transcript live?

Here is the move almost no one makes. Once your audio stays local, attention moves to a second destination that is just as important and far less discussed: the text itself. You can record privately and still hand the transcript to a service that keeps it, indexes it, and locks it inside its own format. Local capture without local keeping is half a solution.

A third-party project arrives at the same conclusion from a different direction. Whispering, part of the open-source Epicenter project, describes itself plainly: "Whispering is an open-source speech-to-text application. Press a keyboard shortcut, speak, and your words will transcribe, transform, then copy and paste at the cursor" 10. Its author explains the motivation in terms that name the gap exactly: "Even those claiming to be 'local' or 'on-device' were still black boxes that left me wondering where my audio really went. So I built Whispering. It's open-source, local-first, and most importantly, transparent with your data" 10. Its trust-boundaries documentation is explicit: "Audio stays on your device when you use local Whisper C++. Transcripts and settings are stored locally by the desktop app" 11.

The Epicenter project states the destination principle better than most marketing could: "Your data lives on your machine as plain Markdown and SQLite: grep it, version it, open it in Obsidian. When an app stops mattering, your files don't" 12. That is the whole argument in two sentences. Audio stays local, and the output is a plain file you can read, search, and move — long after any single app is gone.

What to do tomorrow

The practical version is short. Decide where your voice goes, then decide where the words land — and own both ends.

Frequently Asked Questions

Can I transcribe voice notes without the cloud?

Yes. Since Whisper was open-sourced in September 2022, on-device speech-to-text has grown small and fast enough to run locally 1. By early 2026, a 4B model ran CPU-only with no GPU 2, and other projects shipped under roughly 26 megabytes 8. With a local model, your audio never leaves your device, and the transcript can land in a plain text file you keep.

Where does my voice go with a cloud dictation app?

It goes to the vendor's servers — and, as Apple's 2019 Siri "grading" disclosure showed, sometimes to human reviewers who heard "confidential details" inside recordings 4. Apple later said it would, by default, no longer retain those recordings and would let users opt in 5. Local transcription removes the question entirely: the audio never travels.

Is my voice really considered biometric data?

Under GDPR Article 4(14), biometric data is personal data from technical processing of physical or behavioural characteristics that can uniquely identify a person 3. Article 9 treats biometric data used to identify someone as a special category, prohibited by default without a lawful basis 6. A voice recording is a rich biometric source, which is why where it travels matters.

Is local transcription as accurate as the cloud?

Not always, though the gap is closing fast. The 2019 Preech study found that offline, open-source speech recognition "eliminates the privacy risks" but historically delivered "inferior" accuracy to cloud systems 9. By 2026, on-device models had improved sharply 2 8. Local trades some accuracy for privacy and ownership — a trade many people will take, and one that gets easier each quarter.

What hardware do I need to run dictation locally?

Less than you might think, but not nothing. A community demonstration ran a 4B speech-to-text model "CPU-only" in early 2026 2, and small models now ship under roughly 25 megabytes 7. The smallest models trade accuracy for size, and the most accurate still want a capable machine. Match the model to your hardware rather than assuming every device runs the best one.

Why does it matter where the transcript ends up?

Because keeping your audio local is only half the problem. A transcript handed to a closed service can still be indexed and locked into a proprietary format. As the Epicenter project puts it, data that "lives on your machine as plain Markdown" can be grepped, versioned, and opened anywhere — and "when an app stops mattering, your files don't" 12. A plain file outlives the tool that made it.

Dictation came home, and the recording finally stays where you are. The harder discipline is making sure the words do too — speak them, transcribe them on your own device, and keep them as a plain file that will still open when today's app is forgotten.

This piece builds on the open-source Whispering project's framing of where audio and transcripts actually go 10 11.


If you want a place for those words to land, mnmnote.com keeps notes as plain Markdown on your own device.

Footnotes

  1. "openai/whisper," OpenAI (GitHub). Repository created 2022-09-16, MIT-licensed; README: "Whisper is a general-purpose speech recognition model." https://github.com/openai/whisper. Accessed 2026-06-20. 2 3 4

  2. "Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model," Hacker News (311 points, 2026-02-10). https://news.ycombinator.com/item?id=46954049. Accessed 2026-06-20. 2 3 4 5

  3. "Art. 4 GDPR – Definitions," EU General Data Protection Regulation (gdpr-info.eu), Article 4(14) ("biometric data"). GDPR in force 2018-05-25. https://gdpr-info.eu/art-4-gdpr/. Accessed 2026-06-20. 2

  4. Alex Hern. "Apple contractors 'regularly hear confidential details' on Siri recordings." The Guardian, 2019-07-26. https://www.theguardian.com/technology/2019/jul/26/apple-contractors-regularly-hear-confidential-details-on-siri-recordings. Accessed 2026-06-20. 2 3 4 5

  5. "Improving Siri's privacy protections." Apple Newsroom, 2019-08-28. https://www.apple.com/newsroom/2019/08/improving-siris-privacy-protections/. Accessed 2026-06-20. 2 3 4

  6. "Art. 9 GDPR – Processing of special categories of personal data," EU General Data Protection Regulation (gdpr-info.eu), Article 9(1). https://gdpr-info.eu/art-9-gdpr/. Accessed 2026-06-20. 2

  7. "Three new Kitten TTS models – smallest less than 25MB," Hacker News (561 points, 2026-03-19). https://news.ycombinator.com/item?id=47441546. Accessed 2026-06-20. 2

  8. "usefulsensors/moonshine," Useful Sensors (GitHub). Vendor-reported: "higher accuracy than Whisper Large V3 at the top end … down to tiny 26MB models"; "Everything runs on-device, so it's fast, private, and you don't need an account, credit card, or API keys." https://github.com/usefulsensors/moonshine. Accessed 2026-06-20. 2 3

  9. Shimaa Ahmed, Amrita Roy Chowdhury, Kassem Fawaz, Parmesh Ramanathan. "Preech: A System for Privacy-Preserving Speech Transcription." arXiv:1909.04198, September 2019. "Although offline and open-source ASR eliminates the privacy risks, its transcription performance is inferior to that of cloud-based ASR systems, especially for real-world use cases." https://arxiv.org/abs/1909.04198. Accessed 2026-06-20. 2

  10. "Whispering," Epicenter (GitHub), apps/whispering README. https://github.com/epicenter-so/epicenter/tree/main/apps/whispering. Accessed 2026-06-20. 2 3

  11. "epicenter-so/epicenter," Epicenter (GitHub), README "Trust Boundaries": "Audio stays on your device when you use local Whisper C++. Transcripts and settings are stored locally by the desktop app." https://github.com/epicenter-so/epicenter. Accessed 2026-06-20. 2 3

  12. "epicenter-so/epicenter," Epicenter (GitHub), README: "Your data lives on your machine as plain Markdown and SQLite: grep it, version it, open it in Obsidian. When an app stops mattering, your files don't." https://github.com/epicenter-so/epicenter. Accessed 2026-06-20. 2