Recovering the Flow of Writing with Voice Dictation

A Break in the Gesture of Writing

A few years ago, writing was still a fast gesture for me. I do not mean that writing was simple, or that ideas always arrived in order, without hesitation or revision. But there was a very concrete continuity between what I was thinking and what appeared on the screen. When I wrote my dissertation on a laptop, my muscle tone, dexterity, and typing speed still allowed me to work in that kind of proximity to my text. My hands more or less kept up with my mind. I could move forward, correct, move a sentence, return to a word, almost without the technology making itself felt.

Before that, handwriting had played that role. Then the keyboard replaced it, with its own constraints, but also with remarkable efficiency. A fairly simple spell checker was often enough to fix the mistakes left along the way. The essential work remained mine: composing, specifying, revising, moving things around. The errors came from me, or from the speed at which I had written.

Since my hands no longer keep up, that fluidity has broken down. Voice recognition has given me back a possibility of writing, which is already enormous. But it has not yet given me back the same relationship to writing. Dictating is not simply replacing fingers with voice. It means introducing a series of intermediaries between thought and text: a microphone, a transcription model, a punctuation system, sometimes an automatic corrector, and then a language model.

This shift changes everything. When I wrote on a keyboard, I could produce an imperfect sentence, but I knew what I had written. Today, I sometimes see a sentence appear that resembles what I said, but is not exactly what I meant. One word has been replaced by another. A sentence has been cut in the wrong place. A respiratory pause has been interpreted as the end of a sentence. Sometimes the text is grammatically plausible, but intellectually wrong. I then have to reread with suspicion a text that, at first glance, seems correct.

Real-time transcription models make this difficulty even more acute. They are essential if one wants to see text appear quickly, but by definition they are less advantaged than models that transcribe a complete recording. When a model can process a longer segment, return to the context, and recognize the coherence of a sentence or paragraph, it has better support. Real time works under constraint. It has to produce quickly, sometimes before the sentence has revealed its full logic.

I therefore added a correction stage after transcription. Specialized prompts ask a language model to repair the most frequent errors: agreement mistakes, phonetic confusions, punctuation, poor segmentation, false starts. This stage has become almost indispensable. Without it, I lose too much time cleaning up the raw output. (For a more technical discussion of this transcription-correction chain, see the article about my voice dictation correction prompt and the one about its dynamic, externalized architecture.)

But this correction has a cost that is not only financial. It has a temporal cost, and above all a cognitive one. When I dictate an idea, stop the recording, then wait several seconds for transcription and correction, my flow of thought is suspended. I already know what I want to write next, but the technical system holds me back. Ideas continue to form, sometimes very quickly, while the tool asks me to wait. Even a short delay is enough to produce a very specific kind of frustration: the frustration of a thought that is available, but prevented from landing.

A Workaround for Continuity, but Not a Solution

I tried to work around this latency by using several transcription tools in alternation. While the first was processing a segment, I could continue dictating into a second one. Then, when the first text appeared, I returned to it. On paper, this solution made it possible to recover a kind of continuity. In practice, it created other problems. The tools do not all have the same processing times, nor the same segmentation behavior. If the second recording was shorter than the first, I sometimes found myself waiting anyway. And I had to manage several tools instead of concentrating on writing itself.

Distinguishing the Times of Correction

What I am looking for today is simpler: to reduce as much as possible the delay between dictation and the appearance of a text clean enough for me to continue. The goal is not to obtain a perfect version immediately. The goal is not to break the momentum. The text must be readable, faithful, corrected enough for me to trust it a few seconds later, but it does not need to be definitively publishable after this first pass.

It is from this perspective that I began to take an interest in Mercury 2, the model developed by Inception and also available through OpenRouter. What interests me here is not only its quality, but the relationship between quality, speed, and cost. For light correction of already transcribed text, speed matters enormously. A correction that arrives almost immediately does not play the same role as a correction that imposes a pause of ten or fifteen seconds. In the first case, it accompanies writing. In the second, it becomes an additional obstacle.

The working hypothesis is therefore as follows: to use a very fast model to correct short segments dictated in succession, in order to preserve as much as possible the natural movement of writing. I could dictate a sentence, a paragraph, an idea, then quickly obtain a corrected version that strictly respects what I meant. This immediate correction would not aim to embellish the text. It should only remove errors introduced by transcription and repair the most obvious accidents of dictation.

In a second stage, once the text has already been composed, a more demanding model could intervene for a global rereading. This stage would have a different function. It would no longer serve to preserve the flow in real time, but to slightly homogenize the text after the fact: check repetitions, improve punctuation, identify a few heavy passages, smooth certain transitions without touching the voice of the text. There, latency would matter less. Waiting a few dozen seconds to reread a text that has already been written is not the same as waiting in the middle of a sentence for the tool to give me back control.

This distinction between several temporalities seems essential to me. The time of writing is not the time of revision. In the first, the tool must be almost transparent. In the second, it can become more attentive, slower, more meticulous. Confusing these moments leads to asking the same prompt and the same model to do several contradictory things: go very fast and reread very finely, change nothing and improve the text, respect orality and produce already stabilized prose.

Matching Forms of Writing with Technical Architectures

Recent tests have therefore led me to distinguish several writing regimes rather than a single universal flow. For familiar messages, personal notes, or low-stakes content, a very fast correction with Mercury 2 may be enough. It lets a few imperfections pass, but the gain in speed is decisive. The goal is not to produce an impeccable text, but to send or preserve an idea without interrupting the movement.

Between this fast writing and very careful writing, an intermediate level can make sense. A model such as Grok 4.2 seems capable of providing a somewhat more robust correction than Mercury 2, while remaining faster and less costly than a very demanding model. This level could be used for everyday messages that deserve greater cleanliness without justifying a slow rereading: simple professional exchanges, shared notes, short drafts.

When the text carries more of my responsibility, such as an official email, an important message, or a passage I want to reread and send without risk, GPT 5.4 becomes more relevant. The latency is more noticeable, but it is justified by the quality of the correction. The model understands long sentences better, restores transcription confusions more finely, and produces a text that is more immediately usable.

Finally, for long texts that I dictate quickly in order to get ideas out of my head, a two-stage architecture seems more appropriate: a first rapid correction, then a light rereading with GPT 5.4. In this case, I can accept a small amount of editorial work after the fact: reattaching isolated fragments, reconstructing paragraphs according to their argumentative logic, correcting formulations damaged by dictation. It is no longer only a cleaning operation; it is a way of making a thought dictated under imperfect conditions readable again.

This typology seems important to me because it shifts the question. The point is no longer to find the single model that would do everything, but to choose the correction architecture according to the type of writing. At low stakes, speed comes first. For important texts, reliability matters most. For long and exploratory texts, what becomes central is the ability to preserve momentum while later reconstructing readability.

This is a point I care about deeply. I am not trying to delegate my writing to a language model. I do not want a tool to transform my texts into standardized, polished, efficient prose that is foreign to my way of thinking. The promise of these systems interests me only if they help me become the author of my sentences again, not if they place themselves between me and those sentences. The aim is not to produce a text more brilliant than mine. The aim is to recover continuity in writing despite the loss of my physical means.

There is, it seems to me, a broader question here for assistive technologies. A good tool is not only a powerful tool. It is a tool that respects the rhythm of the person who uses it. In the case of voice-assisted writing, precision matters, of course. But temporality matters just as much. A system can be objectively very powerful and subjectively unusable if it imposes too much waiting, too many checks, too many ruptures in action.

Fluidity, then, is not a luxury. For me, it conditions the very possibility of writing. It determines the difference between noting an idea at the moment it forms and watching it dissipate while the machine finishes processing. It also determines the relationship to the tool: a system that accompanies gives momentum; a system that delays eventually discourages.

I will therefore continue to test this path: dictation in short segments, rapid correction with Mercury 2, then a more global rereading in a second stage. It will probably be necessary to adjust the prompts, distinguish several levels of correction, measure costs, compare models, and observe the remaining errors. But the objective will remain the same: to make voice dictation not a fallback, but a form of writing that is sufficiently fast, natural, and reliable for me to think while writing again.

What I am trying to recover is not exactly handwriting, nor even the keyboard of my dissertation years. Those gestures belong to another state of my body. What I am seeking is a functional and intimate equivalence: to follow my ideas without technology fragmenting them. If voice recognition and language models can serve that purpose, then they will not only be compensatory tools. They will become instruments of continuity.

A Break in the Gesture of Writing

A Workaround for Continuity, but Not a Solution

Distinguishing the Times of Correction

Matching Forms of Writing with Technical Architectures

Leave a ReplyCancel Reply