Why an LLM Reads Your Markdown Better Than an Export

Q: Does Markdown use fewer tokens than HTML?

Yes, usually by a wide margin. With OpenAI's o200k_base tokenizer, one sample note ran 90 tokens as Markdown versus 366 as a styled HTML export — about four times more — and 125 as minimal clean HTML. The saving grows with the amount of styling, wrapper markup, and inline CSS the export carries.

The cleanest copy of your knowledge to hand a language model is the plain Markdown you already keep. The same note exported to styled HTML can cost roughly four times as many tokens — the extra weight is wrapper tags, classes, and inline CSS the model has to read past before it reaches a single one of your words.

That ratio is not a marketing number. It comes from a tokenizer you can run yourself, on a note you write yourself, in about a minute. The bigger point sits underneath it: the conversion tools the AI industry now ships — from Microsoft, from Jina, from Google Cloud — all convert toward Markdown, not away from it. The format you reach for to feed an AI is the format you were probably already writing in.

This piece is the export-economics companion to the broader case that your Markdown notes are already AI-ready. Here the focus is narrower: the tokens the export step quietly burns.

What most people assume about feeding notes to an AI

Most people assume the format barely matters, that you paste in whatever your app exports and the model sorts it out. That assumption is half right. A capable model will read a messy HTML dump. It will also pay for every byte of that mess in tokens, and tokens are the budget the whole interaction runs on.

The instinct to "just export and paste" treats the export as free. It is not. An export is a translation, and the proprietary formats most apps translate into — wrapped HTML, a database dump, a PDF — are optimized for storage, rendering, or display fidelity, not for being read by a language model. The model has to undo that translation before it can use what you wrote.

Why an export is the expensive way to hand over knowledge

An export is expensive in the literal, countable sense: it spends tokens on structure the model does not need. Jina AI, describing why it built models to clean web pages, put it plainly: "Inline CSS and scripts can easily balloon the code to hundreds of thousands of tokens."¹ That ballooning is the bill the export hands you.

You can measure it on your own note. Using OpenAI's o200k_base tokenizer (the one behind GPT-4o and GPT-4.1), a short kickoff note runs 90 tokens as plain Markdown, 125 tokens as minimal clean HTML, and 366 tokens as a typical styled export with wrapper divs, classes, and inline CSS.² That is the same note three ways. The styled export is about four times the Markdown, and roughly three-quarters of its tokens are paying for chrome you never read.

Here is the receipt, so you do not have to take the number on faith:

# Requires: pip install tiktoken
# Same note, three formats. o200k_base = GPT-4o / GPT-4.1 tokenizer.
python3 -c "import tiktoken; e=tiktoken.get_encoding('o200k_base'); \
print('markdown:', len(e.encode(open('note.md').read()))); \
print('minimal-html:', len(e.encode(open('note-minimal.html').read()))); \
print('styled-export:', len(e.encode(open('note-styled.html').read())))"

Be honest about the range. Against minimal, semantic HTML — just headings, lists, and links — Markdown saves around 28% on this note.² Against a realistic styled export with wrapper tags and inline CSS, the saving climbs past 75%. The cleaner the export, the smaller the win; the more chrome it carries, the larger. Across the typical span, plain Markdown lands somewhere between roughly a third and four-fifths cheaper. The number moves with the mess.

Why the industry kept building tools to make Markdown

The strongest evidence that Markdown is the format to hand an AI is not an argument. It is a pattern of what large vendors actually built. They did not build tools to feed HTML to models. They built tools to strip HTML down to Markdown first, and they said why in their own documentation.

Microsoft's open-source MarkItDown — a tool whose entire job is "convert any file to Markdown for use with LLMs" — has over 155,000 GitHub stars.³ Its README states the rationale directly: large models "natively 'speak' Markdown … this suggests that they have been trained on vast amounts of Markdown-formatted text … As a side benefit, Markdown conventions are also highly token-efficient."⁴ Training distribution and token cost, named by the vendor.

Jina AI built Reader-LM, a family of small models that support "a context length of up to 256K tokens" for one purpose: turning noisy HTML into clean Markdown.⁵ The output, in their words, is "a well-structured markdown file, ready to be used by LLMs for grounding, summarizing, and reasoning."⁶ When a company builds a purpose-specific model to manufacture the format, the format has won the argument.

What the AI world standardized on

The convergence is not limited to conversion utilities. In June 2026, Google Cloud published the Open Knowledge Format, an interchange standard that "represents knowledge as a directory of markdown files with YAML frontmatter."⁷ A trillion-dollar cloud vendor, defining how knowledge moves between AI systems, chose plain Markdown in a folder.

The reason they gave is the one that matters for ownership, and it is worth reading their Open Knowledge Format announcement in full. "OKF is not tied to any specific cloud, database, model provider, or agent framework," the authors wrote. "It will never require a proprietary account or SDK to read, write, or serve."⁸

And the principle behind it: "the value of a knowledge format comes from how many parties speak it, not from who owns it."⁹ An export locks your knowledge to the tool that produced it. Markdown speaks to everything.

Simon Willison, who has fed text to these models since the early days, names the same trade-off from the practitioner's side: "I've been defaulting to asking for most things in Markdown since the GPT-4 days, when the 8,192 token limit meant that Markdown's token-efficiency over HTML was extremely worthwhile."¹⁰

His point is narrow and worth keeping narrow — he is talking about what you send a model, the input. (In the same post he argues HTML can be the better format for rich, interactive output the model produces; that is a different question, and we are not over-reading it.)

Where this argument stops being true

Markdown is not magic, and "always use Markdown" would be a dishonest place to end. The win is real for prose and notes, where Markdown's structure maps cleanly to how a model learned to read. It is not universal. For genuinely tabular data, the academic record runs the other way.

The Microsoft Research paper "Table Meets LLM" (WSDM 2024) found that model "performance varied with different input choices, including table input format," and in that study the input format was a measurable lever, not a wash, for structured-table understanding.¹¹ Markdown was not the universal winner there. So the honest claim is bounded: Markdown wins on token cost, native fluency, and ownership, not on every task. If your knowledge is mostly dense tables, test both formats yourself. If it is mostly notes and prose, the choice is already made.

There is a second honest caveat, and it cuts against complacency. Clean Markdown is only an advantage if it is clean — structured with headings, lists, and a frontmatter header the model can lean on. A wall of unstructured text saved with a .md extension is not more readable than a wall of unstructured HTML. Garbage Markdown is still garbage. The savings come from the structure, not the file extension.

What to actually do with this

The practical takeaway is short, and most of it is about doing less. The most AI-ready copy of your knowledge is probably the one you already have, written in plain Markdown, and the export step is the part to drop. Keep the structure, skip the conversion, and convert toward Markdown only when the source leaves you no choice.

Keep your notes as plain Markdown with real structure — headings, lists, links, a frontmatter header. The structure is what the model reads.
Skip the export when you can. If your notes already live as Markdown files, hand the model the files. The conversion you were about to run is the conversion MarkItDown and Reader-LM exist to undo.
When you must convert a PDF, a web page, or a database dump, convert to Markdown first, with an open tool, before you spend prompt tokens on it.
For tables specifically, test both formats. The token argument still favors Markdown; the accuracy argument may not.

Frequently asked questions

What is the best format to give an LLM my notes — Markdown, HTML, or an export?

For notes and prose, plain Markdown. It costs the fewest tokens (on a sample note, roughly a third to four-fifths cheaper than HTML depending on how styled the HTML is)² and large models were trained on enough Markdown that they read it natively.⁴ The exception is dense tables, where HTML can score better.¹¹

Does Markdown use fewer tokens than HTML?

Yes, usually by a wide margin. With OpenAI's o200k_base tokenizer, one sample note ran 90 tokens as Markdown versus 366 as a styled HTML export, about four times more, and 125 as minimal clean HTML.² The saving grows with the amount of styling, wrapper markup, and inline CSS the export carries.¹

Do LLMs understand Markdown natively?

Yes. Microsoft's MarkItDown documentation states that mainstream models "natively 'speak' Markdown" and "often incorporate Markdown into their responses unprompted," which it attributes to training on "vast amounts of Markdown-formatted text."⁴ The format is close to the model's own default output.

Should I convert HTML to Markdown before sending it to an LLM?

For token cost and noise reduction, yes. Inline CSS and scripts can "balloon the code to hundreds of thousands of tokens,"¹ and purpose-built tools (Microsoft's MarkItDown,³ Jina's Reader-LM⁵) exist specifically to strip HTML down to clean Markdown for exactly this reason.

Is Markdown or HTML better for tables in an LLM?

It depends on the task. For most prose and notes, Markdown wins on token cost and native fluency. But "Table Meets LLM" (WSDM 2024) found that input format measurably changes accuracy on structured-table tasks, and Markdown was not the universal winner there.¹¹ For table-heavy data, test both.

Why are we still using Markdown for this?

Because the tools the AI world is built on converge on it. Microsoft converts files to Markdown for model use,³ Jina built models to produce it,⁶ and Google Cloud's Open Knowledge Format represents knowledge as "a directory of markdown files with YAML frontmatter."⁷ It is the format the most parties already speak.⁹

The export was always a translation, and translation has a cost the token meter makes visible. If the most fluent, cheapest copy of your knowledge is the plain Markdown you already keep on your own device, then the real question is not which format to convert your notes into — it is why you would convert them at all. MNMNOTE keeps your notes as open, plain Markdown stored locally on your device, so the AI-ready copy already exists, with no export step in between: mnmnote.com.

"Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown," Jina AI, 2024-09-11, https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/, retrieved 2026-06-18. ↩ ↩² ↩³
Token counts derived with OpenAI's o200k_base tokenizer (the GPT-4o / GPT-4.1 encoding) via the tiktoken library, MNMNOTE, 2026-06-18. The same kickoff note measured 90 tokens as plain Markdown, 125 as minimal semantic HTML, and 366 as a styled HTML export with wrapper divs, classes, and inline CSS; the command and inputs are reproduced in the post. ↩ ↩² ↩³ ↩⁴
"microsoft/markitdown," Microsoft, GitHub, https://github.com/microsoft/markitdown, 155,209 stars as of 2026-06-18. ↩ ↩² ↩³
"Why Markdown?," MarkItDown README, Microsoft, https://raw.githubusercontent.com/microsoft/markitdown/main/README.md, retrieved 2026-06-18. ↩ ↩² ↩³
"Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown," Jina AI, 2024-09-11, https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/, retrieved 2026-06-18. ↩ ↩²
"Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown," Jina AI, 2024-09-11, https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/, retrieved 2026-06-18. ↩ ↩²
Sam McVeety and Amir Hormati, "How the Open Knowledge Format can improve data sharing," Google Cloud blog, 2026-06-13, https://cloud.google.com/blog/products/data-analytics/how-the-open-knowledge-format-can-improve-data-sharing/, retrieved 2026-06-18. ↩ ↩²
Sam McVeety and Amir Hormati, "How the Open Knowledge Format can improve data sharing," Google Cloud blog, 2026-06-13, https://cloud.google.com/blog/products/data-analytics/how-the-open-knowledge-format-can-improve-data-sharing/, retrieved 2026-06-18. ↩
Sam McVeety and Amir Hormati, "How the Open Knowledge Format can improve data sharing," Google Cloud blog, 2026-06-13, https://cloud.google.com/blog/products/data-analytics/how-the-open-knowledge-format-can-improve-data-sharing/, retrieved 2026-06-18. ↩ ↩²
Simon Willison, "Using Claude Code: The Unreasonable Effectiveness of HTML," simonwillison.net, 2026-05-08, https://simonwillison.net/2026/May/8/unreasonable-effectiveness-of-html/, retrieved 2026-06-18. ↩
Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang, "Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study," WSDM 2024, arXiv:2305.13062, https://arxiv.org/abs/2305.13062, retrieved 2026-06-18. ↩ ↩² ↩³