Tutorials 17 min read

Why Your Notes Show Garbled Characters: What 'Plain Text' Actually Means at the Byte Level

MMNMNOTE
plain textutf-8character encodingmojibakemarkdownfile formats

A note shows ’ or é or empty boxes because "plain text" is not one thing. A text file is bytes plus an encoding — a rule mapping bytes to characters. When one machine writes in one encoding and another reads in a different one, the bytes survive but their meaning garbles. Save notes as UTF-8 and the problem disappears.

That garble has a name: mojibake. Wikipedia defines it as "the garbled or gibberish text that is the result of text being decoded using an unintended character encoding."1 The word is borrowed from Japanese (文字化け, roughly "character transformation").

The good news hidden inside that definition is that your data is intact — only the interpretation is wrong. Re-read the same bytes with the correct encoding and the original text comes back. This post explains the bytes underneath, then gives you the five-minute fix.

What does "plain text" actually mean?

"Plain text" means a file with no formatting markup — just characters. But characters are not stored on disk; bytes are. The missing half of the definition is the encoding: the lookup table that says which byte (or group of bytes) stands for which character. Without that table, a string of bytes is ambiguous.

Joel Spolsky put the punchline in his 2003 essay title: "There Ain't No Such Thing As Plain Text."2 His single most important rule follows from it: "It does not make sense to have a string without knowing what encoding it uses."3 A .txt file does not announce its encoding inside the bytes. The reading program guesses — and a wrong guess is where garbled notes come from.

You can see the byte underneath any character. Here is the same letter encoded three ways:

>>> 'A'.encode('ascii')     # b'A'  ->  byte 0x41
>>> 'A'.encode('latin-1')   # b'A'  ->  byte 0x41
>>> 'A'.encode('utf-8')     # b'A'  ->  byte 0x41
# 'A' is byte 0x41 in ALL THREE. Plain ASCII (code points 0-127) is identical everywhere.

The letter A is byte 0x41 in every common encoding. That is not a coincidence — it is the design decision that keeps simple notes from ever garbling, and the next section is about why.

Why are the first 128 characters always safe?

The first 128 characters — the English letters, digits, and basic punctuation of ASCII — are byte-identical across ASCII, Latin-1, and UTF-8. A note containing only those characters cannot garble, no matter which of those encodings a reader assumes. This is the safe subset, and it is guaranteed by the UTF-8 specification itself.

RFC 3629, the IETF standard that defines UTF-8, states it plainly: "US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for a US-ASCII character," and notes that "a direct consequence is that a plain ASCII string is also a valid UTF-8 string."4

The Unicode Consortium says the same from the other direction: "UTF-8 uses the bytes in the ASCII only for ASCII characters. Therefore, it works well in any environment where ASCII characters have a significance as syntax characters."5

Spolsky compresses the whole guarantee into one line: "In UTF-8, every code point from 0-127 is stored in a single byte."6 You can prove the safety yourself by deliberately reading ASCII text with the wrong encoding:

>>> 'Hello world'.encode('utf-8').decode('latin-1')
'Hello world'
# Pure-ASCII content is byte-identical in UTF-8 and Latin-1, so it CANNOT garble.

ASCII is the floor, not the ceiling. Real notes contain accented names, em dashes, curly quotes, and emoji — none of which fit in 128 characters. So the safe subset is reassuring but incomplete; the durable answer is the superset that holds all of it, which is UTF-8.

Why does é turn into é?

An accented character like é lives outside the safe 128, so encodings store it differently. UTF-8 stores é as two bytes; Latin-1 stores it as one. When one program writes é in UTF-8 and another reads those bytes as Latin-1, the two bytes become two separate Latin-1 characters — and you see é.

Here are the bytes, derived directly:

>>> 'é'.encode('utf-8')     # b'\xc3\xa9'   ->  bytes C3 A9   (two bytes)
>>> 'é'.encode('latin-1')   # b'\xe9'       ->  byte  E9      (one byte)

Now feed the UTF-8 bytes to a Latin-1 reader and watch the garble appear:

>>> 'é'.encode('utf-8').decode('latin-1')
'é'
# The bytes C3 A9 are valid in BOTH encodings — they just mean different things.

That is mojibake in a single line. The bytes C3 A9 are perfectly legal in both encodings; nothing is corrupted. The reader simply applied the wrong lookup table. This is why mojibake is recoverable: the original é is one correct decode away.

Why does my note show ’ everywhere?

The ’ garble is the smart-apostrophe version of the same problem. A curly apostrophe (', the character that word processors and many editors insert automatically) is three bytes in UTF-8. Read those three bytes with Windows-1252 — a legacy default on older Windows tooling — and each byte renders as its own visible character, producing ’.

Watch it happen:

>>> '’'.encode('utf-8')                       # U+2019 -> bytes E2 80 99 (three bytes)
b'\xe2\x80\x99'
>>> '’'.encode('utf-8').decode('windows-1252')
'’'
# A UTF-8 curly apostrophe, read with Windows-1252, becomes three garbage characters.

Three bytes in, three garbage characters out. The chain explains nearly every … (ellipsis), “ (curly opening quote), and ’ (curly apostrophe) you have ever seen in an exported document, a CSV opened in a spreadsheet, or a note copied between apps. The fix is never to retype the note — it is to read it with the right encoding, which the fix-it section covers.

What is the UTF-8 BOM (the EF BB BF at the start of my file)?

A BOM (byte-order mark) is an optional three-byte signature at the start of a UTF-8 file. Its bytes are EF BB BF. Some tools write it to mark a file as UTF-8; some older tools read those bytes as literal text and show  at the top. It is a helpful signal in one world, a visible bug in another.

The Unicode Consortium is precise about what it is for: "Where a BOM is used with UTF-8, it is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order."7 RFC 3629 confirms the exact bytes: "UTF-8 having a single-octet encoding unit, this last function is useless and the BOM will always appear as the octet sequence EF BB BF."8 You can see those three bytes in Python:

>>> import codecs
>>> codecs.BOM_UTF8           # b'\xef\xbb\xbf'  ->  bytes EF BB BF

And here is the difference a BOM makes at the head of a file:

EF BB BF 48 65 6C 6C 6F        # "Hello" WITH a UTF-8 BOM
         48 65 6C 6C 6F        # "Hello" WITHOUT a BOM
# Some Unix tools read EF BB BF as literal text (shows as ) instead of a signature.

A BOM is a tradeoff, not a villain. It genuinely helps some legacy Windows tools detect UTF-8; it breaks some Unix tools that read the bytes literally. The honest default for notes is UTF-8 without a BOM — but both exist for a reason, and the next section shows how to add or strip one on purpose.

How do I check and fix a garbled file?

The fix is a five-minute, two-step process: find out what encoding a file actually is, then convert it to UTF-8. You retype nothing — the bytes are intact, so you only change how they are read. On macOS or Linux the four commands below cover almost every case; on Windows, a text editor's encoding menu does the same.

First, ask the file what it claims to be:

file -I note.txt
# note.txt: text/plain; charset=utf-8       <- good
# note.txt: text/plain; charset=iso-8859-1  <- legacy; convert it
# note.txt: text/plain; charset=us-ascii    <- safe subset of UTF-8

When you need ground truth — file only guesses — look at the raw bytes:

hexdump -C note.txt | head
# 00000000  ef bb bf 48 65 6c 6c 6f      <- starts EF BB BF = has a UTF-8 BOM

To convert a legacy file to UTF-8, name the encoding it is from and the one it is going to:

iconv -f WINDOWS-1252 -t UTF-8 old.txt > new.txt
# -f = from (the wrong/legacy encoding)   -t = to (UTF-8)

If a tool added a BOM you do not want, strip the three leading bytes:

sed '1s/^\xEF\xBB\xBF//' withbom.txt > clean.txt

Two minutes of file, iconv, and a re-open turns a wall of ’ back into the apostrophes you typed. Nothing was ever lost — only mislabeled.

What encoding should I save my notes in so they open in 20 years?

Save them as UTF-8, without a BOM. It is the encoding the modern stack expects by default, it holds every character a note can contain, and the web's own standards now mandate it. ASCII text is automatically valid UTF-8, so UTF-8 costs nothing for simple notes and protects you the moment a note picks up an accented name.

The WHATWG Encoding Standard — the normative web specification — is unambiguous: "New protocols and formats, as well as existing formats deployed in new contexts, must use the UTF-8 encoding exclusively," and "Authors must use the UTF-8 encoding and must use its (ASCII case-insensitive) 'utf-8' label to identify it."9 The same standard states the reason in one line: "The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set."10

The world has already converged. According to W3Techs, "UTF-8 is used by 99% of all the websites whose character encoding we know."11 That figure is a web-page statistic, not a measurement of private note files — but it is strong evidence of which encoding everything else now reads and writes by default. Saving a note as UTF-8 is choosing the encoding the rest of the world already assumes.

In code, that default is one line:

>>> open('note.md', 'w', encoding='utf-8').write(text)   # no BOM by default
# To force a BOM (legacy Windows interop only): encoding='utf-8-sig'

This is also the deeper reason a plain-text note outlasts the app that made it — the file is a readable UTF-8 byte stream anyone can open, not a proprietary blob locked to one program. The case for why that durability matters is in Your Notes Are Already AI-Ready; this post is the byte-level reason it works.

A small discipline worth borrowing

There is a quiet best practice that follows from all of this. When text has to survive being served, copied, or read by an unknown machine, keep it in the ASCII safe subset wherever you can, and declare UTF-8 everywhere else.

A document that is plain ASCII cannot mojibake — it is byte-identical no matter which encoding a reader assumes (RFC 3629's "a plain ASCII string is also a valid UTF-8 string").4 Anywhere your content needs accents, em dashes, or emoji, UTF-8 is the superset that carries them safely. The principle is small and it never fails: pick the subset that cannot break, then label the rest.

Frequently asked questions

Why does my text file show weird or garbled characters on another computer?

Because the second computer read the bytes with a different encoding than the first one wrote them in. The bytes are unchanged; only the lookup table differs. A UTF-8 curly apostrophe read as Windows-1252 becomes ’; a UTF-8 é read as Latin-1 becomes é. Re-open the file with the correct encoding and the original text returns.

What does "mojibake" mean?

Mojibake is "the garbled or gibberish text that is the result of text being decoded using an unintended character encoding," per Wikipedia.1 The term comes from Japanese (文字化け, "character transformation"). The key fact is that it is a display problem, not data loss — the bytes are intact and recover fully when read with the right encoding.

What is the UTF-8 BOM, and do I need it?

The BOM is an optional three-byte signature (EF BB BF) at the start of a UTF-8 file that marks the file as UTF-8. Unicode says it "is only used as an encoding signature ... it has nothing to do with byte order."7 You generally do not need it for notes; it helps a few legacy Windows tools but makes some Unix tools show . Default to UTF-8 without a BOM.

How do I save a file as UTF-8 without a BOM?

In most text editors, choose "UTF-8" (not "UTF-8 with BOM") in the Save dialog's encoding menu. On macOS or Linux, convert with iconv -f WINDOWS-1252 -t UTF-8 old.txt > new.txt, or strip an existing BOM with sed '1s/^\xEF\xBB\xBF//' withbom.txt > clean.txt. In Python, open('note.md', 'w', encoding='utf-8') writes UTF-8 with no BOM by default.

Is ASCII the same as UTF-8?

For the first 128 characters, yes — ASCII is a subset of UTF-8, and RFC 3629 confirms "a plain ASCII string is also a valid UTF-8 string."4 Every ASCII character is byte-identical in UTF-8. The difference appears beyond character 127: ASCII cannot represent accented letters, em dashes, or emoji, while UTF-8 represents all of Unicode. Use UTF-8; you lose nothing for plain English and gain everything else.

Why is ’ or é showing up in my notes?

’ is a UTF-8 curly apostrophe (bytes E2 80 99) read as Windows-1252; é is a UTF-8 é (bytes C3 A9) read as Latin-1. Both are the same root cause: a file written in UTF-8 was read with a legacy single-byte encoding. The data is fine — convert the file to UTF-8 and re-open it, and the apostrophes and accents come back.

Is mojibake permanent? Did I lose my notes?

No. Mojibake is a misreading, not corruption. The original bytes are still on disk, so the text recovers completely once you read it with the encoding it was written in. Use file -I to see what a file claims to be, hexdump -C to inspect the raw bytes, and iconv to convert it to UTF-8. Plain text degrades gracefully — that is one of its quiet strengths.

A note is bytes, and bytes only mean something once you name their encoding — so name it once, choose UTF-8, and your sentences will still open cleanly long after the app that wrote them is gone. If you want notes that stay plain, readable text files on your own device, mnmnote.com keeps them that way.

Footnotes

  1. "Mojibake," Wikipedia, https://en.wikipedia.org/wiki/Mojibake, accessed 2026-06-19. 2

  2. Joel Spolsky, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)," joelonsoftware.com, 2003-10-08, https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.

  3. Joel Spolsky, "The Absolute Minimum ... About Unicode and Character Sets," joelonsoftware.com, 2003-10-08, https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.

  4. F. Yergeau, "UTF-8, a transformation format of ISO 10646," RFC 3629, IETF, November 2003, https://datatracker.ietf.org/doc/html/rfc3629. 2 3

  5. "UTF-8, UTF-16, UTF-32 & BOM," Unicode Consortium FAQ, https://www.unicode.org/faq/utf_bom.html, accessed 2026-06-19.

  6. Joel Spolsky, "The Absolute Minimum ... About Unicode and Character Sets," joelonsoftware.com, 2003-10-08, https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.

  7. "UTF-8, UTF-16, UTF-32 & BOM," Unicode Consortium FAQ, https://www.unicode.org/faq/utf_bom.html, accessed 2026-06-19. 2

  8. F. Yergeau, "UTF-8, a transformation format of ISO 10646," RFC 3629 §6, IETF, November 2003, https://datatracker.ietf.org/doc/html/rfc3629.

  9. "Encoding Standard," WHATWG (Anne van Kesteren et al., eds.), Last Updated 21 May 2026, https://encoding.spec.whatwg.org/.

  10. "Encoding Standard," WHATWG (Anne van Kesteren et al., eds.), Last Updated 21 May 2026, https://encoding.spec.whatwg.org/.

  11. "Usage statistics of character encodings for websites," W3Techs, dated 19 June 2026, https://w3techs.com/technologies/overview/character_encoding, accessed 2026-06-19.