Why Your MySQL Database Thinks “café” “café”
Tags: #mysql #database #beginners #webdev
Cover Image Concept: Split image showing two identical-looking coffee cups with “café” written on them, but one has a ❌ between them, symbolizing database comparison failure.
Meta Description: That moment when your database can’t find “café” even though it’s right there. Dive into the surprisingly romantic world of character sets and collations—where encoding meets comparison, and tiny differences break everything.
The Scene
It’s 2:47 AM. You’re debugging in production (yes, we’ve all been there). A customer in Paris reports they can’t find their favorite restaurant, “Café Rouge,” even though they just created it. You run a simple query:
SELECT * FROM restaurants WHERE name = 'Café Rouge';
Empty set. Zero rows.
But when you look at the actual data in the table, there it is, staring back at you like a mocking phantom:
| id | name |
|----|------------|
| 42 | Café Rouge |
You rub your eyes. You question your career choices. You consider becoming a goat farmer in rural Montana.
Welcome to the invisible battlefield of character encodings.
The Plot Twist: Two “é”s Walk Into a Database…
Here’s the thing that will blow your mind: there are multiple ways to encode the letter “é”.
-
Precomposed character:
é(Unicode U+00E9 – a single character) -
Decomposed characters:
e+´(Unicode U+0065 + U+0301 – base letter + combining accent)
They look identical to humans. They’re completely different to computers. And if your customer typed one while your database stored the other? Good luck with that.
User types: Café (é = U+00E9)
DB stores: Café (é = U+0065 + U+0301)
Comparison: 🔥 Everything is fine 🔥
This is where character sets and collations enter the stage like a romantic couple trying to solve compatibility issues.
Character Sets: The Dictionary of Love
Think of a character set as a language dictionary, but instead of words, it’s mapping symbols to numbers. Let’s use MySQL’s adorable example from their docs:
Imaginary Alphabet:
A = 0
B = 1
a = 2
b = 3
Every letter gets a number (called an encoding). The collection of all these mappings? That’s your character set.
In real life, we have character sets like:
- UTF-8 (utf8mb4 in MySQL): The Swiss Army knife – handles emoji, Chinese characters, ancient Egyptian hieroglyphs 🐪
- Latin1: The old reliable – English and Western European languages
- ASCII: The grandfather – English only, 7-bit encoding
The UTF-8 vs utf8mb4 Drama
MySQL had an interesting oopsie moment in history. They created utf8 which… doesn’t fully support UTF-8. It only handles 3-byte characters, missing emojis and rare characters.
-- This will haunt you
CREATE TABLE old_table (
message VARCHAR(100) CHARACTER SET utf8
);
INSERT INTO old_table VALUES ('I love coding! 😍');
-- Error: Incorrect string value: 'xF0x9Fx98x8D'
Always use utf8mb4 (UTF-8 with full 4-byte support) in modern MySQL:
CREATE TABLE modern_table (
message VARCHAR(100) CHARACTER SET utf8mb4
);
INSERT INTO modern_table VALUES ('I love coding! 😍');
-- Success! ✨
Collations: The Rules of Engagement
If character sets are the dictionary, collations are the grammar rules for comparison. They answer questions like:
- Is
Aequal toa? (case sensitivity) - Is
caféequal tocafe? (accent sensitivity) - Is
ßequal toss? (multi-character mappings) - Should
10come before2when sorting? (natural ordering)
The Collation Naming Convention Decoder Ring
MySQL collations follow a pattern: charset_language_sensitivity
utf8mb4_general_ci
│ │ │
│ │ └─ ci = Case Insensitive
│ └─────────── general sorting rules
└─────────────────── character set
utf8mb4_0900_ai_ci
│ │ │ │
│ │ │ └─ ci = Case Insensitive
│ │ └──── ai = Accent Insensitive
│ └───────── Unicode 9.0 standard
└───────────────── character set
utf8mb4_bin
│ │
│ └─ bin = Binary (byte-by-byte comparison)
└─────────── character set
The Love Triangle: Binary vs CI vs CS
Let’s see how different collations handle the same comparison:
-- Setup
SET @name1 = 'José';
SET @name2 = 'jose';
SET @name3 = 'José'; -- Different é encoding
-- Binary Collation (byte comparison)
SELECT @name1 = @name2 COLLATE utf8mb4_bin;
-- Result: 0 (false - different bytes)
-- Case Insensitive Collation
SELECT @name1 = @name2 COLLATE utf8mb4_general_ci;
-- Result: 1 (true - case doesn't matter)
-- Accent Insensitive Collation
SELECT 'José' = 'Jose' COLLATE utf8mb4_0900_ai_ci;
-- Result: 1 (true - accents don't matter either!)
Practical Impact:
-- Your username table
CREATE TABLE users (
username VARCHAR(50) COLLATE utf8mb4_0900_as_cs
-- as = accent sensitive, cs = case sensitive
);
INSERT INTO users VALUES ('José'), ('jose'), ('Jose');
-- Three distinct users! ✅
-- Your search table
CREATE TABLE search_terms (
query VARCHAR(100) COLLATE utf8mb4_0900_ai_ci
);
-- This finds all variations
SELECT * FROM products
WHERE name LIKE '%café%' COLLATE utf8mb4_0900_ai_ci;
-- Matches: café, Café, CAFÉ, cafe, CAFE ✅
The Real-World Horror Stories
Story #1: The Umlaut Incident
A German e-commerce site allowed users to search for “Müller” (a common name). But the database was set to latin1 with binary collation:
-- Database has: Müller (ü = single character)
-- User types: Müller (ü = u + ¨ combining)
SELECT * FROM users WHERE name = 'Müller';
-- No results! 😱
Fix: Use utf8mb4_unicode_ci which normalizes combining characters:
ALTER TABLE users
MODIFY name VARCHAR(100)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
Story #2: The Emoji Apocalypse
A social media app stored posts in a utf8 table. When iOS users started adding emoji reactions (💯🔥😂), the database threw errors:
INSERT INTO posts (content) VALUES ('This is fire! 🔥');
-- ERROR 1366: Incorrect string value: 'xF0x9Fx94x25'
Prevention:
-- Check your defaults
SHOW VARIABLES LIKE 'character_set%';
-- Set server-wide defaults (my.cnf)
[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
-- Or per-connection
SET NAMES utf8mb4;
The Performance Plot Twist
Different collations have wildly different performance characteristics:
-- Speed Test (1M rows)
-- Binary collation (byte comparison)
SELECT COUNT(*) FROM users
WHERE email = 'test@example.com' COLLATE utf8mb4_bin;
-- ~ 0.05 seconds ⚡
-- Unicode collation (sophisticated rules)
SELECT COUNT(*) FROM users
WHERE email = 'test@example.com' COLLATE utf8mb4_unicode_ci;
-- ~ 0.12 seconds 🐌 (2.4x slower)
When to use binary collations:
- Exact matching (emails, API keys, tokens)
- Case-sensitive identifiers
- Performance-critical queries
When to use unicode collations:
- User-facing search
- International content
- Name matching
- Anything that humans will type
The Mixing Disaster (And How to Survive It)
MySQL lets you mix character sets and collations at different levels:
Server → Database → Table → Column → Query
This is both powerful and dangerous:
-- The Tower of Babel
CREATE DATABASE app
CHARACTER SET utf8mb4
COLLATE utf8mb4_general_ci;
CREATE TABLE app.users (
id INT PRIMARY KEY,
email VARCHAR(255) CHARACTER SET latin1, -- ⚠️
username VARCHAR(100) COLLATE utf8mb4_bin -- ⚠️
);
-- Now watch this burn:
SELECT * FROM users u1
JOIN users u2 ON u1.username = u2.email;
-- ERROR 1267: Illegal mix of collations
The Golden Rule: Pick one character set and one collation for your entire application and stick with it. Mixing is for experts and masochists.
My recommendation for 99% of applications:
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
The Debug Toolkit
When things go sideways, these queries are your forensic tools:
-- Inspect a specific string's encoding
SELECT HEX('café');
-- Returns: 636166C3A9 (UTF-8 encoded)
-- Check column character set
SELECT
TABLE_NAME,
COLUMN_NAME,
CHARACTER_SET_NAME,
COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'your_database'
AND TABLE_NAME = 'your_table';
-- See how collations compare strings
SELECT 'Müller' = 'Mueller' COLLATE utf8mb4_unicode_ci;
-- Some German collations return TRUE!
-- Find collation mismatches in your schema
SELECT DISTINCT COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'your_database';
-- Should ideally return just 1-2 results
The Migration Survival Guide
Already have a database with wrong character sets? Here’s the escape hatch:
-- Step 1: Backup (seriously, do this)
mysqldump your_database > backup.sql
-- Step 2: Convert database default
ALTER DATABASE your_database
CHARACTER SET = utf8mb4
COLLATE = utf8mb4_unicode_ci;
-- Step 3: Convert tables (this is the scary part)
ALTER TABLE your_table
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- Step 4: Verify
SELECT
TABLE_NAME,
TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'your_database';
⚠️ Warning: CONVERT TO changes the actual data bytes. Use MODIFY COLUMN if you just want to change metadata.
The Philosophy
Character sets and collations are ultimately about reconciling human language with digital precision. Humans write “café” a dozen different ways and expect computers to understand. Computers want bytes and rules.
The magic happens when you choose the right balance:
- Too strict (binary collations): Users can’t find anything
- Too loose (accent-insensitive + case-insensitive): Username collisions
- Just right: Depends on your use case!
Precision ←---------------→ Flexibility
↑ ↑
│ │
Email addresses User search
API tokens Product names
Password hashes City names
The Checklist
Before deploying your next database:
✅ Set character-set-server=utf8mb4 in my.cnf
✅ Use utf8mb4, never utf8
✅ Choose collation based on use case (unicode_ci for most apps)
✅ Be consistent across all tables
✅ Test with emoji: 🎉🔥💯 (if they break, you have problems)
✅ Test with accents: café, Müller, São Paulo
✅ Add character set to your connection strings
✅ Document your choices for future maintainers
The Happy Ending
Remember our café that couldn’t be found? Here’s the fix:
-- Before
CREATE TABLE restaurants (
name VARCHAR(100) -- Inherited wrong collation
);
-- After
CREATE TABLE restaurants (
name VARCHAR(100) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci
);
-- Now this works
SELECT * FROM restaurants
WHERE name = 'Café Rouge';
-- Returns: Café Rouge ✨
-- And so does this
SELECT * FROM restaurants
WHERE name = 'cafe rouge';
-- Also returns: Café Rouge ✨✨
The customer in Paris can find their restaurant. The 2:47 AM debug session is over. The goat farm in Montana can wait.
Character sets and collations: they’re not just technical details. They’re the invisible translators between human language and digital storage, working tirelessly to ensure that “café” equals “café” in all the ways that matter.
Further Reading
- MySQL Character Set Documentation
- Unicode Normalization Forms
- The Absolute Minimum Every Software Developer Must Know About Unicode
Question for the comments: What’s the strangest character encoding bug you’ve encountered? I once spent three days tracking down why user names with ñ were failing… only to discover it was a Mac vs Windows keyboard encoding issue. 😅
Did this save you from a 2 AM production incident? Drop a ☕ in the comments and share your own encoding horror stories!