How much data can UTF-8 hold?

(This is the first article in the series for Golf Horse: Wordle)

UTF-8 has become a nearly universal format for encoding text on the web, and indeed, it’s the only encoding that the golf horse competition will accept. For the purposes of data compression, we want to figure out how many bits we can theoretically pack into a valid UTF-8 file. We’ll take a look at how the UTF-8 format works, and then use some math to figure out the theoretical data capacity of UTF-8.

The UTF-8 format

Let’s start with a quick summary of the UTF-8 format. The “8” in “UTF-8” means that the basic unit is 8 bits—better known as 1 byte. Each character in the Unicode set is assigned a unique representation consisting of 1 to 4 bytes.

128 characters with 1-byte representations
1920 characters with 2-byte representations
61440 characters with 3-byte representations
1048576 characters with 4-byte representations

A valid UTF-8 file is simply a concatenation of characters’ representations. For the purposes of this article, that’s all we need to know about UTF-8!

Calculating data capacity

What do we mean by “data capacity”? We’d like to think about how many bits the data source could encode. Since \(N\) bits have \(2^N\) possible combinations, we need at least \(2^N\) encodings to hold them all. So, the theoretical number of bits that can be encoded is \(\log_2\) of the number of valid encodings. For example, a 100-byte binary file has \(256^{100}\) possible encodings, so it can represent \(\log_2 256^{100} = 800\) bits. If you want to learn more about analyzing data capacity, information theory is the subject.

So, our next step is to enumerate the number of possible UTF-8 encodings. Let \(x_n\) be the number of valid UTF-8 files of length \(n\). We have the obvious \(x_0=1\). Because there are 128 characters with 1-byte representations, and those are the only possible 1-byte UTF-8 files, \(x_1 = 128\).

Looking at possible 2-byte UTF-8 files, there are 1920 possible files containing a 2-byte character, and \(128 \times 128\) possible files containing two 1-byte characters, for a total of \(x_2=1920+128\times 128 = 18304\).

Continuing in this manner, we can develop a recursive relationship between the \(x_n\) as follows:

\(x_n = 128x_{n-1} + 1920x_{n-2} + 61440x_{n-3} + 1048576x_{n-4}\)

This is saying that the number of files of length \(n\) is the sum of:

The number of (n-1)-byte files times the number of 1-byte characters
The number of (n-2)-byte files times the number of 2-byte characters
The number of (n-3)-byte files times the number of 3-byte characters
The number of (n-4)-byte files times the number of 4-byte characters

This is a linear recurrence, which has a well-developed mathematical theory. I’ll present an informal method to estimate \(x_n\).

We’ll start by assuming \(x_n\) is exponential, so \(x_n = ab^n\). This isn’t exactly correct, but it’s close enough to get us the right result. Let’s substitute into the recurrence relation:

\(ab^n = 128ab^{n-1} + 1920ab^{n-2} + 61440ab^{n-3} + 1048576ab^{n-4}\)

We divide through by \(ab^{n-4}\) and are left with a polynomial in \(b\):

\(b^4 = 128b^3 + 1920b^2 + 61440b + 1048576\)

Plugging this into your favorite computer algebra system, we find that its only positive solution is \(b=144.57…\).

Since \(x_0=1\), we have \(a=1\).

Therefore, we have \(x_n \approx 144.57^n\), and our data capacity is

\(\log_2 x_n \approx \log_2 144.57^n = n \log_2 144.57\)

This says that the capacity is proportional to the number of bytes, which makes sense, after all! If we divide through by the number of bytes \(n\), we have the data capacity of UTF-8 is about \(\log_2 144.57 \approx 7.1756\) bits per byte.

How is this useful?

We’ve established that UTF-8 can theoretically hold 7.1756 bits per byte, but we haven’t shown how it’s actually possible! An actual implementation is unlikely to reach the theoretical limit, so this is an upper bound which we will aspire to. In the next article, we’ll take a look at some practical options for packing data into UTF-8.

How much data can UTF-8 hold?

How much data can UTF-8 hold?

The UTF-8 format

Calculating data capacity

How is this useful?

Leave a Reply Cancel reply