Compression

The following demonstration allows you to see with your own eyes how data is being compressed.
Follow the steps below to see it and then read below how it works.

NOTE: html5 required. This is a beta version. Please send me your comments or notify me if you encounter bugs: home@zutopedia.com

Step #1: Type English text here

Or use the given example: the first verses of Edgar Ellen Poe's Annabel Lee.

Step #2: Click this button to transfer text to the compression chamber.
For simplicity, only english characters are copied, and only the first 1440.
  

Original length:
Current length:
Space savings:
Step #3: Click "Compress: Full" button to start compression

Step #4: Read explanation here

How it works

The compression above uses the Toy Compression Algorithm (TCA), an algorithm made especially for educational purposes and especially for Zutopedia. It works as follows.

Take a look at this example text:

THE FATHER OF ALBERT AND THE MOTHER OF ALBERT ARE TOGETHER AGAIN

It has 64 characters (including spaces). We marked in green all the appearances of the string "THE". As you can see, it appears five times. This is a lot, which is typical of the English language. However, the letter Q is missing altogether. So why not switch?

Q FAQR OF ALBERT AND Q MOQR OF ALBERT ARE TOGEQR AGAIN

The text is now only 54 characters long. We saved 10 characters. You might say we compressed it a little. However, in order to be able to decompress it later, we need to make a comment of what we've done:

Q FAQR OF ALBERT AND Q MOQR OF ALBERT ARE TOGEQR AGAIN|THE,Q

We appended a comment in the end, using special characters "|" and ",". The comment tells us we switched "THE" and "Q". When we want to decompress it, we simply switch back all Qs to "THE", according to the comment. The comment itself costs 6 characters, so now we're at 60 characters. This is still shorter than the original!

We can repeat the same trick again. Notice the string "OF ALBERT" appears twice, while the letter "X" doesn't appear at all. So we can make another switch, and add another comment:

Q FAQR X AND Q MOQR X ARE TOGEQR AGAIN|THE,Q|OF ALBERT,X

This shortened our text to 56 characters.

So this is how TCA works: it searches for strings that appear frequently on the one hand, and letters or short strings that are missing on the other hand. It then performs the most cost effective switch. It repeats this until no further cost effective replacements can be found.

How do real compression algorithms work?

Real algorithms use the same principle: they exploit frequent patterns in the input data. In contrast with TCA, they don't just recode the frequent patterns. They recode everything. Imagine for example that the letter Q appears once in a text. We can recode it with some long string "XYBADZDA". This frees up Q to be a code for a pattern that appears frequently in the text.

See also


A book about computers for ages 9+

Proof that computers can't do everything

A physics riddle

Sorting competitions