Everything in this site is original, copyright protected material.
The following demonstration allows you to see with your own eyes
how data is being compressed.
NOTE: html5 required. This is a beta version. Please send me your comments or notify me if you encounter bugs: firstname.lastname@example.org
Step #1: Type English text here
Or use the given example: the first verses of Edgar Ellen Poe's Annabel Lee.
Step #3: Click "Compress: Full" button to start compression
Step #4: Read explanation here
How it works
The compression above uses the Toy Compression Algorithm (TCA), an algorithm made especially for educational purposes and especially for Zutopedia. It works as follows.
Take a look at this example text:
THE FATHER OF ALBERT AND THE MOTHER OF ALBERT ARE TOGETHER AGAIN
It has 64 characters (including spaces). We marked in green all the appearances of the string "THE". As you can see, it appears five times. This is a lot, which is typical of the English language. However, the letter Q is missing altogether. So why not switch?
Q FAQR OF ALBERT AND Q MOQR OF ALBERT ARE TOGEQR AGAIN
The text is now only 54 characters long. We saved 10 characters. You might say we compressed it a little. However, in order to be able to decompress it later, we need to make a comment of what we've done:
Q FAQR OF ALBERT AND Q MOQR OF ALBERT ARE TOGEQR AGAIN|THE,Q
We appended a comment in the end, using special characters "|" and ",". The comment tells us we switched "THE" and "Q". When we want to decompress it, we simply switch back all Qs to "THE", according to the comment. The comment itself costs 6 characters, so now we're at 60 characters. This is still shorter than the original!
We can repeat the same trick again. Notice the string "OF ALBERT" appears twice, while the letter "X" doesn't appear at all. So we can make another switch, and add another comment:
Q FAQR X AND Q MOQR X ARE TOGEQR AGAIN|THE,Q|OF ALBERT,X
This shortened our text to 56 characters.
So this is how TCA works: it searches for strings that appear frequently on the one hand, and letters or short strings that are missing on the other hand. It then performs the most cost effective switch. It repeats this until no further cost effective replacements can be found.
How do real compression algorithms work?
Real algorithms use the same principle: they exploit frequent patterns in the input data. In contrast with TCA, they don't just recode the frequent patterns. They recode everything. Imagine for example that the letter Q appears once in a text. We can recode it with some long string "XYBADZDA". This frees up Q to be a code for a pattern that appears frequently in the text.