Compression

Lab setup

First, make sure you have completed the initial setup.

Open Terminal. Run the update command to make sure you have the latest code.
```
$ mwc update
```

Move to this lab's directory.

$ cd ~/Desktop/making_with_code/ny-csdf/unit2/lab_compression

Enter this lab's shell environment.
```
$ poetry shell
```

Move to your MWC directory.
```
$ cd ~/Desktop/making_with_code
```

Get a copy of this lab's materials.

git clone https://git.makingwithcode.org/mwc/lab_compression.git

In the previous lab we explored encodings, or how we can store information in bits and get it back again. This lab asks: "How can we store information using fewer bits?" This question leads to another question which is at the heart of computer science: "If we have some information, could we know the smallest possible size in which it could be encoded?"

Compression means packing information into fewer bits. We will explore two types of compression: lossless compression, where we can get back exactly the same information we stored, and lossy compression, where we accept some loss of quality in exchange for a smaller size. You might be familiar with lossy compression of images and video, which can appear grainy or pixelated.

Files

Let's talk briefly about files. A file is just a sequence of bits saved on your computer. Like any other bits, the bits in a file don't have any particular meaning. The file suffix (.txt, .pdf, .jpg, etc.) provides a hint about how to interpret the bits, but no guarantee--you can write whatever bits you want into any file.

Reading and writing files is one way a Python program can interact with the rest of your computer. When you open a file in Python, you are provided with a file handler, which you can then use to read or write the file's contents. The built-in open function requires the file's name (potentially including a path, like ~/Desktop/shopping_list.txt). You can also pass a second argument telling open to open the file in a certain mode:

"r" Read text from the file. (This is the default.)
"w" Write text to the file.
"rb" Read bytes from the file.
"wb" Write bytes to the file.

💻 Practice writing a file and then reading from it.

$ python
>>> with open("hello.txt", "w") as f:
...     f.write("Hello!")
...
>>> with open("hello.txt") as f:
...     text = f.read()
...     print(text)
...
Hello!

Compressing text

In the previous lab we discussed ASCII and UTF8, two text encodings. Let's consider some other possible encodings; then you will create your own. In Python, an encoding is implemented as a coder function (which transforms text into bits) and a decoder function (which transforms bits into text). coder-decoder is sometimes shortened to codec. Several new codecs are defined in text_codecs.

💻 Try out alphanumeric, whose encoder goes through text one character at a time and throws away any which aren't ASCII letters, numbers, or spaces. Also, sequences of multiple spaces are replaced with a single space.

>>> import text_codecs.alphanumeric
>>> "Hello".encode("alphanumeric")
b'Hello'
>>> Bits("Hello", encoding="alphanumeric")
0100100001100101011011000110110001101111
>>> Bits("Hello", encoding="alphanumeric").ascii
'Hello'
>>> "Won't you dance with me?".encode("alphanumeric")
b'Wont you dance with me'

There are a few interesting things here:

When we import the text_codecs.alphanumeric module we make the alphanumeric encoding available.
The result of encoding a str ("Hello".encode("alphanumeric")) is a bytes object. Remember: encoding is the process of representing something as bits. b'Hello' is one way of representing bytes--an ASCII character for each byte.
If you want to see zeros and ones, encode with Bits.

Now look at text_codecs/alphanumeric.py. Two functions are defined: encode (which converts text into bits) and decode (which converts bits into text). Then we register the codec so that later requests to encode or decode can use it.

register_codec(encode, decode, "alphanumeric")

Measuring the compression rate

Our goal is to encode text into as few bits as possible, without losing too much quality. We can objectively measure a codec's rate of compression by comparing the size of the original (utf8) encoded text with the size of the text encoded with the codec being tested.

text_codecs/evaluate.py measures the rate of compression for one or more codecs. By default, this script tests compression using Louisa May Alcott's Little Women, but you can provide a different text if you want.

💻 Use text_codecs/evaluate.py to compare the compression rate of utf8, alphanumeric, and ascii7, another text codec provided in this lab. The text codecs are pretty slow, but their results are saved in the texts directory, so evaluation is fast after the first time.

$ python text_codecs/evaluate.py utf8 alphanumeric ascii7
Evaluating encoding utf8
Evaluating encoding alphanumeric
Evaluating encoding ascii7
encoding        original    compressed    compression_rate
------------  ----------  ------------  ------------------
utf8             1090625       1090625            1
alphanumeric     1090625       1015472            0.931092
ascii7           1090625        953701            0.874454

As we can see, compression with alphanumeric reduces the file size to 93% of the original--not great. Another text codec, ascii7 (which you can see in text_codecs/ascii7.py), does a little better at 87%. Could you do better?

Compression quality

There are lots of ways you could compress text if you're not worried about quality loss. There's no objective way to talk about quality--it's subjective.

💻 Use text_codecs/evaluate.py to look at the result of compressing with alphanumeric. Piping the result into less, as shown in the code below, will give you a helpful scrolling window. Press q to exit.

$ python text_codecs/evaluate.py --inspect alphanumeric | less

Your turn

Now it's your turn to write a text text codec and to evaluate its quality. You can do this any way you want--the only rules are that your encode function needs to take in text and return bytes; your decode function needs to do the reverse. Here are a few ideas:

You are already provided with ascii7; could you do better and create ascii6? With six bits per character, you have 64 possibilities (from 000000 to 111111, or integers from -32 to 31). Isn't that enough? You need 26 for lower-case letters, 26 for upper-case letters, and then you still have 12 left over for spaces and punctuation.
There are only 26 letters in the alphabet. If you allow five bits per character, there are 32 possible combinations (from 00000 to 11111, or integers ranging from -16 to 15). You could convert the entire text to upper-case, and then you'd still have 6 possibilities remaining for other useful characters like spaces and punctuation. You could drop everything else.
ASCII uses one byte (8 bits) per character (or integers ranging from -128 to 127), but only defines the range from 0 to 127. Here's the ASCII character table. You could encode entire common words such as "and" or "the" using a single negative number.
You could find some way of encoding the more common letters using fewer bits. For example, you could design an algorithm like this:
- For encoding, if the character is one of the 8 most common characters, then write a 1. Then use 3 bits (8 possibilities) to specify which character it is. Otherwise, write the character's ASCII code (which always starts with 0).
- For decoding, read the next bit. If it's a 1, then read the next 3 bits and use them to determine which of 8 common characters is encoded. If it's a 0, then read the next 7 bits and decode them as a regular ASCII byte.