Compression
Lab setup
First, make sure you have completed the initial setup.
If you are part of a course
-
Open Terminal. Run the update command to make sure you have the latest code.
$ mwc update
-
Move to this lab's directory.
$ cd ~/Desktop/making_with_code/ny-csdf/unit2/lab_compression
-
Enter this lab's shell environment.
$ poetry shell
If you are working on your own
-
Move to your MWC directory.
$ cd ~/Desktop/making_with_code
-
Get a copy of this lab's materials.
git clone https://git.makingwithcode.org/mwc/lab_compression.git
In the previous lab we explored encodings, or how we can store information in bits and get it back again. This lab asks: "How can we store information using fewer bits?" This question leads to another question which is at the heart of computer science: "If we have some information, could we know the smallest possible size in which it could be encoded?"
Compression means packing information into fewer bits. We will explore two types of compression: lossless compression, where we can get back exactly the same information we stored, and lossy compression, where we accept some loss of quality in exchange for a smaller size. You might be familiar with lossy compression of images and video, which can appear grainy or pixelated.
Files
Let's talk briefly about files. A file is just a sequence of bits saved on your computer.
Like any other bits, the bits in a file don't have any particular meaning.
The file suffix (.txt
, .pdf
, .jpg
, etc.) provides a hint about how to interpret
the bits, but no guarantee--you can write whatever bits you want into any file.
Reading and writing files is one way a Python program can interact with the rest of your
computer. When you open a file in Python, you are provided with a file handler, which
you can then use to read or write the file's contents. The built-in open
function
requires the file's name (potentially including a path, like ~/Desktop/shopping_list.txt
).
You can also pass a second argument telling open
to open the file in a certain mode:
"r"
Read text from the file. (This is the default.)"w"
Write text to the file."rb"
Read bytes from the file."wb"
Write bytes to the file.
💻 Practice writing a file and then reading from it.
$ python
>>> with open("hello.txt", "w") as f:
... f.write("Hello!")
...
>>> with open("hello.txt") as f:
... text = f.read()
... print(text)
...
Hello!
Compressing text
In the previous lab we discussed ASCII and UTF8, two text encodings. Let's consider some
other possible encodings; then you will create your own. In Python, an encoding is
implemented as a coder function (which transforms text into bits) and a decoder function
(which transforms bits into text). coder-decoder is sometimes shortened to codec.
Several new codecs are defined in text_codecs
.
💻
Try out alphanumeric
, whose encoder goes through text one character
at a time and throws away any which aren't ASCII letters, numbers, or spaces. Also, sequences
of multiple spaces are replaced with a single space.
>>> import text_codecs.alphanumeric
>>> "Hello".encode("alphanumeric")
b'Hello'
>>> Bits("Hello", encoding="alphanumeric")
0100100001100101011011000110110001101111
>>> Bits("Hello", encoding="alphanumeric").ascii
'Hello'
>>> "Won't you dance with me?".encode("alphanumeric")
b'Wont you dance with me'
There are a few interesting things here:
- When we import the
text_codecs.alphanumeric
module we make the alphanumeric encoding available. - The result of encoding a
str
("Hello".encode("alphanumeric")
) is abytes
object. Remember: encoding is the process of representing something as bits.b'Hello'
is one way of representing bytes--an ASCII character for each byte. - If you want to see zeros and ones, encode with
Bits
.
Now look at text_codecs/alphanumeric.py
. Two functions are defined: encode
(which converts text
into bits) and decode
(which converts bits into text). Then we register the codec so that later requests
to encode or decode can use it.
register_codec(encode, decode, "alphanumeric")
Measuring the compression rate
Our goal is to encode text into as few bits as possible, without losing too much quality.
We can objectively measure a codec's rate of compression by comparing the size of the original (utf8
)
encoded text with the size of the text encoded with the codec being tested.
text_codecs/evaluate.py
measures the rate of compression for one or more codecs.
By default, this script tests compression using Louisa May Alcott's Little Women, but you can provide a
different text if you want.
💻
Use text_codecs/evaluate.py
to compare the compression rate of utf8
, alphanumeric
, and
ascii7
, another text codec provided in this lab. The text codecs are pretty slow, but their results
are saved in the texts
directory, so evaluation is fast after the first time.
$ python text_codecs/evaluate.py utf8 alphanumeric ascii7
Evaluating encoding utf8
Evaluating encoding alphanumeric
Evaluating encoding ascii7
encoding original compressed compression_rate
------------ ---------- ------------ ------------------
utf8 1090625 1090625 1
alphanumeric 1090625 1015472 0.931092
ascii7 1090625 953701 0.874454
As we can see, compression with alphanumeric
reduces the file size to
93% of the original--not great. Another text codec,
ascii7
(which you can see in text_codecs/ascii7.py
), does a little
better at 87%. Could you do better?
Compression quality
There are lots of ways you could compress text if you're not worried about quality loss. There's no objective way to talk about quality--it's subjective.
💻
Use text_codecs/evaluate.py
to look at the result of compressing
with alphanumeric
. Piping the result into less
, as shown in the code below,
will give you a helpful scrolling window. Press q
to exit.
$ python text_codecs/evaluate.py --inspect alphanumeric | less
Your turn
Now it's your turn to write a text text codec and to evaluate its quality.
You can do this any way you want--the only rules are that your encode
function
needs to take in text and return bytes; your decode
function needs to do the
reverse. Here are a few ideas:
- You are already provided with
ascii7
; could you do better and createascii6
? With six bits per character, you have 64 possibilities (from000000
to111111
, or integers from -32 to 31). Isn't that enough? You need 26 for lower-case letters, 26 for upper-case letters, and then you still have 12 left over for spaces and punctuation. - There are only 26 letters in the alphabet. If you allow five bits per character,
there are 32 possible combinations (from
00000
to11111
, or integers ranging from -16 to 15). You could convert the entire text to upper-case, and then you'd still have 6 possibilities remaining for other useful characters like spaces and punctuation. You could drop everything else. - ASCII uses one byte (8 bits) per character (or integers ranging from -128 to 127), but only defines the range from 0 to 127. Here's the ASCII character table. You could encode entire common words such as "and" or "the" using a single negative number.
- You could find some way of encoding the more common letters using fewer bits. For
example, you could design an algorithm like this:
- For encoding, if the character is one of the 8 most common characters, then write a 1. Then use 3 bits (8 possibilities) to specify which character it is. Otherwise, write the character's ASCII code (which always starts with 0).
- For decoding, read the next bit. If it's a 1, then read the next 3 bits and use them to determine which of 8 common characters is encoded. If it's a 0, then read the next 7 bits and decode them as a regular ASCII byte.