Encoding

Lab setup

First, make sure you have completed the initial setup.

If you are part of a course

  1. Open Terminal. Run the update command to make sure you have the latest code.
    $ mwc update
  2. Move to this lab's directory.
    $ cd ~/Desktop/making_with_code/ny-csdf/unit2/lab_encoding
    
  3. Enter this lab's shell environment.
    $ poetry shell
    

If you are working on your own

  1. Move to your MWC directory.
    $ cd ~/Desktop/making_with_code
    
  2. Get a copy of this lab's materials.
    git clone https://git.makingwithcode.org/mwc/lab_encoding.git

It's all bits

At the lowest level, computers only work with two values: 0 and 1. (This is called binary because bi- means two.) A single binary value (a 0 or a 1) is called a bit. Every file on your computer is just a sequence of bits. Every message that gets sent between computers is just a sequence of bits.

💻 Use the built-in utility xxd to look at the bits in your files.

$ xxd -b -d poem.txt
00000000: 01001100 01101001 01110110 01101001 01101110 01100111  Living
00000006: 00100000 01101001 01110011 00100000 01101110 01101111   is no
00000012: 00100000 01101100 01100001 01110101 01100111 01101000   laugh
00000018: 01101001 01101110 01100111 00100000 01101101 01100001  ing ma
00000024: 01110100 01110100 01100101 01110010 00111010 00001010  tter:.
...

The first column is an index, showing how far along you are in the file. Then, the bits are shown in groups of eight, called a byte. The first byte in this file is 01001100.

What do these bits mean? Nothing, really. Bits don't have any intrinsic meaning; they're just ones and zeros. It's up to us to interpret them. In the sections below, we will interpret bits as several different data types with which we're already famliar. We'll use a library called easybits, which was designed to make it easy to move between bits and the types they can represent. (Python is a high-level language designed to let users work with objects like integers and strings, without having to think about how they are represented as bits. When you program in lower-level languages such as C or Rust, you're working with bits all the time.)

Booleans

Bits can easily be treated as booleans: 0 is False and 1 is True.

💻 Try out creating Bits from booleans. You can create a single bit with one boolean, or a sequence of bits with a list of booleans. You can also use a string of 1 and 0, which is equivalent and easier to type.

$ python 
>>> from easybits import Bits
>>> Bits(True)
1
>>> Bits(False)
0
>>> Bits([True, False, True, False])
1010
>>> Bits("01001001")
01001001

Here are some operators you can use to manipulate bits. These correspond directly to the boolean operators you have used before (not, and, or). When working with bits, operators are applied bitwise, meaning they are applied to each bit separately.

opnameexampledescription
~not~aEach bit is flipped. 0 becomes 1, and 1 becomes 0.
&anda & bThe result is 1 when both inputs are 1.
|ora | bThe result is 1 when either input is 1.
^exclusive ora ^ bThe result is 1 when one input, but not both, is 1.
>>shift righta >> 2The bits are pushed to the right (2 places in this case), with 0 entering on the left. 2
<<shift lefta << 3The bits are pushed to the left (3 places in this case), with 0 entering on the left.

💻 Try out the bit operators.

>>> a = Bits("11110000")
>>> b = Bits("10101010")
>>> ~a
00001111
>>> a & b
10100000
>>> a | b
11111010
>>> a ^ b
01011010
>>> a >> 2
00111100
>>> a << 2
11000000

Integers

We can also think of bits as integers--use Bits.int to see the integer representation of any sequence of bits. When you create Bits from an integer, you need to specify the length of your bit sequence--we'll see why shortly. Here are a few more bit operators which treat bits like integers.

opnameexampledescription
-negation-aFlips the sign. (if a represents 4, -a will represent -4.
+additiona + bPerforms bitwise addition of a and b.
-subtractiona - bPerforms bitwise subtraction of a and b. (This is implemented as a + (-b))

💻 Create some bits from integers.

>>> Bits(1, length=8)
00000001
>>> Bits(2, length=8)
00000010
>>> Bits(3, length=8)
00000011
>>> Bits(4, length=8)
00000100
>>> Bits(5, length=8)
00000101
>>> Bits(8, length=8)
00001000
>>> Bits(10, length=8)
00001010

Do you see the pattern? The first bit is worth 1, the second bit is worth 2, the third is worth 4, the fourth is worth 8, and so on, with the value of the bit doubling each time. Starting with all the bits turned off, just turn on the bits you need to make the number you want.

Bitwise addition

Bitwise addition works in the same way you learned addition in grade school: starting from the right, add the numbers in the column and write the sum at the bottom. In normal base-ten addition, if the sum goes over 9, you "carry" the extra digit to the next column. The same rule applies here, except binary numbers only use 0 and 1, so we carry anything greater than 1.

    111   
  00011101  # 29
+ 00001110  # 14
----------------
  00101011  # 43

One detail: The result of bitwise addition will always be the same number of digits as the two numbers being added. This means that if we need to carry a 1 beyond the left-most column, it has nowhere to go, and it just gets thrown away. 11111111 + 00000001 = 00000000

💻 Play around with bitwise addition.

>>> one = Bits("0001")
>>> one + one
0010
>>> one + one + one
0011
>>> (one + one + one + one).int
4

Negative integers

How should we represent negative numbers? One option would be to simply reserve the first bit for the sign: 0 is for positive integers and 1 is for negative integers. This works fine, but it's not ideal because bitwise addition doesn't give the right answer. For example:

  010  #  2
+ 101  # -1
-----------
  111  # -3

Instead, we'll use a clever (though slightly more complex) approach, starting with the idea that any integer plus its negative should equal zero. Let's consider 29, or 00011101. What bit pattern could be added to 29 to get 00000000? Well, if we flip all the bits in 29, we get 11100010, and if we add this to 00011101, we get 11111111. Add 00000001 and the whole thing rolls over, giving 00000000. This pattern works for any sequence of bits: the bits plus their inverse equals 11111111; the bits plus their inverse plus one equals 00000000. Therefore:

What's lovely and brilliant about this system is that it works: bitwise addition on positive and negative integers gives the right answer. One note: for this system to work, the first digit of positive integers must always be 0. When the first digit is 1, the bits represent a negative integer instead.

💻 Play around with positive and negative integers.

>>> a = Bits(29, length=8)
>>> a
00011101
>>> -a
11100011
>>> -a.int
-29
>>> (a + -a)
00000000

Text

Finally, let's think about how to represent text with bits. Any system for representing text in bits (and converting bits back into text) is called an encoding. As with integers, there's no right way to do this--we could choose any system we want. Some will work better than others.

When computers were first being invented in the early 1960's, a simple encoding called ASCII was created: one byte represents one character, with pre-defined assignments. Most of the characters are letters, numbers, and punctuation, but some of the characters represent physical functions on a printer such as backspace, moving down to the next line, and even ringing a bell. Before screens existed, all the output from a computer was printed onto paper!

💻 Print out all the printable ASCII characters, along with their bit patterns.

>>> for i in range(128):
...     b = Bits(i, length=8)
...     if b.ascii.isprintable():
...         print(b, b.int, b.ascii)
...

ASCII works fine, but it's very limited: the characters listed are all you get. Using the ACSII encoding, it's impossible to write the names of poets like بدر شاكر السياب, Czesław Miłosz, or 李白. And how could we live without 😍?

This has real impacts on people's lives. In most US states, your legal name must be written in ASCII because the computer records systems have not been updated. If your real last name is Muñoz, it will be Munoz in the US. And if your real name is written in a different writing system, your official name in the US will be an ASCII translation. Should the US require everyone to use English? Or should the US update its systems to support the many different languages spoken by its people?

Since the 1980's, people all around the world have been working on a system called Unicode, whose goal is "to support the use of text in all of the world's writing systems that can be digitized." The latest version of Unicode defines over 150,000 characters, including almost every written language, some dead languages (cuneiform), some that have never existed (Klingon, Elvish), and emoji. utf8 is the main encoding used to encode Unicode into bits and back again.

When you initialize Bits with text, by default it tries to encode the text using ASCII; you'll get an error if you try to encode non-ASCII characters. But you can specify another encoding:

>>> Bits('😍', encoding='utf8')
11110000100111111001100010001101