Encoding
Lab setup
First, make sure you have completed the initial setup.
If you are part of a course
-
Open Terminal. Run the update command to make sure you have the latest code.
$ mwc update
-
Move to this lab's directory.
$ cd ~/Desktop/making_with_code/ny-csdf/unit2/lab_encoding
-
Enter this lab's shell environment.
$ poetry shell
If you are working on your own
-
Move to your MWC directory.
$ cd ~/Desktop/making_with_code
-
Get a copy of this lab's materials.
git clone https://git.makingwithcode.org/mwc/lab_encoding.git
It's all bits
At the lowest level, computers only work with two values: 0
and 1
. (This is called
binary because bi-
means two.) A single binary value (a 0
or a 1
) is called a bit.
Every file on your computer is just a sequence of bits. Every message that gets sent
between computers is just a sequence of bits.
💻
Use the built-in utility xxd
to look at the bits in your files.
$ xxd -b -d poem.txt
00000000: 01001100 01101001 01110110 01101001 01101110 01100111 Living
00000006: 00100000 01101001 01110011 00100000 01101110 01101111 is no
00000012: 00100000 01101100 01100001 01110101 01100111 01101000 laugh
00000018: 01101001 01101110 01100111 00100000 01101101 01100001 ing ma
00000024: 01110100 01110100 01100101 01110010 00111010 00001010 tter:.
...
The first column is an index, showing how far along you are in the file.
Then, the bits are shown in groups of eight, called a byte. The first byte
in this file is 01001100
.
What do these bits mean? Nothing, really. Bits don't have any intrinsic meaning;
they're just ones and zeros. It's up to us to interpret them. In the sections below,
we will interpret bits as several different data types with which we're already
famliar. We'll use a library called easybits
, which was designed to make it easy
to move between bits and the types they can represent. (Python is a high-level language
designed to let users work with objects like integers and strings, without having
to think about how they are represented as bits. When you program in lower-level languages
such as C or Rust, you're working with bits all the time.)
Booleans
Bits can easily be treated as booleans: 0
is False
and 1
is True
.
💻
Try out creating Bits
from booleans. You can create
a single bit with one boolean, or a sequence of bits with a list of booleans.
You can also use a string of 1
and 0
, which is equivalent and easier to type.
$ python
>>> from easybits import Bits
>>> Bits(True)
1
>>> Bits(False)
0
>>> Bits([True, False, True, False])
1010
>>> Bits("01001001")
01001001
Here are some operators you can use to manipulate bits. These correspond
directly to the boolean operators you have used before (not
, and
, or
).
When working with bits, operators are applied bitwise, meaning they are applied
to each bit separately.
op | name | example | description |
---|---|---|---|
~ | not | ~a | Each bit is flipped. 0 becomes 1, and 1 becomes 0. |
& | and | a & b | The result is 1 when both inputs are 1. |
| | or | a | b | The result is 1 when either input is 1. |
^ | exclusive or | a ^ b | The result is 1 when one input, but not both, is 1. |
>> | shift right | a >> 2 | The bits are pushed to the right (2 places in this case), with 0 entering on the left. 2 |
<< | shift left | a << 3 | The bits are pushed to the left (3 places in this case), with 0 entering on the left. |
💻 Try out the bit operators.
>>> a = Bits("11110000")
>>> b = Bits("10101010")
>>> ~a
00001111
>>> a & b
10100000
>>> a | b
11111010
>>> a ^ b
01011010
>>> a >> 2
00111100
>>> a << 2
11000000
Integers
We can also think of bits as integers--use Bits.int
to see the integer representation
of any sequence of bits. When you create Bits
from an integer, you need to specify the length
of your bit sequence--we'll see why shortly. Here are a few more bit operators which
treat bits like integers.
op | name | example | description |
---|---|---|---|
- | negation | -a | Flips the sign. (if a represents 4, -a will represent -4. |
+ | addition | a + b | Performs bitwise addition of a and b . |
- | subtraction | a - b | Performs bitwise subtraction of a and b . (This is implemented as a + (-b) ) |
💻 Create some bits from integers.
>>> Bits(1, length=8)
00000001
>>> Bits(2, length=8)
00000010
>>> Bits(3, length=8)
00000011
>>> Bits(4, length=8)
00000100
>>> Bits(5, length=8)
00000101
>>> Bits(8, length=8)
00001000
>>> Bits(10, length=8)
00001010
Do you see the pattern? The first bit is worth 1, the second bit is worth 2, the third is worth 4, the fourth is worth 8, and so on, with the value of the bit doubling each time. Starting with all the bits turned off, just turn on the bits you need to make the number you want.
Bitwise addition
Bitwise addition works in the same way you learned addition in grade school: starting
from the right, add the numbers in the column and write the sum at the bottom. In normal
base-ten addition, if the sum goes over 9, you "carry" the extra digit to the next column.
The same rule applies here, except binary numbers only use 0
and 1
, so we carry anything
greater than 1
.
111
00011101 # 29
+ 00001110 # 14
----------------
00101011 # 43
One detail: The result of bitwise addition will always be the same number of digits as the
two numbers being added. This means that if we need to carry a 1
beyond the left-most column,
it has nowhere to go, and it just gets thrown away. 11111111
+ 00000001
= 00000000
💻 Play around with bitwise addition.
>>> one = Bits("0001")
>>> one + one
0010
>>> one + one + one
0011
>>> (one + one + one + one).int
4
Negative integers
How should we represent negative numbers? One option would be to simply reserve the first bit
for the sign: 0
is for positive integers and 1
is for negative integers. This works fine,
but it's not ideal because bitwise addition doesn't give the right answer. For example:
010 # 2
+ 101 # -1
-----------
111 # -3
Instead, we'll use a clever (though slightly more complex) approach, starting with the idea that
any integer plus its negative should equal zero. Let's consider 29, or 00011101
. What bit pattern
could be added to 29 to get 00000000
? Well, if we flip all the bits in 29, we get 11100010
, and
if we add this to 00011101
, we get 11111111
. Add 00000001
and the whole thing rolls over, giving
00000000
. This pattern works for any sequence of bits: the bits plus their inverse equals 11111111
;
the bits plus their inverse plus one equals 00000000
. Therefore:
- To make a positive integer negative, flip all the bits and add one.
- To make a negative integer positive, undo the process by subtracting one and flipping all the bits.
What's lovely and brilliant about this system is that it works: bitwise addition on positive and negative
integers gives the right answer. One note: for this system to work, the first digit of positive integers
must always be 0
. When the first digit is 1
, the bits represent a negative integer instead.
💻 Play around with positive and negative integers.
>>> a = Bits(29, length=8)
>>> a
00011101
>>> -a
11100011
>>> -a.int
-29
>>> (a + -a)
00000000
Text
Finally, let's think about how to represent text with bits. Any system for representing text in bits (and converting bits back into text) is called an encoding. As with integers, there's no right way to do this--we could choose any system we want. Some will work better than others.
When computers were first being invented in the early 1960's, a simple encoding called ASCII was created: one byte represents one character, with pre-defined assignments. Most of the characters are letters, numbers, and punctuation, but some of the characters represent physical functions on a printer such as backspace, moving down to the next line, and even ringing a bell. Before screens existed, all the output from a computer was printed onto paper!
💻 Print out all the printable ASCII characters, along with their bit patterns.
>>> for i in range(128):
... b = Bits(i, length=8)
... if b.ascii.isprintable():
... print(b, b.int, b.ascii)
...
ASCII works fine, but it's very limited: the characters listed are all you get. Using the ACSII encoding, it's impossible to write the names of poets like بدر شاكر السياب, Czesław Miłosz, or 李白. And how could we live without 😍?
This has real impacts on people's lives. In most US states, your legal name must be written in ASCII because the computer records systems have not been updated. If your real last name is Muñoz, it will be Munoz in the US. And if your real name is written in a different writing system, your official name in the US will be an ASCII translation. Should the US require everyone to use English? Or should the US update its systems to support the many different languages spoken by its people?
Since the 1980's, people all around the world have been working on a system
called Unicode, whose goal is
"to support the use of text in all of the world's writing systems that can be digitized."
The latest version of Unicode defines over 150,000 characters, including almost
every written language, some dead languages (cuneiform), some that have never existed
(Klingon, Elvish), and emoji. utf8
is the main encoding used to encode Unicode
into bits and back again.
When you initialize Bits
with text, by default it tries to encode the
text using ASCII; you'll get an error if you try to encode non-ASCII characters.
But you can specify another encoding:
>>> Bits('😍', encoding='utf8')
11110000100111111001100010001101