Encryption

A look under the hood...

Our first question is how we're going to convert our text message into a useful (and relatively standrad) number form. It is natural to use the existing encodings that underlie our existing computational infrastrcuture. For this, it will be helpful to understand the distinction between 'strings' and 'bytes'.

Strings are made of characters, language that people use, and are stored in your computers' member at a higher level of abstraction. Bytes, on the other hand, are actually sets of bits, what we represent as 0's and 1's in the binary code that zips through the logic gates of your processor. The characters that those numbers refer to depends on your "encoding," the instructions your give your computer for doing that translation work from the 1's and 0's of bits to the letters, punctuation, and emoji's on your screen.

The default encoding used in most contexts to 'utf-08' (Unicode Transformation Format - 8 bit.) Historically speaking, ASCII came first (The American Standard Code for Information Interchange). In ASCII, seven 1's and 0's are used to encode what is call a 7-bit byte, with an 8th bit leading as a 0 for... reasons. Without dipping too deeply into this ocean, some of the bit strings were reserved for things like controlling your printer or sending other kinds of signals. ASCII didn't just carry character symbols, but other kinds of signals, and so its design was extremely consice. It having been the 1960s, it was important that the encoding use as little memory as possible. There are different expansions of 7-bit ASCII into 8-bit ASCII that take advantage of the other 128 possible numbers you can actually represent by using that last bit, and the history and design of these extensions are a quirky little piece of computer science history.

With only 7 bits to play with, the smallest representable number is 0000000, also known as 0, The largest representable number is 1111111, or 127 in base 10. Importantly, for an 8-bit byte, the extra digit doubles the number of possibilities, meaning each 8-bit byte represents a number between 0 and 255, for a total of 256 possibilities. As 256=16*16, this is very easy to represent in base-16, hexadecimal. In hexadecimal, as you may or may not recall, we represent numbers with the characters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F, with F referring to what you know as "15." Each digit counts multiples of powers of 16 rather than 10. It has a 1's place, a 16's place, a 256's place, and so on. This means that 42 in base 10 would be written as 2A in hexadecimcal (since 2*16 + 10*1 = 42). Now, if we want to refer to 256 numbers in hexademical, we only need two digits. Below is a list of a few different numbers. See if you can fill in the last two, yourself, and see if you can explain why the binary forms are written in two halves:

Decimal : Hexadecimal : Binary

10  : 0A : 0000-1010
16  : 10 : 0001-0000
32  : 20 : 0010-0000
33  : 21 : 0010-0001
128 : 80 : 1000-0000
255 : FF : 1111-1111
99  :    : 
193 :    :

Check your answers in answers19.txt.

So, to recap, 8-bit encodings (like UTF-08) assign to each byte to a unique output, eg. the 8-bit byte 0100-0001 is assigned to the capital letter "A" in ASCII. In other words, "A" is numerically the number 65. There are 256 possible bytes, which means each byte can be represented by a two digit hexadecimal number. In general, these bytes don't actually refer to specific characters, but what are callled "code points." Some characters might be constructed from a single code point, or perhaps a few code points put together. The exact mechanics of this isn't important. This nuance, however, gives UTF-08 a big advantage over ASCII... an object on the screen could be made of more than one byte, meaning we have far more than 256 possibilities. With four bytes, we have 256^4 possibilities, or 4,294,967,296 different code points.

All of this is to say in UTF-08, characters are encoded in sequences of bytes, and each byte can be represented by eight binary bits or a two digit hexadecimal number. Since your computer uses this ecoding or a similar one for all text, it's a ready made and extremely efficent way of encoding our own text numerically.

With story time over, let's move on to some code. Open ccipher.py and take a look. We use the bytes(object,'encoding') function to convert each character in the test string into their correspond bytes. You can also "encode" the characters as below. Finally, we can also translate this into hexadecimal via the list() function.

test_str = "abc"

bytes_lst = []
for item in test_str:
    bytes_lst.append(bytes(item,'utf-8'))
print("As bytes, the string is:")
print(bytes_lst)

encoded_lst = []
for item in test_str:
    encoded_lst.append(item.encode())
print("Encoding the characters in the string also gives:")
print(encoded_lst)

hex_lst=[]
for item in encoded_lst:
    hex_lst.append(list(item)[0])
print("The hex values associated with those bytes is:")
print(hex_lst)

💻 Your fist coding task is to create function that inputs an arbitrary string and outputs the numeric values of the bytes that represent that string (the message) with the UTF-08 encoding.

def numerify(message):
    return(message)

For our first encryption, we will use the Caesarian cipher, named so because Julius Caesar (yes that one) used it in sending his own messages. A Caesarian cipher simply shifts each letter the same number of places down the alphabet, with later numbers wrapping around back to the beginning. In other words, if we associated each letter to a number between 0 and 25, with 'A' as 0 and 'Z' as 25, we might add 1 mod 25. This is our old friend, modular arithmetic, where 20+6=1 because it's the remainder we get when we divide 26 by 25. In other words, it's precisely 1 more than a multiple of 25. A cipher that just adds 1 mod 25 would turn CAT into DBU.

Of course, our cipher will have to add some number between 0 and 255, since each byte is a number in that range (rather than the range 0 to 25 in our simplistic example above).

We should start by choosing a secret number:

secret_number = 1

💻 Then create a function that uses the % operator to take change the value of each byte. Don't forget that our function as above is a list of lists, and you want to add to the integer value of each numeric value in those lists.

def encrypt(numeric_message):
    return numeric_message

💻 Now that we have this, create a function that decrypts the message. Note you'll need to subtract this time if you added last time and vis versa. Challenge: but do you? If you added the number 52 the first time around, is there a number you can add the second time around still get back the original message?

def decrypt(encrypted_message):
    return encrypted_message

💻 In order to turn these bytes back into a characters, we use the chr(byte) function. You should also then take the list of characters this outputs and concatenate them back into a single string with ''.join(list).

def wordify(decrypted_message)
    return decrypted_message

Now you've got that working, let's work on cracking the code! In modern cryptography, we assume that the method behind our encryption algorithm is always known to would-be spies. They might not know what our secret number is, but they've guessed that we're using a caesarian cipher. Now you get to play code-breaker.

Challenge:

💻 Write a script that can decrypts the encrypted message in xfile.txt. In this case, you aren't given the secret number, so your script should automatically identify what the secret number is without you having to check.