Pipes

Lab setup

First, make sure you have completed the initial setup.

Open Terminal. Run the update command to make sure you have the latest code.
```
$ mwc update
```

Move to this lab's directory.

$ cd ~/Desktop/making_with_code/mwc1/unit2/lab_pipes

Enter this lab's shell environment.
```
$ poetry shell
```

Move to your MWC directory.
```
$ cd ~/Desktop/making_with_code
```

Get a copy of this lab's materials.

git clone https://git.makingwithcode.org/mwc/lab_pipes.git

Powerful new ideas

Files: Everything on your computer is a file. Files contain data and programs. Some files, like images, contain binary data which you can't read as text. But a lot of files just contain text. Every file is located at a path on the filesystem (See the Terminal Adventure)
Streams: Every program has three streams: standard in, standard out, and standard error. Many programs work by reading in data from standard in (stdin) and sending their output to stdout. In Python, print() sends data to standard out (stdout), which by default shows up on the screen.
Pipes: We can redirect the input and output streams of programs. In particular, we can connect one program's output to another program's input to create a more complex program.

Reading from files

Data is stored in your computer within files. The cat command prints the contents of a file to the screen. This lab contains words_370k.txt which contains about 370,000 English words--although you'll probably disagree about whether some of them are words.

💻 Let's print it out.

$ cat words_370k.txt

That's a lot of words. Sometimes you don't want to see all of a file. The head command just shows you the first few lines. tail is like head, but shows you a few lines from the end of the file. By default, head and tail show 10 lines. You can use the -n flag to choose a different number of lines.

💻 Use head to show the first ten words.

$ head words_370k.txt

You can connect programs together using |, the pipe operator. Pipes connect the output of one program to the input of another program.

💻 Pipe the result of cat into head -n 20, to show the first twenty words.

$ cat words_370k.txt | head -n 20

💻 Print out the words from 90 to 100. Read the words using cat, then pipe the result to head -n 100 to read the first hundred words, and pipe the result of this to tail to show the last 10 words of the first 100 words.

Piping commands together

In this lab, we are going to work with commands that take input, manipulate it, and then return something else. We will use pipes to connect simple commands together. length counts the length of each word and puts this number at the beginning of the line.

💻 Print out the first 20 words with their length.

$ cat words_370k.txt | head -n 20 | length

💻 You can use order to sort the lines.

$ cat words_370k.txt | length | order | head
$ cat words_370k.txt | length | head | order

All programs

Here's a list of programs included in this lab. You'll use these, along with cat and head, to solve the exercises. Arguments in brackets are optional, and have a default value which is used if you don't provide one.

program	arguments	description
length	`[ix0=0]`	Puts the length of the word at index 0 at the beginning of each line.
frequency	`[ix0=0]`	Puts the frequency of the word at index 0 at the beginning of each line. A word's frequency is how many times it occurs per billion words of everyday English.
match	`pattern [ix0=0]`	Allows lines where the word at index 0 matches regular expression `pattern`. (`pattern` should be in quotation marks.)
put	`[number]`	Puts `number` at the beginning of each line.
equal	`[ix0=0] [ix1=1]`	Allows lines where the number at index 0 equals the number at index 1.
lessthan	`[ix0=0] [ix1=1] [-e]`	Allows lines where the number at index 0 is less than the number at index 1. When `-e`, compares numbers using "less than or equal."
pluck	`[ix0=0]`	Keeps only the word at index 0 and discards the rest of the line.
unique	`[ix0=0]`	Puts a sequence of letters, one copy of each letter in the word at index 0, at the beginning of the line.
order	`[ix=0] [-r]`	Sorts lines according to the word or number at index 0 on each line. `-r` sorts in reverse order.
count		Counts the total number of lines.

More complex piping

Piping into files

Let's get some more manageable word lists. We can use the > operator to pipe input into a file. Let's create a file called words_1k.txt, containing the 1000 most common words in English in alphabetical order. We will call these the 1k words.

$ cat words_370k.txt | frequency | order -r | pluck 1 | head -n 1000 | order > words_1k.txt

What's going on here? It helps to build up these commands step-by-step. If you really want to see what's going on, first run cat words_370k.txt, then cat words_370k.txt | frequency, then keep adding one more pipe at a time.

step	explanation	example line
`cat words_370k.txt`	Start with all the words	`the`
`frequency`	Get the frequency for each word	`53700000 the`
`order -r`	Put the words in reverse order of frequency--most common words first	`53700000 the`
`pluck 1`	Select just the word at index 1 of each line	`the`
`head -n 1000`	Keep only the top 1000 words	`the`
`order`	Put those words in alphabetical order	`the`
`> words_1k.txt`	And save the result in a file	`the`

💻 Create words_10k.txt, an alphabetized file of the 10,000 most common English words. We'll call these the 10k words.

💻 Create words_100k.txt, an alphabetized file of the 100,000 most common English words. We'll call these the 100k words. From now on, we'll just use the 100k words, so that we don't have so many questionable words.

Filtering words

Some of the programs above act as filters, blocking some lines and letting others through. The simplest is match, which requires one argument, pattern, and then only allows matching words to pass through.

💻 List out the words containing "spider".

$ cat words_100k.txt | match "spider"

You can also match against more complicated patterns. ^ is a symbol which means "the beginning of the word."

💻 Print out all the words which start with "oo"

$ cat words_100k.txt | match "^oo"

Make sure you understand how match "^oo" is different from match "oo". (If you're not sure, try both, or ask on Discord.) Here are a few more useful symbols.

$ means "the end of the word"
. means "any one letter"
.* means "zero or more letters"
.+ means "one or more letters"

💻 Print out all the words containing at least five s's. The pattern below requires five s's, and allows any letters to be between them.

$ cat words_100k.txt | match "s.*s.*s.*s.*s"

You can also match against properties of words, like their length or their frequency. In order to do this, use put to put a number on every line, and then use equal or lessthan to compare numbers. Let's see step-by-step how to print out all the 22-letter words.

💻 Print out each word, with its length and the number 22. Actually, let's just see the first ten.

$ cat words_100k.txt | length | put 22 | head
22 1 a
22 2 aa
22 3 aaa
22 3 aah
22 6 aahing
22 4 aahs
22 3 aal
22 4 aals
22 3 aam
22 8 aardvark

You can think of each line as a list. For the last line, 22 is in position 0, 8 is in position 1, and aardvark is in position 2. (Computer scientists always start counting with 0. We'll come back to this.) Now we can use equal 0 1 to filter only the lines where position 0 matches position 1.

💻 Print out all the 22-letter words.

$ cat words_100k.txt | length | put 22 | equal 0 1

In fact, 0 and 1 are the default arguments for equal. If you want to compare positions 0 and 1, you can just write equal without any arguments. lessthan works in a similar way.

In more complicated piping, you will start to get a bunch of items on each line. You can use pluck to just keep a single item. pluck takes an argument, ix0, specifying which item to keep--the default is 0.

💻 Print out all the 22-letter words, without their lengths.

$ cat words_100k.txt | length | put 22 | equal | pluck 2

unique is another useful tool: it lists one copy of each letter in a word. (The example below uses another new command, echo, which spits its input back out. This might not seem so useful, except for when you pipe the output somewhere else instead of letting it return to the screen.)

$ echo "mississippi" | unique

💻 Print out all the words which have no repeated letters. (To do this, we'll check that the length of the word equals the length of its unique letters.)

$ cat words_100k.txt | length | unique 1 | length | equal 0 2

Let's break this down.

step	explanation	example line
`cat words_100k.txt`	Read in the 100k words.	`zucchini`
`cat words_100k.txt \| length`	Add the length of each word.	`8 zucchini`
`cat words_100k.txt \| length \| unique 1`	Add the unique letters for the word at position 1.	`chinuz 8 zucchini`
`cat words_100k.txt \| length \| unique 1 \| length`	Add the length of the unique letters.	`6 chinuz 8 zucchini`
`cat words_100k.txt \| length \| unique 1 \| length \| equal 0 2`	Filter the results where positions 0 and 2 (the two lengths) are equal.	`6 beimoz 6 zombie`

Organizing the results

You can put the results in order using order, which optionally takes an argument to specify which item to use for sorting (the default, as usual, is 0). Add -r for reverse order.

💻 Print out the 100k words, in order of length.

$ cat words_100k.txt | length | order

Sometimes you don't want to see all the words; you just want to count them. Use count.

💻 How many ten-letter words are there in the 100k words?

$ cat words_100k.txt | length | put 10 | equal | count

Exercises

exercises.md contains a list of questions about words. Use the tools above to answer them. Feel free to work together and/or discuss these on Discord!