Pipes

Powerful new ideas

Reading from files

Data is stored in your computer within files. The cat command prints the contents of a file to the screen. This lab contains words_370k.txt which contains about 370,000 English words--although you'll probably disagree about whether some of them are words.

πŸ’» Let's print it out.

$ cat words_370k.txt

That's a lot of words. Sometimes you don't want to see all of a file. The head command just shows you the first few lines. tail is like head, but shows you a few lines from the end of the file. By default, head and tail show 10 lines. You can use the -n flag to choose a different number of lines.

πŸ’» Use head to show the first ten words.

$ head words_370k.txt

You can connect programs together using |, the pipe operator. Pipes connect the output of one program to the input of another program.

πŸ’» Pipe the result of cat into head -n 20, to show the first twenty words.

$ cat words_370k.txt | head -n 20

πŸ’» Print out the words from 90 to 100. Read the words using cat, then pipe the result to head -n 100 to read the first hundred words, and pipe the result of this to tail to show the last 10 words of the first 100 words.

$ 

Piping commands together

In this lab, we are going to work with commands that take input, manipulate it, and then return something else. We will use pipes to connect simple commands together. length counts the length of each word and puts this number at the beginning of the line.

πŸ’» Print out the first 20 words with their length.

$ cat words_370k.txt | head -n 20 | length

πŸ’» You can use order to sort the lines.

$ cat words_370k.txt | length | order | head
$ cat words_370k.txt | length | head | order

All programs

Here's a list of programs included in this lab. You'll use these, along with cat and head, to solve the exercises. Arguments in brackets are optional, and have a default value which is used if you don't provide one.

programargumentsdescription
length[ix0=0]Puts the length of the word at index 0 at the beginning of each line.
frequency[ix0=0]Puts the frequency of the word at index 0 at the beginning of each line. A word's frequency is how many times it occurs per billion words of everyday English.
matchpattern [ix0=0]Allows lines where the word at index 0 matches regular expression pattern. (pattern should be in quotation marks.)
put[number]Puts number at the beginning of each line.
equal[ix0=0] [ix1=1]Allows lines where the number at index 0 equals the number at index 1.
lessthan[ix0=0] [ix1=1] [-e]Allows lines where the number at index 0 is less than the number at index 1. When -e, compares numbers using "less than or equal."
pluck[ix0=0]Keeps only the word at index 0 and discards the rest of the line.
unique[ix0=0]Puts a sequence of letters, one copy of each letter in the word at index 0, at the beginning of the line.
order[ix=0] [-r]Sorts lines according to the word or number at index 0 on each line. -r sorts in reverse order.
countCounts the total number of lines.

More complex piping

Piping into files

Let's get some more manageable word lists. We can use the > operator to pipe input into a file. Let's create a file called words_1k.txt, containing the 1000 most common words in English in alphabetical order. We will call these the 1k words.

$ cat words_370k.txt | frequency | order -r | pluck 1 | head -n 1000 | order > words_1k.txt

What's going on here? It helps to build up these commands step-by-step. If you really want to see what's going on, first run cat words_370k.txt, then cat words_370k.txt | frequency, then keep adding one more pipe at a time.

stepexplanationexample line
cat words_370k.txtStart with all the wordsthe
frequencyGet the frequency for each word53700000 the
order -rPut the words in reverse order of frequency--most common words first53700000 the
pluck 1Select just the word at index 1 of each linethe
head -n 1000Keep only the top 1000 wordsthe
orderPut those words in alphabetical orderthe
> words_1k.txtAnd save the result in a filethe

πŸ’» Create words_10k.txt, an alphabetized file of the 10,000 most common English words. We'll call these the 10k words.

πŸ’» Create words_100k.txt, an alphabetized file of the 100,000 most common English words. We'll call these the 100k words. From now on, we'll just use the 100k words, so that we don't have so many questionable words.

Filtering words

Some of the programs above act as filters, blocking some lines and letting others through. The simplest is match, which requires one argument, pattern, and then only allows matching words to pass through.

πŸ’» List out the words containing "spider".

$ cat words_100k.txt | match "spider"

You can also match against more complicated patterns. ^ is a symbol which means "the beginning of the word."

πŸ’» Print out all the words which start with "oo"

$ cat words_100k.txt | match "^oo"

Make sure you understand how match "^oo" is different from match "oo". (If you're not sure, try both, or ask on Discord.) Here are a few more useful symbols.

πŸ’» Print out all the words containing at least five s's. The pattern below requires five s's, and allows any letters to be between them.

$ cat words_100k.txt | match "s.*s.*s.*s.*s"

You can also match against properties of words, like their length or their frequency. In order to do this, use put to put a number on every line, and then use equal or lessthan to compare numbers. Let's see step-by-step how to print out all the 22-letter words.

πŸ’» Print out each word, with its length and the number 22. Actually, let's just see the first ten.

$ cat words_100k.txt | length | put 22 | head
22 1 a
22 2 aa
22 3 aaa
22 3 aah
22 6 aahing
22 4 aahs
22 3 aal
22 4 aals
22 3 aam
22 8 aardvark

You can think of each line as a list. For the last line, 22 is in position 0, 8 is in position 1, and aardvark is in position 2. (Computer scientists always start counting with 0. We'll come back to this.) Now we can use equal 0 1 to filter only the lines where position 0 matches position 1.

πŸ’» Print out all the 22-letter words.

$ cat words_100k.txt | length | put 22 | equal 0 1

In fact, 0 and 1 are the default arguments for equal. If you want to compare positions 0 and 1, you can just write equal without any arguments. lessthan works in a similar way.

In more complicated piping, you will start to get a bunch of items on each line. You can use pluck to just keep a single item. pluck takes an argument, ix0, specifying which item to keep--the default is 0.

πŸ’» Print out all the 22-letter words, without their lengths.

$ cat words_100k.txt | length | put 22 | equal | pluck 2

unique is another useful tool: it lists one copy of each letter in a word. (The example below uses another new command, echo, which spits its input back out. This might not seem so useful, except for when you pipe the output somewhere else instead of letting it return to the screen.)

$ echo "mississippi" | unique

πŸ’» Print out all the words which have no repeated letters. (To do this, we'll check that the length of the word equals the length of its unique letters.)

$ cat words_100k.txt | length | unique 1 | length | equal 0 2

Let's break this down.

stepexplanationexample line
cat words_100k.txtRead in the 100k words.zucchini
cat words_100k.txt | lengthAdd the length of each word.8 zucchini
cat words_100k.txt | length | unique 1Add the unique letters for the word at position 1.chinuz 8 zucchini
cat words_100k.txt | length | unique 1 | lengthAdd the length of the unique letters.6 chinuz 8 zucchini
cat words_100k.txt | length | unique 1 | length | equal 0 2Filter the results where positions 0 and 2 (the two lengths) are equal.6 beimoz 6 zombie

Organizing the results

You can put the results in order using order, which optionally takes an argument to specify which item to use for sorting (the default, as usual, is 0). Add -r for reverse order.

πŸ’» Print out the 100k words, in order of length.

$ cat words_100k.txt | length | order

Sometimes you don't want to see all the words; you just want to count them. Use count.

πŸ’» How many ten-letter words are there in the 100k words?

$ cat words_100k.txt | length | put 10 | equal | count

Exercises

exercises.md contains a list of questions about words. Use the tools above to answer them. Feel free to work together and/or discuss these on Discord!