Pipes
Powerful new ideas
- Files: Everything on your computer is a file. Files contain data and programs. Some files, like images, contain binary data which you can't read as text. But a lot of files just contain text. Every file is located at a path on the filesystem (See the Terminal Adventure)
- Streams: Every program has three streams: standard in, standard out, and standard error.
Many programs work by reading in data from standard in (
stdin
) and sending their output to stdout. In Python,print()
sends data to standard out (stdout
), which by default shows up on the screen. - Pipes: We can redirect the input and output streams of programs. In particular, we can connect one program's output to another program's input to create a more complex program.
Reading from files
Data is stored in your computer within files. The cat
command prints the contents of
a file to the screen. This lab contains words_370k.txt
which contains about 370,000
English words--although you'll probably disagree about whether some of them are words.
π» Let's print it out.
$ cat words_370k.txt
That's a lot of words. Sometimes you don't want to see all of a file. The head
command just shows you the first
few lines. tail
is like head, but shows you a few lines from the end of the file.
By default, head
and tail
show 10 lines. You can use the -n
flag to choose
a different number of lines.
π»
Use head
to show the first ten words.
$ head words_370k.txt
You can connect programs together using |
, the pipe operator. Pipes connect the output of one program to the input of another program.
π»
Pipe the result of cat
into head -n 20
, to show the first twenty words.
$ cat words_370k.txt | head -n 20
π»
Print out the words from 90 to 100. Read the words using cat
,
then pipe the result to head -n 100
to read the first hundred words, and pipe the result of this
to tail
to show the last 10 words of the first 100 words.
$
Piping commands together
In this lab, we are going to work with commands that take input, manipulate it, and then
return something else. We will use pipes to connect simple commands together. length
counts the length of each word and puts this number at the beginning of the line.
π» Print out the first 20 words with their length.
$ cat words_370k.txt | head -n 20 | length
π»
You can use order
to sort the lines.
$ cat words_370k.txt | length | order | head
$ cat words_370k.txt | length | head | order
All programs
Here's a list of programs included in this lab. You'll use these, along with cat
and head
, to solve the exercises.
Arguments in brackets are optional, and have a default value which
is used if you don't provide one.
program | arguments | description |
---|---|---|
length | [ix0=0] | Puts the length of the word at index 0 at the beginning of each line. |
frequency | [ix0=0] | Puts the frequency of the word at index 0 at the beginning of each line. A word's frequency is how many times it occurs per billion words of everyday English. |
match | pattern [ix0=0] | Allows lines where the word at index 0 matches regular expression pattern . (pattern should be in quotation marks.) |
put | [number] | Puts number at the beginning of each line. |
equal | [ix0=0] [ix1=1] | Allows lines where the number at index 0 equals the number at index 1. |
lessthan | [ix0=0] [ix1=1] [-e] | Allows lines where the number at index 0 is less than the number at index 1. When -e , compares numbers using "less than or equal." |
pluck | [ix0=0] | Keeps only the word at index 0 and discards the rest of the line. |
unique | [ix0=0] | Puts a sequence of letters, one copy of each letter in the word at index 0, at the beginning of the line. |
order | [ix=0] [-r] | Sorts lines according to the word or number at index 0 on each line. -r sorts in reverse order. |
count | Counts the total number of lines. |
More complex piping
Piping into files
Let's get some more manageable word lists. We can use the >
operator to pipe
input into a file. Let's create a file called words_1k.txt
, containing the 1000 most
common words in English in alphabetical order. We will call these the 1k words.
$ cat words_370k.txt | frequency | order -r | pluck 1 | head -n 1000 | order > words_1k.txt
What's going on here? It helps to build up these commands step-by-step. If you really want to see
what's going on, first run cat words_370k.txt
, then cat words_370k.txt | frequency
, then keep adding
one more pipe at a time.
step | explanation | example line |
---|---|---|
cat words_370k.txt | Start with all the words | the |
frequency | Get the frequency for each word | 53700000 the |
order -r | Put the words in reverse order of frequency--most common words first | 53700000 the |
pluck 1 | Select just the word at index 1 of each line | the |
head -n 1000 | Keep only the top 1000 words | the |
order | Put those words in alphabetical order | the |
> words_1k.txt | And save the result in a file | the |
π»
Create words_10k.txt
, an alphabetized file of the 10,000 most common English words. We'll call these the 10k words.
π»
Create words_100k.txt
, an alphabetized file of the 100,000 most common English words. We'll call these the 100k words.
From now on, we'll just use the 100k words, so that we don't have so many questionable words.
Filtering words
Some of the programs above act as filters, blocking some lines and letting others through. The simplest
is match
, which requires one argument, pattern
, and then only allows matching words to pass through.
π» List out the words containing "spider".
$ cat words_100k.txt | match "spider"
You can also match against more complicated patterns. ^
is a symbol which means "the beginning of the word."
π» Print out all the words which start with "oo"
$ cat words_100k.txt | match "^oo"
Make sure you understand how match "^oo"
is different from match "oo"
. (If you're not sure, try both, or ask on Discord.)
Here are a few more useful symbols.
$
means "the end of the word".
means "any one letter".*
means "zero or more letters".+
means "one or more letters"
π» Print out all the words containing at least five s's. The pattern below requires five s's, and allows any letters to be between them.
$ cat words_100k.txt | match "s.*s.*s.*s.*s"
You can also match against properties of words, like their length or their frequency. In order to do this,
use put
to put a number on every line, and then use equal
or lessthan
to compare numbers.
Let's see step-by-step how to print out all the 22-letter words.
π» Print out each word, with its length and the number 22. Actually, let's just see the first ten.
$ cat words_100k.txt | length | put 22 | head
22 1 a
22 2 aa
22 3 aaa
22 3 aah
22 6 aahing
22 4 aahs
22 3 aal
22 4 aals
22 3 aam
22 8 aardvark
You can think of each line as a list. For the last line, 22
is in position 0,
8
is in position 1, and aardvark
is in position 2. (Computer scientists always start counting with 0.
We'll come back to this.) Now we can use equal 0 1
to filter only the lines where position 0 matches
position 1.
π» Print out all the 22-letter words.
$ cat words_100k.txt | length | put 22 | equal 0 1
In fact, 0
and 1
are the default arguments for equal
. If you want to compare positions 0 and 1,
you can just write equal
without any arguments. lessthan
works in a similar way.
In more complicated piping, you will start to get a bunch of items on each line. You can use pluck
to just keep a single item. pluck
takes an argument, ix0
, specifying which item to keep--the default is 0.
π» Print out all the 22-letter words, without their lengths.
$ cat words_100k.txt | length | put 22 | equal | pluck 2
unique
is another useful tool: it lists one copy of each letter in a word.
(The example below uses another new command, echo
, which spits its input back out.
This might not seem so useful, except for when you pipe the output somewhere else
instead of letting it return to the screen.)
$ echo "mississippi" | unique
π» Print out all the words which have no repeated letters. (To do this, we'll check that the length of the word equals the length of its unique letters.)
$ cat words_100k.txt | length | unique 1 | length | equal 0 2
Let's break this down.
step | explanation | example line |
---|---|---|
cat words_100k.txt | Read in the 100k words. | zucchini |
cat words_100k.txt | length | Add the length of each word. | 8 zucchini |
cat words_100k.txt | length | unique 1 | Add the unique letters for the word at position 1. | chinuz 8 zucchini |
cat words_100k.txt | length | unique 1 | length | Add the length of the unique letters. | 6 chinuz 8 zucchini |
cat words_100k.txt | length | unique 1 | length | equal 0 2 | Filter the results where positions 0 and 2 (the two lengths) are equal. | 6 beimoz 6 zombie |
Organizing the results
You can put the results in order using order
, which optionally takes an argument to specify
which item to use for sorting (the default, as usual, is 0). Add -r
for reverse order.
π» Print out the 100k words, in order of length.
$ cat words_100k.txt | length | order
Sometimes you don't want to see all the words; you just want to count them. Use count
.
π» How many ten-letter words are there in the 100k words?
$ cat words_100k.txt | length | put 10 | equal | count
Exercises
exercises.md
contains a list of questions about words. Use the tools above to answer them.
Feel free to work together and/or discuss these on Discord!