SED Basics using O_Reilly.pdf example, and Unexpected Mind-Bending Parsing Behaviour

The pdf of this 1990 book refers to POSIX standard changes in SED etc. so may account for different behaviour by my Mint version of grep, sed and awk etc. from their original examples, as I think happened below.

As an introduction to the ideas of iteration and recursion using SED:

In simple terms, an iterative function is one that loops to repeat some part of the code, and a recursive function is one that calls itself again to repeat the code.”

Experiment with some examples of SED, to try to understand how it works line by line, as a script proceeds, with any changed values in the first line becoming possible targets for change by subsequent script line commands – as it's not obvious at first how it works, or to visualise, (especially if you are a crap programmer like me!) – like much with grep, sed and awk - depending on what regular expression methods and metacharacters are used.

Here's the “pig, cow, horse” example in:

O_Reilly_-_sed___awk_2nd_Edition.pdf

which I think contains omissions re the global function - states:

Let's look at an example that uses the substitute command. Suppose someone quickly wrote the

following script to change "pig" to "cow" and "cow" to "horse":

s/pig/cow/

s/cow/horse/

What do you think happened? Try it on a sample file. We'll discuss what happened later, after we look at how sed works.”

At this point, I guessed the script would change all pig words to cow words, then all cows to horses, (which I assume is what most people would?), so that only the word “horse” would exist in place of previous pigs and cows. It's not clear what the fictitious scriptwriter actually intended though, from the text.

Did he want to substitute all "pig" and "cow" words with "horse" in the first place or just some? Can SED do that on one line or does it need two lines?

So, create the script as above:

cat scriptest.txt

s/pig/cow/

s/cow/horse/

and a suitable file to apply it to:

cat pigcowhorse.txt

pig

pig cow

pig cow horse

pig cow horse pig

It would seem that each word “pig” would first change to “cow” in the process, then all the new cow values would change to horse – if you think the first line acts on the whole file before the 2nd line acts, to penultimately change each pig to cow:

cow

cow cow

cow cow horse

cow cow horse cow

so that ultimately, cows become horses – by the incredible way sed may swiftly alter their genetics:

horse

horse horse

horse horse horse

horse horse horse horse

as the rushed scriptwriter possibly overlooked, from what he desired?

If you run this script, it does neither that, nor what the example intends to show, I believe, as some cows don't become horses, and a final pig remains un-mutated:

stevee@Mint5630 ~/Documents/SEDAWK $ sed -f scriptest.txt pigcowhorse.txt

horse

horse cow

horse cow horse

horse cow horse pig

The authors intended the script to act as explained:

As a consequence, any sed command might change the contents of the pattern space for the next

command. The contents of the pattern space are dynamic and do not always match the original input line. That was the problem with the sample script at the beginning of this chapter. The first command would change "pig" to "cow" as expected. However, when the second command changed "cow" to "horse" on the same line, it also changed the "cow" that had been a "pig." So, where the input file contained pigs and cows, the output file has only horses!”

To understand what happens, you need to go in steps, as sed does in the script:

sed s/pig/cow/ pigcowhorse.txt (> cows.txt)

cow

cow cow

cow cow horse

cow cow horse pig

Now you see why the pig gets left. Sed only acts on the first occurrence on each line.

You can guess now what happens the next time round if only the FIRST occurrence of cow gets substituted by horse (s/cow/horse/) and all else left alone, same as the result I got above:

horse

horse cow

horse cow horse

horse cow horse pig

What the authors forgot – possibly due to different sed version behaviour since theirs in 1990 – to give us all horses – was the global command, g, so that all occurrences of the searched term get substituted on each line:

cat scriptest.txt

s/pig/cow/g

s/cow/horse/g

so that ALL pigs become cows, then ALL cows become horses, which when run globally, you do indeed get all horses:

sed -f scriptest.txt pigcowhorse.txt

horse

horse horse

horse horse horse

horse horse horse horse

Again, you can run the single sed lines to see the global substitution effect of all occurrences of the searched word on each line. The first pass changes all pigs to cows, the second, all cows to horses:

sed s/pig/cow/g pigcowhorse.txt (> cows2.txt)

cow

cow cow

cow cow horse

cow cow horse cow

sed s/cow/horse/g cows2.txt

horse

horse horse

horse horse horse

horse horse horse horse

So this is what the authors were driving home – they just left out the global command aspect – unless their 1990 version worked differently:

As a consequence, any sed command might change the contents of the pattern space for the next

command. The contents of the pattern space are dynamic and do not always match the original input line. That was the problem with the sample script at the beginning of this chapter. The first command would change "pig" to "cow" as expected. However, when the second command changed "cow" to "horse" on the same line, it also changed the "cow" that had been a "pig." So, where the input file contained pigs and cows, the output file has only horses!”

It seems what they were suggesting, that was desired by the scriptwriter, was to have been achieved by changing the script command order, but not globally either:

This mistake is simply a problem of the order of the commands in the script. Reversing the order of the commands - changing "cow" into "horse" before changing "pig" into "cow" - does the trick.

s/cow/horse/

s/pig/cow/”

This time, if the script is edited as above, (no global option), and run on the original file you would get from:

cat pigcowhorse.txt

pig

pig cow

pig cow horse

pig cow horse pig

to

sed -f scriptest.txt pigcowhorse.txt

cow

cow horse

cow horse horse

cow horse horse pig

Again, an unchanged pig left over.

Compare that with adding the global g:

s/cow/horse/g

s/pig/cow/g

to get what I think they meant the scriptwriter meant for all along – all original pigs ONLY get changed to cows, and all original cows ONLY get changed to horses!:

cat pigcowhorse.txt

pig

pig cow

pig cow horse

pig cow horse pig

sed -f scriptest.txt pigcowhorse.txt

cow

cow horse

cow horse horse

cow horse horse cow

Only working patiently through these types of examples will you discover errors like this, and think about what is going on, to get a feel for how easy it is to error in the initial thought process of how you think things may work, to how they turn out.

The next sed example is a handy one liner for removing blank lines, so I insert 4 in pigcowhorse.txt:

cat pigcowhorse.txt

pig

pig cow

pig cow horse

pig cow horse pig

-----------------

From grep, and knowing that the (^) denotes the first char of any/every line – as even blank lines by definition, have to have a null char of some sort to separate it from the next line, so therefore a last character ($) also, by definition, which would be a line feed (LF char) or similar.

You can find all empty lines in a file using these special characters together - the first and last chars with NOTHING between them = a blank line.

For the 4 blanks in pigcowhorse.txt you get a gap (which I numbered below) in the tty where grep finds only the 4 empty lines:

stevee@Mint5630 ~/Documents/SEDAWK $ grep ^$ pigcowhorse.txt

1

2

3

4

stevee@Mint5630 ~/Documents/SEDAWK $

You can use this with sed combined with the “d” delete option to remove those lines from a file (note the WordPress editor adds extra lines also when pasted - but you realise there is no line space between these in the Terminal?):

sed '/^$/d' pigcowhorse.txt

pig

pig cow

pig cow horse

pig cow horse pig

-----------------

Extending that regex, you can use grep, as cat, to view any file, using the “.” period, which denotes any single character; along with the asterisk “*” wildcard, which denotes none OR any number of any characters.

This means the first character of any line (which has to exist for there to be a line in the first place), can be followed by anything or nothing, and be displayed as a search result by grep, i.e. the whole file's contents!

grep ^.* pigcowhorse.txt

pig

pig cow

pig cow horse

pig cow horse pig

-----------------

These two one-liners are handy to check your understanding of four common special characters (^.*$) which can mean different things to the command line, depending on how they are bounded by each other. For example, if you list the attribs of Documents with the dir option “d”, you get a single line output from ls:

ls -ld Documents/

drwxr-xr-x 4 stevee stevee 4096 Sep 30 10:36 Documents/

If you pipe that to grep to search for any single character in that line, you get the whole line as you may expect:

ls -ld Documents/ | grep .

drwxr-xr-x 4 stevee stevee 4096 Sep 30 10:36 Documents/

You may think that grep would still only read the content stream of it's input from ls -ld, if you added the wildcard;

ls -ld Documents/ | grep .*

to show “none or all characters after any single character (found by the period), but it doesn't do that by a LONG way – it lists all the characters after any single character, found in the actual Documents directory - some 418 files, as if you used grep alone in the first place:

ls -ld Documents/ | grep .* is equal to grep .*

Why isn't it listing just the single line that defined the Documents line, which has a first character d (from drwx etc.), followed by none or any amount of any other characters i.e. the input stream alone, from ls -ld?

It seems the command line parses the “.” period after grep, and expands as a current directory node, which is also a “.” period, as seen by using:

stevee@AMDA8 ~ $ ls -al

total 140

drwxr-xr-x 24 stevee stevee 4096 Oct 1 16:51 .

drwxr-xr-x 4 root root 4096 Sep 17 18:46 ..

drwxr-xr-x 3 stevee stevee 4096 Sep 18 13:08 .AMD

-rw------- 1 stevee stevee 3432 Sep 26 20:21 .bash_history

so grep's search now also includes the total current directory (including hidden) contents in (/home/stevee), which also includes the Documents directory anyway! Maybe not what you expect?

Run it yourself to see. It's only playing with the fundamentals like this, that you will start to think about and start to understand the underlying OS and file system workings, and the tools originally created to manipulate it.

The Unix engineers really were smart guys, no doubt.

stevee@AMDA8 ~ $ ls -ld Documents/ | grep .* | wc -l

grep: ..: Is a directory

grep: .AMD: Is a directory

grep: .cache: Is a directory

grep: .cinnamon: Is a directory

grep: .config: Is a directory

grep: .dbus: Is a directory

grep: .gconf: Is a directory

grep: .gnome: Is a directory

grep: .gnome2: Is a directory

grep: .gnome2_private: Is a directory

grep: .linuxmint: Is a directory

grep: .local: Is a directory

grep: .mozilla: Is a directory

grep: .pki: Is a directory

grep: .ssh: Is a directory

412

stevee@AMDA8 ~ $ grep .* | wc -l

grep: ..: Is a directory

grep: .AMD: Is a directory

grep: .cache: Is a directory

grep: .cinnamon: Is a directory

grep: .config: Is a directory

grep: .dbus: Is a directory

grep: .gconf: Is a directory

grep: .gnome: Is a directory

grep: .gnome2: Is a directory

grep: .gnome2_private: Is a directory

grep: .linuxmint: Is a directory

grep: .local: Is a directory

grep: .mozilla: Is a directory

grep: .pki: Is a directory

grep: .ssh: Is a directory

412

Comments are closed.

Post Navigation