Beginner’s intro to awk (part 1)

Why learn awk?

In some situations, you might find that languages you’re more used to are not available, (such as my case, when neither PHP nor Perl are not available), but awk will rather often come pre-installed with systems, even if they are a bit more exotic and it is generally considered a pretty standard and well established tool.

This also makes awk a very portable language, and especially in the case of simple scripts, little to no modification is needed to transfer a script from one environment to another. This, coupled with its powerful text processing capabilities makes it a very useful utility to have under the belt.

If there is a regular need to analyze datasets, log files, extract data from files, or anything of the sorts, awk can often be utilized to save quite a bit of time and manual work. The reader is assumed to have some familiarity with the shell and with regular expressions.

The structure of an awk script

The very basic idea of an awk script is the following

pattern { program code }

The premise here is that the pattern determines the condition under which to perform some small program code to execute. Like many other common terminal based tools, awk reads input line by line. By default, if there is no pattern specified for a block of awk code, it means it’ll be executed for every line. The other two most commonly seen patterns are BEGIN and END. So in summary:

BEGIN{…} → this block gets executed before any of the file is processed, this is a good place to declare any variables for later use
{…} → a block with no specified pattern; the code added here is going to be executed for every line of the input file
END{…} → this block gets executed after the last record has been processed, the main loop is completely over.

The following awk script

BEGIN{print "HELLO"}{print $0}END{print "WORLD"}

would print the contents of the given input, and append the string HELLO before line 1 of the input, and the string WORLD after the final line of the input. $0 simply refers to the entire current line being read in by awk, but we’ll get into more details about that in a section further down the line.

Executing awk

Before we get into more details about fields and variables in awk, it’s probably useful to know how to actually run awk code.

It can be used simply like any other command. All of the your awk code then needs to go inside single quotes, where you can put all your awk code.

For example to use the above awk script on a file in this way,

awk 'BEGIN{print "HELLO"}{print $0}{print "WORLD"}' file

will suffice.

If an awk script turns out to be particularly useful, and it is desired to re-use it in many different occasions and places, it’s possible to save awk scripts and use the awk interpreter directly. The above example as a native awk script:

#!/bin/awk
BEGIN{
    print "HELLO"
}
{
    print $0
}
END{
   print "WORLD"
}

This can then be executed by awk using the -f flag, which tells awk to read program code from the given file as the argument of that flag.

awk -f ./the_awk_script.awk input_file

A useful sidenote here, is that awk generally expects an input through standard input. However, if an awk script consists only of a BEGIN block, then an input can be omitted.

awk '{print "Test"}' -> this will hang until an input through standard input is given
awk 'BEGIN{print "Test"}' -> this will execute successfully

Basic awk syntax

Basic Variables

Going back to that strange little $0 variable we used in our first script, it’s important to note that in general variables in awk do not begin with a dollar sign ($). Variables beginning with a $ have a special meaning in awk, and are often(? is there anyone who often talks about awk?) referred to as positional variables. What that means, is that awk thinks of each line of input in terms of so-called “fields”.

A field is a unit of data that is separated by a delimiter (by default, whitespace, or [ \t]+ for those of you familiar with regular expressions). For example, if a text file has multiple columns of data separated by spaces or tabs, each column is considered a field.

In the case of such an example text file

$0 -> refers to an entire line
$1 -> refers to column 1
$2 -> refers to column 2
etc…

reading the fstab file is a good example to understand this visually.

root@debian-test:~# awk '(/^UUID/){print $0}' /etc/fstab
UUID=ced45257-8704-4b2a-baac-3c9c1edfcf71 /               ext4    errors=remount-ro 0       1
UUID=034e0a39-a1ab-4691-b87f-6c2db3b99e68 none            swap    sw              0       0

With the above line, we say that, if the beginning of the line matches UUID (the pattern), print the entire line ($0). If we’re only interested in looking at the first column of this data, we could do

root@debian-test:~# awk '(/^UUID/){print $1}' /etc/fstab
UUID=ced45257-8704-4b2a-baac-3c9c1edfcf71
UUID=034e0a39-a1ab-4691-b87f-6c2db3b99e68

You can safely try this example on your own computer, as awk by default does not modify the original input file.

Otherwise, variables in awk don’t start with $, and can be declared without any keyword (no var, int, char, etc).

awk 'BEGIN{somevalue="test"; print somevalue;}'

The -v flag can be used to transfer a variable from your shell into awk. For example:

root@debian-test:~# SOMEVALUE=test
root@debian-test:~# awk -v somevalue=$SOMEVALUE 'BEGIN{print somevalue;}'
test

Built in variables

Awk comes with a number of helpful built-in variables to make certain tasks easier.

FS

For example above, awk assumed that the delimiter (field separator) between columns is whitespace. For reading the fstab file, that’s okay, howver, we might want to peek into something like a csv file. In that case, awk has a built in FS variable, which can be used to adjust the field separator between columns from the default to something else.

root@debian-test:~# echo "1,2,3,4" | awk '{print $1}'
1,2,3,4

Here awk printed everything, since the above input had no whitespaces. Therefore, the first column is the entire line. However, by adjusting the FS value, we can easily get the first column as desired:

root@debian-test:~# echo "1,2,3,4" | awk 'BEGIN{FS=","}{print $1}'
1

Because setting FS to different values is such a common thing to do, awk has a specific flag for doing so. The above expression could be done with the omission of the BEGIN block, if the F flag is used:

root@debian-test:~# echo "1,2,3,4" | awk -F"," '{print $1}'
1

OFS

There’s a similar and related built in variable, OFS (Output Field Separator), which determines what the field separator of awk’s output should be. However, OFS has a few perhaps not so intuitive rules about it, so this needs to be covered in a bit more detail than FS.

If you simply print $0, you might expect to see whatever you set OFS to in your output, but no, in such a situation, awk does not make use of OFS. Example:

root@debian-test:~# echo "1,2,3,4" | awk 'BEGIN{FS=","; OFS=";"}{print $0}'
1,2,3,4

OFS needs to be used explicitly in order for it to appear in the output. The obvious way would be to type out all the fields with OFS between them

root@debian-test:~# echo "1,2,3,4" | awk 'BEGIN{FS=","; OFS=";"}{print $1OFS$2OFS$3OFS$4}'
1;2;3;4

Now this, probably seems rather cumbersome, so surely, there has to be a better way? Yes, one trick is to redefine one of the fields. Example:

root@debian-test:~# echo "1,2,3,4" | awk 'BEGIN{FS=","; OFS=";"}{$1=$1; print $0}'
1;2;3;4

What this trick does, is it gets awk to deconstruct $0, and put it together again using the OFS. But beware, redefining $0 itself makes use of only the FS, but not of OFS!

root@debian-test:~# echo "1,2,3,4" | awk 'BEGIN{FS=","; OFS=";"}{$0=$0; print $0}'
1,2,3,4

RS

The next useful to know built in variable is RS (Record Separator). In terms of awk, each line that awk reads in is called a “record”. Essentially, whatever awk would put into $0 is a record. By default, records are separated by \n (newline), but this can be set to anything.

root@debian-test:~# echo "1,2,3,4" | awk 'BEGIN{RS=","}{print $0}'
1
2
3
4

ORS

Similarly to how FS has OFS as its counterpart, RS has ORS (Output Record Separator) as its counterpart. Unlike OFS, ORS does not have any special rules about its usage.

Consider the following file:

root@debian-test:~# cat numbers
1
2
3
4

The file could be printed as a single line using ORS:

root@debian-test:~# awk 'BEGIN{ORS=","}{print $0}' numbers ; echo "
1,2,3,4,

(The echo ““ is just to force the next shell prompt to appear on a new line)

NR

NR represents the total line number that is being processed by awk. Using the above file,

$ awk 'END{print NR}' numbers
4

we can tell that the file has exactly 4 records (this basically does the same thing as doing wc -l)

NF

NF represents the number of fields (this depends on what FS is set to!) in the current record

root@debian-test:~# echo "1,2,3,4" | awk '{print NF}'
1

Remeber, since FS is left as default, we have only one field here, since the field separator is whitespace.

root@debian-test:~# echo "1,2,3,4" | awk 'BEGIN{FS=","}{print NF}'
4

Here FS making a clear difference.

FNR

FNR represents the record number in the current file. This probably feels very similar to NR, and in the case of opening a single file with awk, yes, it gives the same results. However, it’s possible for awk to open multiple files, and that’s where FNR will make a difference compared to NR.

root@debian-test:~# cat numbers
1
2
3
4

Using the above example file

root@debian-test:~# awk '{print "NR: "NR ";FNR: " FNR}' numbers
NR: 1;FNR: 1
NR: 2;FNR: 2
NR: 3;FNR: 3
NR: 4;FNR: 4

The difference is that FNR gets reset to 0 when opening a new file, while NR continues to increment.

If we consider TWO files:

root@debian-test:~# cat numbers
1
2
3
4
root@debian-test:~# cat numbers2
1
2
3
4

We can see the difference between NR and FNR very clearly:

root@debian-test:~# awk '{print "NR: "NR ";FNR: " FNR}' numbers numbers2
NR: 1;FNR: 1
NR: 2;FNR: 2
NR: 3;FNR: 3
NR: 4;FNR: 4
NR: 5;FNR: 1
NR: 6;FNR: 2
NR: 7;FNR: 3
NR: 8;FNR: 4

This can be very useful, if we want to compare things between two files, or perhaps merge two files with multiple columns in some ways. This aspect will be certainly revisited in a follow up post.

ARGV and ARGC

Circling back to cases of awk being used with only a BEGIN{} block, it’s possible to pass arguments to awk from the command line, instead of using standard input. For example

root@debian-test:~# awk 'BEGIN {print ARGV[2]}' arg1 arg2 arg3 arg4
arg2

root@debian-test:~# awk 'BEGIN {print ARGC}' arg1 arg2 arg3 arg4
4

ARGC refers to the number of arguments passed into awk
ARGV is an array (more on those later!) of command line arguments passed to awk. ARGV[0] is always going to be the command itself, so ARGV[1] will contain the first argument input by the user

To be clear, it’s possible to use both a file and ARGC together, but doing that might be more difficult to understand and maintain than using the -v flag. Consider the following awk script

#!/bin/awk

BEGIN {
    for (i=ARGC; i>2; i--) {
        print ARGV[ARGC-1];
        ARGC--;
    }
}
{
    print $0
}

and the following file

root@debian-test:~# cat file
1
2
3
4

We can process the file and also keep the arguments passed into awk from the command line

root@debian-test:~# awk -f argc_example.awk file arg1 arg2
arg2
arg1
1
2
3
4

At this point, this post has probably grown long enough. The next part of this tutorial will start touching on some more practical examples.