Beginner’s intro to awk (part 2)

As the previous post has covered some of the basics needed to get started with awk, it’s now possible to talk about some practical examples of making things work with awk.

Picking out unique values from a list of results

This can be a common problem, for example finding all the unique ip addresses in a logfile, or finding all the unique values in some sort of report.

Let’s say we have an example file, where’ve already done some text processing, and ended up with a list of ip addresses of people logging into some system:

root@debian-test:~# cat ip_addresses 
192.168.2.11
192.168.2.12
192.168.2.16
192.168.2.16
192.168.2.10
192.168.2.16
192.168.2.12

If we are interested in merely the unique values here (remove duplicates, but keep one copy), the awk command will be surprisngly short:

root@debian-test:~# awk '!a[$0]++' ip_addresses
192.168.2.11
192.168.2.12
192.168.2.16
192.168.2.10

How does this work?

The secret ingredient to the simplicity is that awk uses associative arrays. In fact, all arrays in awk are associative. This might be familiar to those with PHP or perl (though perl calls them hashes) background. An associative array is an array whose indices are strings. (It is possible to mimic traditional arrays by using integers as the array’s indices). The trick here is that instead of using an integer to find an entry in an array, it’s possible to use anything we want at all.

What that means, is that we can keep track of whether we’ve seen an ip address or not, and the ip addresses themselves using a single array.

Let’s break it down:

So what a[$0] does, is it looks for the value of index $0 in the associative array a[]. If it doesn’t find such a value, it’ll automatically create one using an empty string.

Basically now, what a[$0]++ does, is it looks for the value of index $0 in the associative array a[]. If it doesn’t find such a value, it’ll automatically create one using an empty string. However, ++ always forces a numeric value to be returned, so if a[$0] was a newly created empty string, 0 is returned, and a[$0] is incremented to 1.

If a[$0] was not an empty string (or in other words, not equal to 0), meaning that we have seen the string represented by $0 in a previous record already, then the value of a[$0] will be 1, which will be incremented to 2, and the value of 1 will be returned for further processing. Each subsequent time we encounter the same value of $0, a[$0] will be incremented to 3, 4, 5 and so on.

So now, what that does is that if the outcome of evaluating a[$0]++ as above is 0, it flips the result to 1. Otherwise if the outcome of evaluating a[$0]++ is any other number, it flips the result to 0.

So the first time we encounter the string “192.168.2.12” in the file ip_addresses, a[$0] is not yet set, awk sets it to an empty string, which is converted to 0 by the ++ operator, which then gets flipped to 1 by the ! operator (meaning we should print $0).

The second time we encounter the string “192.168.2.12” in the file ip_addresses, a[$0] will be already set with the value of 1, which is evaluated as true, and flipped to 0 (false) by the ! operator (meaning we don’t print).

If the overall expression evaluates to 1, awk will print the record. Essentially, here we’ve only provided a pattern, but on code. When that happens, awk assumes you’ll want a print $0.

The expression could very well be rewritten as:

{if(!a[$0]++){print $0;}}

Let’s consider a slightly more complicated example. This time, we have not only ip addresses, but also usernames of users logging into some system, and some random data we don’t care about. The task is to print all records where both the ip address and the username is unique.

root@debian-test:~# cat ip_addresses
192.168.2.11,Bob,c612951f-9dbe-43f4-9fb4-1b4b49afec6e
192.168.2.12,Tim,7ac5863b-c802-406a-9d5c-883745726b17
192.168.2.16,Jim,26810bc2-5072-4354-997f-9892d82d54f0
192.168.2.16,Jim,c173dda5-644e-40a6-b256-23c51b0adfe1
192.168.2.10,Ed,1ff49ee3-9c7f-459e-b113-622dddd743f0
192.168.2.16,Tom,d1e31ad5-8880-4867-ba15-ab3ef3f2ab4d
192.168.2.12,Jon,0193a512-0fd5-4168-a4be-524768650c96

Can we still do this? Since every record is now unique thanks to the UUIDs, we can’t simply just use $0. But remember how we can put anything as the index of an array? We can simply adjust the file separator, and add both fields $1 and $2 into the index:

root@debian-test:~# awk 'BEGIN{FS=","; OFS=","}!a[$1,$2]++{print $1,$2}' ip_addresses
192.168.2.11,Bob
192.168.2.12,Tim
192.168.2.16,Jim
192.168.2.10,Ed
192.168.2.16,Tom
192.168.2.12,Jon

and we get the desired result.

If you are interested in finding only the duplicates, the solution is quite easy. Simply remove the negation operator, so that we only print if we’ve already encountered an index in the array for the second, third, etc. time.

root@debian-test:~# awk 'BEGIN{FS=","; OFS=","}a[$1,$2]++{print $1,$2}' ip_addresses
192.168.2.16,Jim

This will create multiple duplicates if something exists more than twice in the input, but you now know how to solve that problem :-)

Count occurences of value in a column

This can also often come in handy, e.g. might need to know how often some users logged in to some system, or how often a certain error has happened in some log files, etc.

If using the earlier example file, we want to know how many times an ip address appeared, we could do

root@debian-test:~# awk 'BEGIN{FS=","}{a[$1]++}END{for (x in a){print x,a[x]}}' ip_addresses
192.168.2.16 4
192.168.2.10 1
192.168.2.11 1
192.168.2.12 2

How it works?

As established in the previous post of this tutorial, awk’s default file separator is [\t]+. Since csv files are separated by commas (,) and not spaces or tabs, we use the BEGIN block to adjust the value of FS to ,

In the main block we have a[$1]++. Now, this essentially looks for the value of index $1 in the associative array a[..] and increments that value by one. This way, we will know how many times we have encountered each unique value for the field $1. This basically gives us a list of numbers as values, where each key is the value itself that we tallied up.

Now we move on to the END block. Here we just print out each index and value of the array a.

So “for (x in a){print x,a[x]}” means that for each index x in associative array a[..] print the index and the value. In our case, the key was the content given by column 1 of the csv file (aka field $1), and the value will be how many times we have encountered $1.

Sum the values of a column

Let’s say we now wanted to get the total number of logins based on the result of our previous awk expression.

e.g.

root@debian-test:~# cat logins
192.168.2.16 4
192.168.2.10 1
192.168.2.11 1
192.168.2.12 2

The following expression will do the job:

root@debian-test:~# awk '{sum=sum+$2}END{print sum}' logins
8

This one is quite simple. For each record read by awk, add the value of field $2 to the variable sum. The main useful takeaway here is that there is no need to initialize the variable sum in a BEGIN block, the uninitialized variable will have an empty string as a default value, which converts to 0 when adding a number to it.

Combining/Joining two files

While working with datafiles, especially csv files, the need to join/merge files together might arise, in order to build a clearer picture about something. Awk is very well suited to this task.

Consider the earlier example with the ip addresses. Perhaps in another file, we have information about the operating systems of the machines behind the ip addresses, and want all the information consolidated into one single file.

root@debian-test:~# cat logins
192.168.2.16 4
192.168.2.10 1
192.168.2.11 1
192.168.2.12 2
root@debian-test:~# cat machines
192.168.2.10 Linux
192.168.2.11 Linux
192.168.2.12 BSD
192.168.2.16 Linux

As established in part 1 of this tutorial, awk can read multiple files in succession, and there is a helpful built-in variable already out of the box available to us for keeping track of which file to use what code with. Quickly circling back to the FNR built in variable, we can see that if we load two files into awk, FNR will be reset to 1 when we start processing fi le 2, but NR will keep on increasing. Meaning that while we are processing file 1, FNR and NR will have the same value. However, as soon as we hit line 1 of file 2, FNR and NR will have completely different values.

root@debian-test:~# awk '{print "NR: "NR", FNR: "FNR}' logins machines
NR: 1, FNR: 1
NR: 2, FNR: 2
NR: 3, FNR: 3
NR: 4, FNR: 4
NR: 5, FNR: 1
NR: 6, FNR: 2
NR: 7, FNR: 3
NR: 8, FNR: 4

So now, with this trick, we can clearly see that there is a simple way to execute awk code on only the first or the second input file based on this. Now we can see that if we set NR==FNR as a condition, any code in that block will be executed only on the first file.

awk '(NR==FNR){a[$0]; next}(#some other condition){#some action}' logins machines

Remembering the previous section, a[$0] means that awk will create an associative array remembering each of the lines it has seen. The “ next” keyword of awk simply means that it skips to the next iteration of the implicit awk loop (move on to the next line of the file) an d that means that #some other condition will NOT be evaluated, thus preventing “#some action” to take place while awk is still reading file 1.

So now we need to find such a second condition, and second action, that can join two files together.

The expression we need is the following

root@debian-test:~# awk '(NR==FNR){a[$1]=$1" "$2; next} ($1 in a){print a[$1],$2}' logins machines
192.168.2.10 1 Linux
192.168.2.11 1 Linux
192.168.2.12 2 BSD
192.168.2.16 4 Linux

How does it work?

Basically, a[$1] just creates an array element in associative array a containing: $1, a whitespace character and $2 for each line of the file logins.

Now that we have such a list, in our array a, we can do something with the second file, called machines. Here’s where the second condition comes in.

What ($1 in a) roughly means, is that we check whether $1 of the file machines appears as an index in array a or not. If it does, we print the array element of this index (which remember, contains columns $1 and $2 of file 1, separated by a space) and also print $2 of the current file (which in this case, is the file called machines).

Useful to note though, that in order for this to work correctly, it’s essential that the the two input files do not have duplicate values on the column that you wish to use for merging/joining the two files. If duplicates do exist on the column, then you need to find some way to purge the duplicate values before attempting to join them together like this.

Group items together

Sometimes it might be useful to group relevant values together. Let’s say now, someone took a look at the machines file, and asks you if you could please put the ip addresses of each operating system into a single row because it’d be easy to read that way.

So considering the file earlier:

root@debian-test:~# cat machines
192.168.2.10 Linux
192.168.2.11 Linux
192.168.2.12 BSD
192.168.2.16 Linux

The required expression is the following:

root@debian-test:~# awk '{if(a[$2])a[$2]=a[$2]":"$1; else a[$2]=$1;}END{for (x in a){print x, a[x];}}' machines
BSD 192.168.2.12
Linux 192.168.2.10:192.168.2.11:192.168.2.16

This is a lot simpler than it actually looks. If there is some value in a[$2], then append a colon and the content of field $1 to the array element where the index is $2. Otherwise, we need a new element in the associative array a, so we can assign $1 as its initial value, as this is the first value we encountered for $2 in the file.

Given that the above expression does look pretty cumbersome, here it is as an awk script:

#!/usr/bin/awk
{
    if(a[$2]) {
        a[$2]=a[$2]":"$1;
    } else {
        a[$2]=$1;
    }
}
END{
    for(x in a) {
        print x, a[x];
    }
}

Mapping two files based on a column

Perhaps it has been decided to decommission any machines with BSD on it, so you are given a file with the following information:

root@debian-test:~# cat instructions
Linux Keep
BSD Decommission

Given the machines file earlier, we should be able to create a file which tells which ip addresses will be kept, and which ones will be gone.

The expression goes as:

root@debian-test:~# awk '(NR==FNR){a[$1]=$2; next}{$2=a[$2]; print $0}' instructions machines
192.168.2.10 Keep
192.168.2.11 Keep
192.168.2.12 Decommission
192.168.2.16 Keep

The logic here works very simply. We go through the instructions file, keep each instruction (Keep or Decommission) in an associate array, where the index of each array element (the instruction) is the operating system it refers to.

Then, we loop through the second file (the machines file), we redefine $2 of machines to the value of associative array a where the index matches the operating system found in field $2 of the machines file.

Comparing two files (practice example)

Given two files with ip addresses, ip1 and ip2, we wish to find the ip addresses that appear only in file ip2, but not in ip1.

root@debian-test:~# cat ip1
192.168.2.19
192.168.2.11
192.168.2.10
192.168.2.12
root@debian-test:~# cat ip2
192.168.2.16
192.168.2.10
192.168.2.11
192.168.2.13
192.168.2.12

This final example is left for the reader to solve. For the solution, simply hover into the box below with the cursor to reveal it.

Solution:

awk '(NR==FNR){a[$0]; next}!($0 in a){print $0}' ip1 ip2

With that, this post concludes. It has probably been pretty long. The next post will go through two examples, but they will be slightly more involved than the ones in this post.