Intermediate awk: Log Slicing and Label-Based Extraction

It’s a well known feature of awk that it can be used to search through text files with expressions like:

awk '/pattern/' file

But sometimes, what we might want to do is grab a certain slice of a file, such as a log file – it’s common to only want to look at what happened during certain hours or minutes. Fortunately, awk supports range expressions.

$ awk '/2024-11-01 00/,/2024-11-01 01/' maxscale.log.5
2024-11-01 00:00:01.461   notice : (mxb_log_rotate): Log rotation complete
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): Host: 'linuxpc' OS: Linux@5.15.0-88-generic, #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023, x86_64 with 32 processor cores (32.00 available).
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): Total main memory: 62.73GiB (62.73GiB usable).
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): MaxScale is running in process 1361
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): MariaDB MaxScale 24.02.1 (Commit: 68459d35d45c9b6590b88f1ab603e64bd884af13)
2024-11-01 01:46:52.669   warning: [mariadbmon] (check_semisync_master_status): Failed to query semi-sync status of server 'server1': Query 'SELECT c.VARIABLE_VALUE, s.VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_VARIABLES c JOIN INFORMATION_SCHEMA.GLOBAL_STATUS s ON(c.VARIABLE_NAME = 'rpl_semi_sync_master_enabled' AND s.VARIABLE_NAME = 'rpl_semi_sync_master_status')' failed: 'Can't connect to server
on '192.168.2.18' (115)'.

If the ending pattern is not desired (because for example, we just want to grab this start up message from maxscale), it’s always possible to exlcude a pattern from being printed:

$ awk '/2024-11-01 00/,/2024-11-01 01/{if (!/2024-11-01 01/)print $0}' maxscale.log.5
2024-11-01 00:00:01.461   notice : (mxb_log_rotate): Log rotation complete
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): Host: 'linuxpc' OS: Linux@5.15.0-88-generic, #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023, x86_64 with 32 processor cores (32.00 available).
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): Total main memory: 62.73GiB (62.73GiB usable).
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): MaxScale is running in process 1361
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): MariaDB MaxScale 24.02.1 (Commit: 68459d35d45c9b6590b88f1ab603e64bd884af13)

That’s simple enough, but let’s consider a more interesting scenario. As awk is often used for text processing, problems often arise where a process might give some output where we only want to grab a very specific portion of a file, while also not including the search patterns themselves.

Let’s consider an example file like:

_F2_STUFF
1
2
3
4
_F1_STUFF
a
b
c
d

How would we go about grabbing only what’s after _F1_STUFF, without printing it, or anything else that comes before it?

awk '/^_F1_/{f=1;next}/^_F2_/{f=0;next}f' file

Let’s break down what this actually does:

-> /^_F1_/{f=1;next} : if the pattern of the current line begins with _F1_, set flag f to 1, next for skipping action on the current line. If we were to slap a print $0 before next then this half of the expression would print _F1_STUFF. With this little trick, we’re essentially keeping track of, and marking the lines we’re interested in printing later on.
-> /^_F2_/{f=0;} : if the pattern begins with _F2_, set flag f to 0. This means that we’re not interested in printing anything that comes after _F2_. Skip action on _F2_.
-> f : This condition then triggers the printing of lines. (Actually in this simple case, if you want, you can put f at the beginning too, it doesn’t really matter, but it looks “cleaner” if it’s at the end.)

Ok, but what if we wanted to print the stuff between _F2_STUFF and _F1_STUFF without printing these labels? The process is very similar, but a little bit more involved:

awk 'f{if (/^_F1_/){printf "%s", buf; f=0; buf="} else {buf = buf $0 ORS}}; /^_F2_/{f=1; next}' file

Here the logic needs to be somewhat the reverse of the previous case, since the stuff we want to actually print is at the beginning. So, best to follow the logic from the end:

-> /^_F2_/{f=1} : When we see a line that begins with _F2_, set flag f to one, but don’t print the current line
-> else {buf = buf $0 ORS} : said lines will be added to a variable called buf (aka buffer), but not printed
-> {if (/^_F1_/){printf "%s", buf; f=0; buf="} : When a line starting with _F1_ is encountered, set flag f to zero (we are no longer interested in these lines), and print everything that’s been added to the buf variable

So basically, extract content between the two labels, and print it at once when label _F1_ is hit.

This should also work in case of files that are less structured, e.g.

Some stuff
_F2_
Line A
Line B
Line C
_F1_
Other stuff
_F2_
Line D
Line E
_F1_
Other other stuff

Running this awk expression should extract:

Line A
Line B
Line C
Line D
Line E

Give it a try!