Intermediate awk: Log Slicing and Label-Based Extraction

It’s a well known feature of awk that it can be used to search through text files with expressions like:

awk '/pattern/' file

But sometimes, what we might want to do is grab a certain slice of a file, such as a log file – it’s common to only want to look at what happened during certain hours or minutes. Fortunately, awk supports range expressions.

$ awk '/2024-11-01 00/,/2024-11-01 01/' maxscale.log.5
2024-11-01 00:00:01.461   notice : (mxb_log_rotate): Log rotation complete
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): Host: 'linuxpc' OS: Linux@5.15.0-88-generic, #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023, x86_64 with 32 processor cores (32.00 available).
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): Total main memory: 62.73GiB (62.73GiB usable).
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): MaxScale is running in process 1361
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): MariaDB MaxScale 24.02.1 (Commit: 68459d35d45c9b6590b88f1ab603e64bd884af13)
2024-11-01 01:46:52.669   warning: [mariadbmon] (check_semisync_master_status): Failed to query semi-sync status of server 'server1': Query 'SELECT c.VARIABLE_VALUE, s.VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_VARIABLES c JOIN INFORMATION_SCHEMA.GLOBAL_STATUS s ON(c.VARIABLE_NAME = 'rpl_semi_sync_master_enabled' AND s.VARIABLE_NAME = 'rpl_semi_sync_master_status')' failed: 'Can't connect to server
on '192.168.2.18' (115)'.

If the ending pattern is not desired (because for example, we just want to grab this start up message from maxscale), it’s always possible to exlcude a pattern from being printed:

$ awk '/2024-11-01 00/,/2024-11-01 01/{if (!/2024-11-01 01/)print $0}' maxscale.log.5
2024-11-01 00:00:01.461   notice : (mxb_log_rotate): Log rotation complete
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): Host: 'linuxpc' OS: Linux@5.15.0-88-generic, #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023, x86_64 with 32 processor cores (32.00 available).
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): Total main memory: 62.73GiB (62.73GiB usable).
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): MaxScale is running in process 1361
2024-11-01 00:00:01.461   notice : (maxscale_log_info_blurb): MariaDB MaxScale 24.02.1 (Commit: 68459d35d45c9b6590b88f1ab603e64bd884af13)

That’s simple enough, but let’s consider a more interesting scenario. As awk is often used for text processing, problems often arise where a process might give some output where we only want to grab a very specific portion of a file, while also not including the search patterns themselves.

Let’s consider an example file like:

_F2_STUFF
1
2
3
4
_F1_STUFF
a
b
c
d

How would we go about grabbing only what’s after _F1_STUFF, without printing it, or anything else that comes before it?

awk '/^_F1_/{f=1;next}/^_F2_/{f=0;next}f' file

Let’s break down what this actually does:

Ok, but what if we wanted to print the stuff between _F2_STUFF and _F1_STUFF without printing these labels? The process is very similar, but a little bit more involved:

awk 'f{if (/^_F1_/){printf "%s", buf; f=0; buf="} else {buf = buf $0 ORS}}; /^_F2_/{f=1; next}' file

Here the logic needs to be somewhat the reverse of the previous case, since the stuff we want to actually print is at the beginning. So, best to follow the logic from the end:

So basically, extract content between the two labels, and print it at once when label _F1_ is hit.

This should also work in case of files that are less structured, e.g.

Some stuff
_F2_
Line A
Line B
Line C
_F1_
Other stuff
_F2_
Line D
Line E
_F1_
Other other stuff

Running this awk expression should extract:

Line A
Line B
Line C
Line D
Line E

Give it a try!