Intermediate awk: Log Slicing and Label-Based Extraction
It’s a well known feature of awk that it can be used to search through text files with expressions like:
awk '/pattern/' file
But sometimes, what we might want to do is grab a certain slice of a file, such as a log file – it’s common to only want to look at what happened during certain hours or minutes. Fortunately, awk supports range expressions.
$ awk '/2024-11-01 00/,/2024-11-01 01/' maxscale.log.5
2024-11-01 00:00:01.461 notice : (mxb_log_rotate): Log rotation complete
2024-11-01 00:00:01.461 notice : (maxscale_log_info_blurb): Host: 'linuxpc' OS: Linux@5.15.0-88-generic, #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023, x86_64 with 32 processor cores (32.00 available).
2024-11-01 00:00:01.461 notice : (maxscale_log_info_blurb): Total main memory: 62.73GiB (62.73GiB usable).
2024-11-01 00:00:01.461 notice : (maxscale_log_info_blurb): MaxScale is running in process 1361
2024-11-01 00:00:01.461 notice : (maxscale_log_info_blurb): MariaDB MaxScale 24.02.1 (Commit: 68459d35d45c9b6590b88f1ab603e64bd884af13)
2024-11-01 01:46:52.669 warning: [mariadbmon] (check_semisync_master_status): Failed to query semi-sync status of server 'server1': Query 'SELECT c.VARIABLE_VALUE, s.VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_VARIABLES c JOIN INFORMATION_SCHEMA.GLOBAL_STATUS s ON(c.VARIABLE_NAME = 'rpl_semi_sync_master_enabled' AND s.VARIABLE_NAME = 'rpl_semi_sync_master_status')' failed: 'Can't connect to server
on '192.168.2.18' (115)'.
If the ending pattern is not desired (because for example, we just want to grab this start up message from maxscale), it’s always possible to exlcude a pattern from being printed:
$ awk '/2024-11-01 00/,/2024-11-01 01/{if (!/2024-11-01 01/)print $0}' maxscale.log.5
2024-11-01 00:00:01.461 notice : (mxb_log_rotate): Log rotation complete
2024-11-01 00:00:01.461 notice : (maxscale_log_info_blurb): Host: 'linuxpc' OS: Linux@5.15.0-88-generic, #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023, x86_64 with 32 processor cores (32.00 available).
2024-11-01 00:00:01.461 notice : (maxscale_log_info_blurb): Total main memory: 62.73GiB (62.73GiB usable).
2024-11-01 00:00:01.461 notice : (maxscale_log_info_blurb): MaxScale is running in process 1361
2024-11-01 00:00:01.461 notice : (maxscale_log_info_blurb): MariaDB MaxScale 24.02.1 (Commit: 68459d35d45c9b6590b88f1ab603e64bd884af13)
That’s simple enough, but let’s consider a more interesting scenario. As awk is often used for text processing, problems often arise where a process might give some output where we only want to grab a very specific portion of a file, while also not including the search patterns themselves.
Let’s consider an example file like:
_F2_STUFF
1
2
3
4
_F1_STUFF
a
b
c
d
How would we go about grabbing only what’s after _F1_STUFF
, without printing it, or anything else that comes before it?
awk '/^_F1_/{f=1;next}/^_F2_/{f=0;next}f' file
Let’s break down what this actually does:
->
/^_F1_/{f=1;next}
: if the pattern of the current line begins with_F1_
, set flag f to 1, next for skipping action on the current line. If we were to slap aprint $0
beforenext
then this half of the expression would print_F1_STUFF
. With this little trick, we’re essentially keeping track of, and marking the lines we’re interested in printing later on.->
/^_F2_/{f=0;}
: if the pattern begins with_F2_
, set flag f to 0. This means that we’re not interested in printing anything that comes after_F2_
. Skip action on_F2_
.->
f
: This condition then triggers the printing of lines. (Actually in this simple case, if you want, you can put f at the beginning too, it doesn’t really matter, but it looks “cleaner” if it’s at the end.)
Ok, but what if we wanted to print the stuff between _F2_STUFF
and _F1_STUFF
without printing these labels?
The process is very similar, but a little bit more involved:
awk 'f{if (/^_F1_/){printf "%s", buf; f=0; buf="} else {buf = buf $0 ORS}}; /^_F2_/{f=1; next}' file
Here the logic needs to be somewhat the reverse of the previous case, since the stuff we want to actually print is at the beginning. So, best to follow the logic from the end:
->
/^_F2_/{f=1}
: When we see a line that begins with_F2_
, set flag f to one, but don’t print the current line->
else {buf = buf $0 ORS}
: said lines will be added to a variable called buf (aka buffer), but not printed->
{if (/^_F1_/){printf "%s", buf; f=0; buf="}
: When a line starting with_F1_
is encountered, set flag f to zero (we are no longer interested in these lines), and print everything that’s been added to the buf variable
So basically, extract content between the two labels, and print it at once when label _F1_
is hit.
This should also work in case of files that are less structured, e.g.
Some stuff
_F2_
Line A
Line B
Line C
_F1_
Other stuff
_F2_
Line D
Line E
_F1_
Other other stuff
Running this awk expression should extract:
Line A
Line B
Line C
Line D
Line E
Give it a try!