Displaying a histogram with awk

This short post will demonstrate how to make a terminal friendly display of a histogram. A short post about something I keep finding useful when dealing with csv files.

Context

Histograms are an excellent way to quickly visualize skewness, or outliers that aren’t visible when taking a simple average into account. In the context of data driven work, it’s a nice first pass to check if the data file looks correct. If there’s a massive spike of a single value, it might indicate that some data is being truncated, or seemingly impossible values might indicate if the data generation itself suffered from a logical error.

The thing we really want to achieve, is to keep the output friendly to the eye. If we simply output a character for each result inside a group, then the line of each histogram will possibly need several line breaks, which makes a quick and friendly visualization impossible. The solution is to try and scale the maximum length of each “bar” of the histogram, and scale the rest of the results against this maximum.

The Function

For brevity, the data is just a randomly generated csv file using this function.

root@debian-test:~# awk -v columns=2 -v digits=0.5 -v rows=10 -f gen_csv.awk
1,1
2,1
0,2
1,2
2,0
2,2
2,0
3,2
1,2
3,0

Extending the rows significantly, we should be able to produce a result where scaling is required:

root@debian-test:~# awk -v columns=2 -v digits=0.5 -v rows=1000 -f gen_csv.awk | awk 'BEGIN{FS=",";}{value=$2; a[value]++; if (a[value]>max) {max=a[value];}}END{printf("Value\tFrequency"); for(i in a) {printf("\n%s\t%s\t",i,a[i]); for(j=0;j<(int(a[i]*(50/max)));j++) {printf("#");}} print "";}'
Value   Frequency
0       298     ############################################
1       332     ##################################################
2       323     ################################################
3       47      #######

This is a pretty awkward expression for a one liner, so let’s unpack it step by step:

The value of 50 is just something that makes sense for me, as I often operate in a tiling window manager set up, with split windows. You can adjust this value to your comfort.

If you already know from operational sense that a certain value, or range of values will be uninteresting, you can always wrap the value=$2 inside an appropriate if statement.

For clarity, here’s the full snippet unraveled:

#/bin/awk -f

BEGIN{
    FS=",";
}
{
    value=$2;
    a[value]++;
    if (a[value]>max) {
        max=a[value];
    }
}
END{
    printf("Value\tFrequency");
    for(i in a) {
        printf("\n%s\t%s\t",i,a[i]);
        for(j=0; j<(int(a[i]*(50/max))); j++) {
            printf("#");
        }
    }
    print "";
}

Conclusion

This simple one liner can be useful as a rough first pass on an acquired csv file, to check for (the absence of) some obvious errors. It’s served me useful in building and debugging data pipelines, especially in SSH-only contexts.