Displaying a histogram with awk

This short post will demonstrate how to make a terminal friendly display of a histogram. A short post about something I keep finding useful when dealing with csv files.

Context

Histograms are an excellent way to quickly visualize skewness, or outliers that aren’t visible when taking a simple average into account. In the context of data driven work, it’s a nice first pass to check if the data file looks correct. If there’s a massive spike of a single value, it might indicate that some data is being truncated, or seemingly impossible values might indicate if the data generation itself suffered from a logical error.

The thing we really want to achieve, is to keep the output friendly to the eye. If we simply output a character for each result inside a group, then the line of each histogram will possibly need several line breaks, which makes a quick and friendly visualization impossible. The solution is to try and scale the maximum length of each “bar” of the histogram, and scale the rest of the results against this maximum.

The Function

For brevity, the data is just a randomly generated csv file using this function.

root@debian-test:~# awk -v columns=2 -v digits=0.5 -v rows=10 -f gen_csv.awk
1,1
2,1
0,2
1,2
2,0
2,2
2,0
3,2
1,2
3,0

Extending the rows significantly, we should be able to produce a result where scaling is required:

root@debian-test:~# awk -v columns=2 -v digits=0.5 -v rows=1000 -f gen_csv.awk | awk 'BEGIN{FS=",";}{value=$2; a[value]++; if (a[value]>max) {max=a[value];}}END{printf("Value\tFrequency"); for(i in a) {printf("\n%s\t%s\t",i,a[i]); for(j=0;j<(int(a[i]*(50/max)));j++) {printf("#");}} print "";}'
Value   Frequency
0       298     ############################################
1       332     ##################################################
2       323     ################################################
3       47      #######

This is a pretty awkward expression for a one liner, so let’s unpack it step by step:

-> BEGIN{FS=",";} csv files are delimited by , so we set the file separator to this. Skip this step for tsv files.
-> value=$2 the choice of column that we want to create our frequency of occurrence histogram out of
-> a[value]++; the values are put into array a, the array is indexed by the value, and the array element itself is the number of times each index was found
-> if (a[value]>max) {max=a[value];} we want to know what was the highest value that we found, since we will want to scale the histogram against whatever this value is
-> END{printf("Value\tFrequency"); finished processing the file, print the headers
-> for (i in a) loop through the array a by index
-> printf("\n%s\t%s\t",i,a[i]); print the data we acquired: i -> the value of interest, a[i] -> frequency of the value of interest
-> for(j=0;j<(int(a[i]*(50/max)));j++) {printf("#");} print the histogram bars, the number of runs in this for loop depends on the highest frequency of occurence of any of the values, in this case 332. Printing 332 # characters would be wider than what we can comfortably display, so the divisor of 50 is the solution. It scales down the length of the histogram bars.

The value of 50 is just something that makes sense for me, as I often operate in a tiling window manager set up, with split windows. You can adjust this value to your comfort.

If you already know from operational sense that a certain value, or range of values will be uninteresting, you can always wrap the value=$2 inside an appropriate if statement.

For clarity, here’s the full snippet unraveled:

#/bin/awk -f

BEGIN{
    FS=",";
}
{
    value=$2;
    a[value]++;
    if (a[value]>max) {
        max=a[value];
    }
}
END{
    printf("Value\tFrequency");
    for(i in a) {
        printf("\n%s\t%s\t",i,a[i]);
        for(j=0; j<(int(a[i]*(50/max))); j++) {
            printf("#");
        }
    }
    print "";
}

Conclusion

This simple one liner can be useful as a rough first pass on an acquired csv file, to check for (the absence of) some obvious errors. It’s served me useful in building and debugging data pipelines, especially in SSH-only contexts.