Pivoting tables in MariaDB (and awk)
While raw data can be useful, it’s not uncommon for Business Analysts, Project Managers, Product Owners, etc. to request data in different formats. Pivoting tables is a common request, because often, a different perspective on the data is more useful than the raw output
of a SELECT *.
In this article, we’ll look at a few examples on dealing with common patterns that may appear in reporting tasks.
The Problem
Consider the following table: you already did the hard part, and counted up the number of times a product was traded:
product_id,culture,year,no_of_trades
1,UK,2022,100
2,DE,2011,50
3,UK,2013,93
4,ES,2024,204
5,ES,2024,78
6,AT,2022,34
7,NO,2022,11
8,PT,2025,60
9,ES,2025,149
10,DE,2018,182
11,IE,2016,55
12,GR,2020,11
13,NL,2014,201
14,FI,2017,5
15,DE,2020,107
Where
-- product_id = unique numerical id
-- culture = country issuing the product
-- year = year in which transaction happened
-- no_of_trades = number of trades for product per culture per year
Now this example table is only 15 products, and a handful of countries, with just a few columns, but of course, in reality, datasets will usually be much larger: orders of magnitude more products than just 15, dozens of countries, and years of history spanning decades.
The necessary SQL statements to re-create the test table will be provided in the appendix.
Summary by total trades for each culture
A simple first question would be “how many trades did we have in total, for each culture?”
Based on the table data of this problem, the following SQL statement should suffice:
SELECT culture,
SUM(CASE WHEN culture='UK' THEN no_of_trades ELSE 0 END) AS UK,
SUM(CASE WHEN culture='DE' THEN no_of_trades ELSE 0 END) AS DE,
SUM(CASE WHEN culture='ES' THEN no_of_trades ELSE 0 END) AS ES,
SUM(CASE WHEN culture='AT' THEN no_of_trades ELSE 0 END) AS AT,
SUM(CASE WHEN culture='NO' THEN no_of_trades ELSE 0 END) AS NO,
SUM(CASE WHEN culture='PT' THEN no_of_trades ELSE 0 END) AS PT,
SUM(CASE WHEN culture='IE' THEN no_of_trades ELSE 0 END) AS IE,
SUM(CASE WHEN culture='NL' THEN no_of_trades ELSE 0 END) AS NL,
SUM(CASE WHEN culture='GR' THEN no_of_trades ELSE 0 END) AS GR,
SUM(CASE WHEN culture='FI' THEN no_of_trades ELSE 0 END) AS FI
FROM trades
GROUP BY culture;
This works, but it’s not pleasant to maintain. Should a culture be added or removed, this SQL needs to be updated. If it’s part of a stored procedure, that likely means a full deployment approval and process.
However, if you are creating a report, and the user does not need to execute the queries, you can simplify the problem with a scripting language like awk and pipe the result into mailx:
awk 'BEGIN{FS=","; OFS=","; print "Culture,Sum of Trades"}(NR>1){a[$2]+=$4}END{for(x in a){print x,a[x]}}'
Output:
Culture,Sum of Trades
FI,5
NO,11
UK,193
AT,34
DE,339
PT,60
ES,431
GR,11
NL,201
IE,55
Let’s break down what happens here:
-> The
BEGINstatement sets the File Separator (FS) and Output File Separator (OFS) to,, and also prints the headers of the dataset-> The
(NR>1)condition tells awk to skip/take no action on record number 1 (NR==1)-> The statement
a[$2]+=$4creates an arrayacontaining the unique values in the second column (culture). Each time awk sees one of those cultures as it goes through the table, it adds to the array the corresponding value it finds in the fourth column (the number of trades)-> The
ENDstatement is a simple for loop, and tells awk to print each unique culture found in arraya, and then the added up values of the respective trades
A simple one liner, which can do the process of this pivot in a single pass, keeping only the running totals in memory. This makes it well scalable for tasks like this. This is the approach that I use for these kinds of reporting tasks. Awk is almost always already installed on most unix-like systems, so you can always pipe query results into awk, and set up the report via a simple cron job.
Summary by total trades for each year
Now the user might ask a simple change, “Oh, sorry, I meant year, not culture.” In this case, the SQL query needs a lot of changes, since the WHEN statements now need to refer to the years. In SQL this is getting tedious… again, what if the years go back to 1980? You’re probably already reaching for awk to generate the SQL you were otherwise going to write anyway:
SELECT year,
SUM(CASE WHEN year=2025 THEN no_of_trades ELSE 0 END) AS '2025',
SUM(CASE WHEN year=2024 THEN no_of_trades ELSE 0 END) AS '2024',
SUM(CASE WHEN year=2023 THEN no_of_trades ELSE 0 END) AS '2023'
...
FROM trades
GROUP BY year;
In awk, the only change that needs to be made, other than a title update, is to swap out which column is going to get summed, and that’s it.
awk 'BEGIN{FS=","; OFS=","; print "Year,Sum of Trades"}(NR>1){a[$3]+=$4}END{for(x in a){print x,a[x]}}'
Output:
Year,Sum of Trades
2011,50
2013,93
2014,201
2016,55
2017,5
2018,182
2020,118
2022,145
2024,282
2025,209
Summary of year vs culture
Now, an interesting follow up question would be to get a summary of year vs culture overview. A true 2x2 overview of everything. This is again, something that gets fairly tedious if the number of cultures are more than a handful than the example in this article, but here is the SQL solution:
SELECT year,
SUM(CASE WHEN culture='UK' THEN no_of_trades ELSE 0 END) AS UK,
SUM(CASE WHEN culture='DE' THEN no_of_trades ELSE 0 END) AS DE,
SUM(CASE WHEN culture='ES' THEN no_of_trades ELSE 0 END) AS ES,
SUM(CASE WHEN culture='AT' THEN no_of_trades ELSE 0 END) AS AT,
SUM(CASE WHEN culture='NO' THEN no_of_trades ELSE 0 END) AS NO,
SUM(CASE WHEN culture='PT' THEN no_of_trades ELSE 0 END) AS PT,
SUM(CASE WHEN culture='IE' THEN no_of_trades ELSE 0 END) AS IE,
SUM(CASE WHEN culture='NL' THEN no_of_trades ELSE 0 END) AS NL,
SUM(CASE WHEN culture='GR' THEN no_of_trades ELSE 0 END) AS GR,
SUM(CASE WHEN culture='FI' THEN no_of_trades ELSE 0 END) AS FI
FROM trades
GROUP BY year
ORDER BY year;
Output:
+------+------+------+------+------+------+------+------+------+------+------+
| year | UK | DE | ES | AT | NO | PT | IE | NL | GR | FI |
+------+------+------+------+------+------+------+------+------+------+------+
| 2011 | 0 | 50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2013 | 93 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2014 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 201 | 0 | 0 |
| 2016 | 0 | 0 | 0 | 0 | 0 | 0 | 55 | 0 | 0 | 0 |
| 2017 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| 2018 | 0 | 182 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 0 | 107 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 |
| 2022 | 100 | 0 | 0 | 34 | 11 | 0 | 0 | 0 | 0 | 0 |
| 2024 | 0 | 0 | 282 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2025 | 0 | 0 | 149 | 0 | 0 | 60 | 0 | 0 | 0 | 0 |
+------+------+------+------+------+------+------+------+------+------+------+
10 rows in set (0.000 sec)
But can we reproduce this result in awk? Is it easy? Let’s compare.
Quick recap on printing 2x2 results with awk
This case gets slightly more complicated when it comes to scripting, since unlike before, it is now necessary to take care of a proper 2x2 output. The concept is not very difficult, even if the syntax might look a little daunting. The way to do this, is to use loops. One loop to print the column headers, and then one nested-loop to print out the rows.
Consider a simple awk script to print a small 2x2 table:
BEGIN{
FS=",";
OFS=",";
# here is a synthetic 2x2 "report" with 4 elements
t["A"1"]=10;
t["A"2"]=20;
t["B"1"]=30;
t["B"2"]=40;
rows[1]="A";
rows[2]="B";
cols[1]="1";
cols[2]="2";
printf "x";
for (i=1; i<=length(cols); i++) {
printf "%s%s", OFS, cols[i];
}
print ";
for (j=1; j<=length(rows); j++) {
printf "%s", rows[j];
for (i=1; i<=length(cols); i++) {
printf "%s%s", OFS, t[rows[j] cols[i]];
}
print ";
}
}
-> The first loop, we just want to keep going for as many columns or fields the data has, and print them sequentially. In this example, we have defined “columns” synthetically, but in practice, we’ll take a column of data from the input file, grab the unique values and transpose it horizontally.
-> In the second, nested-loop, we want to go row by row, so we’ll always print the first column value of the next row as the first loop iterates, and then progress one by one to print the next values of each column using the inner loop.
-> to know the number of rows or columns, we need a function that returns the number of elements of the array we input into it, both
length()andasorti()fulfill this purpose.asorti()is also useful if you wish to guarantee a fixed order of rows/columns.
Now going back to the problem, it should be more clear what is happening in the following snippet:
BEGIN{
FS=",";
OFS=",";
}
(NR>1) {
t[$3$2]+=$4;
yr[$3]++;
c[$2]++;
}
END{
n=asorti(c,c1);
printf "x";
for(x=1; x<=n; x++) {
printf "%s%s",FS,c1[x];
};
print ";
n2=asorti(yr,c2);
for(y=1; y<=n2; y++) {
printf "%s", c2[y];
for(z=1; z<=n; z++) {
printf "%s%s",FS,t[c2[y]c1[z]]
};
print ";
}
}
Output:
x,AT,DE,ES,FI,GR,IE,NL,NO,PT,UK
2011,,50,,,,,,,,
2013,,,,,,,,,,93
2014,,,,,,,201,,,
2016,,,,,,55,,,,
2017,,,,5,,,,,,
2018,,182,,,,,,,,
2020,,107,,,11,,,,,
2022,34,,,,,,,11,,100
2024,,,282,,,,,,,
2025,,,149,,,,,,60,
The output will look like a big mess in the terminal, but you can redirect it to a csv, and mail it off, and libreoffice (or excel) will import it as a spreadsheet just fine. If you wish to add a 0 in case there’s no value, you can force awk to do so by sticking a +0 to the result set, in our case t[c2[y]c1[z]]+0, then the output looks like this:
x,AT,DE,ES,FI,GR,IE,NL,NO,PT,UK
2011,0,50,0,0,0,0,0,0,0,0
2013,0,0,0,0,0,0,0,0,0,93
2014,0,0,0,0,0,0,201,0,0,0
2016,0,0,0,0,0,55,0,0,0,0
2017,0,0,0,5,0,0,0,0,0,0
2018,0,182,0,0,0,0,0,0,0,0
2020,0,107,0,0,11,0,0,0,0,0
2022,34,0,0,0,0,0,0,11,0,100
2024,0,0,282,0,0,0,0,0,0,0
2025,0,0,149,0,0,0,0,0,60,0
Caveats
-> Be mindful of unicode characters. Unicode characters can throw off functions like
length()andasorti()in awk, so you might need to run your data through iconv first if that’s the case.-> awk keeps all unique keys in memory: for very large pivots, you might need to consider a chunked approach.
Conclusion
While MariaDB can pivot tables using conditional aggregation, maintaining these queries becomes tedious as columns and fields in a table might change. While a stored procedure with dynamic SQL generation might sidestep that problem, it may be a slower and more time-consuming approach than a simple awk script when deployment approvals and testing requirements come into the picture.
Appendix
These are the queries needed to create the table and data:
CREATE TABLE trades (
product_id INT(11) NOT NULL AUTO_INCREMENT,
culture VARCHAR(255) NOT NULL,
year INT(11) NOT NULL,
no_of_trades INT(11) NOT NULL,
PRIMARY KEY (product_id)
) ENGINE=InnoDB;
INSERT INTO trades (product_id,culture,year,no_of_trades) VALUES
(1,'UK',2022,100),
(2,'DE',2011,50),
(3,'UK',2013,93),
(4,'ES',2024,204),
(5,'ES',2024,78),
(6,'AT',2022,34),
(7,'NO',2022,11),
(8,'PT',2025,60),
(9,'ES',2025,149),
(10,'DE',2018,182),
(11,'IE',2016,55),
(12,'GR',2020,11),
(13,'NL',2014,201),
(14,'FI',2017,5),
(15,'DE',2020,107);