Summary Tables in MariaDB: A case study

This article demonstrates a common database optimization pattern using a simplified example inspired by problems I’ve encountered in production systems, with the goal of showing when a summary table can be useful and how one might go about implementing one.

The Problem

For some user facing application, we need to get the latest price of each product, based on some user defined list of products. Sometimes prices for a particular product on a particular date may not be known, in which case price = NULL.

We can assume a reasonably big table, e.g.:

[MariaDB]> SELECT COUNT('x') FROM pricedata;
+------------+
| COUNT('x') |
+------------+
| 469155449  |
+------------+
1 row in set (4 min 18.129 sec)

Given the following query as the starting point:

SELECT
product_data.product.product_id,
latest_price.price,
latest_price.date
FROM product_data.product
LEFT OUTER JOIN (
    SELECT hd1.product_id, hd1.price, hd1.date
    FROM product_data.pricedata AS hd1
    WHERE hd1.date = (
        SELECT MAX(hd2.date)
        FROM product_data.pricedata as hd2
        WHERE (
            hd2.product_id = hd1.product_id
            AND hd2.price IS NOT NULL
        )
    )
) AS latest_price ON product_data.product.product_id = latest_price.product_id
WHERE (
    product_data.product.product_id IN (<2.5k of random ids of products>) AND product_data.product.some_criteria >= 1000
);

If we inspect the EXPLAIN of this query, nothing immediate stands out:

+------+--------------------+-------------+--------+---------------------------+------------+---------+-----------------------------+------+----------------+
| id   | select_type        | table       | type   | possible_keys             | key        | key_len | ref                         | rows | Extra          |
+------+--------------------+-------------+--------+---------------------------+------------+---------+-----------------------------+------+----------------+
| 1    | PRIMARY            | <subquery4> | ALL    | distinct_key              | NULL       | NULL    | NULL                        | 2500 |                |
| 1    | PRIMARY            | product     | eq_ref | PRIMARY,some_criteria_idx | PRIMARY    | 4       | tvc_0._col_1                | 1    | Using where    |
| 1    | PRIMARY            | hd1         | ref    | product_id                | product_id | 4       | tvc_0._col_1                | 23   | Using where    |
| 4    | MATERIALIZED       | <derived5>  | ALL    | NULL                      | NULL       | NULL    | NULL                        | 2500 |                |
| 5    | DERIVED            | NULL        | NULL   | NULL                      | NULL       | NULL    | NULL                        | NULL | No tables used |
| 3    | DEPENDENT SUBQUERY | hd2         | ref    | product_id                | product_id | 4       | product_data.hd1.product_id | 23   | Using where    |
+------+--------------------+-------------+--------+---------------------------+------------+---------+-----------------------------+------+----------------+
6 rows in set (0.006 sec)

Running a query ends up taking a fair bit of time, way too slow for anything user or client facing:

{actual query results omitted}
2500 rows in set (36.373 sec)

So what is the problem?

The bottleneck wasn’t an obvious table scan, but rather it is the work required to find the latest non-NULL row per product in such a big table.

If recent rows are NULL, MAX(date) WHERE price IS NOT NULL can still touch many rows, and this “is it NULL?” checking on repeat for 2.5k times becomes expensive.

This per-product cost of backtracking just is not evident from an EXPLAIN.

The Summary Table

From what we know about the underlying problem, it is clear that this table is getting impractically big for this type of searching. You may get faster queries by splitting the big IN () into smaller lists, and sending multiple parallel queries, but that will not only increase the complexity of the application, it will also spend more database resources. It seems far more reasonable to assess whether we can cut down the amount of data this query needs to go through. The way to achieve this is to create a summary table:

-> We need only the LATEST price of each product, which should decrease table size significantly.
-> We can explicitly exclude NULL values from the new table, eliminating the expensive IS NOT NULL check.

This approach means we will only have one slow query to build up the summary table each day, and the queries of the application can be redirected against this new summary table, and hopefully be much faster.

Quick Sketch of the Workflow

-> Create some table _tmp, do the SELECT query against the original table, it should grab the latest prices for each product, excluding NULL values, and fill it into this _tmp table
-> If yesterday’s version of the summary table exists, DROP it (having a yesterday’s version of the summary table can be a nice thing in case some freak accident happens to the summary table for any reason, returning slightly outdated values as a fallback was preferrable in my case, rather than erroring out)
-> using RENAME change the current _new table to _last, saving the previously created _new table as the new backup (RENAME is atomic, fast, and replication friendly)
-> using RENAME change the _tmp table that was filled today to become the new _new table

Script Example of the Workflow

#!/bin/bash

# assume that $MARIADB holds some valid login configuration to the database
# assume $RUN_DIR points to the working directory of this script
SPLIT_DIR=$RUN_DIR/split

rm -rf $SPLIT_DIR
mkdir -p $SPLIT_DIR

echo "CREATE TABLE IF NOT EXISTS product_data.pricedata_daily_tmp (product_id INT(11) NOT NULL, date DATE NOT NULL, price BIGINT(20) DEFAULT NULL, PRIMARY KEY(`product_id`,`date`,`price`)) ENGINE=InnoDB;" | $MARIADB

# this query isn't a pleasant one-liner, so for clarity: we determine the latest non-NULL date per product, then we join back to retrieve the price
echo "USE product_data; SELECT pd1.product_id,pd1.date,pd1.price FROM pricedata pd1 JOIN (SELECT product_id, MAX(date) AS max_date,price FROM pricedata USE INDEX(product_id) WHERE price IS NOT NULL GROUP BY product_id) pd2 ON pd1.product_id = pd2.product_id AND pd1.date=pd2.max_date;" | $MARIADB --skip-column-names --silent | sed "s/\t/,/g" > $RUN_DIR/res.csv

echo "TRUNCATE TABLE product_data.pricedata_daily_tmp;" | $MARIADB

split -d -l 10000 $RUN_DIR/res.csv chunk_ --additional-suffix=.csv
mv $RUN_DIR/chunk_* $SPLIT_DIR
cd $SPLIT_DIR

SQL="USE product_data; LOAD DATA LOCAL INFILE './%THE_FILE%' IGNORE INTO TABLE pricedata_daily_tmp FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (product_id,date,price);"

echo "$SQL" > template.sql
for chunk in $(ls chunk_*.csv); do cat template.sql | sed 's/%THE_FILE%/'$chunk'/g' | $MARIADB --local-infile=1 -A -vvv >> $RUN_DIR/output.txt; sleep 2; done
cd $RUN_DIR

echo "DROP TABLE IF EXISTS product_data.pricedata_daily_last;" | $MARIADB
echo "RENAME TABLE product_data.pricedata_daily_new TO product_data.pricedata_daily_last;" | $MARIADB
echo "RENAME TABLE product_data.pricedata_daily_tmp TO product_data.pricedata_daily_new;" | $MARIADB
echo "CREATE TABLE product_data.pricedata_daily_tmp (product_id INT(11) NOT NULL, date DATE NOT NULL, price BIGINT(20) DEFAULT NULL, PRIMARY KEY(`product_id`,`date`,`price`)) ENGINE=InnoDB;" | $MARIADB

The resulting table is at least an order of magnitude smaller than the original table, meaning it passes that rule of thumb:

[MariaDB]> SELECT COUNT('x') FROM pricedata_daily_new;
+------------+
| COUNT('x') |
+------------+
| 4140708    |
+------------+
1 row in set (1.404 sec)

Even by just swapping out pricedata with pricedata_daily_new we can observe a measurable increase in query execution speed:

SELECT
product_data.product.product_id,
latest_price.price,
latest_price.date
FROM product_data.product
LEFT OUTER JOIN (
    SELECT hd1.product_id, hd1.price, hd1.date
    FROM product_data.pricedata_daily_new AS hd1
    WHERE hd1.date = (
        SELECT MAX(hd2.date)
        FROM product_data.pricedata_daily_new as hd2
        WHERE (
            hd2.product_id = hd1.product_id
            AND hd2.price IS NOT NULL
        )
    )
) AS latest_price ON product_data.product.product_id = latest_price.product_id
WHERE (
    product_data.product.product_id IN (<2.5k of random ids of products>) AND product_data.product.some_criteria >= 1000
);

Results in:

{actual query results omitted}
2500 rows in set (0.065 sec)

This is much more acceptable from an end user responsiveness perspective.

Caveats

The script above does not consider that the pricedata table might get updated during the day. If pricedata gets updated, then the values in the summary table will not get those updates and the values will drift. If this is a concern, it may be necessary to extend the script with a function that periodically scans for recent entries in pricedata and transfers the updated results into the summary table:

-> save the max(id) of pricedata when this daily summary gets created
-> select a small slice of pricedata on some regular interval where the ids are > the previous saved id, determine which prices were updated or inserted based on some date_modified column (assuming it exists)
-> replace or add these updated values into the pricedata_daily_new table

Conclusion

Sometimes a slow query might not have an immediate answer to “How do I speed this up?”, and insisting on finding an answer to the question ends up feeling like a bashing your head against the wall. When it comes to dealing with slow queries against big tables, an alternative approach can be to think in terms of “Why are we querying this table at all?”.

The process described in this article provides a practical example of what denormalization looks like in real-world systems, reshaping data to better match against the queries getting sent against the database.