Generated Columns: Dealing with Functions on Indexed Columns (MariaDB)

This article aims to give a quick discussion on why functions on indexed columns can be worth optimizing around, with an example based on a real problem that was encountered.

Context

Some time ago, I was given the problem of speeding up a data transfer and transform pipeline, due to it having a time sensitive nature, where the overall script has to finish under a set amount of time.

While much work needed to be done, an issue that stood out for me as memorable, was one that often gets overlooked: functions on indexed columns. While it gives very clean syntax, the implication can be quite costly in terms of performance.

The Problem

As part of the problem, I encountered a small query that was used to find a specific subset of products.

select
p.code,
r.id
from ProductDictionary as p
inner join RuleDictionary as r on r.rule_set_id=p.rule_set_id and r.module_id=p.module_id
where length(p.class)<3
and length(p.code)=12
and p.type in ('X','Y','Z');

Executing:

{results omitted}
215719 rows in set (47.70 sec)

Examining the EXPLAIN:

[MariaDB]> EXPLAIN select
    -> p.code,
    -> r.id
    -> from ProductDictionary as p
    -> inner join RuleDictionary as r on r.rule_set_id=p.rule_set_id and r.module_id=p.module_id
    -> where length(p.class)<3
    -> and length(p.code)=12
    -> and p.type in ('X','Y','Z');
+------+-------------+-------+-------+-------------------+-------------------+---------+-----------------------------+------+------------------------------------+
| id   | select_type | table | type  | possible_keys     | key               | key_len | ref                         | rows | Extra                              |
+------+-------------+-------+-------+-------------------+-------------------+---------+-----------------------------+------+------------------------------------+
|    1 | SIMPLE      | r     | index | module_definition | module_definition | 22      | NULL                        | 1402 | Using index                        |
|    1 | SIMPLE      | p     | ref   | rule_set_id_index | rule_set_id_index | 19      | product_rules.r.rule_set_id | 455  | Using index condition; Using where |
+------+-------------+-------+-------+-------------------+-------------------+---------+-----------------------------+------+------------------------------------+
2 rows in set (0.00 sec)

This EXPLAIN might suggest that this is indexed, and that everything is fine, but a closer examination of the output suggests that the database is performing a full index scan on RuleDictionary, scanning through each entry in the index, rather than any range scan or direct lookup. As a rule of thumb, ref=NULL plus Using index means we’re not narrowing down to a specific range of values.

But can we find a different index? What we can notice is that we have a length(p.code)=12 comparison in the WHERE clause. Unfortunately that is not directly indexable, even if an index on code existed, the length() function would get the engine to stop, perform the function on all rows, and then apply the comparison, which prevents index use and do a full table scan anyway.

The Generated Column

What we can do, is pre-calculate length(p.code), and store it in a persistent generated column! A persistent generated column means, that in such a column, the data is stored on disk, evaluated at time of writing the data, and the value will generally be given as a function of some other column. The end user/application do not need to concern themselves with inserting or updating these values. (It is worth to note, that this means that the result is there, even if queries do not use it. So be careful on adding a persistent stored column, and keep writes in mind.)

Perfect for our case here, so let’s go ahead:

[MariaDB]> ALTER TABLE ProductDictionary ADD COLUMN code_length tinyint unsigned GENERATED ALWAYS AS (length(code)) PERSISTENT, ADD INDEX lcode_instrument (code_length, type);
Query OK, 7571390 rows affected (2 min 55.04 sec)
Records: 7571390  Duplicates: 0  Warnings: 0

[MariaDB]> EXPLAIN select
    -> p.code,
    -> r.id
    -> from ProductDictionary as p FORCE INDEX(lcode_instrument)
    -> inner join RuleDictionary as r on r.rule_set_id=p.rule_set_id and r.module_id=p.module_id
    -> where length(p.class)<3
    -> and p.code_length=12
    -> and p.type in ('X','Y','Z');
+------+-------------+-------+-------+--------------------+--------------------+---------+-------------------------------------------------------+---------+------------------------------------+
| id   | select_type | table | type  | possible_keys      | key                | key_len | ref                                                   | rows    | Extra                              |
+------+-------------+-------+-------+--------------------+--------------------+---------+-------------------------------------------------------+---------+------------------------------------+
|    1 | SIMPLE      | p     | range | lcode_instrument   | lcode_instrument   | 13      | NULL                                                  | 5175116 | Using index condition; Using where |
|    1 | SIMPLE      | r     | ref   | module_definition  | module_definition  | 22      | product_rules.p.rule_set_id,product_rules.p.module_id | 1       | Using index                        |
+------+-------------+-------+-------+--------------------+--------------------+---------+-------------------------------------------------------+---------+------------------------------------+
2 rows in set (0.00 sec)

Despite estimating 5.1M rows vs 1.4k rows in the original plan, this is still faster because the problem is now reduced to:

-> Range scan on lcode_instrument narrows to products where code_length=12
-> and type IN ('X','Y','Z') then joins to RuleDictionary using the composite index, so no complicated join pattern here

vs original plan where:

-> Full index scan of RuleDictionary
-> nested loop join to ProductDictionary with rule_set_id
-> filter on length(), function is evaluated on the fly for relevant results

Overall, despite starting with more rows, the query now first filters on an indexed column, instead of computing the filter as an afterthought.

Unfortunately, the optimizer still prefers the old plan without forcing the index, because on paper, 1.4k better than 5.1M. Basically, the optimizer’s estimate doesn’t account for the expensive length() calculations happening later. FORCE INDEX overrides the optimizer’s decision, since we know that reading ~5M precalculated values will be faster than calculating length() on the poorly filtered intermediate result set.

select
p.code,
r.id
from ProductDictionary as p FORCE INDEX(lcode_instrument)
inner join RuleDictionary as r on r.rule_set_id=p.rule_set_id and r.module_id=p.module_id
where length(p.class)<3
and p.code_length=12
and p.type in ('X','Y','Z');

Checking the output:

{results omitted}
215719 rows in set (20.59 sec)

This is now indeed faster (47.7s -> 20.6s giving us a 57% speed boost), while giving the same result as desired. The table isn’t extremely huge, so the extra disk space needed for the generated column was an acceptable tradeoff here.

In cases where the workload is more write heavy, but light on reads, a VIRTUAL column might be preferred, where the data is not stored on disk (therefore having no write penalty), and the result is calculated during the runtime of the SELECT queries. As this problem was part of a data pipeline, the PERSISTENT is a better choice, as the underlying data doesn’t receive continuous streams of changes.

For MySQL readers: you may opt for creating a functional index like:

ALTER TABLE ProductDictionary ADD INDEX lcode_instrument((length(code)), type);

Which should achieve a similar result.

Conclusion

Using the generated column didn’t end up being a massive performance improvement in this particular situation, but when it comes to time sensitive data pipelines, every second matters, and this is an easy concept to take in for potential future use cases.

As a sidenote, it is interesting to note that fewer rows expected by the optimizer in EXPLAIN does not always translate to a faster query, even if it usually does. If the option with the fewer rows runs slower, it might mean that it is worth to check on how and when the filtering happens.