Using FORCE INDEX: When Low Cardinality Statistics Mislead the Optimizer (MariaDB)

Not too long ago, I ran into a case where MariaDB’s optimizer picked a sub-optimal plan, one that ignored an index. I had to use FORCE INDEX() to fix it. Before diving into what happened, and why the optimizer missed a mark, let’s look at the original query that was submitted to me.

Context

The initial query was an aggregate query, trying to join three tables together in order to do a bit of filtering on some criteria. Fairly typical stuff.

[MariaDB]> select t2.other_value, t3.symbol from table_2 t2 join morestuff.table_3 t3 using (other_value) join table_1 t1 using (boring_id) where t2.id >458389399 and t1.doubtful_id in (191,192) and t3.symbol in ('A0','A1') group by t2.other_value limit 100000;
15650 rows in set (4 min 49.776 sec)

This is pretty slow, so that explains why a ticket to speed this up was opened. The first step is to check the EXPLAIN, to try and find out why is this so slow:

[MariaDB]> EXPLAIN select t2.other_value, t3.symbol from table_2 t2 join morestuff.table_3 t3 using (other_value) join table_1 t1 using (boring_id) where t2.id >458389399 and t1.doubtful_id in (191,192) and t3.symbol in ('A0','A1') group by t2.other_value limit 100000;
+------+-------------+-------+--------+----------------------------------------------------------+-------------+---------+----------------------------+----------+----------------------------------------------+
| id   | select_type | table | type   | possible_keys                                            | key         | key_len | ref                        | rows     | Extra                                        |
+------+-------------+-------+--------+----------------------------------------------------------+-------------+---------+----------------------------+----------+----------------------------------------------+
|    1 | SIMPLE      | t2    | range  | PRIMARY,<irrelevant idxs>                                | PRIMARY     | 4       | NULL                       | 98390404 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | t3    | eq_ref | <irrelevant idxs>                                        | other_value | 5       | main_schema.t2.other_value | 1        | Using where                                  |
|    1 | SIMPLE      | t1    | eq_ref | PRIMARY,doubtful_id,doubtful_id_2,doubtful_id_3          | PRIMARY     | 4       | main_schema.t2.boring_id   | 1        | Using where                                  |
+------+-------------+-------+--------+----------------------------------------------------------+-------------+---------+----------------------------+----------+----------------------------------------------+
3 rows in set (0.002 sec)

The optimizer seems to decide to begin the work starting with table t2, which in this case had ~500 million rows. Doing a full table scan and then filtering for the remaining conditions is very slow.

So here, I attempted the standard practices for optimizing this kind of query. Split the IN (191,192) on the field doubtful_id to a SELECT ... UNION SELECT statement, so that maybe optimizer would pick one of the indexes on t1. With IN() the optimizer sometimes decides to forego an index, it might simply think that the list inside the IN() would resolve to a high % of total rows in the table. A simple WHERE with just one value often resolves this issue and gets the optimizer to try an index search. If an index on t1 could be used, the optimizer should rearrange the order of execution, and not start with the really big t2 table.

The rewrite

Here I’ve also removed the WHERE clause on t3 and the GROUP BY, as these are trivial and fast to do with awk after the data is output from MariaDB. This makes it easier to get to the real problem with the query, as it removes a fair bit of noise from the EXPLAIN. The ticket reporter was fine with this approach.

So trying the rewrite:

[MariaDB]> EXPLAIN SELECT t1.boring_id ,t2.other_value,t3.symbol FROM table_1 t1 JOIN table_2 t2 ON t1.boring_id  = t2.boring_id  JOIN morestuff.table_3 t3 ON t3.other_value = t2.other_value WHERE t1.doubtful_id = 191 AND t2.id > 458389399 UNION SELECT t1.boring_id ,t2.other_value,t3.symbol FROM table_1 t1 JOIN table_2 t2 ON t1.boring_id  = t2.boring_id  JOIN morestuff.table_3 t3 ON t3.other_value = t2.other_value WHERE t1.doubtful_id = 192 AND t2.id > 458389399;
+------+--------------+------------+--------+----------------------------------------------------------+-------------+---------+----------------------------+----------+-------------+
| id   | select_type  | table      | type   | possible_keys                                            | key         | key_len | ref                        | rows     | Extra       |
+------+--------------+------------+--------+----------------------------------------------------------+-------------+---------+----------------------------+----------+-------------+
|    1 | PRIMARY      | t2         | range  | PRIMARY,<irrelevant idxs>                                | PRIMARY     | 4       | NULL                       | 98390912 | Using where |
|    1 | PRIMARY      | t3         | eq_ref | other_value                                              | other_value | 5       | main_schema.t2.other_value | 1        |             |
|    1 | PRIMARY      | t1         | eq_ref | PRIMARY,doubtful_id,doubtful_id_2,doubtful_id_3          | PRIMARY     | 4       | main_schema.t2.boring_id   | 1        | Using where |
|    2 | UNION        | t2         | range  | PRIMARY,<irrelevant idxs>                                | PRIMARY     | 4       | NULL                       | 98390912 | Using where |
|    2 | UNION        | t3         | eq_ref | other_value                                              | other_value | 5       | main_schema.t2.other_value | 1        |             |
|    2 | UNION        | t1         | eq_ref | PRIMARY,doubtful_id,doubtful_id_2,doubtful_id_3          | PRIMARY     | 4       | main_schema.t2.boring_id   | 1        | Using where |
| NULL | UNION RESULT | <union1,2> | ALL    | NULL                                                     | NULL        | NULL    | NULL                       | NULL     |             |
+------+--------------+------------+--------+----------------------------------------------------------+-------------+---------+----------------------------+----------+-------------+
7 rows in set (0.002 sec)

Not exactly the result I was hoping for. This query should take at least twice as long as the original query I started with. Looks like even with WHERE = on doubtful_id, the optimizer was still refusing to use an index on t1, the optimizer still thought that starting off with t2 and then filtering from there was better.

Just to verify the execution speed:

[MariaDB]> EXPLAIN SELECT t1.boring_id ,t2.other_value,t3.symbol FROM table_1 t1 JOIN table_2 t2 ON t1.boring_id  = t2.boring_id  JOIN morestuff.table_3 t3 ON t3.other_value = t2.other_value WHERE t1.doubtful_id = 191 AND t2.id > 458389399 UNION SELECT t1.boring_id ,t2.other_value,t3.symbol FROM table_1 t1 JOIN table_2 t2 ON t1.boring_id  = t2.boring_id  JOIN morestuff.table_3 t3 ON t3.other_value = t2.other_value WHERE t1.doubtful_id = 192 AND t2.id > 458389399;
584268 rows in set (11 min 4.386 sec)

Confirmed to be terribly slow.

This is probably a good moment to have a look at what the indexes on t1 actually look like:

[MariaDB]> SHOW INDEXES FROM table1;
+---------+------------+----------------------------------+--------------+----------------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| Table   | Non_unique | Key_name                         | Seq_in_index | Column_name                      | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |
+---------+------------+----------------------------------+--------------+----------------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| table_1 |          0 | PRIMARY                          |            1 | boring_id                        | A         |    13214619 |     NULL | NULL   |      | BTREE      |         |               | NO      |
...
| table_1 |          1 | doubtful_id                      |            1 | doubtful_id                      | A         |         104 |     NULL | NULL   |      | BTREE      |         |               | NO      |
...
| table_1 |          1 | doubtful_id_2                    |            1 | doubtful_id                      | A         |         104 |     NULL | NULL   |      | BTREE      |         |               | NO      |
...

The cardinality value of the index doubtful_id is only 104. That basically means that the optimizer considers that for each value of doubtful_id there’s ~100k rows of data, so the optimizer thought that this index is not worth using, because this could mean millions of rows of data anyway, and the table is 13 million rows. If this expectation of the optimizer is true, then using the index doubtful_id would be not worth it, and might as well do a full table scan. With the current plan, filtering by doubtful_id happens as an afterthought (Using where).

The cardinality estimate of 104 was most likely because the table’s PK is auto-increment, and doubtful_id=191/192 wasn’t a thing until years, and millions of PK’s into the table’s existence, but innodb’s statistics sampling tends to mostly read older pages, which gives a potential for that cardinality value to be badly underestimated.

The other rewrite

The way forward here is to force the doubtful_id index. Even though it’s not a “good” index, due to low cardinality, it should be better than what the optimizer assumes:

[MariaDB]> EXPLAIN SELECT t1.boring_id ,t2.other_value,t3.symbol FROM table_1 t1 FORCE INDEX(doubtful_id) JOIN table_2 t2 ON t1.boring_id  = t2.boring_id  JOIN morestuff.table_3 t3 ON t3.other_value = t2.other_value WHERE t1.doubtful_id = 191 AND t2.id > 458389399 UNION SELECT t1.boring_id ,t2.other_value,t3.symbol FROM table_1 t1 FORCE INDEX(doubtful_id) JOIN table_2 t2 ON t1.boring_id  = t2.boring_id  JOIN morestuff.table_3 t3 ON t3.other_value = t2.other_value WHERE t1.doubtful_id = 192 AND t2.id > 458389399;
+------+--------------+------------+--------+----------------------------------------------------------+---------------+---------+----------------------------+---------+-----------------------+
| id   | select_type  | table      | type   | possible_keys                                            | key           | key_len | ref                        | rows    | Extra                 |
+------+--------------+------------+--------+----------------------------------------------------------+---------------+---------+----------------------------+---------+-----------------------+
|    1 | PRIMARY      | t1         | ref    | doubtful_id                                              | doubtful_id   | 4       | const                      | 5147166 | Using index           |
|    1 | PRIMARY      | t2         | ref    | PRIMARY,<irrelevant keys>                                | boring_id     | 4       | main_schema.t1.boring_id   | 33      | Using index condition |
|    1 | PRIMARY      | t3         | eq_ref | other_value                                              | other_value   | 5       | main_schema.t2.other_value | 1       |                       |
|    2 | UNION        | t1         | ref    | doubtful_id                                              | doubtful_id   | 4       | const                      | 5047202 | Using index           |
|    2 | UNION        | t2         | ref    | PRIMARY,<irrelevant keys>                                | boring_id     | 4       | main_schema.t1.boring_id   | 33      | Using index condition |
|    2 | UNION        | t3         | eq_ref | other_value                                              | other_value   | 5       | main_schema.t2.other_value | 1       |                       |
| NULL | UNION RESULT | <union1,2> | ALL    | NULL                                                     | NULL          | NULL    | NULL                       | NULL    |                       |
+------+--------------+------------+--------+----------------------------------------------------------+---------------+---------+----------------------------+---------+-----------------------+
7 rows in set (0.002 sec)

The execution plan looks way better, and now we can actually even decipher why the optimizer did not want to try and use the doubtful_id index. It estimates ~5M rows, just under half the table, so it’s clearly not a great index. But is it better than nothing?

Let’s check the execution:

[MariaDB]> SELECT t1.boring_id ,t2.other_value,t3.symbol FROM table_1 t1 FORCE INDEX(doubtful_id) JOIN table_2 t2 ON t1.boring_id  = t2.boring_id  JOIN morestuff.table_3 t3 ON t3.other_value = t2.other_value WHERE t1.doubtful_id = 191 AND t2.id > 458389399 UNION SELECT t1.boring_id ,t2.other_value,t3.symbol FROM table_1 t1 FORCE INDEX(doubtful_id) JOIN table_2 t2 ON t1.boring_id  = t2.boring_id  JOIN morestuff.table_3 t3 ON t3.other_value = t2.other_value WHERE t1.doubtful_id = 192 AND t2.id > 458389399;
584294 rows in set (40.282 sec)

It is better. This execution plan creates a much smaller working set for the subsequent joins, so the joins become more efficient because they’re probing against fewer rows.

The previous query, before FORCE INDEX()

-> Starts with t2 (98M rows estimated range scan)
-> Joins to t3
-> Joins to t1 on PRIMARY, then filters by doubtful_id

The new query, with FORCE INDEX():

-> Starts with t1.doubtful_id (5M rows vs 98M rows, we begin with a smaller working set)
-> Joins to t2 on boring_id (ref with an index condition pushdown, this pushes the t2.id > 458389399 filter down, meaning MariaDB is able to apply that part of the WHERE clause during index traversal)
-> Joins to t3

Basically, the optimizer didn’t understand that the index would have helped to produce a small filtered subset.

Performance results

-> Original query (IN, no index): 4min 50sec
-> UNION rewrite (no index): 11min 4sec
-> UNION with FORCE INDEX: 40 sec

Overall result: 7x faster than original, 16x faster than version from first rewrite

Why not ANALYZE TABLE?

A common suggestion for optimizer issues is to run ANALYZE TABLE to refresh statistics, and this situation might even look like where the advice is relevant.

In my case, this would have been impractical to even consider:

-> Running ANALYZE TABLE can change statistics for all indexes on the table, potentially helping one query while breaking others, and t1 in this situation is a very busy, central table to a number of systems
-> ANALYZE TABLE can take significant time and resources, and hard to estimate how long it will actually take
-> The timing is difficult, because don’t want to run it during business hours due to busyness, but statistics issues are typically discovered during peak traffic
-> Even with refreshed statistics, a cardinality of 104 on ~13M rows would still look unattractive to the optimizer

In this case, FORCE INDEX provides a targeted fix without the risk of affecting other queries, and got the ticket out of my queue significantly faster.

Conclusion

Sometimes even the optimizer can be mislead, and standard practices of query optimization can fail. It’s always a good idea to dig a layer deeper, because it might just be possible to salvage that query. Overriding the optimizer is an option, though it is worth to keep in mind that littering a codebase with FORCE INDEX() statements is likely to cause push-back, as it adds another layer of future maintenance that must be kept in mind. If a table’s data changes significantly over time, a forgotten FORCE INDEX() might keep the optimizer from picking a more efficient plan.

In this case, there was also no simple way out via some easy indexing fix. doubtful_id was already part of a composite index (also named doubtful_id), but the query filtered only on that one column, so the additional columns never came into the picture. The other composite indexes (doubtful_id_2 and doubtful_id_3) also began with doubtful_id, which means that with only a single column in the WHERE clause of the query, these indexes are all effectively equivalent due to the leftmost-prefix rule.