Dealing with too many connections error in MariaDB

This is a dreaded error message, as this can grind an entire production system to a halt, and in addition to that, it can even end up locking the DBAs out as well.

At the time of writing this article, I was working as an SRE, but often got pulled into database incidents. Therefore, this article will approaches the problem from the point of view of an SRE, focusing on providing some basic understanding and troubleshooting options.

Understanding the Error

MariaDB has a setting called max_connections, which sets the maximum number of permitted simultaneous client connections.

If this max connections limit is exceeded, the result is an error message, and MariaDB will not queue any connection requests. Attempts to make new connections will be refused with the following error

ERROR 1040 (HY000): Too many connections

until some of the existing connections get closed.

Diagnostics

How do I know how many connections my DB server can accept?

The best place to look would be the MariaDB configuration files in /etc. The exact file being read can be a bit of a mess depending on your linux distro, as well as specific MariaDB version, and specific set-up, but in my case, the relevant folder is /etc/mysql/mariadb.conf.d which is a Debian non-standard location. You might try /etc/my.cnf.d/ on other distros. Once you figure out which directory to look into, a simple grep command should reveal the answer.

root@debian-test:/etc/mysql/mariadb.conf.d# grep -rni max_connections
50-server.cnf:40:max_connections        = 100

This number might vary in your case, but whatever number you get, is the maximum number of connections allowed.

How do I know how many connections are being used now?

That really depends on whether you’re still able to get into the database or not. If you are, as root, query:

SELECT COUNT('x') FROM information_schema.PROCESSLIST;

To get the number of connections currently being used.

In this table, you can also see what users are connected, what queries (if any) they are executing, and other potentially useful variables.

Depending on your MariaDB version, you may be able to join the column ID to PROCESSLIST_ID in table performance_schema.session_connect_attrs to get detailed information about the different connectors and reported application names. This can be useful to e.g. single out human users, or users of a specific connector and application.

Basically, your goal here is probably to find out which database user is using up the most connections, and either kill the connections if it’s operationally safe, or alert the relevant process owner that their application is bringing down the Database and work out a way with them to handle the situation safely.

If you’re not even able to get in, don’t panic, you can still resort to commands like netstat on the OS level to get some information:

netstat -ap | grep ESTABLISHED | grep 3306

If you’re not using some proxy solution like proxysql or maxscale, then you’ll directly see what servers are connecting to your database, and use that to deduce what applications are connecting. You should also note though, that even if you’ve no proxy, you might not get a 100% accurate count here, because each client connection corresponds to a server thread, which doesn’t necessarily map 1:1 to TCP connections visible on the OS level.

A quick summary on how netstat can be analyzed:

-> unix socket connections won’t show up, but you’re also unlikely to have many of these

-> TCP connections will appear, but if you’re using a proxy, you’ll see way less TCP connections than MariaDB server sessions, because an application like maxscale will combine multiple logical sessions into one connection

-> You might, in some situations, also see a lot of TCP connections in transient states, like TIME_WAIT or CLOSE_WAIT, which just basically means that something is trying to flood connections to your DB server.

This last point is perhaps the main utility in checking with netstat, it can help you to isolate what server is exhausting the connection, and help you figure out who you need to reach out to, to ask them to shut down the (relevant part of the) application causing this.

If you are using a proxy, you can most likely do analysis on the log files of the proxy to determine connection usage.

Quick Fixes

Whether a quick fix exists for your situation largely depends on whether you can get in to the database or not. Unfortunately, if you can’t get in to the database at all, then there are not a lot of options in terms of immediate remedies. The only realistic option is to figure out who or what is causing the connection exhaustion, and get them to stop.

Nonetheless, here’s a few options you can try in case you can still get in.

Raise the `max_connections` value

You might try to temporarily raise the max_connections value:

SET GLOBAL max_connections = <number>;

This will adjust the value for the current server session (in other words, until the mysql/mariadb process is restarted). Be careful with this value, because your server has to have the resources to be able to handle the associated overhead of increased connections.

This is an often cited solution to the problem on the internet, but in practice, it can happen that a misbehaving application would happily open up more connections than your database server can even handle.

For example, an application which has a retry mechanism for failed queries, can keep triggering new connections if a deployment included an accidental unoptimized query that takes minutes or hours to resolve. In such a scenario, increasing the max_connections value might not help, or even make things worse as the unoptimized queries eat up all the server resources.

Scrub queries using a script

This obviously only works if you can identify them, but if you can, then you can quickly put together a bash script to kill the queries immediately as they appear.

For example, a simple loop can do the trick:

for x in $(mysql -u root --skip-column-names --raw --batch -e "SELECT ID FROM information_schema.PROCESSLIST WHERE time > 60 AND INFO LIKE 'SELECT GROUP_CONCAT(... rest of the slow query...%' ORDER BY time DESC;"); do
    echo $x
    mysql -u root -e "SHOW EXPLAIN FOR $x; KILL $x;" >> out.log &
done

This will give a neat output of what, and how many times has been killed by scrubbing. A solution like this can act as decent band-aid until the root cause of the situation is fixed.

Preventative measures

Actively monitor and fix slow queries

Every single unoptimized query will linger in the system for longer than necessary. A query that executes faster, might easily translate to a connection that gets closed faster, aggressive query optimization can go a long way in making sure that the database server is not slowly eviscerated to death by a 1000 cuts.

A practical way to stay ahead of this problem is to build a script that executes a query to periodically check for long-running or blocked transactions.

-> active innodb transactions (information_schema.innodb_trx)
-> which threads they map to (performance_schema.threads)
-> current lock waits (information_schema.innodb_lock_waits)

Joining these together will often be able to provide a snapshot as to what’s actually holding up the system. This is safe to run every few minutes from cron. This won’t catch 100% of slow queries, but locking queries and slow queries often go hand in hand.

Implement an auto timeout if operationally safe

Many human users who use GUI apps like Workbench tend not to close their connections when they’re done querying things, and misconfigured applications may also leave idle connections behind. Such sleeping connections still do count towards the max value, so killing them can help with preventing these issues. This is obviously not an easy decision to make, because some applications may not react well to having their connections closed by the remote side.

For example, in a typical maxscale configuration for a service definition:

[Read-Write-Service-003]
type=service
router=readwritesplit
servers=server_003
user=maxscale
password=[REDACTED]
connection_timeout=15m

-> connection_timeout=15m means that any connections that have been idle for 15 minutes will be closed by maxscale.

Monitor the number of connected threads

Setting up a simple bash script to send yourself an email when the SELECT COUNT('x') of information_schema.PROCESSLIST approaches 80-90% of the maximum value can give you time to log in and reserve yourself a spot… at least during the busy/core hours.

If you’re using a proxy like maxscale or proxysql, then monitoring the connection utilization there is going to paint a very accurate picture on the connection utilization.

Conclusion

While the problem itself is fairly easy to understand, a solution is not always immediately available. The most important takeaway is perhaps not to panic, not to start thinking about the post-mortem while still dealing with the issue, and just calmly analyze the situation as it happens.