Virtual Memory + Page Faults

Memory Page

A memory page is the basic unit of data from the perspective of the Operating System (OS). All of the code and data is arranged in groups of pages. This mechanism exists, because it is possible for a system to have less RAM installed than the maximum address space. This means that sometimes stuff needs to get swapped between secondary storage (HDD/SSD) and RAM/Physical Memory, and in the case of large applications, it is entirely possible that while the application is running, not all of it stored on the RAM.

So basically, the memory pages tables are a way for the OS to remember where different sets of working data lives, and whether it’s stored on secondary storage or on physical memory.

What is a Page Fault?

A page fault is a hardware exception, that is raised when a process tries to access a memory page that is not currently loaded into the physical memory. Most likely to happen when some process requests access to data in physical memory that hasn’t been recently used, or not yet loaded into it at all.

What basically happens:

-> The OS determines which memory page needs to be loaded
-> It updates the page table, a data structure that maps virtual memory addresses to physical memory addresses, to reflect the new mapping.
-> If physical memory is full, the OS selects a page to be evicted (replaced) using a page replacement algorithm like Least Recently Used (LRU).
-> The missing page is loaded from disk into the newly freed physical memory page.
-> Once the page is loaded, the process that requested the data access can be resumed.

Most commonly occurs when:

-> The data requested has never existed in physical memory.
-> The data requested did exist in physical memory, but was swapped out to secondary storage (HDD/SSD).

Bringing data in from secondary storage is obviously much slower than directly accessing something in physical memory so if a lot of page faults happen during some process, it can start to make itself felt on the performance.

Virtual Memory

-> Every process thinks it has a huge, private address space.
-> The Operating System (OS) keeps a page table for each process.

This provides the illusion of a larger memory space than what is physically available. The page table contains fixed-size pages, and are mapped to physical memory. This virtual representation of memory is called the address space, made possible by the fact that the OS can swap pages between storage and physical memory/RAM.

When an application tries to read from a page that isn’t available in physical memory, the CPU complains and raises a major page fault event.

How a Major Page Fault works

The best understanding is seeing… so let’s write a simple program that

Creates a big file
Maps big file into memory
In a loop, touches each page to try and trigger a major page fault

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/resource.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <string.h>
#include <errno.h>

int main() {
    const char *filename = "bigfile.bin";
    size_t size = 1024UL * 1024 * 1024;  // 1 GB
    size_t pagesize = sysconf(_SC_PAGESIZE);
    struct rusage usage;

    printf("Creating %s (%zu bytes)...\n", filename, size);

    //Create or truncate the file
    int fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666);
    if (fd < 0) {
        perror("open");
        return 1;
    }

    //Extend file to desired size using ftruncate
    if (ftruncate(fd, size) != 0) {
        perror("ftruncate");
        close(fd);
        return 1;
    }

    printf("File created. Mapping to memory...\n");

    //Map the file into memory
    char *map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (map == MAP_FAILED) {
        perror("mmap");
        close(fd);
        return 1;
    }

    printf("Mapped. Touching each page (~%zu bytes per step)...\n", pagesize);
    fflush(stdout);

    //Touch each page
    for (size_t i = 0; i < size; i += pagesize) {
        map[i] = 1;      // write to force disk read into RAM
        getrusage(RUSAGE_SELF, &usage);
        printf("Touched page %zu : majflt=%ld minflt=%ld\n", i/pagesize, usage.ru_majflt, usage.ru_minflt);
    }

    printf("Done touching pages.\n");

    munmap(map, size);
    close(fd);

    printf("All done.\n");
    return 0;
}

On my system, the command to compile this is:

gcc prog_file_page_fault.c -o prog_file_page_fault

Running it should output:

Creating bigfile.bin (1073741824 bytes)...
File created. Mapping to memory...
Mapped. Touching each page (~4096 bytes per step)...
Touched page 0 : majflt=1 minflt=155
Touched page 1 : majflt=2 minflt=155
Touched page 2 : majflt=3 minflt=155
Touched page 3 : majflt=4 minflt=155
...
Touched page 262142 : majflt=262143 minflt=155
Touched page 262143 : majflt=262144 minflt=155
Done touching pages.
All done.

majflt counter is incrementing by 1 for every touched page:

-> The page being worked on in the C loop was not yet in physical memory
-> The OS had to load it from the storage device into RAM
-> The kernel saw a file backed page fault event
-> It updated page tables and resumed the execution of the program

minflt not moving means:

-> we are not trying to access something that’s already in physical memory/RAM
-> this counter should move if we tried to access something already in RAM, but is not yet mapped to the page table of this program’s process

Usually a program wouldn’t print internal statistics this way, but at least on my computer, this executes too fast for a simple monitoring of /proc/$(pidof prog_file_page_fault)/stat to be conveniently feasible.

What does this mean for MariaDB?

If a major page fault happens, the OS must read data from the storage device, which is slow.

This is a major problem for MariaDB, because InnoDB manages its own caching layer (for accessing data and indexes), and the OS shouldn’t do this swapping with it (InnoDB buffer pool should stay in RAM). If Linux needs to force parts of the buffer pool to storage device, it will cause major page fault inside the caching layer of InnoDB, causing extreme slowdown for queries.

Basically, if this happens, the OS is forced to do work that MariaDB expected to avoid/control.

The root cause can boil down to many different things:

-> Concurrency exceeds what the memory can handle. Each connection creates a temporal image of the current state of transactions, too many connections can end up consuming a lot of RAM this way.
-> innodb_buffer_pool_size issues, not trivial to set right, depends on available resources but aim for high Buffer pool hit rate (SHOW ENGINE INNODB STATUS\G)
-> Full table scans of large tables, this can require a lot of new/cold pages i.e. lot of page faults
-> Range scans with bad indexes, MariaDB needs to walk large portions of the table data
-> Big IN(...) lists, can end up selecting such a big portion og the table that the optimizer will want a full table scan
-> ORDER BY on large data set, e.g. filesort will use a temporary file if sort_buffer_size is less than the the size of the data being sorted

Ideally, during normal operations, there should be no major page faults, but minor page faults are OK.

It’s also important to consider that InnoDB does still rely on the OS page cache for other things, like temporary files, bin logs, etc.

`innodb_buffer_pool_size`

-> Too small: the database will perform more disk I/O, which leads to slower queries, and the system might just be sitting on a lot of unused physical memory (unless the server itself was provisioned with too little RAM)
-> Too big: If too much RAM is allocated for this, it might start starving other processes, which will cause the OS to do swapping parts of the innodb buffer between storage and RAM: this will be catastrophically slow in terms of query response times
-> Default value is 128MB: Very small for realistic production workloads, for production start around 70%-ish (or less if the db shares resources with the application it serves), keep monitoring to see if this value makes sense

Configuring `innodb_buffer_pool_size`

Buffer Pool Usage

-> innodb_buffer_pool_pages_total: total number of pages in the buffer pool, this is analogous to how the OS allocates virtual memory, this is the total number of pages that make up the cache for innodb to hold data and indexes
-> innodb_buffer_pool_pages_free: number of free (unused) pages in the buffer pool, the ratio between this and total tells us whether the cache is being actively used. If a huge portion of pages are consistently free, then the buffer pool size might be too big. If it’s consistently zero, it might be too small.

Buffer Pool Hit Ratio

-> innodb_buffer_pool_read_requests: total number of logical read requests
-> innodb_buffer_pool_reads: Total physical reads from the storage device(s)

To calculate the Hit Ratio:

hit_ratio = (1 - (innodb_buffer_pool_reads/innodb_buffer_pool_read_requests)) * 100

Ideally the result of this hit ratio calculation should be somewhere >95%. This by itself doesn’t mean that queries will perform well, but just that the database can access the data and indexes it needs fast.

Important: Not every problem can be fixed by tweaking system wide configurations. If e.g. too much data is indexed (lots of unnecessary, old indexes that could be eliminated), it will use up pages in the buffer pool without giving any performance boost.