MongoDB Notes

Condensed operational notes from a MySQL/MariaDB PoV.

Backup & Filesystem Snapshots

-> Take snapshot of data directory (Debian: /var/lib/mongodb or RHEL: /var/lib/mongo)
-> Snapshot replica if possible (if not lagging)
-> Restore snapshot + replay oplog from that point onward
-> Like ZFS + mysql/mariadb binlog approach

db.fsyncLock() -> zfs snapshot tank/mongodb@bkp_{{DATETIME}} -> db.fsyncUnlock()

Take a dump:

mongodump --db=local --collection=oplog.rs --query='{ "ts": { "$gte": Timestamp(UNIX_TIMESTAMP, 1), "$lt": Timestamp(UNIX_TIMESTAMP, 1) } }' --out=/tmp/oplog_dump

-> $gt: The timestamp matching your ZFS snapshot creation time
-> $lt: The timestamp exactly before the disaster occurred

Then, on the failed instance:

systemctl stop mongod
zfs rollback tank/mongodb@bkp_{{DATETIME}}

or just nuke everything off of the directory manually and apply snapshot:

cd /tank
mkdir mongodb
zfs recv tank/mongodb < /tmp/mongo-backup.zfs
cd ~
zfs set mountpoint=legacy tank/mongodb
zfs set mountpoint=/var/lib/mongodb tank/mongodb
chown -R mongodb:mongodb /var/lib/mongodb
/sbin/restorecon -rv /var/lib/mongodb

Important: If the timeframe between snapshot and problem < oplog expiry, might not need the oplog and can wait for catch up, otherwise proceed

DISABLE REPLICATION. (start standalone)

mongod --port 27018 --dbpath /var/lib/mongodb

Replay the Oplog Using mongorestore Use the mongorestore utility with the –oplogReplay flag and rename the exported BSON file to oplog.bson so the utility recognizes it as a system replay log

Rename the dump file to the format expected by mongorestore:

mkdir /tmp/replay
mv /tmp/oplog_dump/local/oplog.rs.bson /tmp/replay/oplog.bson

replay the transactions into the restored standalone instance

mongorestore --port 27018 --oplogReplay /tmp/replay/

Once the restoration tool finished applying the operations, stop the standalone process and restart MongoDB normally with standard replica set configurations. (change in /etc)

Sharded Cluster

MongoDB Sharded Cluster is not as bad as MySQL NDB.

NDB table rows are spread across ALL data nodes, and for a consistent backup, all writes across all nodes must be frozen. Can not just snapshot one node.

For MongoDB, the shards are independent, can snapshot shards at different times, as long as balancer is stopped, and no chunks are moving, the operation is consistent.

Backup procedure (with ZFS):

#!/bin/bash
# mongodb-sharded-backup.sh

# Stop balancer (30 seconds)
mongosh --eval 'sh.stopBalancer()'
sleep 30

# Timestamp for snapshot
TIMESTAMP=$(date +%Y%m%d-%H%M%S)

# Snapshot all nodes in parallel
for host in shard1-sec shard2-sec shard3-sec config1 config2 config3; do
  ssh $host "zfs snapshot tank/mongodb@backup-$TIMESTAMP" &
done
wait

# Resume balancer
mongosh --eval 'sh.startBalancer()'

echo "Backup complete: backup-$TIMESTAMP"

Sharding Keys

They need good cardinality. If the cardinality is low, then the maximum number of shards can be limited by it. Good alignment with query patterns. If the query patterns do not align with the shard key, then queries will need to hit many shards to retrieve relatively little information.

Failover

In MariaDB, the database nodes themselves know very little about the overall topology, and they don’t natively keep track of who the master is without external help. So failover and routing depend on maxscale.

In MongoDB, the nodes themselves are smart. MongoDB uses a consensus algorithm built directly into the database binary. The nodes constantly ping each other via heartbeats every 2 seconds.

If the Primary goes down, the remaining Secondary nodes immediately hold an internal election and promote a new Primary within seconds, completely on their own, without needing an external coordinator.

Routing Traffic (No MaxScale Needed)

In MariaDB, the application connects to a maxscale listener, and maxscale figures out where to send reads and writes.

In MongoDB, the intelligence is moved to the Application Driver. When an app starts up, the connection string lists the entire cluster: mongodb://node1,node2,node3/?replicaSet=my-cluster.

The MongoDB driver connects to the cluster, discovers who the Primary is, and routes all writes there automatically. So this works a bit like the readwrite split in maxscale.

If a failover occurs, the driver automatically learns about the new Primary from the remaining nodes.

No need to manage or maintain a middleman proxy server like MaxScale.

In /etc/mongod.conf of all the mongodb servers, ensure they all share the exact same replication cluster name:

replication:
  replSetName: "my-cluster"

Then using mongosh in one of the servers:

rs.initiate({
  _id: "my-cluster",
  members: [
    { _id: 0, host: "10.0.0.1:27017" },
    { _id: 1, host: "10.0.0.2:27017" },
    { _id: 2, host: "10.0.0.3:27017" }
  ]
})

Basic CRUD Operations

MariaDB:

INSERT INTO users (name, email, age) VALUES ('Bob', 'bob@example.com', 30);

SELECT * FROM users WHERE age > 25;

UPDATE users SET age = 31 WHERE name = 'Bob';

DELETE FROM users WHERE name = 'Bob';

MongoDB:

db.users.insertOne({
  name: "Bob",
  email: "bob@example.com",
  age: 30
})

db.users.find({ age: { $gt: 25 } })

db.users.updateOne(
  { name: "Bob" },
  { $set: { age: 31 } }
)

db.users.deleteOne({ name: "Bob" })

Basic Index Management

MariaDB:

SHOW INDEX FROM users;

CREATE INDEX idx_email ON users(email);

CREATE INDEX idx_user_status ON users(status, created_at);

DROP INDEX idx_email ON users;

MongoDB:

db.users.getIndexes()

db.users.createIndex({ email: 1 })

db.users.createIndex({ status: 1, created_at: 1 })

db.users.dropIndex("email_1")  // or { email: 1 }

Works similarly:

-> Single-field indexes
-> Compound indexes
-> Index order matters for compound
-> Drop unused indexes to save space/write performance

Basic Monitoring & Troubleshooting

MariaDB:

-- Check current connections
SHOW PROCESSLIST;

-- Check long-running queries
SELECT * FROM information_schema.PROCESSLIST WHERE TIME > 30;

-- Kill problematic query
KILL QUERY 12345;

MongoDB equivalent:

// Check slow queries
db.system.profile.find({ millis: { $gt: 100 } }).sort({ ts: -1 }).limit(10)

// Check current operations
db.currentOp()

// Check long-running operations
db.currentOp({ "secs_running": { $gt: 30 } })

// Kill operation
db.killOp(12345)

Different commands, same workflow:

-> Identify slow/problematic operations
-> Analyze why (missing index? Bad query? Lock contention?)
-> Fix (add index, kill query, optimize)

SETTINGS

RAM

By default, MongoDB will greedily swoop up available RAM, so set it manually if the default doesn’t make sense.

Open /etc/mongod.conf and configure the wiredTiger.engineConfig.cacheSizeGB setting:

storage:
  dbPath: /var/lib/mongodb
  engine: wiredTiger
  wiredTiger:
    engineConfig:
      cacheSizeGB: 16

oplog size

oplogSizeMB = (daily_write_GB * retention_days * 1024) + x% buffer

File Per Table

By default, WiredTiger engine automatically creates a separate file for every single collection and every single index (equivalent to innodb_file_per_table=1).

Tell MongoDB to separate the collection data files from the index files into different directories. This allows them to be mounted on separate ZFS datasets with different record sizes:

In /etc/mongod.conf, enable directory management:

storage:
  dbPath: /var/lib/mongodb
  wiredTiger:
    collectionConfig:
      blockCompressor: snappy
    engineConfig:
      directoryForIndexes: true  # <-- Splits data and indexes into /collection and /index subdirs

In ZFS, optimize block layers:

tank/mongo/collection: Set recordsize=16k (matches WiredTiger’s default internal page write size).
tank/mongo/index: Set recordsize=8k or 16k depending on index access behavior

Other useful ZFS settings:

zfs set atime=off tank/mongodb
zfs set sync=standard tank/mongodb
zfs set compression=lz4 tank/mongodb

But compression off for journal (adds little value)