We first set up MariaDB replication using less powerful servers but considering our increased database load we had to rent more powerful servers and that's when the troubles began.
Right now the replication between the master and slave occasionally lags between 0 and 1000 seconds for reasons which we've not been able to identify. This happens on average a dozen times a day. Considering the nature of my company business, we cannot afford any lags at all.
Master
The master has a six-core Intel CPU, 128GB of RAM, 1TB SSD in raid 1 configuration and load averages slightly above 2, running MariaDB 10.4.2x on Ubuntu 20.04 LTS.
For the past roughly 18 hours there have been just two replication related errors/warnings in the log:
Oct 21 12:00:35 master.com mysqld[808]: 2022-10-21 12:00:35 5677386 [Warning] Aborted connection 5677386 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)
Oct 21 12:03:46 master.com mysqld[808]: 2022-10-21 12:03:46 5780035 [Warning] Aborted connection 5780035 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)
Slave
The slave has exactly the same HW configuration and load averages close to 1.
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: master.com
Master_User: replication
Master_Port: 3306
Connect_Retry: 5
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 133718212
Relay_Log_Space: 133718872
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Seconds_Behind_Master: 0
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 105
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-105-9982390796
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: conservative
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
Slave_Transactional_Groups: 2226622
We have a 1Gbit connection between servers, it's almost perfect. ping -f
reports fewer than 0.002% of lost packets:
root@master # ping -f slave.com
PING slave.com 56(84) bytes of data.
.^C
--- slave.com ping statistics ---
71492 packets transmitted, 71491 received, 0.00139876% packet loss, time 25747ms
rtt min/avg/max/mdev = 0.239/0.339/1.941/0.066 ms, ipg/ewma 0.360/0.335 ms
I've searched the Internet for the issue and haven't found any satisfactory answers in terms of trying to understand the root cause or how to debug the issue.
Right now it's working as intended but earlier when we were experiencing the issue I saw this on master:
$ SELECT * FROM information_schema.processlist WHERE User="replication"\G;
USER: replication
HOST: slave.com:47200
DB: NULL
COMMAND: Binlog Dump
TIME: 2041
STATE: Master has sent all binlog to slave; waiting for binlog to be updated
I'm not sure what to make of it. How come "it's sent all binlog to slave"? Why is the lag increasing at this time? What's the hold out? Why aren't master SQL queries sent to the slave?
Please advise.