Score:0

MariaDB replication lagging up to a dozen times a day for no obvious reasons

in flag

We first set up MariaDB replication using less powerful servers but considering our increased database load we had to rent more powerful servers and that's when the troubles began.

Right now the replication between the master and slave occasionally lags between 0 and 1000 seconds for reasons which we've not been able to identify. This happens on average a dozen times a day. Considering the nature of my company business, we cannot afford any lags at all.

Master

The master has a six-core Intel CPU, 128GB of RAM, 1TB SSD in raid 1 configuration and load averages slightly above 2, running MariaDB 10.4.2x on Ubuntu 20.04 LTS.

For the past roughly 18 hours there have been just two replication related errors/warnings in the log:

Oct 21 12:00:35 master.com mysqld[808]: 2022-10-21 12:00:35 5677386 [Warning] Aborted connection 5677386 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)
Oct 21 12:03:46 master.com mysqld[808]: 2022-10-21 12:03:46 5780035 [Warning] Aborted connection 5780035 to db: 'unconnected' user: 'replication' host: '1.2.3.4' (Got an error writing communication packets)

Slave

The slave has exactly the same HW configuration and load averages close to 1.

MariaDB [(none)]> show slave status\G 
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: master.com
                   Master_User: replication
                   Master_Port: 3306
                 Connect_Retry: 5
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 133718212
               Relay_Log_Space: 133718872
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
         Seconds_Behind_Master: 0
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 105
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-105-9982390796
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 2226622

We have a 1Gbit connection between servers, it's almost perfect. ping -f reports fewer than 0.002% of lost packets:

root@master # ping -f slave.com
PING slave.com 56(84) bytes of data.
.^C
--- slave.com ping statistics ---
71492 packets transmitted, 71491 received, 0.00139876% packet loss, time 25747ms
rtt min/avg/max/mdev = 0.239/0.339/1.941/0.066 ms, ipg/ewma 0.360/0.335 ms

I've searched the Internet for the issue and haven't found any satisfactory answers in terms of trying to understand the root cause or how to debug the issue.

Right now it's working as intended but earlier when we were experiencing the issue I saw this on master:

$ SELECT * FROM information_schema.processlist WHERE User="replication"\G;
           USER: replication
           HOST: slave.com:47200
             DB: NULL
        COMMAND: Binlog Dump
           TIME: 2041
          STATE: Master has sent all binlog to slave; waiting for binlog to be updated

I'm not sure what to make of it. How come "it's sent all binlog to slave"? Why is the lag increasing at this time? What's the hold out? Why aren't master SQL queries sent to the slave?

Please advise.

ua flag
If you can capture it in the act, provide `SHOW FULL PROCESSLIST`.
Artem S. Tashkinov avatar
in flag
Master: `Master has sent all binlog to slave; waiting for binlog to be updated`, SLAVE: `Waiting for master to send event`. Nothing else.
ua flag
Is it actually lagging, or is the value of Seconds_Behind_Master bogus?
Artem S. Tashkinov avatar
in flag
When I was writing the question, there was no lag. When a lag occurs it grows up to 1000 seconds.
ua flag
Ramdomly bounce between 0 and 1000?
Artem S. Tashkinov avatar
in flag
Nope, it slowly grows from 0 to 1000 (I guess a second for two), then resets.
ua flag
If it goes from 0 to 1000 (seconds) in only 1-2 seconds, that sounds like a very old bug for which I have never found a cause or cure. It eventually quits.
Artem S. Tashkinov avatar
in flag
@RickJames No, it's going gradually from 0 to ~1000 in around 2/3 of the lag time. If you could help me debug this, I'd be forever grateful. My email address is all over the web. I want to file a bug report against MariaDB because I really don't know where to look.
ua flag
When it happens again, grab `SHOW SLAVE STATUS` and `SHOW FULL PROCESSLIST` (on Replica).
Artem S. Tashkinov avatar
in flag
@RickJames This needs to be run on the master, right? Do you need anything from the slave?
ua flag
Run those the Replica (aka Slave).
Artem S. Tashkinov avatar
in flag
@rick-james posted a new question with the data you asked for here: https://dba.stackexchange.com/questions/319051/mariadb-replication-lagging-up-to-a-dozen-times-a-day-for-no-obvious-reasons
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.