Score:2

Server

How should I automate replication of a production MySQL db onto a read/write dev server for testing while also scrubbing the data first?

EdwinJ

4/14/24, 4:21 PM

I know there are lots of variables here, and they're highly dependent upon the applications in the environment as well as the needs of the org. I read this post first, as the question was similar. [https://serverfault.com/questions/380701/replicate-main-mysql-db-to-a-development-server-to-play-with-real-data][1]

I thought I could add some depth from my perspective to the question and hopefully get some recommendations very specific to our environment.

We have a standalone server that hosts an online Learning Management System.
The LMS has thousands of users and its db is being written to quite regularly.
The desire is to take a "snapshot" of the db, and replicate it onto the development server for testing, new builds, etc.
The snapshot needs to be relatively close (within a few days atmost) to the live data because of the dynamic nature of the db.
The dev server needs to be read/write, but never affect the prod server at all.
The data in the db will have PII and needs to be sanitized before accessed by the dev team.

I'm thinking the general steps should be:

Automate a mysqldump on some sort of schedule from prod to dev
Run a script on dev to clean/anonymize the data
Automate a restore of the clean data, on a schedule, into dev

Based on these requirements, am I overlooking anything obvious that could impact the data in either server or the load/resources on the server?

2 + 2

mysql

replication

dev

Ginnungagap

4/14/24, 9:00 PM

Voting to close this because it's mostly opinion based. However PII will normally impose measures that the dev environment doesn't meet so the data should be scrubbed before leaving the environment with the appropriate safeguards.

Rick James

4/17/24, 4:34 PM

@Ginnungagap - I disagree. They are asking for techniques and tools, not opinions. The mention of PII merely points out an important, time-consuming step in the process; they are not asking how to do PII.

Score:1

Server

Rick James

4/17/24, 4:21 PM

Plan A

Consider the following setup:

Primary-Replica setup (ordinary replication)
The Replica needs extra disk space. (Recommend 3x the size of the dataset.)
The Replica has LVM already set up.
On the Replica, take a snapshot -- this takes a minute or so, regardless of dataset size.
Anonymize the data (etc) in the snapshot.
Use the snapshot for your testing.
The snapshot is separate from the Replica, which continues to receive writes.
LVM depends on "Copy on write", so the snapshot starts out taking virtually no disk space, but grows in size over time. When finished with testing (or it is time to do a new snapshot), blow it away to recoup the disk space.

Plan A2 -- By having more than one Replica being used this way, multiple dev/test/build setups could be handled.

Plan B

Set up Galera Cluster across 3 "nodes". All 3 will have the same data.
When wanting to test, take one node out of the Cluster to use for dev.
When finished testing, but it back into the Cluster; it will automatically resync. (Give it time to resync before using it again for dev.)
Note that if anonymizing the data is too intensive, the resync will take a long time.

Plan B2 -- Plan B, but with 4 nodes; this gives you full automatic failover even when one node is being used by dev.

Plan B3 -- Plan B, but use a "garbd" to avoid needing a full 3rd node. (Failover is lost with this plan unless you have 4 nodes.)

Plan B4 -- With 5 nodes (one optionally a garbd), you could ping-pong two of the nodes for use by devs. One is busily anonymizing, while the other is being actively used. (I would not go beyond 5 nodes, for a variety or reasons.)

Notes

The advice here is 'general', not specific to LMS.
Pulling a dev server takes only minutes (plus anonymizing).
Replicas (either regular or Galera) provide convenient techniques for taking backups.

+ 0

Score:1

Server

John Mahowald

4/15/24, 4:48 PM

Developers possibly working with sensitive PII data may complicate things immensely. Ask your compliance person about what risks this introduces, and what controls and training should be in place to get dev to the same level of respect for data as operations staff.

Not doing this full copy for dev and test environments may avoid PII entirely. Have support staff build out example data, sanitized from real life examples. Import configuration and build data only, not data about people. This keep them separate approach is a lot of work to keep maintained and realistic, but keeps dev and test small size without sensitive data.

There remains a need to test full restores, if only to test the backups, per IT's business continuity procedures. Backups should be done in ways with an acceptable impact to production performance, whether export like mysqldump, replication, or block storage snapshots.

Should there be a need to transform a copy of production into a non-prod environment, in my opinion it should be a separate support or stage environment, and developers should not have access to it, only support staff. Minimizes impact of a thing in testing doing things users would notice. The full copy non-prod environment may be down for extended periods of time, because re-writing all the PII takes a long time.

+ 2

Rick James

4/17/24, 4:37 PM

Subsetting the data and Anonymizing the data each introduce the risk of missing some key factor of testing or building.

John Mahowald

4/18/24, 5:32 PM

And letting developers or analysts touch production data has its own risks. If build move plus data anonymizing tools cannot produce a minimal reproducing example of a situation in production, the tools should be improved.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: How should I automate replication of a production MySQL db onto a read/write dev server for testing while also scrubbing the data first?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.