patternsqlMinor
Can I run UPDATEs on a replicated MySQL slave, safely?
Viewed 0 times
canrunslavemysqlsafelyupdatesreplicated
Problem
We would like access to accurate MySQL backups for debugging, without having personal information (PII) in the backups.
Is it possible to run UPDATEs on the replicated database (ie. replacing text columns with Lorem Ipsum) without breaking replication?
(Note: The schema allows us to identify newly updated rows very easily - meaning it is trivial for us to identify new PII quickly)
Is it possible to run UPDATEs on the replicated database (ie. replacing text columns with Lorem Ipsum) without breaking replication?
(Note: The schema allows us to identify newly updated rows very easily - meaning it is trivial for us to identify new PII quickly)
Solution
Yes with many caveats- I have literally this setup to expose Wikipedia database in public, and I would not recommend it unless you do not have an alternative options. We use triggers that update/blank fields that are private, replication filters to avoid full tables/databases, and views to limit certain per-row accesses.
Problems we faced:
When we have the resources, we will likely move to a custom filtering automation that analyzes the row based binary protocol in pseudo real-time and performs more complex filtering on application layer.
Problems we faced:
- Statement based replication is very flexible, but if you run certain writes based on existing data (e.g. INSERT...SELECT, updates based on modified columns), it will either break replication or create a data drift. You should limit application writes by inserts and updates by fields you are not going to modify-but it is not possible in most cases to control all possibilities except on simple models (insert-only writes, for example). The breakage can be very subtle but it may happen over time if the application is minimally complex.
- Row based replication is more accurate (because you get the final state), but it does not trigger the triggers. We started using MariaDB 10.1 to have replica-only triggers to be able to do that (thus solving data drift), but we still have issues due to row based replication have limitation in terms of supporting different schema on master and replica.
- Triggers and filters make things difficult to manage and more room for errors
- If possible, I would suggest not having real time replication, but using dumps so you can filter on query time (in our cases it is not possible due to the size of the database and requiring real time from some usages).
- Because you have different data, running pt-table-checksum and other comparing tools is not really possible (to check for data discrepancies)
When we have the resources, we will likely move to a custom filtering automation that analyzes the row based binary protocol in pseudo real-time and performs more complex filtering on application layer.
Context
StackExchange Database Administrators Q#209418, answer score: 2
Revisions (0)
No revisions yet.