patternsqlMinor
Is it possible to have read/write race conditions in a percona cluster?
Viewed 0 times
perconareadwritepossibleconditionshaveracecluster
Problem
I had a discussion today regarding an incident in a percona cluster. The setup was as follows:
A three node percona cluster load-balanced by a haproxy. Both reads and writes where balanced by the proxy to all three nodes.
The application (php) using the database, was coded as such, that it could perform a write on the database, and then immediately try to read the new record. The guy told me that there were cases that, if the immediate read (after the write) was balanced by the proxy to a different node than the one used for the write operation, it would not locate the record.
Is this possible in a percona cluster? As far as I know, the node that executes a write query, it must first receive confirmation by all percona nodes prior returning success to the client issuing the query. So how is this possible, even if the read query lands on a different server than the one used to perform the write operation?
A three node percona cluster load-balanced by a haproxy. Both reads and writes where balanced by the proxy to all three nodes.
The application (php) using the database, was coded as such, that it could perform a write on the database, and then immediately try to read the new record. The guy told me that there were cases that, if the immediate read (after the write) was balanced by the proxy to a different node than the one used for the write operation, it would not locate the record.
Is this possible in a percona cluster? As far as I know, the node that executes a write query, it must first receive confirmation by all percona nodes prior returning success to the client issuing the query. So how is this possible, even if the read query lands on a different server than the one used to perform the write operation?
Solution
A replication setting determines which types of queries first perform a causality check, i.e. have the executing server confirm that it is up-to-date.
Check the
https://www.percona.com/doc/percona-xtradb-cluster/5.5/wsrep-system-index.html
Our consultant advised setting 6 (i.e.
Note: After weeks of success with this setting, the problem suddenly came back. It disappeared as soon as we (temporarily) started doing all reads and writes on the same server, so something seems to be up. I'm wondering if there is another setting that comes into play and needs to be fixed as well.
Updated...
I believe we mitigated the issue in the previous note, but I can't seem to remember what exactly it was.
Use a single write node
In any case, I highly recommend using only a single write node. Since all writes are replicated anyway, all servers are doing comparable writes, so write scaling is severely limited - unlike read scaling. Writing on a single node, however, does give us certain advantages.
Avoid conflicting writes
Using a cluster with multiple write nodes has an unfortunate effect: if two simultaneous writes onto different nodes conflict, they may initially appear to succeed, but then one of them will fail at commit time. At least for application development, this has a significant impact. Every place in the application that runs this risk must implement some catch-and-retry logic... or we could just use a single write node!
Of course, you'll want high availability. We use ProxySQL to load balance. If the write node goes (or is taken) down, another node becomes the write node. Yes, only during failover there might be the risk described above, but that is only a very small window. Ideally, when doing a manual failover, we should momentarily queue incoming requests and let the old write node finish its work, before assigning a new write node.
Auto-increment is not always increasing
Auto-increment is not guaranteed to be always increasing! I found this a particularly disturbing issue that was unknown to many. The greatest ID is not necessarily the latest row. Counterintuitive and harder to reason about. There a various configuration options.
Auto-increment can be controlled using
Alternatively, as I understand from Percona, we can use a single write node and disable auto-increment control. The values should now always be increasing. During a failover, the new write node might fail a transaction or two at commit time if it has not replicated the latest auto-increment value yet, although I expect this should be mitigated by the causality setting discussed at the top of this answer.
I would go with the latter, and, just to be sure, instruct everyone to not rely on auto-increment being strictly increasing.
Do let us know if you can achieve a single write node and high availability with HAProxy. I have seen it done with ProxySQL, but it would be great to know if it can be done without it.
Check the
wsrep_sync_wait variable description here:https://www.percona.com/doc/percona-xtradb-cluster/5.5/wsrep-system-index.html
Our consultant advised setting 6 (i.e.
2 | 4), meaning UPDATE, DELETE, INSERT, and UPDATE perform a causality check. In other words, write queries. In the end, we decided to switch to 7 (i.e. 1 | 2 | 4), performing causality checks on reads as well. Although this makes reads take longer, the problem of outdated reads that you describe goes away.Note: After weeks of success with this setting, the problem suddenly came back. It disappeared as soon as we (temporarily) started doing all reads and writes on the same server, so something seems to be up. I'm wondering if there is another setting that comes into play and needs to be fixed as well.
Updated...
I believe we mitigated the issue in the previous note, but I can't seem to remember what exactly it was.
Use a single write node
In any case, I highly recommend using only a single write node. Since all writes are replicated anyway, all servers are doing comparable writes, so write scaling is severely limited - unlike read scaling. Writing on a single node, however, does give us certain advantages.
Avoid conflicting writes
Using a cluster with multiple write nodes has an unfortunate effect: if two simultaneous writes onto different nodes conflict, they may initially appear to succeed, but then one of them will fail at commit time. At least for application development, this has a significant impact. Every place in the application that runs this risk must implement some catch-and-retry logic... or we could just use a single write node!
Of course, you'll want high availability. We use ProxySQL to load balance. If the write node goes (or is taken) down, another node becomes the write node. Yes, only during failover there might be the risk described above, but that is only a very small window. Ideally, when doing a manual failover, we should momentarily queue incoming requests and let the old write node finish its work, before assigning a new write node.
Auto-increment is not always increasing
Auto-increment is not guaranteed to be always increasing! I found this a particularly disturbing issue that was unknown to many. The greatest ID is not necessarily the latest row. Counterintuitive and harder to reason about. There a various configuration options.
Auto-increment can be controlled using
wsrep_auto_increment_control & friends, having ServerA use 1, 4, 7, ..., ServerB use 2, 5, 8, ..., and ServerC use 3, 6, 9, .... This avoids contention of IDs, yes, but it also skips IDs and really creates non-incremental sequences.Alternatively, as I understand from Percona, we can use a single write node and disable auto-increment control. The values should now always be increasing. During a failover, the new write node might fail a transaction or two at commit time if it has not replicated the latest auto-increment value yet, although I expect this should be mitigated by the causality setting discussed at the top of this answer.
I would go with the latter, and, just to be sure, instruct everyone to not rely on auto-increment being strictly increasing.
Do let us know if you can achieve a single write node and high availability with HAProxy. I have seen it done with ProxySQL, but it would be great to know if it can be done without it.
Context
StackExchange Database Administrators Q#176248, answer score: 3
Revisions (0)
No revisions yet.