Is a couchbase architecture using XDCR and no clustering viable?

DRBryan · March 28, 2014, 2:37pm

Hi, Sorry in advance for the long post, but this topic is a bit complex.

My company uses Couchbase as a data-store for logic that processes signaling transactions from 3G and 4G mobile data networks. The functionality we deliver together is roughly equivalent for all our customers at a functional level, however the network topology in place at larger carriers is significantly different than for regional ones.

Larger volume operators typically require Couchbase performance in the hundreds of thousands of couchbase transactions per second. We deploy together into unique network topologies with customer specific fault tolerant, highly available, geo-redundant architectures. Our technical teams work together to ensure each combined solution can be supported in a live tier one production mobile network.

Compared to the national carriers, regional operators have very straight forward network topologies, so they benefit greatly from more streamlined solution architectures. Regional operators also make up a respectable chunk of our revenue stream, so the entire ecosystem benefits if we provide these customers a solid solution and reasonable plan-path for growing it. A simplified architecture also has a positive impact on our margin because it is easier for us to sell and support the same solution repeatedly into this market.

A “simplified architecture” consists of a dual site, non-clustered, single node, XDCR based synchronization approach. After reading the Couchbase blogs, I understand this is not the “three node minimum” favored approach, but we don’t actually require clustering at all, just replication. For this reason, I’m hopeful this reduced approach will make sense. We would like to move forward with several pending architecture reviews, but are unable until engaging with you on this.

For regional operators any reduction in complexity and hardware footprint is a big win especially in light of their reduced performance and unique redundancy requirements. Specifcially:

Processing Requirements Are Relatively Low: like in the hundreds, or low-thousands of transactions per second. For this discussion we can assume peak rates of 4K TPS or less for any single Couchbase node.
A Reduced Hardware Footprint Is Desirable: Our solution is typically deployed in two separate data centers. This means that a three node cluster in each datacenter (with redundancy) requires six nodes, (as per the “How Many Nodes?” series of lectures). This architecture lopsides us considerably and when the customer understands that a single combined function node (running on a single laptop) more than handles twice their load. This begs an answer to obvious question, which is “Why am I putting in eight boxes when I only really need two to be redundant and one would handle all the load from the network?”
Failover and geo-redundancy complexity are drastically simplified by leveraging the network gateway’s failure mechanism (primary network interface fails to a secondary). The result for the application is the buckets we use to manage usage in Couchbase are no longer directed at sessions on the primary node. On failure those sessions stick with the secondary node until the problem can be addressed (failover is driven by the network not the application). So these network topologies are like “The HA Highlander”: one failure mechanism to rule them all. Deploying on a single Couchbase node and using XDCR to replicate avoids having to support many levels of redundant failover triggers, sharing of IP addresses and complicated recovery procedures. In the case of mobile data charging, if the gateway fails our application has no purpose anyhow, so there is no point of being able to operate in such an environment.

In the end, the perfect Couchbase solution we are advocating for these customers consists of a dual node configuration (one Primary and one Secondary) with bidirectional XDCR to keep things in sync. A network failover from Primary Control to Secondary Control “stays failed” until recovery operations can be completed. This means the Secondary Control Platform is used only in the event of a failure and is only needed long enough for the customer to troubleshoot and recover the Primary Platform. This model reduces contention issues on the database and duplicates usage reporting should the Primary Control Server (my company’s application), or the network gateway (customer’s box) exhibit “transient” availability. It also reduces the solution footprint considerably from eight nodes down to two yet still provides a roadmap for growth by adding additional nodes when/if required at either the Couchbase layer or the App.

Can anyone help with a recommendation for the best way to pursue proposing such a solution model that would be supported in production by the Couchbase organization?

Thanks in advance,

Bryan

cihangirb · March 28, 2014, 4:45pm

Thanks Bryan, thanks for the question. I am oversimplfying but I think the point you make is; I am doing XDCR already. Why do I need all this local redundancy. If I got that right, here is my concern; With your model, the frequency of failover and failback through XDCR is high. however the latency in XDCR replication compared to intra cluster replication is high. That means you may experience data loss because XDCR may not be cough up. would you be ok with that?
thanks
Cihan

DRBryan · March 28, 2014, 6:03pm

Thanks response Cihan,

Let’s stick to the oversimplification at this point. You more or less got it right and restating it below serves as a good platform to discuss your concerns…

…if XDCR based failover is already in place between two non-clustered hosts, is local clustered redundancy absolutely necessary if performance requirements are well within the bounds of the hardware in place .

Your concern about the frequency of fail-over makes complete sense. Basically it’s “never” supposed to happen as the gateway is engineered “never” to fail. In reality I’ve been doing this stuff for 15 years now and unfortunately I have seen Primary network gateways and charging interfaces fail a few times. I’ve never seen a Secondary fail in the time it takes to get the Primary interface back on line. In such an architecture the Secondary interface is basically an insurance policy to buy recovery time and ensure business continuity. I can provide more detail about the mechanism for failure (on a DIAMETER Gy interface), how we cope and the relevant design considerations, but the bottom line is if a Gateway failure ever does occur it stays failed until it’s manually failed back. That part of the equation is key because it is critical that the network remain stable from a state management perspective, even if the Primary Gateway Interface comes back. The gateway fails and then stays failed until the fault is isolated and dealt with. If the secondary interface dies --the gateway stays failed to it. Picking up the pieces of repeated automated fail-overs where this not put in place often times just isn’t tractable.

The other aspect I think is relevant to your concern is managing the risk for data loss in the event of a gateway fail-over. In particular the customer’s tolerance for potential data loss. Again oversimplifying is probably best here, but gateway failure configurations are based the DIAMETER protocol application stack. Mainly they have to do with timeout parameters and interface responsiveness between the network gateway and our solution. There is no voting, or quorum needed as the protocol is designed by nature to be binary. My understanding of XDCR is that it’s async and a bit slow. This means depending on load and network latency some transactions are “at risk”. Provided this risk could be quantified there shouldn’t be a problem. The relative values of each of these transactions isn’t so high and, the session management protocol and configuration of the gateways actually provide feature sets that address some of this at the interface layer. Bottom line here is that the tradeoff of a simplified architecture is of significantly more value. The trick is in conveying this in a way that the customer can understand. I think we are most of the way there, just need a bit of help with how to quantifying how much an XDCR queue might back up.

Does all that make sense and sound reasonable?

Thanks again.

DRBryan · April 2, 2014, 6:39pm

Hi Cihan,
Any word on this topic? Is there a better way to address this, email or call?

Thanks,

Bryan

cihangirb · April 2, 2014, 11:00pm

Apologies for the delayed response Bryan. Lets do a call.
Could you send me an email on cihan@couchbase.com so we can chat about this over the phone.
thanks
-cihan