Question About CB Clusters and Mixed OS


#1

We are going to test an idea (POC) if you will in our DEV environment using 2 separate clusters of CB. We have a 2 node cluster in Data Center 1, its on Windows 2016, its CB v5.1, we want to spin up a new 2 node cluster in Data Center 2, on RedHat 7.4, CB v5.5, they will be two different clusters. No XDCR, not talking to each other, mutually exclusive. Our application only stores keys, so it doesn’t require persistence, we can basically point to one cluster and up and move to a new cluster and the app will keep working, or should. We are going to test our developers skills who need to configure the web configs to point to DC1 cluster, then once that cluster is no longer available, in the web configs have the IPs of DC2 cluster to automatically start using it. Sort of our own HA/DR solution using two different clusters, which are both always on and always ready to go if one or the other isn’t available.

Does anyone see any issues with this theory/idea/poc? Long term isn’t to have mixed OS, but shouldn’t matter in this case since we are not using them together with xdcr etc, just one is Windows and one is RH, point to one or point to the other. Long term in PROD, we’d have the mixed mode for a little while, but would eventually get a 4 and 4 solution of 8 RH.

Any glaring issues I’m not seeing?


#2

Hi @NowhereMan,

What you’re describing sounds like “multi-cluster awareness” (MCA), which is currently available for the Java SDK, and I believe will be coming to other SDKs soon enough. But basically, that’s the idea: if one cluster goes down, the clients can switch over to the other cluster. If the SDKs you’re using don’t have MCA available yet, then you may have to do a little extra work on your own. But yes, it sounds reasonable. I don’t think the OS really figures into it (though it could figure into other, related factors). More information on MCA here: https://blog.couchbase.com/couchbase-high-availability-disaster-recovery-java-multi-cluster-aware-client/


#3

Yes, what I’m describing is basically the home made solution of MCA, but because our SDK is PHP its not available yet, I think PHP SDK is low on Couchbase’s list to update. It would be nice if it was available, we’d for sure move to using it. But since its not there, our PHP developers are going to have to come up with some logic to do what MCA does.


#4

Actually, this isn’t like Multi-Cluster Awareness in that there is no XDCR. That feature is aligned directly to the XDCR feature.

I guess I’m not clear on what would happen to data written to one cluster when you switch to the other cluster from what I see, but if you’re suggesting that you’d write to both, that leaves you open to consistency issues since you can’t undo a write if it partially fails. Also, it’ll be visible to other actors, unless you always do quorum reads. Then if you do quorum reads, you’ll possibly need more reads to establish quorum meaning either three clusters or more reads at a later time.

Finally, one related item, we do not support using the SDK across a wide area network. The reason for this limitation is there are certain kinds of retries and topology updates that depend on LAN like timings/throughputs. There is nothing in the SDK that prevents WANs working, but we don’t test those situations and things that throw off the timings across a WAN, where all cluster nodes might be higher latency and inaccessible at the same time. It could be worked around with tuneables, but you’ll need to test/verify for your particular environment.

I’m sure there are pieces I don’t know the background on here, but based on what I see above, the lowest risk approach might be to just have a set of app servers ready or instantiated at the time of failure that are local to each cluster.


#5

Hello, thanks for your reply. Our two DCs are not going across a WAN, they are actually on the same campus just different buildings, so it should be still just a LAN network, so the SDK should be fine as the latency isn’t that of a WAN at all. We had considered trying to do the same design with another DC in a different city but that latency did worry us and now you’ve confirmed its not a great design nor supported.

So we have in mind is the app servers are here at our main location in DC1, we have one cluster of CB in DC1 and one in DC2 (which is still here just a different building), we’ll have the developers write the code using the SDK to point to DC1 until it can no longer reach any nodes there, then start using the DC2 CB. Again, all on same campus, same LAN (different subnets), but all in all just like having 2 different instances of CB to point to locally.


#6

No data replicated between the clusters? Is this being used mainly as a cache?


#7

Yes, I guess you can say that, they are just storing keys in Couchbase, so there would be no replicating between clusters, thats how the IT manager wants us to test and see if it can work. We’ve kind of already proven it can work, we had an older 4.5 CB cluster that we migrated off of, to our newer 5.1 CB cluster, well one day right during a heavy load time, the cluster cascade failed on us, from 1, to 2, then 3 nodes lost, we quickly had our build release team point the web configs back to the 4.5 cluster that was still up and running and were able to get our application back online within 15 mins. So he’s wanting a similar scenario that we can use if that were to happen again. I’ve tried to explain to him the XDCR but he doesn’t want any of the data that could possibly corrupt one cluster, to get copied over to the other. Since its just storing/caching keys, this scenario is what he’s after, a hot/hot, with hopefully some logic in the SDK or adapters that will do a health check with one set of IPs in the cluster and if it can’t talk to them, fail over and use the 2nd set of IPs.


#8

Thanks. I guess I’m not sure how this “write once, read never” data will be used, but what you describe should be possible. If writes to one cluster aren’t good, start writing to another. :slight_smile:

Regarding health check, note that the SDKs have two functions for that. The one most useful here is a ping() which will check all expected-to-be-active services on all nodes for an SDK instance. That might be useful to you. See the docs for details.


#9

Hello, so to my best understanding of our developers code, its not write once, never read, its used as just a cache, so they write, and read, and mutate, and all documents will have a TTL of something other than 0. So the deal is the data is not intended to be source data. We just need to have CB available to use. The cluster aware feature would be nice but its not ready for PHP SDK yet, otherwise that would be our choice. I think one of our top developers has written some code that will test a few nodes in Cluster 1, if it can’t find them will start moving all the webservers to point to Cluster 2 of Couchbase, then check back for Cluster 1 again later.

We had professional services out this week, we are expanding our clusters too, by a lot. So we’ll have 2 clusters with 4 nodes each, in two different Data Centers.

Now I need to figure out if we can ditch ElasticSearch and have the devs start using the new Search features in Couchbase instead?!