Sync Gateway cluster

I’ve searched this forum for Sync Gateway cluster, and it seems all cluster discussions are about CouchBase cluster (and how it’s supported by Sync Gateway), so allow me to raise the question: does Sync Gateway support cluster of itself ?

Sync Gateway doc sounds cluster supported @
HTTP://developer.CouchBase.com/documentation/mobile/current/develop/guides/sync-gateway/deployment/index.html#story-h2-2

So we’ve set up a 2-node cluster of Sync Gateway 1.3.1, then we noticed some clients changes not synced and we saw many logs of “WARNING: Skipped Sequence … didn’t show up in MaxChannelLogMissingWaitTime, and isn’t available from the * channel view. If it’s a valid sequence,
it won’t be replicated until Sync Gateway is restarted. – db.(*changeCache).CleanSkippedSequenceQueue.func1() at change_cache.go:220”

While wondering if/how Sync Gateway cluster nodes communicate to each other about its own cache of changes/sequences, we’re not sure if that’s caused by cluster or other cause(s). In order to isolate problem(s), allow us to raise the question: does Sync Gateway officially support cluster of itself ?

BTW, we have a Load Balancer (AWS ELB) in front of our Sync Gateway cluster, so in theory a client’s 1st communication can reach node1 and 2nd communication could reach node2, hope that’s supported by Sync Gateway.

Sync Gateway is a stateless server and doesn’t have to be aware of other nodes. As such, there is no concept of Sync Gateway cluster.

In the config file of each Sync Gateway node you must specify the server IP of one of the Couchbase Server node and that’s it.

You can follow this part of the training which has a few scripts to make things a bit easier with installing Sync Gateways and Couchbase Server nodes from the command line. Although in the training we cover NGINX instead of ELB the overall architecture is the same.

Is there anything in particular you’ve noticed that isn’t functioning as expected?
I will investigate the log message you pasted above.

James

Thanks a lot to @jamiltz for the response as well as to look into the log message pasted above.

The thing in particular we’ve noticed that isn’t functioning as expected is some clients changes not synced.

Glad to see the cluster setting up training material mentioned by @jamiltz, as another hint Sync Gateway cluster is officially (designed to be) supported.

Speaking of Sync Gateway as a stateless server, I guess that depends on how to interpret “state”. I can see each Sync Gateway node doesn’t have local storage therefore starts up from no (start-up) state therefore it’s (start-up) “stateless”. On the other hand, I’m aware Sync Gateway node does build its own cache of changes/sequences which is some kind of run-time “state” preventing “Skipped Sequence” from being “replicated until Sync Gateway is restarted” if that’s what above log message means indeed. I’m just trying to contribute brainstorming to the community such as wondering aloud if some run-time state/cache is needed to be communicated among all the nodes. Do we have some kind of implementation document WRT run-time state/cache so that the community can help examine each run-time state/cache needed or not to be communicated within a cluster please? Or, if that kind of exercise had been done already, have we documented that kind of due diligence anywhere please? I had searched this forum and didn’t find related info.

Again, really appreciate @jamiltz.

As you have pointed out previously, Sync Gateway instances may allocate sequence numbers themselves.

This log message can be seen in an environment with multiple SG nodes and in particular scenarios. If a document gets rejected (maybe because it doesn’t adhere to the rules in the Sync Function) then that sequence number is still in Sync Gateway’s sequence cache. If no other documents get processed by that Sync Gateway instance within MaxChannelLogMissingWaitTime (which defaults to 60 minutes) then this sequence number will never be allocated and will appear as being skipped.

Could this scenario be similar to yours?

So we’ve set up a 2-node cluster of Sync Gateway 1.3.1, then we noticed some clients changes

Was that not the case with a 1 node cluster (i.e all the documents get synced)?

James

Thanks for the analysis.

Sorry that we’re not sure about the hypothetical scenario similar to ours or not. What kinds of logs should we look for please?

As for documents all synced for Sync Gateway singleton or not, sorry that we’re not sure about past (singleton) before cluster. We didn’t notice document not synced @ singleton, nonetheless that doesn’t prove documents all synced in the past (singleton) before cluster. The thing we’re sure is that “WARNING: Skipped Sequence …” also showed up in Sync Gateway singleton logs:

  • Sync Gateway 1.2 singleton: “WARNING: Skipped Sequence … didn’t show up in MaxChannelLogMissingWaitTime, and isn’t available from the * channel view. If it’s a valid sequence, it won’t be replicated until Sync Gateway is restarted. – db.(*changeCache).CleanSkippedSequenceQueue.func1() at change_cache.go:220”

  • Sync Gateway 1.3.1 singleton (see 1st post above for the log)

What else can we look for please?

We have reverted back to 1.3.1 singleton, haven’t noticed document not synced so far, no above “WARNING: Skipped Sequence …” yet. What else can we try in order to restore cluster in order for High Availability as well as scalability please?

I suppose that by singleton you mean a single instance of Sync Gateway.

If you notice an issue that only occurs with multiple instances of SG it would be helpful to have the logs of when the issue occurs and narrow down the steps to reproduce it otherwise there’s not much we can do to help.

James

Yes, by singleton I meant a single instance of Sync Gateway.

The unusual logs when we noticed some clients changes not synced, were provided above. While we keep on monitoring & investigating, if you or any one comes across specific logs or CouchBase document(s) we should look for, let us know please.

1 disclaimer though. Although the issue of some clients changes not synced was noticed in Sync Gateway cluster and this thread is titled “Sync Gateway cluster”, it’s not a proof Sync Gateway singleton/stand-alone won’t have the issue. Therefore, any non-cluster thoughts/ideas of specific logs or CouchBase document(s) worth attention are appreciated as well.

Ok great thanks. Will keep an eye out for similar topics discussed here or on the issue tracker.