Sync Gateway initial shadow scalability concerns

zirzirikos · May 12, 2015, 12:27pm

We are evaluating Couchbase + Sync Gateway for an application and have reached a capacity planning stage and have some concerns with our recent findings, mainly around how there’s no scalability to changing the sync function.

We have set up a CB bucket which will store about 3 million documents per month averaging about 1KB each. We have about 10000 users (~5000 concurrent) who would subscribe to one channel each (with little overlap). We have set up SG to shadow this bucket - we would like to keep documents indefinitely in the primary bucket as a data warehouse and be able to run historical analytics on it so we need CB’s full view functionality and our data loaders make extensive use of CAS and occasional use of locking, so we really need the shadowing.

The sync function evaluates certain (effectively immutable) properties of each document and assigns it to one or more channels, except if it finds a specific property value in which case it uses the throw({forbiddden:… syntax to exclude it from the SG bucket enitrely. A periodic task queries a view and flags documents with this property. Effectively about 150,000 documents at a time will not have this flag and hence be shadowed by the SG; everything else will be excluded via 403 forbidden.

We built a test harness that loaded our test cluster (2x 4-core VMs with 32GB of RAM each; 32GB for the main bucket across cluster, 4GB for the SG shadow bucket, SG instances running on same machines) with about 18,000,000 documents (about 6 months worth of expected data) with only a tiny fraction of this (about 1000) being assigned to channels by the Sync function, the rest throwing 403 forbidden in order to be excluded.

We started the test harness and it loaded the full data set into the cluster without issue (well, we had to switch to full eviction to keep it from running out of RAM). The problem started when we started the SG.

As noted in http://developer.couchbase.com/mobile/develop/guides/sync-gateway/sync-function-api-guide/intro/index.html#changing-function all SG instances in a cluster should have the same sync function and when it is changed, only one instance can be started until it has finished processing all documents.

The first problem is that when the sync function returns 403 forbidden, two log lines, including the entire document in slightly abbreviated form, are printed to the SG log regardless of what logging is enabled in the SG config. This effectively means that SG creates a log file on a par with the size of the entire bucket being shadowed, on only one machine. Until 1.1.0 arrives with support for logging to file and kill -HUP, rotating these logs is also problematic.

The second and more worrying problem is that this has taken hours, with SG taking up increasingly more RAM until it exhaust’s the server and is killed by oom-killer. This is particularly worrying since it seems that with a shadow setup, though we can horizontally scale the buckets and the sync gateway effectively indefinitely, if we ever need to change the sync function for whatever reason we will need one server with enough disk space to hold all the log files (granted, this would be addressed by the logging changes in 1.1.0) but more worryingly we would need to bring the entire cluster down until a single server could process the entire dataset, which it seems is not even possible with our current load. I assume (perhaps incorrectly) that after SG goes OOM and is restarted it picks up where it left off in processing the shadowed bucket, but we had it running for a total of about 8 hours yesterday and it’s still not done. Granted, our test VMs aren’t the beefiest out there but I am concerned that if we need to do this, say, a year after go-live, we’ll be looking at a 2-day outage at least.

Is there any way to scale the initial shadowing of a bucket in sync gateway? Would you recommend a different setup?

jamiltz · July 3, 2015, 3:17pm

Hi @zirzirikos,

This thread may help.
If you find a bug, it would be great to have the scenario in a github ticket so we can reproduce it and fix it.

James