Recommended approach for handling sync when Couchbase server has millions of documents

Hi

My Couchbase (CB) model represents multiple companies, each with multiple users. The companies typically have 10k-100k documents each. All docs are in one bucket.

I have both a web app and a mobile app. Web app talks through Couchbase Sync Gateway (CSG) using the RESTful API and mobile makes use of Couchbase Lite, so also talks through CSG. Only changes that might bypass CSG is those made using CB’s admin console but they are few.

Given the large amount of documents in CB, I do not want it all “tracked” in CSG. Setting “enable_shared_bucket_access=true” and “import_docs=true” replicates all CB docs in CSG which is unnecessary. A lot of the documents will not change that often, if ever so it feels like a waste to keep them from bogging the CSG’s hardware resources.

Now I’ve read a little about “import filters” so I guess there might be a way to filter imports on some datetime field or so but I will then still have the use case of the new mobile user who expects all his/her company data to be on the mobile device when he/she signs up. Meaning my mobile app would then not be able to sync from only the CSG but I will also have to then manually pull from CB the docs not covered by the configured CSG’s datetime import filter.

Currently using:

  • CB Server = v6
  • CSG = v2.6.1

With that background, a couple of questions:

  1. Various forum examples mention “import_docs=continuous” and it seems to work, yet the current documentation (https://docs.couchbase.com/sync-gateway/2.6/config-properties.html) indicates it is a boolean with only true/false. Where can I find the documentation explaining the “continuous” import_docs setting?
  2. What is the recommended approach for handling sync with CSG when your CB server has got millions of documents, some of which might not change often?
  1. Various forum examples mention “import_docs=continuous” and it seems to work, yet the current documentation (Legacy Pre-3.0 Configuration | Couchbase Docs) indicates it is a boolean with only true/false. Where can I find the documentation explaining the “continuous” import_docs setting?

Use true. Equivalent in functionality to continuous (which is deprecated in favor of true)

  1. What is the recommended approach for handling sync with CSG when your CB server has got millions of documents, some of which might not change often?

Not sure I understand the concern. After initial sync, if documents don’t change they won’t be synced . You can use channels to separate the documents by type so the semi-static docs are in a separate channel that’s filtered out on Couchbase Lite (i.e. the clients can do a on-demand pull on this channel as needed)

Hi @priya.rajagopal

My concern relates directly to making most optimal use of server resources in order to reduce costs. Now my concern might very much be due to my lack of understanding of CSG details, so bear with me.

I’m assuming that all Couchbase documents for a bucket has to be kept in memory by CSG and is not persisted somewhere by CSG because it cannot function if not connected to a CB server (excluding walrus mode). And if I know I’m going to have a large amount of long-lived and close-to-static documents, I don’t want those documents to just sit and unnecessarily consume memory in the CSG. The difference between a cloud instance with 4GB vs 8GB vs 16GB is considerable over a year and even more so if one makes use of multiple CSGs.

So I’m basically trying to ensure I follow best practices while also optimising for cost.

PS: Thanks for mentioning the “import_docs=continuous” deprecation

Hi, that’s not true. Sync Gateway actually persists its metadata in the Couchbase Server bucket where it is reading the documents from. So there’s no requirement to keep this stuff in memory.

What SG does keep in memory is only used for caching to serve requests for active documents. These caches are managed, and items are evicted from the cache, so only a very small, active subset of documents are in memory on the Sync Gateway side. These are all fully tuneable via the SG config.

Thanks @bbrks, the “managed caches” are reassuring.

That 2nd link is very useful and I’ve seen that doc before, but reading it again after your comment above suddenly made it more clear to me.