Should I split my data into multiple buckets?


#1

Hi folks,

I have 2 document types that are linked by a common key (customer id). They have different usage properties though, so I was wondering if I would benefit from provisioning separate buckets for them.

  • Customer Record - this will be read frequently, but updated infrequently. I will probably use strong consistency so I can read my writes. The average doc size will probably be about 10kb, and let’s say I expect 1 million+ records

  • Customer Event - this will be read infrequently, but will require fast writes. I anticipate these documents being immutable so each write is an insert, not an update. The average doc size will probably be about 1-2kb, and let’s say we get 1million+ records per day, and we retain 1 years worth. It is possible this data will be synchronised with Apache Spark for real-time analysis purposes. Some events will necessitate an update to the related Customer Record

There are some use cases where I will need to join across the 2 documents, they are not performance critical so a union (or 2 separate queries and an in-app join) would probably be OK.

What do you think? Will the different usage mean that performance of both document types are compromised if they are in the same bucket?

I’m using enterprise edition 4.00, but I would be interested to know if the advice would be any different for the community edition since it scales differently

thanks

Nozzer


#2

Hi Nozzer,
You probably want two buckets so that you can manage them separately as they grow. The resource quotas can be different for each bucket, which might be useful if you expect very different sizes, growth and traffic patterns.

BTW, I’m the PM for the Couchbase Spark Connector, so I would be interested to hear more about your experience with it once you start using it. Let me know if you’re willing to share feedback.
Best,
-Will


#3

Thanks Will.

I’ll let you know how I get on with the spark connector

Nozzer