Data model best practice for schema less backend analysis


#1

Hello.
We are trying to model the data structure in couchbase.
We have two buckets for operational use that demands high availability and contains billion of docs.
My question is about the third bucket I’m looking to add that will contain schema less raw data for each event in the system.
Let’s say we have event A with fields: date, x1, y1 and event B with fields date, x2,y2.
Both of them will be written to that same bucket for future analysis that we cant predict in the present.
This bucket will be mostly writes and will be read only by views and never by keys.
The views will be: count events by date, count events by date and event type, etc. (Of course that in real life there will be much more fields).

As I understand, in this case I’m not looking to place all keys in RAM and will configure the minimum RAM for this bucket.

Are we using couchbase wrong here. I’m not sure it is designed for these purposes.
In this case will we get the “Metadata overhead warning” and is there an option to turn it off?

We are currently use Couchbase server 2.2.
(And trying to figure out if Couchbase 3 is production ready so we can upgrade)

Thanks in advance


#2

Hi There, I don’t see anything here that jumps out as a red flag item. In some cases, buckets are generally good for separation of data in cases where the data has wide variance in availability requirements, security isolation requirements, or in cases where there is variance in resource governance. Based on the brief description above, it sounds like resource governance of the 3rd bucket is quite different compared to #1 and #2 so it would work.
If you are worried about metadata overhead, I’d highly recommend looking the new caching option for keys and metadata that is called ‘full ejection’ in 3.0. we can reduce the metadata overhead drastically with that option in this case. more information here: http://blog.couchbase.com/all-new-30-full-ejection-tuning-memory-large-databases


#3

Thanks for the answer.
That full ejection feature looks right for the analysis bucket.
My greater concern is that there will be a business question that can’t be queried in the analysis bucket.
My plan is, when such business question arrive, I will create a view which I will query in a scheduled manner (let’s say every 5 minutes or every day) and write the results to an aggregated bucket. This process of query the analysis bucket and write to aggregated bucket should be incremental. I’m still in theory mode so I’m not sure this practice is good.


#4

So I went ahead and created that data model and used couchbase server 3.0.1 full eviction option on the analysis bucket.
But as I was concerned, I’m getting the “Metadata overhead warning” over and over again.
I’m talking about a bucket that will have billions of documents with only 100MB of RAM allocated to it (because it is a write only).
So, I will ask again. Am I using couchbase in the wrong way here?
And is there a way to disable that warning?
@cihangirb

Thanks


#5

the metadata overhead warning should not happen if you have full ejection enabled. That warning must be coming from another bucket.

Your setting of Bil docs and 100MB of memory may be challenging IF you cannot flush things to disk fast enough. that is if you disk cannot drain as fast as your incoming mutations. However as long as you can free up enough memory and flush to disk fast enough, we should be good.

I still think it is a good use of couchbase.
thanks
-cihan


#6

OK. You were right.
The warning does not happen when the bucket is set to full eviction.
Good news. I will try to update when we will advance in this practice.

Thanks a lot, Cihan.
@cihangirb


#7

Absolutely - anytime.
-cihan