Couchbase holds a lot of documents in memory, hence it’s enormous speed but it’s also putting a greater demand on the memory size of the server(s) it’s running on.
I’m looking for the best strategy between several contrary strategies of storing documents in a NoSQL database. These are:
- Optimise for speed
Putting the whole information into one (big) document has the advantage that with a single GET the information can be retrieved from memory or from disk (if it was purged from memory before). With the schema-less NoSQL databases this almost wished. But eventually the document will become too big and eat up a lot of memory, less documents will be able to be kept in memory in total
- Optimise for memory
Splitting up all documents into several documents (eg using compound keys as what is described in this question: Designing record keys for document-oriented database - best practice especially when those documents would only hold information that is necessary in a specific Read/Update operation would allow more (transient) documents to be held in memory.
The use case I’m looking at is Call Detail Records (CDR’s) from Telecommunication Providers. These CDR’s all go into hundreds of millions typically per day. Yet, many of these customer don’t provide a single record on each given day (I’m looking at the South-East Asian market with it’s Prepaid dominance and still less data saturation). That would mean that typically a large number of documents are having a Read/Update maybe every other day, only a small percentage will have several Read/Update cycles per day.
One solution that was suggested to me is to build 2 buckets, with more RAM being allocated to the more transient ones and less RAM being allocated to the second bucket holding the bigger documents. That would allow a faster access to the more transient data and more slower one to the bigger document which eg holds profile/user information that isn’t changing at all. I do see two downsides to this proposal though, one is that you can’t build a view (Map/Reduce) across two buckets and the second one would be more overhead in managing closely the balance between the memory allocation for both buckets as the user base growths.
Has anyone else being challenged by this and what was your solution to that problem? What would be the best strategy from your POV and why? Clearly it most be something in the middle of both strategies, having only one document or having one big document split up into hundreds of documents can’t be the ideal solution IMO.
EDIT 2014-9-14 Ok, though that comes close to answering my own question but in absence of any offered solution so far and following a comment here is a bit more background how I now plan to organise my data, trying to achieve a sweet spot between speed and memory consumption:
this holds profile information from a table, not directly from a CDR. Less transient data goes in here like age, gender and name. The key is a compound key consisting of the mobile number (MSISDN) and the word profile, separated by a “:”
this holds transient information like usage counters and variables accumulating the total revenue the customer spent. The key is again a compound key consisting of the mobile number (MSISDN) and the word revenue, separated by a “:”
this holds semi transient information about when a customer opted into the program and when he/she opted out of the program again. This can happen several times and is handled via an array. The key is again a compound key consisting of the mobile number (MSISDN) and the word optin, separated by a “:”
this holds information about a specific A/B connection (sender/receiver) which was done via voice or video call or SMS/MMS. The key is consisting of both mobile_no’s which are concatenated.
Before these changes in the document structure I was putting all the profile, revenue and optin information in one big document, always keeping the connection_id as a separate document. This new document storing strategy gives me hopefully a better compromise between speed and memory consumption as I split the main document into several documents so that each of them has only the important information that is read/updated in a single step of the app.
This also takes care of the different rate of changes over time with some data being very transient (like the counters and the accumulative revenue field that gets updated with every CDR coming in) and the profile information being mostly unchanged. I do hope this gives a better understanding of what I’m trying to achieve, comments and feedback is more than welcome.