Couchbase for all (new couchdb involved)

I am a bit lost nowadays between the right tool to choose.

I really want to jump in the nosql era.

We are in the late 2013 and bigcouch joined couchdb.
Couchbase is pretty fast.

My thoughts:

  1. Couchdb can preferably be used to all sort of data mining, archiving.

It leans to use disk technology.

  1. Couchbase can preferably be used to store login credentials, calculations summaries and all sorts of datas that need to be retrieved fast.

It leans to use ram technology.

Replication, cluster and auto-healing work as good in the two products, right now.

I want the simplest stack.
Could i only use couchbase + elastic search to make all the deal ? Isn’t it a good solution for data mining too ?
What happens if datas cannot fit in memory ? Say we have 15 teras of datas. We have 4 servers x 128go ram each. How will it be handled on the long run ? Is there a disk access possibility/strategy to rest the ram ?(real hybrid approach : some commands would only rely on disk whereas other on memory).

All the blogs and forums rely on the former version of couchdb (before cloudant’s bigcouch made the jump) and so, are totally outdated.

Please, a bit of “light” would help me :slight_smile:

Thanks,

Couchbase is a great solution for Fast data. The reason it is so fast is b/c it stores all the keys in memory and tries to store as much data in memory also. Lets say you set up all 4 servers with no replicas and 100GB dedicated to 1 bucket, a working set of 80% of memory = 320GB. So that is 2.0% (320GB / 15TB) of data in memory.

So,
With 2.0, we are going nowhere, right ?

What is the time needed to seek on disk? What if the requested datas are never the same because of multiple users and not able to keep up server ram ?

Say in a few months we will have 45tB, we cannot afford 50tB of ram !!

so having four 3TB HD setup as a raid 1+0 or (10) gives you 4xRead/2Xwrite. A good spinning disk HD is about 120MB/sec. so 480MB/sec Read , 240MB/sec write.
No matter if you use a NoSQL like wide column (Cassandra) or key=>value (Couchbase)
realtime Querying off of 15TB of data will probably require 2-5x that size in indexing. Using ElasticSearch on 15TB is not easy either.

If your rate of data doubling is months not years you need to look into using hadoop.
Hortonworks, Cloudera, and MapR are the big players. From the map/reduce job data then you can put it in Hbase, Cassandra, or Couchbase then query.

Nice,

thanks for the directions.

What about couchdb + elasticsearch ?
It seems to be a simpler stack…

What do you think about it ?

Couchdb + Elasticsearch are a great combo. The problem that you are going is clustering couchdb and speed
CLUSTERING -
You can use Couch Lounge http://guide.couchdb.org/draft/clustering.html to cluster, but you are stuck in between transition of Big Couchdb into Couchdb. So soon you will not need couch Lounge.
Speed - So with CouchDB and ElasticSearch is great b/c you write to any CouchDB node and it magically appears in ElasticSearch for easy querying. But getting your data from ElasticSearch can be very painful(ie slow). Why slow? ElasticSearch is a GREAT Indexing engine not a fast database. The better way to do it is Couchbase + ElasticSearch. So query ElasticSearch and only bring back the KEYS of the documents which is not very painful, then do a GETBULK(“keys from ES here”) into Couchbase and get the data in a millisecond or faster.

Yes it may be a better choice performance wise, but we reach the same problem as before : ram allocation for big datas <-> not enough ram to handle the whole “package”.

Don’t we ?

It depends. My next question would be out of the 15 TB how many documents will you have? So 15TB / (’# of documents’) = YY(GB) or YY(KB) per document.
Does all parts of the data need to me searchable?
How fast do you want an response back from a query? EX. 300ms
How many queries/sec at peak will you be doing? Ex. 300/sec

Hmm,

Out of 45TB, we can say that we will need 15TB approx. This quantity will still increase but less and less. Since other documents are not to be retrieved quickly. For those, couchdb applies i think.
You can see the relation as the inverse of exponential function : decreasing with time. The more the documents, the less the ratio “important docs/overall docs”.

1500 req/sec and lower latency as possible :-). The charge is distributed over 10 servers so it is 150 req/sec/server but 1500 req/sec overall.
Approximately of course. And increasing.

With your Request only being 1500/sec you can go with CouchDB + ElasticSearch. I would recommend you use couchbase as your caching layer as a memecached replacement. To speed up ElasticSearch watch this video http://www.elasticsearch.org/videos/scaling-massive-elasticsearch-clusters/. Good luck with your project.

Thanks,

I will look into it !

Bye!