Indexing and Search on Huge Bucket

Hi Community,

We are facing a challenge regarding how to handle huge amount of data on Couchbase.
Basically we are going to register statistics everyday in a specific bucket, which are going to be 30 millions of documents per month. Over these stats we are writing several views with stale OK strategy in order to not stress out Couchbase engines but these views are going to be to summarize, count and group data in a very specific way.

On the other hand we are going to need specific SQL like searches on these documents. What i mean by specific SQL like searches is to be able to search for a couple of documents on this millions or billions of documents based on some criteria for example date range, type, name, etc.

We have been exploring some approaches like Apache Spark to connect to Couchbase and handle all this data with this kind of tool but we are not 100% sure this is the most accurate way to solve it in Couchbase.

Is there another way to handle this kind of query on huge buckets as i am describing above?
Could it be possible to setup a couple of N1QL secondary indexes on billions document’s bucket without stressing out Couchbase engines and everything continuing working well?



Couchbase Analytics, which is in Developer Preview, would be a great fit for the use case you’ve described above. It is designed to allow ad-hoc querying of data in a Couchbase cluster without impacting the operational workloads.

You can download the 5.5. beta to try the analytics service. We have a simple tutorial to get you started.

I am happy to chat more to give you an overview and show you a demo of the Analytics service.


Anyway it can be done on the community version? As Juan is stating we are saving 1M records per day. So we need to query ideally over 1B records.
Records are from different clients, on each query we search for a subset of a client on a particular date range. Map-reduce for general stats works, but looking individual records on a 1B looks like not doable.

N1QL and GSI will work well for your secondary lookups and range scans, even across very large datasets. You won’t need analytics for that.

Thank you for the response.

Does this also apply for Community Edition which doesnt support Multidimensional Scaling? We are thinking in search in several billions of records.


1 Like