ETL InDatabase vs External ETL tool (Scalability)

Few datapoint to set the context.

  1. My data source would be single with multiple files. Mostly csv.
  2. Size would be ranging from 200K - ~ 1mil records.
  3. Complex transformations. Generating records for missing periods, grouping, windowing etc.

My experience so far has been not so good with doing with in CB.

  1. Eats up the memory quota. when adding indexes it easily shoots up.
  2. CB has stopped responding. Even has brought down the EC2 also. we had to restart the server.
  3. I am using memory optimized storage.

Now question is (ETL tool vs CB for transformations)

  1. Should i do transformations(using ETL Tools) outside of CB using a tool (NiFi ??) and use CB only for storing the transformed data (including intermediate transformations) and load the final transformed data in the database.


If you can load JSON, the Eventing service can do transformations via JavaScript. This works on the database change protocol (DCP) and works without indexes merely responding to mutations.

In your case the initial insert. Typically you would do all your needed transformation and write the document back to the source keyspace and also add a flag like “etl_done”: true to prevent redoing the work if you ever redeployed the Eventing function (plus also have the ability to ignore the data until it has been transformed in other services like N1QL).

You just leave the Eventing function running first it transforms everything in a keyspace then it just responds in real-time to new mutations. If you wanted to only have clean items read form a source keyspace transform and write to a destination keyspace and delee the source.= ( no need for a flag like “etl_done”: true)

Take a look at the scriptlet convertXMLtoJSON in the Eventing docs.

@jon.strabala Thanks for responding. That is idea. For any document mutations we can use eventing to transform to the destination.

My larger question is still unanswered. should i bring complex transformations for large data size outside of CB using ETL tools.

Can any CB experts who has experience in this are respond please ?

Hi @Prince_Matthew

My larger question is still unanswered. should i bring complex transformations for large data size outside of CB using ETL tools.

This is a general purpose question, I have no idea the number of documents stored or the ingestion rate, but I’ll go out on a limb “a smallish cluster” should be able to transform 20,000 mutations/sec. resulting in a stream of documents with an average size of 3,000 bytes using an Eventing Function. You just said you had 1M records (this is a small dataset) you really shouldn’t have a problem.

To get an idea of performance, you might want to play around with How (and Why) to Use Couchbase as an XML Database the concepts in this blog can be used to process deeply nested XML documents up to 10M bytes in size. It starts with a simple example BUT provides a data generator to test converting 100K XML small docs to JSON you could easily make the docs bigger and up the # of docs to 1M or 10M,

Know if you need to transform 1M+ documents a second I’d definitely opt for an external ETL tool as it would be doubtful you are running a cluster large enough (both in regards to the data service and the eventing service).


Jon Strabala
Principal Product Manager - Server‌