ETL InDatabase vs External ETL tool (Scalability)

Few datapoint to set the context.

  1. My data source would be single with multiple files. Mostly csv.
  2. Size would be ranging from 200K - ~ 1mil records.
  3. Complex transformations. Generating records for missing periods, grouping, windowing etc.

My experience so far has been not so good with doing with in CB.

  1. Eats up the memory quota. when adding indexes it easily shoots up.
  2. CB has stopped responding. Even has brought down the EC2 also. we had to restart the server.
  3. I am using memory optimized storage.

Now question is (ETL tool vs CB for transformations)

  1. Should i do transformations(using ETL Tools) outside of CB using a tool (NiFi ??) and use CB only for storing the transformed data (including intermediate transformations) and load the final transformed data in the database.

Thanks
Prince

If you can load JSON, the Eventing service can do transformations via JavaScript. This works on the database change protocol (DCP) and works without indexes merely responding to mutations.

In your case the initial insert. Typically you would do all your needed transformation and write the document back to the source keyspace and also add a flag like “etl_done”: true to prevent redoing the work if you ever redeployed the Eventing function (plus also have the ability to ignore the data until it has been transformed in other services like N1QL).

You just leave the Eventing function running first it transforms everything in a keyspace then it just responds in real-time to new mutations. If you wanted to only have clean items read form a source keyspace transform and write to a destination keyspace and delee the source.= ( no need for a flag like “etl_done”: true)

Take a look at the scriptlet convertXMLtoJSON in the Eventing docs.

@jon.strabala Thanks for responding. That is idea. For any document mutations we can use eventing to transform to the destination.

My larger question is still unanswered. should i bring complex transformations for large data size outside of CB using ETL tools.

Can any CB experts who has experience in this are respond please ?

Hi @Prince_Matthew

My larger question is still unanswered. should i bring complex transformations for large data size outside of CB using ETL tools.

This is a general purpose question, I have no idea the number of documents stored or the ingestion rate, but I’ll go out on a limb “a smallish cluster” should be able to transform 20,000 mutations/sec. resulting in a stream of documents with an average size of 3,000 bytes using an Eventing Function. You just said you had 1M records (this is a small dataset) you really shouldn’t have a problem.

To get an idea of performance, you might want to play around with How (and Why) to Use Couchbase as an XML Database the concepts in this blog can be used to process deeply nested XML documents up to 10M bytes in size. It starts with a simple example BUT provides a data generator to test converting 100K XML small docs to JSON you could easily make the docs bigger and up the # of docs to 1M or 10M,

Know if you need to transform 1M+ documents a second I’d definitely opt for an external ETL tool as it would be doubtful you are running a cluster large enough (both in regards to the data service and the eventing service).

Best

Jon Strabala
Principal Product Manager - Server‌

1 Like

Thank you @jon.strabala. Thats what i thought. I am generally weary of doing complex transformations in the database. Ideally it should be used for storage. But since moving to couchbase since a year and with its multidimensional scaling capabilities i wonder if Couchbase MDS can handle such usecase if i am adding more data and index nodes to the cluster? asking because you mentioned “… definitely opt for an external ETL tool as it would be doubtful you are running a cluster large enough”
from your statement It sounds to me that if i run a large enough cluster then it is possible. PLease correct me if i am wrong.

Thanks
Prince

cc: @Siju_Babu

@Prince_Matthew,

My larger question is still unanswered. should i bring complex transformations for large data size outside of CB using ETL tools.

If you are only dealing with 1M records I don’t see why not use Eventing for complex transformations, if you are dealing with 1M records per second I think you will be pushing to capability of Eventing and KV even in a very large cluster.

By all means test your transformation logic out., the v8 runners on a small cluster can stream 600K mutation a second (doing nothing - perhaps just picking out a key).

As a rule of thumb I think a large cluster will support about 300K transformation (doing some analysis) a second to about 1M simple transformations (like add an item) via Eventing.

For Eventing you don’t need more than 256MiB quota on a node it is essentially compute oriented.

Best

Jon Strabala
Principal Product Manager - Server‌

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.