Bulk Upload using Java API In couchbase server


#1

We have requirement to load 240K data using Java API during server startup. If we use set method then I am facing data loss as async process.Total records it shows 15K only. If I put one milli sec sleep after each insert then it working fine but take huge amount of time. If i use set(key,value).get() then ops/sec reduced to single digit causing huge performance bottleneck.

I have also tried alternative tool like cbdocloader Tool, which is taking more than 5 min to load all data which is very bad number for us.
Please let us know if you guys have come up with any solution in this regard?


#2

Hi,

yes this is a common mistake that gets made because of the async API. If you do never block or orchestrate on the results, you are shutting down your main thread while 15k items on the server and the rest is in the write queue on the clients.

You did the right thing in the first place to block on the future, but if you do nothing else you will end up with a single-threaded synchronous loop. And that’s not the best way to utilize the given resources.

So here are the alternatives, use the one that fits your programming model best:

  1. Stick with .get() blocking calls, but fan them out in the first place into a Executor to get more concurrency. The client is thread safe so you dont need to worry about synchronization.

  2. Do not block on the future but rather use a listener to get notified once the future is complete. You can use CountDownLatches to orchestrate with your main thread so that it doesn’t shut down prematurely.

  3. Write and dont block, put all futures in a list and once you are done writing, iterate over the list and only exit the main thread once all things are done and potentially retry if an op failed.

Does that help you forward?


#3

For 10 Million records to insert in the batches of 100/1000/… Which approach should we go with?


#4

I think you should benchmark what fits your needs, its hard to say. Start with smaller batches and work your way up to a number that works.

With the old SDK, make sure fire off a bunch and use the latch to coordinate, with the new SDK you want to utilize the async reactive API to your needs with Observable.from() the list of IDs and then flatMap your insert/upsert and you can then block at the end once done (and wait for the last one with .last()).


#5

Is it better to wait after each batch, or only for the last one of the last batch ?
I imagine it’s better to wait after each batch, otherwise with millions of items we would overwhelm the couchbase api calls queue, right? In the old SDK there was setOpQueueMaxBlockTime to prevent this but I don’t see it in the new SDK.


#6

@juliango202 this part works differently now. There is a RingBuffer sitting there and it will immediately return you a BackpressureException if you overwhelm it. You can use it together with onError…() functions in RxJava to gracefully wait some time and retry.

Can you give me a simple example (let’s say 10 records) of what you want to store an then we could work out some code that you run? Is it json data? do you want to insert or upsert or replace it?


#7

Thanks @daschl. Our use case is pretty generic, just bulk import some millions of ~2K JsonDocuments into Couchbase. Possibly we will also need to do bulk update(get-modify-set) later.

We’ll try to not block, wait and retry on BackpressureException, and use a CountDownLatches to make sure every thread has finished.

If you can put such example of bulk import in the SDK docs, it would probably be useful to many people ; )


#8

@juliango202 yes we are working on adding it. In the meantime, if you have code to show I’d be happy to take a look.


#9

I am using the same operation. Still while using around 50 threads to insert the documents in batches of 200, I am getting BackPressureException. I am unable to solve this for a long time.