Leveraging pyspark to write to couchbase

varunkumthekar · April 19, 2022, 10:29pm

Hi,

I have a json file with around 6mn enteries. I am using java sdk for writing to couchbase, but writing one entry at a time is taking hours. I am fairly new to couchbase and wanted to know what is best way for speeding up the write to db. Would leveraging spark help in this case. Do you have any sample programs for pyspark and couchbase. Any other suggestions would also be really great

dh · April 20, 2022, 10:02am

If the source data is already valid JSON as indicated, you may be able to use the cbimport tool to more efficiently load it.

Ref: cbimport | Couchbase Docs

I’d typically use something along the lines of:

cbimport json --format list -c http://<host>:8091 -u <user> -p <pwd> -b <bucket> -g "#UUID#" -d file:///path/to/data.json

Where I want the document keys to be generated as UUIDs, and the data is a JSON array of objects (“list” format), e.g.

[
{"field":"value","another":"value"},
{"field":"value","another":"value"},
...
]

HTH.

varunkumthekar · April 20, 2022, 2:43pm

Hi @dh

Thank you for the reply. My data get updated frequently (every week) and have some other data which gets updated on daily basis. So I need to push the data into cb as part of a script instead of using cbimport to pull the data into couchbase. Also the records need to updated/overwritten

dh · April 21, 2022, 11:10am

Is probably a good place for you to start and attempt to ingest and submit a batch of JSON documents from your file at a time, rather than one at a time.

HTH.