Persisting spark dataframe to couchbase

basotia · March 20, 2018, 11:51am

Hi
I am trying to resolve a issue with persisting a spark Dataframe to couchbase. I would also like to set ttl on the documents. we are using scala. At present it is performed as below

val jsonDocs: Array[JsonDocuments] = myDataFrame.toJSON.collect.map(json => someFunToCreateJsonDoc(json))

Above returns Array[JsonDocument] and takes care of setting TTL, which is then persisted using sc.parallelize(jsonDocs).saveToCouchbase(bucketName).

Issue with this approach is as the Dataframe size is huge , performing collect is not the right thing to do , it fails with OOM as expected.

Now i tried tweaking this a bit as below, just to see if this will work for full load, i know this is not ideal as there is no point in doing parallelize with one doc. But this also doesn’t work as get some scala serialisation issue 2.11.8 (spark 2.1)
myDataFrame.toJSON.foreach(json => sc.parallelize(Seq(someFunToCreateJsonDoc(json))).saveToCouchbase(bucketName)

Other option that i can think of is writing Dataframe directly or using stream (last option, as have no experience with spark stream and issue needs to be fixed urgently)

Writing DF directly can be achieved as , but not sure if this is the right way to do it.
dataFrame.write.mode(saveMode).couchbase(options)

graham.pople · April 12, 2018, 11:01am

Hi basotia,

As you’ve found, trying to collect() a huge Dataframe generally fails with an OOM, so the recommended approach is just as you’ve written:

dataFrame.write.mode(saveMode).couchbase(options)