Couchbase spark bulk insert


#1

couch base server 5.1 and spark connector 2.2.0
is there any way to insert bulk using spark couch base connector ?


#2

Hi there. Indeed there is, please see Persisting spark dataframe to couchbase for the recommended way to insert a large Spark DataFrame into Couchbase.


#3

I tried the mentioned method but it not work, I have RDD[String] then I converted it into data frame as below

val dfSchema = Seq(“data”)
val a = data.toDF(dfSchema: _*)
a.write.mode(saveMode).couchbase(“temp”) // temp is bucket name

but it show error “can not resolved symbol saveMode” I used spark connector 2.2.0 and it mentioned method saveToCouchbase(options) but not found method like couchbase(options)


#4

saveMode is a variable referencing a Spark SaveMode, e.g.

import org.apache.spark.sql.SaveMode

val saveMode = SaveMode.Overwrite

#5

The saveToCouchbase method is documented here https://developer.couchbase.com/documentation/server/current/connectors/spark-2.2/spark-sql.html.

Hope this helps :slight_smile:


#6

Thanks for reply, I’m pretty close to upload in bulk but I having some issues I tried following things

val saveMode = SaveMode.Overwrite
dataframe.write.mode(saveMode).couchbase(Map(“bucket” -> “temp”))

but it show following error

com.couchbase.client.core.CouchbaseException: Could not find ID field META_ID in {“data”:“test”}


#7

Yep this is as expected. All documents in Couchbase need a document ID, unique to that bucket. And when you’re inserting Spark DataFrames directly into Couchbase, as here, you’ll need to make sure that each row in the DataFrame has this unique document id. By default the Spark connector will look it in a json field called ‘META_ID’, but you can change this - search for ‘idField’ in these docs https://developer.couchbase.com/documentation/server/current/connectors/spark-2.2/spark-sql.html for more details.