Spark Streaming using DCP to replicate all data in the first batch

zoltan.zvara · March 13, 2017, 6:27pm

Hi,

I’m using the Spark Streaming Connector for Spark 2.0 and using it without any issues so far. Now I’d like to implement zero data loss while using the couchbaseStream operator.

I’m using couchbaseStream to receive all data and future mutations, while immediately using an updateStateByKey to cache the filtered bucket partitioned across the executors. Let’s name this stream C. Then, the states are retrieved with stateSnapshots and I join an incoming stream from Kafka (named K) with these state snapshots (data received from Couchbase).

I’ve managed to save Kafka offsets to a reliable storage to bypass the default checkpointing in Spark Streaming. Code upgrades are not supported with Spark Streaming checkpointing.
Unfortunately, when I restart the job, since couchbaseStream uses Receivers to gradually replicate the data using DCP, the parallel Kafka Direct Stream K will not be able to match any data from stream C for a long time, until all data has been replicated and updated into C. I’m either halting K for a long period of time (don’t know how to achieve in Spark Streaming at the moment), or forcing C to not to finish the first micro-batch until all the data has been replicated (except very recent mutations).

I think I’d like to do the latter, but since couchbaseStream uses Receivers, it is not doable. It would be a solution to create a dummy queueStream that as a foreachRDD, creates a huge Couchbase query to read up all the data into an RDD. First batch will be quite longer, but then I’d have an RDD R with the most recent state of a Couchbase bucket. Then, right after couchbaseStream, C would be merged with R using C.foreachRDD. After a while, the codepath to use R would be cut out (it is feasible in foreachRDD as it is running on the driver on each batch interval), since C would have the most recent state updated by DCP.

Please advise.