I’m using the Spark Streaming Connector for Spark 2.0 and using it without any issues so far. Now I’d like to implement zero data loss while using the
couchbaseStream to receive all data and future mutations, while immediately using an
updateStateByKey to cache the filtered bucket partitioned across the executors. Let’s name this stream
C. Then, the states are retrieved with
stateSnapshots and I join an incoming stream from Kafka (named
K) with these state snapshots (data received from Couchbase).
I’ve managed to save Kafka offsets to a reliable storage to bypass the default checkpointing in Spark Streaming. Code upgrades are not supported with Spark Streaming checkpointing.
Unfortunately, when I restart the job, since
Receivers to gradually replicate the data using DCP, the parallel Kafka Direct Stream
K will not be able to match any data from stream
C for a long time, until all data has been replicated and updated into
C. I’m either halting
K for a long period of time (don’t know how to achieve in Spark Streaming at the moment), or forcing
C to not to finish the first micro-batch until all the data has been replicated (except very recent mutations).
I think I’d like to do the latter, but since
Receivers, it is not doable. It would be a solution to create a dummy
queueStream that as a
foreachRDD, creates a huge Couchbase query to read up all the data into an RDD. First batch will be quite longer, but then I’d have an RDD
R with the most recent state of a Couchbase bucket. Then, right after
C would be merged with
C.foreachRDD. After a while, the codepath to use
R would be cut out (it is feasible in
foreachRDD as it is running on the driver on each batch interval), since
C would have the most recent state updated by DCP.