DCP is a great way to batch access data or to ingest updates of reference objects to a streaming application. For batch and stream access we either use Apache Spark or Apache Flink. Let’s say that the total number of any type of document (e.g. users) are sufficiently large not to retrieve all through a single query. Also let’s state that some buckets hold multiple types of documents.
We implement the following use-cases:
UC1. Incoming user-click stream is analyzed real time using Apache Spark in a streaming program. Click events are correlated with user data, that is retrieved from Couchbase using the Spark Connector’s Receiver implementation, which uses DCP. The latter user update stream is kept as a state in Spark, using for example
UC2. Incoming user ratings (explicit or implicit click events) are used for online update of a factor model, a recommendation engine that is implemented in Apache Spark. In the program, ratings are correlated with user objects and items as well. These are kept in memory the same way as in the previous use-case.
UC3. Using Apache Spark we batch read items from Couchbase using DCP [see spark.sparkContext.couchbaseQuery number of partitions] into Apache Zeppelin notebook and do simple analytics and ad-hoc queries, aggregations on Spark’s in-memory data. Data is then written to HDFS.
I could easily add 10 more critical use cases from our applications.
Due to the “get-everything” nature of DCP, all in scenarios, the whole bucket will be read, transferred, deserialized and then a decision will be made whether the document is useful (i.e. filter for users in UC1). As mutations occur in Couchbase, all DCP clients will receive all the mutations, even though most DCP use cases filter for a specific type of document (e.g. users or items). This results in an impractically high overhead - pressure on the memory/disk (server side) as well as the CPU (client side).
A simple filter push-down mechanism would mitigate most overhead off these use cases. In a filter push-down, during DCP handshake, the client would initialize the connection with arbitrary filters, thus refusing mutations of some “document.type”. In UC1, the streaming application would essentially receive mutation and deletion events only where the “document.type” is “user”.
I would be happy to start a discussion on this improvement, whether it is feasible on the server side. Also, I’m willing to contribute to the Java client library if the idea ever gets materialization.