Couchbase Spark Connector Java Streaming


#1

Hi all,
we would like to export our data from Couchbase to a Relational DB in order to analyze them and create some reports.

My idea was to use the Spark Connector Java API to:

  1. Load the RDBS db with a N1QL query extracting data from the Couchbase db.
  2. Exploit the Couchbase Spark Connector Spark Streaming functionalities to propagate data changes from the couchbase database to the relational DB.

I would like to know if my approach is correct.
In addition looking at http://developer.couchbase.com/documentation/server/4.0/connectors/spark-1.0/java-api.html
I have noticed that an example regarding streaming functionalities is missing.
Is it possible to use spark connector streaming from Java? does it exist any example?

I have tried to write some code but withouth any documentation I have not been able to achieve my goal. In particular I have not been able to build the CouchbaseInputDStream correctly because I am not able to provide to the constructor the streamFrom and streamTo parameter. In addition I have not idea ho to retrieve changed couchbase documents from notifications.

Many thanks for the attention.

-Giovanni


#2

I would personally use the JDBC or ODBC driver for Couchbase. It will make it appear like an RDBMS and you don’t have to move data around.

I believe we do not support streaming yet for Spark. @daschl can tell you more about this.


#3

@ldoguin
As far as I undersrsand with the JDBC driver I am not able to move data from Couchbase to the RDBMS incrementally and I think that this is essential on a huge data set. Am I wrong?


#4

I was saying you could use the JDBC driver to avoid moving data and using an RDBMS.


#5

@giovanni.casella

A few points here. First, please pick spark connector 1.2.1 since it has the most recent stuff that you’ll be happy with: http://developer.couchbase.com/documentation/server/current/connectors/spark-1.2/spark-intro.html (and spark 1.6.x).

You can use spark streaming for the incremental changes, but this of course requires more work to “ingest” properly and also streaming support is experimental right now. A different option would be to “poll for changes” with a N1QL query that fits this criteria, like “where updated_at > …” and so on.


#6

@daschl
Dear Michael,
actually I was using version 1.2.1 but I included in the message the wrong documentation link.

In any case even if looking at the right documentation I am not able to achieve my goal of using streaming from Java.


#7

@ldoguin
The problem is that to create reports I have to do a select all on a lot of documents and I think that RDBMS databases are more performant than couchbase on this kind of operations. This is the reason I had an idea to use an RDBMS with a schema already optimized for reports.


#8

@giovanni.casella I don’t know if thats really the case - did you benchmark it? N1QL is pretty good at scale out with the new GSI especially when in memory. Combined with KV fetches you can get awesome performance.

And you could do the analysis directly in spark? I’m not sure you really need to go back into an RDBMS at all, can you tell us more about your use case?


#9

is it because the docs only have examples in scala for spark streaming?


#10

The problem is on one side that an example is missing and on the other side that I was not able to instantiate and use the CouchbaseInputDStream class correctly.


#11

@daschl How can I combine N1QL with KV fetches? With couchbase 4.2 the only way to achieve fast N1QL queries was adding covering indexes but a Couchbase engineered told us that is better to avoid more than 5 indexes for each bucket so we moved from N1QL to KV fetch in almost all the cases (and it was painful). Now I would like to avoid adding indexes for reports.

Regarding the reports with Spark I must admit that I am newbye with spark that I meet this morning for the first time while I have some experiences with some tools able to retrieve data from RDBMS.


#12

@giovanni.casella combining n1ql with kv fetches would be a SELECT META().id as id FROM … and then you get back the document IDs which you can “pip” into the KV fetch.

I’ll see if I can get a sample together, using scala is not an option for you at this point?