@daschl Thanks for the reply…
The dataset that has discrepancy has about 1 million records, so running select * from my-bucket from cbq console is kind of impossible (I tried and after several minutes got and error “index scan timed out”). However I believe the numbers from the cbq query is always correct.
Smaller datasets (with thousands of records) don’t have any number discrepancy issue in my tests with cbq query and spark-connector.
I tried sparkContect.couchbaseQuery(…) with the same query I used in cbq. couchbaseQuery(“select * from …”) returns a number that’s much smaller than the correct number.
couchbaseQuery(“select count(*) as counts from …”) returns nothing…when I print out the result RDD’s count(), I got 0. When I try resultRDD.map(row => row.value.getLong(“counts”)).collect foreach println, I still got nothing. Did I missing something here?
One thing worth mentioning is when using sqlContext.read.couchbase, I can see there’s only about 1k read/sec on the couchbase cluster. But when using couchbaseQuery(select count(*) …), it shows > 10K reads from the couchbase UI, same as directly running the query in cbq.
Maybe for some reason sqlContext.read.couchbase was much slower (because select * is uses more network bandwidth, for example), and later it got error just like “Index scan timed out”, but swallowed the error, and returned only whatever was scanned by that time?
A separate question I found when experimenting with spark connector is, in a spark app, if I create the sparkConf with couchbase properties, I won’t be able to talk to my Hive anymore. I am trying to batch load metadata to Hive table. Now it seems an impossible approach as my sparkContext won’t be able to talk to both Couchbase and Hive at the same time.
Do you have any other suggestions for this use case? I know there’s a spark-hadoop connector, but I would need more fine-grained processing when publishing data to hive.