Batch reading data - controlling vbuckets


#1

I would like to read a significant portion of Couchbase by Spark. What I would like to achieve is to do the read in a reasonable way - instead of millions of random reads I would like to control which vBucket/Server/file I’m reading the data from. I know there is a Spark connector but there are many complaints about lack of control like that which causes bad performance.
So my question is - does Couchbase allow for operations like:

  • read entire vBucket (preferable since I believe that the entire vbucket is not only on the same server but also in the single place storage-wise).
    or:
  • any other way of using indexes which will allow efficient batch reads (i.e. clustered index in relational databases allows to read data per range in an effective way).

Thanks,
Marcin


#2

Hi @marcin.szymaniuk,

I don’t think there is any interface (at least public interface) that allows you to do any vbucket-centric operation.

Are the Spark connector complaints yours? You may want to start a separate thread in the Spark connector forum to address them: https://forums.couchbase.com/c/spark or at least make @tyler.mitchell aware of them (if he isn’t already).

The closest I can think of to what you’re asking is the new index partitioning feature in Couchbase Server 5.5. This allows you to do “partition elimination” when executing N1QL queries. This means you can narrow your index down to a single partition (kinda like a vbucket, but for indexes). But it will still be gathering documents from multiple vbuckets. More information here: https://blog.couchbase.com/index-partitioning-couchbase-server/


#3

Thanks @matthew.groves

The complaints are not mine, I just mentioned them because I want to use other’s experience and not end up in a blind alley.

I will try the Spark connector forum according to your suggestion.

Thanks for interesting link!