Couchbase Bucket ,vBucket , Namespace , Keyspace , Hash Key definitions plus best Practice for key definitions

eldorado · August 9, 2019, 2:07pm

Dear CB team ,

Lot of CB blog and other technical materials and some forums threads makes me very confused on all those definitions. I am new to CB and learning to get the grip on the fundamentals. Can you please help to clarify this with clarity ?
Read through below but still unclear few things:

This is what I assume correct term . please correct me inline if this is NOT correct understanding.

Namespace is logical group of Buckets. and how the ‘default’ namespace edited to different name ?
Buckets are nothing but Keyspaces . 1 Keyspace is equivalent to 1 table in relational world. 1 keyspace is not 1 relational record and neither one documents inside bucket.
One Bucket can contain multiple different Keyspaces ? or is this a simple one to one relationship ? Keyspaces are not same as keys ?
Multiple Keyspaces can mapped to vBuckets using CB internal HASH Key maintained by CB internal Hashing algorithm .
vBuckets are conceptual and can’t be accessed programatically or using any SDK either ?
vBuckets will be replicated in cluster nodes and each vBucket must present to every cluster node
vBucket are mapped to each server in cluster nodes using CB internal mapping algorithm and using Cluster map lookup .
each bucket can have 1024 max vBuckets ? not matter how big the storage for the cluster nodes

Also in all above definitions what is logical and physical ?

One a separate question , what is the best practice of defining keys in documents ? Can we use :: as separator if we want multiple column values to be part of key ? or is this key should be as relational world surrogate key ?

Thanks for help

graham.pople · August 9, 2019, 3:23pm

Hey @eldorado

You’re along the right lines

I’m going to skip this one as namespace isn’t a term I’d use with Couchbase, and I don’t really understand this question. Did you find some resource mentioning it? We don’t really have the concept of a grouping of buckets.
I also wouldn’t really use the “keyspace” with Couchbase. The doc you link (a very old one, now), is using it to mean “all your keys” basically - in the sense that Couchbase, at it’s simplest, is a key-value store. Buckets are groupings of keys. Buckets aren’t really analogous to tables, no - you don’t have to e.g. only put your customer keys in a bucket, you can put your order keys in as well.
There are some physical resources consumed for each bucket, so you are actively encouraged to minimise your bucket count, and store multiple sorts of keys in each.
Each bucket has 1024 logical virtual buckets (bvuckets). All keys in the bucket are distributed over the 1024 vbuckets using a Couchbase-defined hashing function.
In the general case yes, you won’t be accessing vbuckets directly. The SDK abstracts all of this for you. You just get and update keys, and the SDK routes it to the right vbucket using the hashing function.
You configure how many replicas of each vbucket you want, and those replicas will get distributed around the cluster nodes. Each vbucket active will exist on only one node. “each vBucket must present to every cluster node”: not necessarily, e.g. picture a scenario where you have 20 nodes and 3 vbucket replicas - each vbucket will exist on 4 of the 20 nodes (3 replicas+ 1 active).
Yes, the 1024 vbuckets get distributed around the physical nodes. You have some control over this, e.g. in Enterprise Edition you can have the concept of ‘rack awareness’. But in general, you don’t have to worry about it.
In general, yes each bucket has 1024 vbuckets. There are some exceptions and it can be configured, but that’s the usual case.

On keys - I like the convention of e.g. “customer::12387”. You can also use a “type”:“Customer” field in the doc. I’m not sure I understand about multiple column values? Generally you will be storing data as JSON in the document’s body.

Hope this helps!

eldorado · August 9, 2019, 4:46pm

Thank you @graham.pople for such swift reply and detail explanation.
I have asked Q1 for a reason
When I do this : SELECT * FROM system:namespaces
It results me below JSON :
[
{
“namespaces”: {
“datastore_id”: “http://127.0.0.1:8091”,
“id”: “default”,
“name”: “default”
}
}
]

And when I do “SELECT * FROM system:keyspaces” it results me :
[
{
“keyspaces”: {
“datastore_id”: “http://127.0.0.1:8091”,
“id”: “beer-sample”,
“name”: “beer-sample”,
“namespace_id”: “default”
}
},
{
“keyspaces”: {
“datastore_id”: “http://127.0.0.1:8091”,
“id”: “gamesim-sample”,
“name”: “gamesim-sample”,
“namespace_id”: “default”
}
},
{
“keyspaces”: {
“datastore_id”: “http://127.0.0.1:8091”,
“id”: “travel-sample”,
“name”: “travel-sample”,
“namespace_id”: “default”
}
}
]

This tells me each Physical bucket has namespace_id assigned to it which is ‘default’ . Not sure what does this mean as ‘default’ . Is this customizable ? if yes then is this possible to assign namespace. Is this namespace has any relation to the Kubernetes cluster namespace ‘name’ configuration ?

And also this eventually tells me each namespaces is collection of keyspaces … and keyspaces and nothing but buckets with the keys obviously
On #4: Do you mean by any time if we put more on more data / documents / rows and different kind of keys is there any chance we can exhaust that Internal vBucket limit of 1024 ?
On #6: I am slightly confused as I guess we can define replica’s for buckets in web console but not the vBuckets … internally those are related to 1024 vBuckets which I think is what you try to mean . Since virtual means appearance in existence and not in reality I believe Virtual Buckets are just a cluster map hash table and not real storage. So the replica of Bucket or vBuckets are not redundant storage of same data across node correct ?

on #7: If I have more than one Bucket say 5 then 5*1024 vbuckets will be distributed across all physical nodes ?

In a 3 node cluster with 3 bucket replica’s If my Python SDK API calls for a single document using get() to the opened bucket I assume hashing table will lookup which active vBucket which is taking the space for that key and retrieve the data from any particular nodes where the data presents ? Is there any way to know programmatically which cluster node serving this data request from storage perspective (assuming its Couchbase bucket and not memcached) ? I am trying to understand if this standard flow for API calls and same for N1QL query request ?

The reason I ask key question because I want to use unique combination of 7 column values in keys for a single key type and key would be separated by :: for column values (just to identify each documents very quickly ) and I was reading somewhere that the more the characters in keys can take more bytes and eventually more in memory space in index storage . Is that true ? We are planning to move a gigantic table and eventually we will use different indexes so want to be cautious from design perspective if this would take more Index In memory ?

Hope I able to put my questions right way again . Sorry if I had any mistake
thanks

vsr1 · August 9, 2019, 6:57pm

N1QL uses namespace, keyspace.
As of know the following are namespace values “default”, “system” these are not configurable.

As of know keyspace is same as bucketname or virtual bucketname (virtual buckets used in system namespace only)
SELECT * FROM namespace:keyspace WHERE …
when namespace: is omitted it uses default nampespace and keyspace becomes bucketname.

Physical buckets are created inside default pool that is why namespace is derived from the pool ( default )

eldorado · August 9, 2019, 10:24pm

@vsr1 thanks …couple of concepts around namespaces is cleared now .Still waiting for @graham.pople to clarify other questions.

vsr1 · August 10, 2019, 3:12am

The following should help some of those questions https://docs.couchbase.com/server/4.1/concepts/buckets-vbuckets.html

https://docs.couchbase.com/server/4.5/data-modeling/intro-data-modeling.html

https://docs.couchbase.com/server/4.5/travel-app/travel-app-data-model.html

If you are going to use N1QL check Indexing and optimization sections

https://blog.couchbase.com/author/keshav-murthy/

eldorado · August 10, 2019, 3:02pm

Thanks for additional links … I folllow Keshav and his all valuable blogs … so I am well versed on that.
I just need few clarifications on my previous thread to clarify my final doubts . So I hope to see reply from @graham.pople

thanks again

graham.pople · August 12, 2019, 8:46am

Hey @eldorado

So, I’m learning from this thread too about N1QL namespaces and keyspaces! (I work on the SDKs and am not super knowledgable on N1QL.)

Nope you can’t exhaust the 1024 vbuckets. Your data gets distributed between them using the hashing function.
When you define the number of replicas in the Admin UI, what precisely you’re setting is how many times each individual vbucket will be replicated. E.g. with replicas=2, you will have in total 1024 active vbucket, and 2048 replica vbuckets providing redundant backups. You read and write data initially to the active vbuckets, and then they are seamlessly sent to their replica vbuckets in the background. (As an aside, we also have two mechanisms for you to check when the data is safely available on the replicas, I can go into that if desired.)
If you have replicas=0 then yes, 5 buckets means 5 * 1024 vbuckets. If you have 2 replicas then it means 5 * 3 * 1024 vbuckets (1 active & 2 replicas, for each vbucket, for each bucket).

Yes, if you do a get(), the SDK uses the hashing function to determine the vbucket, and sends the request to the correct node for that vbucket. The request goes only to the active node (there is another SDK method to request the doc from all replicas, and you could use this for e.g. quorum reads, but it’s not commonly used). You generally don’t need to know the actual node it came from, and while the SDK has this info, it’s not returned to the app.

For N1QL queries and updates, the SDK will route the request to a N1QL query service, which will then take care of querying indexes and querying/updating particular vbuckets, as required. So if you know the exact key(s) you’re trying to query or modify, it’s usually faster to do that directly with SDK get/replace calls than to go via N1QL (though of course N1QL is fantastic for multi-document queries where you don’t have the keys).

Yes, the bigger the key, the more memory it takes. I don’t know your use-case but putting data, especially 7 chunks of data, into the key, is perhaps the wrong way to go. Have you looked at just putting your data into the document, and creating indexes on it? The links that @vsr1 provided above are some great resources on this.

eldorado · August 13, 2019, 1:32pm

Super . thank you and it helps again . So one last question , so replica is redundant storage correct ? So if I have 1 TB of bucket data with 2 replica eventually I am looking for 3 TB of physical storage ?
For index I am more concerned because it take in Memory storage so looking for best practices in terms of design .
thanks

graham.pople · August 13, 2019, 1:40pm

On the redundant storage - yes, that’s correct. vsr1 can address any index questions much better than I, but those resources above will likely answer any questions you have.

vsr1 · August 13, 2019, 2:17pm

https://docs.couchbase.com/server/current/install/sizing-general.html#sizing-index-service-nodes

eldorado · August 14, 2019, 4:32am

Hi , this helps but it doesn’t tell me how much maximum bytes a Index can consume if I consider to have only one index on my document Key and whether the document key size or length has any impact on index while it is in Memory ?
thanks

vsr1 · August 14, 2019, 4:43am

If you want have one index for bucket on document key , you can create primary index. If query use prefix on document key it works okay, But if you want do post fix or leading wildcard will not be efficient.

eldorado · August 16, 2019, 3:47pm

Hi @vsr1 - I want to use UPSERT command on certain namespace . As you said default and system can’t be use how I will create namespace main to use my N1QL queries ?

https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/upsert.html

You can add an optional namespace-name to the keyspace-name in this way:

namespace-name : keyspace-name .

For example, main:customer indicates the customer keyspace in the main namespace. If the namespace name is omitted, the default namespace in the current session is used.

vsr1 · August 16, 2019, 4:02pm

At present only default, system allowed in N1QL query and you can’t create any other name-space.

Few days back CB 6.50 Beta is released in that there is Developer Preview of Collections. You can checkout https://blog.couchbase.com/get-started-with-couchbase-collections-using-the-demo-app/

eldorado · August 16, 2019, 4:05pm

thanks , Interesting and good to know … One thing I am not sure if you can’t use namespace then why in CB n1ql document still gives his reference of some namespace called ‘main’ . this sounds contradictory of what not possible.

vsr1 · August 16, 2019, 4:13pm

On the documentation page down right corner there is Feedback . You can open issue or post the link here will get fixed.

eldorado · August 17, 2019, 7:00pm

thanks Done … DOC-5667

Harmanat_Singh · October 12, 2020, 7:19am

I am learning from this thread, but there is one question, the answer to which I am still looking for: