Keyspace not found CBAuth database is stale


#1

Hello Everyone,
we are using Couchbase 4.6.1 as local cache for our service - in this case 1 Couchbase instance per service instance. Using Couchbase C SDK on client. OS Windows server 2012 x64.
Problem we have is: at customer site randomly our bucket becomes unavailable. To fix the problem we have to delete existing bucket and create a new one.
Bucket name is Nuance.
When searching for data getting following error:

status: "fatal", [{"code":12003,"msg":"Keyspace not found keyspace Nuance - cause: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it."}] Error code 59 (HTTP Operation failed. Inspect status code for details)

Last time issue happened at 2018/04/24 15:31:12 and I see seem to be related errors in error.log.1 around that time.
All query.log.x files report “CBAuth database is stale” for number of days before I see error in our logs on April 24.

Appreciate advice on how this issue can be traced down and fixed. I got all Couchbase logs.
Thanks,
Vlad

from query.log.9:

2018-04-23T17:40:17.844-07:00 [Error] common.ClusterAuthUrl(): CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it.
_time=2018-04-23T17:40:19.174-07:00 _level=INFO _msg= keyspace Nuance not found CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it. 

from error.log.1:

[ns_server:error,2018-04-24T15:26:20.303-07:00,ns_1@127.0.0.1:<0.42.5>:menelaus_web:loop:187]Server error during processing: ["web request failed",
                                 {path,"/pools/default/buckets/Nuance"},
                                 {method,'GET'},
                                 {type,exit},
                                 {what,
                                  {timeout,
                                   {gen_server,call,[ns_config,get,15000]}}},

[ns_server:error,2018-04-24T15:27:00.257-07:00,ns_1@127.0.0.1:<0.1368.5>:menelaus_web:loop:187]Server error during processing: ["web request failed",
                                 {path,"/pools/default"},
                                 {method,'GET'},
                                 {type,exit},
                                 {what,
                                  {{noproc,
                                    {gen_server,call,
                                     ['index_status_keeper-index',
                                      get_indexes_version]}},
                                   {gen_server,call,
                                    [<0.1288.5>,
                                     #Fun<menelaus_web_cache.2.70484883>,
                                     infinity]}}},

[ns_server:error,2018-04-24T15:26:46.844-07:00,ns_1@127.0.0.1:ns_doctor<0.320.0>:ns_doctor:update_status:308]The following buckets became not ready on node 'ns_1@127.0.0.1': ["Nuance"], those of them are active ["Nuance"]
[ns_server:error,2018-04-24T15:33:42.596-07:00,ns_1@127.0.0.1:capi_ddoc_replication_srv-Nuance<0.541.0>:ns_couchdb_api:wait_for_doc_manager:307]Waited 10000 ms for doc manager pid to no avail. Crash.
[ns_server:error,2018-04-24T15:33:42.616-07:00,ns_1@127.0.0.1:capi_doc_replicator-Nuance<0.540.0>:ns_couchdb_api:wait_for_doc_manager:307]Waited 10000 ms for doc manager pid to no avail. Crash.

#2

I’m a bit unclear. You’re saying it randomly becomes unavailable and you see those messages? Then later you fix it by deleting it and recreating it?

Or are you saying it randomly becomes unavailable and when deleting and recreating the bucket, you see those messages?

Note that bucket delete/recreate propagates asynchronously through the cluster to things, so some messages about not being able to connect, while not great, may happen and things should recover with a little bit of time after bucket creation. Effectively, bucket deletion/creation isn’t expected to be a short period of time kind of thing.


#3

Bucket randomly becomes unavailable, our service search against Couchbase fails and I see error in our log {“code”:12003,“msg”:"Keyspace not found keyspace Nuance…}.
We do not monitor Couchbase logs. I tried to examine those today hoping to see event which triggered the failure…
Before our support techs did reboot the server and it usually fixed the issue. At some point reboot did not help, so they reverted to stop our service, delete and create bucket again and this fixed the issue about 3 times already.
Our setup for this customer is pretty simple: one Couchbase server per our service running on the same box, so basically there is no cluster per say…
If I browse through 10 query.log files I see hundreds of errors similar to:

2018-04-25T12:25:55.725-07:00 [Error] common.ClusterAuthUrl(): CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it.
2018-04-24T01:22:43.880-07:00 [Error] common.ClusterAuthUrl(): CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it.
2018-04-23T17:40:17.844-07:00 [Error] common.ClusterAuthUrl(): CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: ConnectEx tcp: No connection could be made because the target machine actively refused it.

#4

@ingenthr Hi Matt, any suggestions on what to look for in Couchbase logs to track down the issue? or maybe you or your team could review logs I got? thx


#5

Apologies for the long delay in reply, I’ve been traveling a bit. I’d start with the error.log and then maybe debug.log (see the docs for the location on your platform). Possible theory: a bug in cbauth in 4.6.1 that has since been fixed?

If you have an enterprise subscription, you may want to contact Couchbase support to have a look at the logs. Log analysis is usually iterative and may have to look at a few components.

I also searched the issues and found that this behavior can be observed owing to defects in prepared statements, fixed in 4.6.4. See MB-26075. Based on that finding, you probably should upgrade to 4.6.4 before doing any more searching.