Primary N1QL Index for a bucket mysteriously disappears just after graceful failover of a node

40-rc

#1

OK, I’m really excited about the possibilities of using Couchbase 4 and N1QL, so I thought I would first get a feel for how the failover / redundancy technology handles it, and how things appear from the client side. I’m encountering a problem right off the bat, and it seems so simple that I feel I must be doing something wrong, so please, somebody slap me and tell me what I’ve messed up. :smile:

I installed a 2-node cluster of 4.0 Community Edition RC0 (build 4047). During the initialization of each node, I told it I wanted all the service checkboxes turned on, so all components should be running on both nodes. I created a Couchbase bucket, then I did a CREATE PRIMARY INDEX USING GSI. (This problem happens whether I give the primary index a name or leave it unnamed and have it called “#primary”.) I can then do a few UPSERTs and some very nice SELECTs, and it’s nice and fast, and I can N1QL query from either of the nodes and get the same results. All lovely.

But if I do a controlled Graceful Failover of either node, and keep querying over and over as the progress bar goes fills up, it works all the way until the progress bar reaches 100% and disappears; then, instantly, it’s as if the primary index is missing, or deleted, or was not copied from the vBucket it was supposed to during the failover, or something. I get

       "code": 4000,
        "msg": "No primary index on keyspace throttlen1ql. Use CREATE PRIMARY INDEX to create one."

in my JSON result, so of course all N1QL queries fail. If I Delta Recovery that node back into place, the index remains gone. If I recreate the index, then all my queries start working again, and it has not lost the actual data (the JSON documents themselves in the bucket). If I recreate the index while there’s just one node in, and then I bring the other node back in, then it continues to work fine. So it seems to be just something that happens right at the end of the graceful fail-out process. I also don’t think it happens every time, but I’ve failed out one node, then the other, and it has happened at least once for each of the two nodes, so I can’t believe it’s something corrupted on just one of the nodes.

Any ideas? I did formerly have the Beta version of 4.0 (which I believe was 4.0 Enterprise Beta) on these two servers, but I did carefully “dpkg --remove” it and I zapped the contents of /opt/couchbase before "dpkg --install"ing the release version. So I don’t think it’s any ghost data hanging around from before, but even if it is, this behavior is certainly unexpected and unwelcome. If anyone wants any logs, I can excerpt or upload anything that would help. Just looking over the “Logs” tab in the GUI, I don’t see anything I would call strange. Thanks, all!!

– Jeff Saxe
SNL Financial
Charlottesville, Virginia


#2

In 4.0, index replicas aren’t created automatically. You can, however, create a primary index on multiple nodes. The documentation has a general example of how to create an index on multiple nodes of the cluster. Let us know if you need any additional help.


#3

Aha! OK, this explains why the index was disappearing when one of the nodes was pulled out. If I follow the instructions in the page you referenced, essentially creating the index twice, with two names, and with the nodes property on one of the nodes each time, then it works — in the sense that pulling out just one node doesn’t kill all querying. However, the index defined to belong to that node still does persistently disappear from the GUI, and does not return after the node is brought back in.

So I guess this complicates how we would use this in production…?? I’m trying to imagine scaling this out multi-dimensionally, as you allude to on the site. We could have lots of storage-only nodes, which we could take out and put in gracefully at any time with very little expertise or preparation. Graceful failover, vBuckets move to other storage nodes, we do our patching or other server maintenance, then gracefully rebalance it in, and this would all be online with no loss. But then any nodes on which we run the indexing we would need to be more careful: Any indexes we define would have to be on at least two nodes for redundancy, and if we failed out one of those specific nodes, when we brought it back online we’d have to remember to recreate all the indexes that used to be on it, or else we’d inadvertently lose our redundancy — and we wouldn’t realize this until we failed out a different node later, and queries suddenly stopped working.

Am I describing this correctly? Is this recreation strategy necessary? Or do I need to separate roles in some other way so that it’s easier to manage? Also, your first sentence seems to imply that you envision automatic index replicas in some future version, so perhaps I just need to be patient to get easier administration later.

Thanks.


#4

Hi @JeffSaxe, sorry for the delayed response on this.
You do have this right. In the existing 4.0 and 4.1 release, we require that you create identical indexes to get both HA and load balancing. In the case of cluster operations that add/remove nodes, we require that you recreate indexes after a node has been removed and added back in.
In future these will be simplified but for now, there is some additional overhead to managing GSI indexes. You can use View indexes (CREATE INDEX … USING VIEW) with N1QL if the management of GSI indexes is an issue for your production deployment. Views live in the data service and have automated replica, failover and rebalance management.
thanks
-cihan