Create Index Timeout / Hanging

40-rc
n1ql

#1

Hi

We tried yesterday to create a new index on one of our buckets to use N1QL.

Cluster and server info:
1 cluster
3 servers (Physical)
2 buckets (1 high prio and 1 low prio)
OS: Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux
couchbase v4.0.0-4051 Community Edition (build-4051)
451GB Memory (177GB used)

On all 3 servers are following services enabled: Data,Index and Query

we used following statement:
CREATE PRIMARY INDEX bucket_high-index ON bucket_high USING GSI;
in /opt/couchbase/bin/cbq

The issue we now have is that the index is in status created with initial progress 0% since yesterday. But we don’t see this index if we run select * from system:indexes

Additionally we tried to run the create command multiple times with the options defer_build: true and always get now timeouts after 2 minutes.

One log entry:
/opt/couchbase/var/lib/couchbase/logs/query.log:12154:2015-11-10T10:14:09.232Z+01:00 [Info] CreateIndex 9879947322840320459 bucket_high/bucket_high-index using:forestdb exprType:N1QL partnExpr: whereExpr: secExprs:[] isPrimary:true with:[123 34 100 101 102 101 114 95 98 117 105 108 100 34 58 116 114 117 101 44 34 110 111 100 101 120 34 58 34 97 100 119 99 98 48 51 46 97 100 119 101 98 115 116 101 114 46 99 111 109 34 125] - elapsed(2m0.675478241s) err(Request Timeout)

another log entry:
/opt/couchbase/var/lib/couchbase/logs/query.log:12308:2015-11-10T10:27:40.232Z+01:00 [Info] CreateIndex 3641916042920834401 bucket_high/bucket_high-index using:forestdb exprType:N1QL partnExpr: whereExpr: secExprs:[] isPrimary:true with:[] - elapsed(2m0.825863341s) err(Request Timeout)

If you need more information just ask.


#2

We seem to have the same problem - any progress since?

We solved the immediate issue (the index not responding etc…) by creating another primary index on a different node : http://developer.couchbase.com/documentation/server/4.0/indexes/gsi-for-n1ql.html

But the issue I see here is nothing told us anything was wrong with the node in question (its documents are all still obtainable, its just the index that seemed to no longer update or respond properly) - How can we detect this - and what is the best practice to solve it? (Failover the node and re-add it? I’m about to try that)


#3

Yes I found a workaround which solves this.

I had to reboot the node which was trying to create this index. So I did a graceful failover of this node and removed it completly out of the cluster (rebalance). Then I did a reboot and added the server back to the cluster.
After this procedure the stuck index was gone. Then I tried to create it again and everything worked.

But now we have a high IOPs count if we query this index…