Hello,
We are currently running Couchbase version 4.6.1. Since the past 2 weeks, an intermittent problem happens 2-3 times Per Day were N1QL stop responding.
When the issue appears, the N1QL queries never returns back to the client. Here is an example when the problem is active using a curl command:
$ curl -v http://10.50.51.145:8093/query/service -d “statement=SELECT * FROM My_Bucket where field is not null limit 2”
- Trying 10.50.51.145…
- TCP_NODELAY set
- Connected to DNSNAME (10.50.51.145) port 8093 (#0)
POST /query/service HTTP/1.1
Host: DNSNAME:8093
User-Agent: curl/7.51.0
Accept: /
Content-Length: 63
Content-Type: application/x-www-form-urlencoded
- upload completely sent off: 63 out of 63 bytes <----------------Will never return
I have to ctrl+c the above curl command and try again. It Might fail 3-10 times in a row. It might return the results randomly, but will fail most of the time.
The above select statement is just an example, any N1QL query will fail, with the exception of a select using a covering index. So if I do:
SELECT doct FROM My_Bucket where field is not null limit 2
instead of
select * FROM My_Bucket where field is not null limit 2
The query works and return’s back to the client.
When the problem starts to happen, it never resolved itself and I Cannot find hint of the problem in couchbase logs.
One observation is that when the issue appears, We see files piling up in /tmp, see here:
/tmp # ll
total 22M
drwxrwxrwt 4 root root 4.0K May 16 15:02 ./
drwxr-xr-x 61 root root 4.0K May 16 13:53 …/
-rw------- 1 couchbase couchbase 1.4M May 16 09:18 scan-backfill504053237332
-rw------- 1 couchbase couchbase 1.4M May 16 08:27 scan-backfill504067426679
-rw------- 1 couchbase couchbase 1.4M May 16 09:00 scan-backfill504084483739
-rw------- 1 couchbase couchbase 1.4M May 16 08:29 scan-backfill504120555308
-rw------- 1 couchbase couchbase 1.4M May 16 09:11 scan-backfill504151485829
-rw------- 1 couchbase couchbase 1.4M May 16 09:34 scan-backfill504211081416
-rw------- 1 couchbase couchbase 1.4M May 16 09:14 scan-backfill504254016850
-rw------- 1 couchbase couchbase 1.4M May 16 09:20 scan-backfill504277482150
-rw------- 1 couchbase couchbase 1.4M May 16 09:12 scan-backfill504397440800
-rw------- 1 couchbase couchbase 1.4M May 16 09:11 scan-backfill504490357566
-rw------- 1 couchbase couchbase 1.4M May 16 09:19 scan-backfill504492734371
-rw------- 1 couchbase couchbase 1.4M May 16 09:22 scan-backfill504501588429
-rw------- 1 couchbase couchbase 1.4M May 16 09:15 scan-backfill504564230281
-rw------- 1 couchbase couchbase 1.4M May 16 08:28 scan-backfill504824889450
-rw------- 1 couchbase couchbase 1.4M May 16 08:28 scan-backfill504829276353
-rw------- 1 couchbase couchbase 1.4M May 16 09:12 scan-backfill504852423167
Those files never go away. Unless I do what I explain later.
If I do “SELECT * FROM system:active_requests;” Returns only one row, I don’t see any running queries or hang queries.
Another Observation is that normal get/sets operations are working properly when the issue appears, this only affect N1QL queries with non covering indexes.
We have started a second cluster with couchbase version 4.6.2, and we have the same issue.
The cluster was upgraded from a 3 nodes cluster with 2cpu’s and 15GIG RAM nodes. (It is a DEV Cluster).
To a 3 nodes cluster running 4 CPU with 30GIG RAM. Performance is never an issue. We don’t do much on it except running a few dev apps. The issue still appears on the upgraded cluster, but takes more time to start appearing. Meaning that we can work on it for a couple hours, then eventually stop working.
The only solution so far to quick fix the issue, is to edit the bucket and change the cache metadata.
So if it is currently set to Value Ejections, I switched it to full ejection. And vice-versa.
When the cache metadata is changed, all the tmp files disappears and the problem vanish. Cluster is healthy again.
But a few hours later , issue reappears.
We have 4-5 buckets. The above solution only works if I change the cache metadata on that one particular bucket.
But all N1QL are affected. and If I change the metadata on that one particular bucket, all N1QL queries start to work on all the buckets.
The Issue happens on 4.5.1, 4.6.1, 4.6.2
The only thing in common between all the clusters we tried is the bucket and data, which I restored using cbbackup/cbrestore.
Need advise,
Thanks,
Steeve