N1QL Not responding

steevebisson · May 16, 2017, 5:26pm

Hello,

We are currently running Couchbase version 4.6.1. Since the past 2 weeks, an intermittent problem happens 2-3 times Per Day were N1QL stop responding.

When the issue appears, the N1QL queries never returns back to the client. Here is an example when the problem is active using a curl command:

$ curl -v http://10.50.51.145:8093/query/service -d “statement=SELECT * FROM My_Bucket where field is not null limit 2”

Trying 10.50.51.145…
TCP_NODELAY set
Connected to DNSNAME (10.50.51.145) port 8093 (#0)

POST /query/service HTTP/1.1
Host: DNSNAME:8093
User-Agent: curl/7.51.0
Accept: /
Content-Length: 63
Content-Type: application/x-www-form-urlencoded

upload completely sent off: 63 out of 63 bytes <----------------Will never return

I have to ctrl+c the above curl command and try again. It Might fail 3-10 times in a row. It might return the results randomly, but will fail most of the time.

The above select statement is just an example, any N1QL query will fail, with the exception of a select using a covering index. So if I do:
SELECT doct FROM My_Bucket where field is not null limit 2
instead of
select * FROM My_Bucket where field is not null limit 2

The query works and return’s back to the client.

When the problem starts to happen, it never resolved itself and I Cannot find hint of the problem in couchbase logs.

One observation is that when the issue appears, We see files piling up in /tmp, see here:
/tmp # ll
total 22M
drwxrwxrwt 4 root root 4.0K May 16 15:02 ./
drwxr-xr-x 61 root root 4.0K May 16 13:53 …/
-rw------- 1 couchbase couchbase 1.4M May 16 09:18 scan-backfill504053237332
-rw------- 1 couchbase couchbase 1.4M May 16 08:27 scan-backfill504067426679
-rw------- 1 couchbase couchbase 1.4M May 16 09:00 scan-backfill504084483739
-rw------- 1 couchbase couchbase 1.4M May 16 08:29 scan-backfill504120555308
-rw------- 1 couchbase couchbase 1.4M May 16 09:11 scan-backfill504151485829
-rw------- 1 couchbase couchbase 1.4M May 16 09:34 scan-backfill504211081416
-rw------- 1 couchbase couchbase 1.4M May 16 09:14 scan-backfill504254016850
-rw------- 1 couchbase couchbase 1.4M May 16 09:20 scan-backfill504277482150
-rw------- 1 couchbase couchbase 1.4M May 16 09:12 scan-backfill504397440800
-rw------- 1 couchbase couchbase 1.4M May 16 09:11 scan-backfill504490357566
-rw------- 1 couchbase couchbase 1.4M May 16 09:19 scan-backfill504492734371
-rw------- 1 couchbase couchbase 1.4M May 16 09:22 scan-backfill504501588429
-rw------- 1 couchbase couchbase 1.4M May 16 09:15 scan-backfill504564230281
-rw------- 1 couchbase couchbase 1.4M May 16 08:28 scan-backfill504824889450
-rw------- 1 couchbase couchbase 1.4M May 16 08:28 scan-backfill504829276353
-rw------- 1 couchbase couchbase 1.4M May 16 09:12 scan-backfill504852423167

Those files never go away. Unless I do what I explain later.
If I do “SELECT * FROM system:active_requests;” Returns only one row, I don’t see any running queries or hang queries.

Another Observation is that normal get/sets operations are working properly when the issue appears, this only affect N1QL queries with non covering indexes.

We have started a second cluster with couchbase version 4.6.2, and we have the same issue.

The cluster was upgraded from a 3 nodes cluster with 2cpu’s and 15GIG RAM nodes. (It is a DEV Cluster).
To a 3 nodes cluster running 4 CPU with 30GIG RAM. Performance is never an issue. We don’t do much on it except running a few dev apps. The issue still appears on the upgraded cluster, but takes more time to start appearing. Meaning that we can work on it for a couple hours, then eventually stop working.

The only solution so far to quick fix the issue, is to edit the bucket and change the cache metadata.
So if it is currently set to Value Ejections, I switched it to full ejection. And vice-versa.
When the cache metadata is changed, all the tmp files disappears and the problem vanish. Cluster is healthy again.
But a few hours later , issue reappears.
We have 4-5 buckets. The above solution only works if I change the cache metadata on that one particular bucket.

But all N1QL are affected. and If I change the metadata on that one particular bucket, all N1QL queries start to work on all the buckets.

The Issue happens on 4.5.1, 4.6.1, 4.6.2
The only thing in common between all the clusters we tried is the bucket and data, which I restored using cbbackup/cbrestore.

Need advise,

Thanks,

Steeve

marcog · May 16, 2017, 6:12pm

I would say that matches MB-22677.
Try removing the limit clause, does that make a difference?

steevebisson · May 16, 2017, 6:42pm

Hello Marco,

thanks for the reply. All SQL statements are affected. Not just the ones with the cause limit.

-Steeve

marcog · May 16, 2017, 6:53pm

Strike MB-22677 then.
You have hanging goroutines.

On the host of a query node that hangs, could you collect the output of

http://localhost:6060/debug/pprof/goroutine?debug=2

This is going to be quite long, so best to get it to me via support.
I understand you have a CBSE - if you attach it to that, I’ll have a look.

steevebisson · May 16, 2017, 7:02pm

Hello Marco,

thanks for your feedback. Much Appreciated. I will do that and update.

Regards,

Steeve

steevebisson · May 17, 2017, 8:13pm

Hello, attached are the logs generated from the goroutine debug

goroutinecouchbase-master-1.txt.zip (138.9 KB)
goroutinecouchbase-master-3.txt.zip (138.4 KB)
goroutinecouchbase-master-2.txt.zip (155.7 KB)

marcog · May 19, 2017, 9:52pm

Seems like your data service is not responding - can you get data straight out of the datastore?

steevebisson · May 23, 2017, 8:04pm

normal get/sets operations are working properly when the issue appears, this only affect N1QL queries with non covering indexes.

steevebisson · May 23, 2017, 8:08pm

I found a way to reproduce the problem with random generated data.

create a new bucket. Lets say onebigjson

-Run the following on your newly created bucket:
cbworkloadgen -n 10.50.51.11:8091 -j -b onebigjson -i 1000 -s 2000000
-Create the following index:
create index idx_onebig_name on onebigjson(name)

-Then execute the following n1ql query over and over and over.
select * from onebigjson where name is not null;
At one point in time, the cluster will stop responding to your queries. temporary files will start to pile up in the /tmp. And even if you stop executing the N1QL’s, cluster never returns back to normal.
all n1ql queries on all bucket in that cluster are affected when the problem appears. The only way to successfully query a bucket via n1ql is to restart the affected bucket by changing the cache metadata. Or restart all nodes one at a time.