Couchbase Server is partially not responding

sntentos · March 30, 2017, 6:25pm

I am not sure what happened to the server and it’s not responding.
Somehow, after months of operating, the server decided to become non-responsive in some function.

Couchbase Server 4.5, is inside a docker container. I have tried restarting everything, but nothing helped.

Views are working fine (I wrote a new one in-between actually, everything is returning beautifully).
N1QL server is not responding to me, but some other colleague somehow can get queries, both failing and successful. The weird thing is he is using my scripts to fetch queries.
Sync_gateway stopped accepting new documents (according to my colleague), and I have some evidence to provide that (after some time, the bucket activity was showing 0, requests to 8091 port balanced [0, 1].

What is more weird about the N1QL queries, is that server activity is a constant [400, 600] queries per second. I believe it showed [4k, 6k]. Also, queries run for 4h30m and then they completely failing, without any kind of results, with or without indexes.

I get these kind of responses: (With error codes being red)

SELECT did, ARRAY_AGG({batteryCharge, batteryStatus, bcn, stepCounterDelta, t, type}) AS entries
FROM agora_bucket USE INDEX(`#primary`)
WHERE type="bcn_scan"
GROUP BY did ORDER BY did ASC LIMIT 3;

 Connected to : http://localhost:8091/. Type Ctrl-D or \QUIT to exit.

 Path to history file for the shell : /root/.cbq_history
 ERROR 174 : N1QL: Query nodes not responding

SELECT did, ARRAY_AGG({batteryCharge, batteryStatus, bcn, stepCounterDelta, t, type}) AS entries
FROM agora_bucket
WHERE type="bcn_scan"
GROUP BY did ORDER BY did ASC LIMIT 3;

 ERROR 100 : Get http://localhost:8093/admin/clusters/default/nodes: dial tcp 127.0.0.1:8093: connection refused


 Path to history file for the shell : /root/.cbq_history
 ERROR 107 : Not connected to any cluster. Use \CONNECT command.

SELECT did, ARRAY_AGG({batteryCharge, batteryStatus, bcn, stepCounterDelta, t, type}) AS entries
FROM agora_bucket
WHERE type="bcn_scan"
GROUP BY did ORDER BY did ASC;

 ERROR 100 : Get http://localhost:8093/admin/clusters/default/nodes: dial tcp 127.0.0.1:8093: connection refused


 Path to history file for the shell : /root/.cbq_history
 ERROR 107 : Not connected to any cluster. Use \CONNECT command.

SELECT did, ARRAY_AGG({batteryCharge, batteryStatus, bcn, stepCounterDelta, t, type}) AS entries
FROM agora_bucket USE INDEX(`#primary`)
WHERE type="bcn_scan"
GROUP BY did ORDER BY did ASC;

 ERROR 100 : Get http://localhost:8093/admin/clusters/default/nodes: dial tcp 127.0.0.1:8093: connection refused


 Path to history file for the shell : /root/.cbq_history
 ERROR 107 : Not connected to any cluster. Use \CONNECT command.

The server has limited resources, but a) it’s gonna be hard to intervene, b) the need for that server is almost over. So I just need it to work for a month or two (or just dump the whole data in JSON) and the server will go out of commission.

Prompt responces are appreciated and thank you for wasting your time with me <3

Edit: have also been getting messages like:

Date: Mon, 27 Mar 2017 22:55:47 +0000
From: couchbase@silent.cs.abo.fi
To: ntentos@ntentos.abo.local
Subject: Couchbase Server alert: IP address changed

IP address seems to have changed. Unable to listen on 'ns_1@127.0.0.1'. (Underlaying POSIX error code: 'nxdomain')

eben · March 31, 2017, 9:39pm

It sounds like there is a problem with the query service, which runs as a process named “cbq-engine” listening on port 8093 (by default). Can you see if that process is running? Also, you could look in the log files for “query.log” which would contain stack traces for any failures of the query service.

sntentos · March 31, 2017, 10:18pm

/opt/couchbase/bin/cbq-engine --datastore=http://127.0.0.1:8091 --http=:8093 --configstore=http://127.0.0.1:8091 --enterprise=true --https=:18093 --certfile=/opt/couchbase/var/lib/couchbase/config/ssl-cert-
key.pem --keyfile=/opt/couchbase/var/lib/couchbase/config/ssl-cert-key.pem --ssl_minimum_protocol=tlsv1

query.zip (419.3 KB)

The only “out of the ordinary” lines I see are:

_time=2017-03-28T10:32:17.169+00:00 _level=INFO _msg=Pool Get returned dial tcp 127.0.0.1:11210: connection refused 
_time=2017-03-28T10:32:27.781+00:00 _level=INFO _msg=Pool Get returned MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: Not found 
_time=2017-03-28T10:33:22.350+00:00 _level=INFO _msg=Pool Get returned dial tcp 127.0.0.1:11210: cannot assign requested address

However, everything is “Info”

eben · April 10, 2017, 6:17pm

Hi, sorry for the delayed reply. It may be too late, but there are some clues in query.log. It appears that cbq-engine is having trouble communicating with the memcached:

2017-03-28T10:31:47.416+00:00 [Info] switched currmeta from 9 → 9
_time=2017-03-28T10:32:06.982+00:00 _level=ERROR _msg=Connection Error: EOF. Refreshing bucket
_time=2017-03-28T10:32:07.002+00:00 _level=ERROR _msg=Connection Error: read tcp 127.0.0.1:11210: connection reset by peer. Refreshing bucket
_time=2017-03-28T10:32:07.002+00:00 _level=ERROR _msg=Connection Error: EOF. Refreshing bucket
_time=2017-03-28T10:32:07.002+00:00 _level=ERROR _msg=Connection Error: read tcp 127.0.0.1:11210: connection reset by peer. Refreshing bucket
_time=2017-03-28T10:32:07.002+00:00 _level=ERROR _msg=Connection Error: read tcp 127.0.0.1:11210: connection reset by peer. Refreshing bucket
_time=2017-03-28T10:32:07.004+00:00 _level=ERROR _msg=Connection Error: EOF. Refreshing bucket
_time=2017-03-28T10:32:07.006+00:00 _level=ERROR _msg=Connection Error: read tcp 127.0.0.1:11210: connection reset by peer. Refreshing bucket
_time=2017-03-28T10:32:07.067+00:00 _level=ERROR _msg=Connection Error: read tcp 127.0.0.1:11210: connection reset by peer. Refreshing bucket
_time=2017-03-28T10:32:07.067+00:00 _level=ERROR _msg=Connection Error: read tcp 127.0.0.1:11210: connection reset by peer. Refreshing bucket
_time=2017-03-28T10:32:07.067+00:00 _level=ERROR _msg=Connection Error: read tcp 127.0.0.1:11210: connection reset by peer. Refreshing bucket
_time=2017-03-28T10:32:07.698+00:00 _level=INFO _msg=Pool Get returned dial tcp 127.0.0.1:11210: connection refused

Can you send the rest of the logs? @drigby, any ideas?

drigby · April 10, 2017, 6:33pm

Maybe check your file descriptor / socket limits? Certainly it looks like localhost connections are failing for some reason.

sntentos · April 10, 2017, 7:43pm

Fortunately, we could extract the bare minimum needed to safely decommission the server (without rebuilding it). However, getting the server operational would be, blissful.

By

you mean zip and send every other available log file in the folder?

root@1fa8c97bf8ca:~# cat /proc/sys/fs/file-max; cat /proc/sys/net/core/somaxconn
817472
128

Also, (although it appears useless to me)

root@1fa8c97bf8ca:~# ulimit -a
core file size          (blocks, -c) 97656
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31935
max locked memory       (kbytes, -l) 97656
max memory size         (kbytes, -m) unlimited
open files                      (-n) 40960
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Although I must say that 128 looks terribly weird. I never needed to check those limits, so a command would be useful

Couchbase was installed using docker and this (Docker) webpage

eben · April 10, 2017, 8:12pm

For the logs, yes, either zip up all the logs in that folder, or use the “collect info” function from the “Logs” tab in the UI, which will upload the logs to S3 (and send us the resulting link).

sntentos · April 10, 2017, 8:27pm

Unfortunately, especially considering the age of this topic, 700 mb of logs is not easy to store.

https://drive.google.com/open?id=0B5P4tz3jnyD2THFJcHpjbEFUYjA

eben · April 10, 2017, 9:52pm

The size of the logs is why we offer the “collect info” feature with uploads to Amazon S3, feel free to use that next time.

@drigby, something seemed to happen around 2017-03-28T10:32, which is between memcached.log.000043.txt and memcached.log.000044.txt. In the memcached.log it mentions restarting logging, and there are various errors reported in error.log.