Help! Server going berserk


#1

Hi.

I have a 4 node system with three buckets (and no views for now - but i did do a trial with a view, and do believe I deleted it, but now trying to get the view list from the web GUI just times out on this bucket).
I started pumping data into it few weeks ago, and have few millions of documents. All was well but insertion time was not adequate (not keeping up with our data).
In attempt to improve things, I edited the main bucket (‘user_store’) I use and changed the “I/O Priority” to “High”, clicked Save and got the warning - that this can result in some downtime.

It has now been almost 2 days since!!

All nodes are showing yellow with “Pend”. CPU jumps up and down. Expanding them sometimes shows messages such as “Starting ep-engine” or “Initializing” next to the buckets.

All buckets showing yellow too, and sometimes showing those same messages over and over (“Starting ep-engine” or “Initializing”).

Also, calling this (which works fine on my staging system):

http://localhost:8091/pools/default/buckets/user_store/ddocs

returns:

["Unexpected server error, request logged."]

Any help will be appreciated. If I can’t resolve this ASAP i’ll have to delete the buckets and start fresh!

Log is full of messages such as :


    [couchdb:info,2014-11-27T11:47:13.623,ns_1@couch01.colo.com:<0.17407.264>:couch_log:info:41]Started main (prod) set view group `user_store`, group `_design/dev_main`, signature `824a6eff44708e8dce37ca0071a589a1', view count 1
    active partitions:      [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
    passive partitions:     []
    cleanup partitions:     []
    unindexable partitions: []
    no replica support
    
    [couchdb:info,2014-11-27T11:47:13.624,ns_1@couch01.colo.com:<0.17407.264>:couch_log:info:41]Flow control buffer size is 20971520 bytes
    [couchdb:error,2014-11-27T11:47:13.625,ns_1@couch01.colo.com:<0.17407.264>:couch_log:error:44]couch_set_view_group error opening set view group `_design/dev_main` (prod), signature `824a6eff44708e8dce37ca0071a589a1', from set `user_store`: {error,
                                                                                                                                                       {error,
                                                                                                                                                        {dcp_socket_connect_failed,
                                                                                                                                                     econnrefused}}}
~~~~~~~~~~~~~~~~~~~~~~~

Server performance declining constantly
#2

Hi uris2000,

I assume this is Couchbase Server 3.0.1?

It sounds like after changing the “I/O Priority” to “High”, the bucket had problems restarting and warming up. It sounds like the memcached process keeps failing. Can you open an defect and upload the logs please.

Please drop a link to the defect here.

Thanks,
Patrick


#3

Yes, 3.0.1 - sorry forgot to mention.

I have logs collected last night - I can upload.
2 things:

  • When I upload the logs - do I need to upload all 4 nodes? they all show exact same pattern in the log files I saw.
  • The other thing is - all 3 buckets are behaving the same. Even a new bucket I created for testing yesterday became yellow and stayed that way (creating and deleting buckets still works). Tho I only modified one bucket’s priority.

#4

Ideally all 4 but one would be a good start.


#5

The ZIP file is ~70MM and the upload limit is 50MB on the Dashboard site.
I tried collecting single node, expanded the zip and rezipped with hhigher compression - still at ~55MB…
Any file from the list that is not needed?

-rw-r--r-- 1 root root  11038459 Nov 27 14:24 couchbase.log
-rw-r--r-- 1 root root      2049 Nov 27 14:24 ddocs.log
-rw-r--r-- 1 root root  35650874 Nov 27 14:24 diag.log
-rw-r--r-- 1 root root     16150 Nov 27 14:24 ini.log
-rw-r--r-- 1 root root       305 Nov 27 14:24 memcached.log
-rw-r--r-- 1 root root 182965900 Nov 27 14:25 ns_server.babysitter.log
-rw-r--r-- 1 root root 187285133 Nov 27 14:25 ns_server.couchdb.log
-rw-r--r-- 1 root root 207348524 Nov 27 14:25 ns_server.debug.log
-rw-r--r-- 1 root root 184903560 Nov 27 14:25 ns_server.error.log
-rw-r--r-- 1 root root  15002860 Nov 27 14:25 ns_server.http_access.log
-rw-r--r-- 1 root root 199982734 Nov 27 14:25 ns_server.info.log
-rw-r--r-- 1 root root       231 Nov 27 14:25 ns_server.mapreduce_errors.log
-rw-r--r-- 1 root root 181907071 Nov 27 14:25 ns_server.reports.log
-rw-r--r-- 1 root root       318 Nov 27 14:25 ns_server.ssl_proxy.log
-rw-r--r-- 1 root root 197128240 Nov 27 14:25 ns_server.stats.log
-rw-r--r-- 1 root root 181879433 Nov 27 14:25 ns_server.views.log
-rw-r--r-- 1 root root       221 Nov 27 14:25 ns_server.xdcr_errors.log
-rw-r--r-- 1 root root      1493 Nov 27 14:25 ns_server.xdcr.log
-rw-r--r-- 1 root root       219 Nov 27 14:25 ns_server.xdcr_trace.log
-rw-r--r-- 1 root root      6956 Nov 27 14:25 stats.log

#6

Ok, I just broke it down to several ZIP files.

The issue ID is MB-12796
http://www.couchbase.com/issues/browse/MB-12796

thanks for the help.


#7

Issue resolved. thanks pvarley.

I wonder - if a bucket is more heavy on writes than on reads - is there anything that can be done to speed things up?


#8

Let’s take a step back, what is the problem you are seeing? Is your disk queue too high?


#9

I deployed code to query data from a SQL based DB and insert (or update) documents in Couchbase.
I deployed around Nov 11, made some improvements in the following days, and around Nov 14 was running the code more or less as it is now.
It reached a pick of about 200 ops/sec but since then was in constant decline.

(I have a graph to explain but the system here won’t let me upload image).


#10

We should really open a new question, as this is a different problem.

You can use the “reply as Linked Topic” link on the right hand side of your last post to do it for you :smiley: