Couchbase node dieing when uploading documents using Java SDK


#1

Hi,
The nodes on my couchbase server seems to be constantly going into pending status which then goes down eventually and then goes back up again after some time.
This happened as I try to upload more and more documents to the nodes using the Java SDK. My bucket is currently holding about 490 million documents with full-eviction. It’s running on 3 nodes with about 46.8GB of memory and about 2.1 TB of HDD space in total. Below is what i’ve got from error.log on one of the node.

  [stats:error,2015-06-30T8:03:28.331,ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com:<0.925.0>:stats_collector:handle_info:124]Exception in stats collector: {exit,
                                   {noproc,
                                    {gen_server,call,
                                     ['ns_memcached-Sample',{stats,<<>>},180000]}},
                                   [{gen_server,call,3,
                                     [{file,"gen_server.erl"},{line,188}]},
                                    {ns_memcached,do_call,3,
                                     [{file,"src/ns_memcached.erl"},{line,1399}]},
                                    {stats_collector,grab_all_stats,1,
                                     [{file,"src/stats_collector.erl"},{line,84}]},
                                    {stats_collector,handle_info,2,
                                     [{file,"src/stats_collector.erl"},
                                      {line,116}]},
                                    {gen_server,handle_msg,5,
                                     [{file,"gen_server.erl"},{line,604}]},
                                    {proc_lib,init_p_do_apply,3,
                                     [{file,"proc_lib.erl"},{line,239}]}]}
    
    [stats:error,2015-06-30T8:03:28.332,ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com:<0.925.0>:stats_collector:handle_info:124]Exception in stats collector: {exit,
                                   {noproc,
                                    {gen_server,call,
                                     ['ns_memcached-Sample',{stats,<<>>},180000]}},
                                   [{gen_server,call,3,
                                     [{file,"gen_server.erl"},{line,188}]},
                                    {ns_memcached,do_call,3,
                                     [{file,"src/ns_memcached.erl"},{line,1399}]},
                                    {stats_collector,grab_all_stats,1,
                                     [{file,"src/stats_collector.erl"},{line,84}]},
                                    {stats_collector,handle_info,2,
                                     [{file,"src/stats_collector.erl"},
                                      {line,116}]},
                                    {gen_server,handle_msg,5,
                                     [{file,"gen_server.erl"},{line,604}]},
                                    {proc_lib,init_p_do_apply,3,
                                     [{file,"proc_lib.erl"},{line,239}]}]}
    
    [ns_server:error,2015-06-30T8:03:37.090,ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com:ns_doctor<0.329.0>:ns_doctor:update_status:229]The following buckets became not ready on node 'ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com': ["Sample",
                                                                                                                    "office"], those of them are active ["Sample",
                                                                                                                                                         "office"]
    [ns_server:error,2015-06-30T8:04:24.298,ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com:ns_log<0.277.0>:ns_log:handle_cast:210]unable to notify listeners because of badarg
    Type  :quit<Enter>  to exit Vim          

I’m confused as to what is causing this problem and would like to know what’s causing it so that i can prevent it from happening in production…
Does anyone know why? Is the server possibly not big enough?


#2

@reVrost this looks like a server issue of some sort. I’ll see if I can pull someone in from the server team to look at it, can you in the meantime run a cbcollectinfo and upload it?

Also, which server version are you running?


#3

I am running Couchbase server Version: 3.0.3-1716 Enterprise Edition (build-1716).

I couldnt upload the cbcollectinfo here due to size restriction. I’ve upload it on my dropbox its (94mb)
Here is the link:


#4

You can use the “cluster wide diagnostics” tool to collect and upload logs - http://docs.couchbase.com/admin/admin/Misc/cluster-wide-info-intro.html


#5

Sorry for the late reply,
Here are the logs:

              https://s3.amazonaws.com/cb-customers/reVrost/collectinfo-2015-07-01T030656-ns_1%40ec2-52-64-116-219.ap-southeast-2.compute.amazonaws.com.zip
            
              https://s3.amazonaws.com/cb-customers/reVrost/collectinfo-2015-07-01T030656-ns_1%40ec2-52-64-98-58.ap-southeast-2.compute.amazonaws.com.zip
            
              https://s3.amazonaws.com/cb-customers/reVrost/collectinfo-2015-07-01T030656-ns_1%40ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com.zip

#6

Bump, I restarted one of the server that kept dieing and now it seems like I am able to upsert more documents back up again without the server going down every moment or so. However, i do get some errors like

Hard Out Of Memory Error. Bucket "office" on node 
ec2-52-64-98-58.ap-southeast-2.compute.amazonaws.com is full. All memory
 allocated to this bucket is used for metadata.

Even though that error popped, it seems that the servers are working as normal (nothing went down or anything so FAR).
Although, I’m still not quite sure how/why this error occurred, as far as i know the bucket is operating under full-eviction so why would there be an out of memory error due to metadata? Unless I’m not fully understanding what full-eviction is…


#7

You’re correct-- you shouldn’t see that if you are operating with full eviction.

Can you see if the stats show your bucket on one node operating much higher on memory usage than others? In the web UI, you’ll see a little blue arrow that will let you show details by server.


#8

The RAM usage seems to be roughly equal across all 3 nodes, the cpu usage varies a little though.
Here are some of the screenshots from web ui view: