SERVER ERROR (code 10) for large number of put operations

Zygmumac · July 9, 2013, 1:17pm

Hello,

I am running couchbase 2.0.1 community edition (build-170) on two Ubuntu Server 12.04.2 machines and during some of my couchbase stress-testing i’ve come accross a problem i can’t explain as of yet.

The scenario is as follows:

2 server nodes, 4GB memory in clusterm 1.36 TB storage space, replicas enabled, 1 replica copy, single bucket with a RAM quota of 200MB (100 MB per node, intentional for testing), persistence enabled
Set requests are sent to the cluster in groups of <=400000 (each group’s elements sent sequentially in a loop) - the set requests are sent using PHP, via both the SDK and the Memcached library (doesn’t seem to matter), each set request has a random INT key and a random INT value (mt_rand)
For each group, I am calculating two things:

a) How many set requests fail (getResultCode() is non-zero) - this is checked for each request (failrue rate)

b) After the group is sent and the disk write queue is empty - how many sets cannot be ‘verified’, as in for how many elements a get( key ) request does not return the proper value (or returns no value at all)

In all cases, values calculated in a) and b) are identical (things confirmed set are allways verified)
Initial group has 13% failrue rate, once the bucket runs out of RAM all following set requests fail with code 10 (SERVER ERROR) - the couchbase log dump does not contain any out of memory errors!
Second group has 50% fail rate, third 99% and finally 100%, each group is exectuted after a delay (I intentionally wait for the disk writes to finish), RAM usage stays a few megs above 200 (200 is the limit, however)
Every set attempt after that (even from the web interface) fails with code 10, the delay before the set does not matter
If the cluster is restarted, I can squeeze one more group, with 70% failrue rate

In theory, at least if I understand the documentation correctly, I can get out of memory errors at high load or when it’s running out of RAM, but once everything calms down and all data is flushed to disk I should be able to perform set operations again, plus there are no out of memory messages in the error log.

I might be doing something wrong, of course. Does anyone have any leads on what might be wrong or what I might be doing wrong? I can, of course, provide log or code fragments if needed.

Thanks in advance.

tgrall · October 2, 2013, 4:59pm

Hello,

It looks to me that you have issue with the RAM, too much st and the disk queue cannot be drained quickly enough. It is not clear to me what is the status of the request when you get all the status back.

Do you see any ep_tmp_oom_errors ( temp OOM per sec.) in cbstats or the admin console (at the bucket level) ?

Have you tried with the latest release Couchbase 2.2 to configure more writers and readers?
http://docs.couchbase.com/couchbase-manual-2.2/#using-multi--readers-and-writers

This could probably help.

regards
Tug
@tgrall