Avoid cascading failures

Hi,

I’m running a tracker service based on play framework, see http://www.playframework.com/ and use couchbase to store the trackings.

I have noted some problems with cascading errors where a single error can mean that the system gets blocked for many hours, probably due to contention.

What typically happens is this, from the log I see a message like

2013-11-13 00:31:08.015 WARN com.couchbase.client.vbucket.ConfigurationProviderHTTP: Connection problems with URI http://production.couchbase.node.5:8091/pools ...skipping java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method)

Meaning that a connection problem for one node has happened. After a little time I start to see messages like

2013-11-13 06:02:18.738 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Operation canceled because authentication or reconnection and authentication has taken more than one second to complete.

and the whole system becomes unresponsive including request handling but couchbase nodes are running fine again.

As my track storage is running asynchronously I believe that one problem is that set/add operations just keeps flowing into the system so buffers/queues just grows because no operations gets completed. So basically a temporary problems persist because the clients is overwhelmed by data.

Does the above conclusion sound valid. It currently looks like that the couchbase java driver has a hard time recovering from a temporary issue because to many operations gets queued into the system.

Have anybody experienced similar issues?

Are there any suggested methods/patterns to work around this. I’m currently considering make a kind of queue in front of the couchbase driver, such that we never have more than xxx request in flight to couchbase nodes.

Best Regards
Niels

Hello,

I am not sure that is related to any contention. I can not be 100% positive but I feel that the issue is more related to some inactivity on the network and the sockets are then closed and cannot be reoponed.

Any though on a possible drop of socket on your network?

Can you confirm that when you have activity the system is working as exepected?

Regards
Tug
@tgrall

Hello,

Yes, system is working as expected most of the time, then I suddenly see these errors, starting as described and then the only way to get the system back is to restart the server.

Any though on a possible drop of socket on your network?

I’m not sure that I follow you here. Any hints to how I can confirm that sockets are dropping

Best Regards
Niels

Hello,

I believe I found the cause of the problems.

A part of the code used for importing data where doing a lot of blocking calls to get views and also querying these views. This would happen in a spike

Could that explain the behaviour I described?

Best Regards
Niels