Thank you for getting back to me.
-We've recently seen some issues with THP in various environments. Can you check that your choice of operating system has this turned off (cat /sys/kernel/mm/transparent_hugepage/enabled)?
I am using CB's commercial AWS AMI. Things like huge page are configured by you. All I did to the image was add a second ephemeral spindle and pointed the indices there.
While this isn't from the cluster in question, it is from your AMI:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory
-It sounds like you may be approaching some sizing limits...you're putting 75GB into only ~36GB of RAM. While it shouldn't be a problem on paper, it will depend on how large your items are and how quickly you're trying to insert them. Can you share a screenshot of the "summary" graphs on your bucket in the hour/day timeframe? I wouldn't say that this behavior is expected when undersized, but it may be making it worse.
It was my understanding that the key thing was to keep the metadata below half of the configured RAM and to only use 60% of the machine's RAM for CB. By my observation/calculations, I am using about 75 bytes/item for metadata for 2.25 GB total. The cluster has 4.5 GB/node for CB for 13.5 GB total. 2.25GB * 2 for the replica leaves me with 9 GB RAM for the rest of the machine. By my calculation, I can stick another 15M or so documents into this cluster. But this problem is showing up long before we get there.
My particular app mostly uses views. I don't, at this time, depend upon keeping documents in memcache. It is a mostly idempotent data application. If I could dedicate more of the machine towards metadata, I would.
As I've torn the problematic cluster down and moved up the server size chart, I cannot make you a screen shot. Tomorrow, I'll be spinning up a second node of the m2.4xlarge servers. Then I could make you a picture.
-On your application side, how frequently are you creating Couchbase client objects? We would always recommend trying to use a single object for as long as possible, and you may also want to look at turning on the "config_cache" (http://www.couchbase.com/wiki/display/couchbase/libcouchbase+configurati...) which will reduce the traffic back to the cluster.
My Python client spins up a connection document per thread and keeps them open for the life of the client. I have 7 threads operating. I move hundreds of thousands of documents through each of those connections. It has been stable. I log each timeout. They are rare except up when using this cluster.
-Any views configured? Can you share the definitions for them?
I have about a dozen views configured in three design documents. They are pretty straight forward views. I would be happy to share them privately with you.
I'm happy to add more data as I have it.