Workaround for MB-14036 (memcached high CPU usage)


#1

Hello -

I believe we’re experiencing the issues noted MB-14036 and MB-12775. We have a two-node cluster of Azure A3 VMs (4 cores, 7GB mem). We’re running CB 3.0.1 on Ubuntu 12.04.5. Even when the cluster is doing nothing (no index rebuilds, zero ops/sec) we still see CPU bounce between 50-80%.

This sounds like a dead-ringer for the two bugs listed above, however we’re currently on 3.0.1 community edition and aren’t in a position to upgrade to Enterprise at this time. Does anyone know if there’s a work-around for this bug? Seems like there would be otherwise everyone running community edition would be having the same issue?

Thanks,

(Note: I tried uploading an image of the CPU Usage with no ops/sec, but the forum won’t allow me to upload a picture).


#2

I’m not aware of a workaround to this - the fix was in the scheduling of internal ep-engine (i.e. the storage layer) background threads, so it’s not something which can be changed without modifying the source.

The bugs listed at fixed in the forthcoming 4.0 release, so if you can wait until when that’s released the issue will be resolved then.

Otherwise if you’re comfortable C++ and building from source you could attempt backport the fix to the 3.0.x branch - but I couldn’t say how straightforward backporting be without attempting it.


#3

Thanks drigby. By work-around more what I meant was a sequence of steps (reboots, service restarts, bucket config changes) that will prevent the condition from happening. I’m operating on the (possibly incorrect) assumption that not everyone running CB 3.0.1 is having this issue and that it only happens under a unique set of circumstances. Do you think that is an accurate statement?

Thanks,

  • Jeff

#4

From what I recall the bug was related to how tasks are scheduled on worker threads; and it was an intermittent issue; possibly related to certain workload patterns and/or environment. However I can’t characterise it more specifically I’m afraid. Therefore I don’t know of any specific action to resolve it.


#5

Thanks drigby. Our fix was to reboot the nodes. After going through a reboot cycle, the issues went away for us.