Periodic OPS down on couchbase server 3.0.1 every hour

#1

OPS down occurred on 44min ~ 49min periodically (every hour).
Is there any configuration for periodic job for Couchabse Server ?

I am trying to find the reason why OPS down is occurred periodically every 44min~50min per hour.

  • Network Traffic (1Gbps) => may be not (Couchbase v2.2 server is ok)
  • Garbage Collection on WebApi (using Couchbase Client) => not peridoically
  • System periodically batch job ? there’s no crontab

2014.12.19 09:44 ~ 09:50

2014.12.21 11:44 ~ 11:50

Couchbase 2.2 is ok (OPS down not occurred )
2014.12.21 11:44 ~11:50

#2

How often is compaction running? Also, are you setting TTL’s on documents? The Expiry Pager runs by default once per hour but typically this doesn’t cause a degradation in performance.

#3

Thx for your answer, @tgreenstein

I am not setting TTL’s on documents and I don’t use auto compaction option. :frowning:

#4

Is there anyone experienced above issue on 3.0.1?

Every 44, 48min, so many timeouts are occurred on cb clients .
I use cb java sdk v2.0.2 and I have 2 Clusters for joining data.

  • There’s no periodic throughput down like this on couchbase server 2.2
  • And there’s no batch job on all nodes.
  • It occurs 44min and 48min every hour exactly no
    matter with gc and request loads (it acts like alarm ;()
  • There’s no network effects like high traffic periodically.
  • I never set TTL on documents and never configure auto compaction option.
#5

Given everything you’re saying, it sounds like an issue we’ve not seen yet. Can you file one against Couchbase Server on our issue tracker? Please include a cbcollect_info for the nodes, which can be generated from the console.

1 Like
#6

@ingenthr

Happy new year !

I’ve filed this issue on http://issues.couchbase.com/browse/MB-13032.

:smile: I’m so sorry too late for update

#7

I’ve found 2 factors about this issue.

  1. data size
  2. node counts

And 1more thing, I assum that key length is related with Periodic OPS dropped down.

I did test by increasing data to cluster and recorded fail counts per 100million.

100million - not occurred
400million - no failure. but retries
850million - many failures (10k)
==> add 8 nodes (total 16nodes)
850 million - just 8 failures (decreasing failure)

=========
Test Informations

*** Server Informations ***
nodes : 8

*** node spec ***
OS : Linux ( 2.6.32-358.6.2.el6.x86_64 ) 64 bit
CPU : Intel® Xeon® CPU E5-2420 0 @ 1.90GHz[6] * 2 N
RAM : 128GB(DDR3[1333 MHz] 16384 * 8)
DISK : [LSI MegaRAID SAS PCI Express ROMB [F/W: 3.340.05-2939] (1024MB)]
[-] 299.0 GB * 4

*** bucket spec ***
Ram Quata : 858GB
data size : 1.27 billion (1,270,000,000) (284GB, ALL data is on Memory)
replicas : 1
disk io/optimization : Low
Auto Compaction : OFF
Flush : enable