New single node Couchbase cluster startup initially has 2 nodes

Our CI environment frequently deploys new single node couchbase clusters into a test kubernetes cluster. Whenever this happens i observe that cb-0000 pod is created and slowly starts up. However when it gets to about 1 minute (i think) it still hasn’t become Ready and so cb-0001 is created. Shortly after cb-0000 becomes Ready but eventually cb-0000 is terminated and i’m left with a ready cb-0001.

I’m guessing this is because my test cluster is slow and the operator has a 60s readiness timeout and so spins up another node. Is there a way to increase this to smooth the creation of the cluster?

There’s a couple things you should be aware of:

What do the logs say that the Operator is doing and why?

Hi,

The operator logs seem to mention that once cb-0000 is created, it then immediately starts “Pod upgrading” which i believe causes it to start cb-0001?

Here is the snippet:
{“level”:“info”,“ts”:1605170512.46828,“logger”:“couchbaseutil”,“msg”:“Node status”,“cluster”:“journey-reconciler-test-28367/cb”,“name”:“cb-0000”,“version”:“6.5.1”,“class”:“all_services”,“managed”:true,“status”:“warmup”}
{“level”:“info”,“ts”:1605170512.6510231,“logger”:“cluster”,“msg”:“Pods warming up, skipping”,“cluster”:“journey-reconciler-test-28367/cb”}
{“level”:“info”,“ts”:1605170512.8513014,“logger”:“cluster”,“msg”:“Reconcile completed”,“cluster”:“journey-reconciler-test-28367/cb”}
{“level”:“info”,“ts”:1605170514.0321784,“logger”:“cluster”,“msg”:“create CouchbaseUser”,“name”:“tdm-service”}
{“level”:“info”,“ts”:1605170514.2597518,“logger”:“couchbaseutil”,“msg”:“Cluster status”,“cluster”:“journey-reconciler-test-28367/cb”,“balance”:“balanced”,“rebalancing”:false}
{“level”:“info”,“ts”:1605170514.259801,“logger”:“couchbaseutil”,“msg”:“Node status”,“cluster”:“journey-reconciler-test-28367/cb”,“name”:“cb-0000”,“version”:“6.5.1”,“class”:“all_services”,“managed”:true,“status”:“active”}
{“level”:“info”,“ts”:1605170517.0578322,“logger”:“cluster”,“msg”:“Pod upgrading”,“cluster”:“journey-reconciler-test-28367/cb”,“name”:“cb-0000”,“source”:“6.5.1”,“target”:“6.5.1”,“diff”:" strings.Join({\n \t… // 130 identical lines\n \t"hostname: cb-0000",\n \t"restartPolicy: Never",\n+ \t"securityContext:",\n+ \t" fsGroup: 1000",\n \t"subdomain: cb",\n \t"volumes:",\n \t… // 4 identical lines\n }, “\n”)\n"}
{“level”:“info”,“ts”:1605170517.5661707,“logger”:“cluster”,“msg”:“Creating pod”,“cluster”:“journey-reconciler-test-28367/cb”,“name”:“cb-0001”,“image”:“couchbase/server:6.5.1”}
{“level”:“info”,“ts”:1605170570.34202,“logger”:“cluster”,“msg”:“Pod added to cluster”,“cluster”:“journey-reconciler-test-28367/cb”,“name”:“cb-0001”}

Thanks.

Thank you discourse for reformatting perfectly good JSON with unicode :wink:

The interesting bit is when you look at why it thought the upgrade was necessary–in particular the diff–which when decoded looks like:

 strings.Join({
  … // 130 identical lines
  "hostname: cb-0000",
  "restartPolicy: Never",
+ 	"securityContext:",
+ 	" fsGroup: 1000",
 	"subdomain: cb",
 	"volumes:",
 	… // 4 identical lines
 }, 
)

Now, the fsGroup is filled in by the dynamic admission controller, so the only conclusion I can make here is:

  • You provision the DAC
  • It hasn’t fully started yet
  • The CouchbaseCluster is created, but because the DAC isn’t running, no defaults get added
  • Eventually the DAC comes online
  • The operator writes the CouchbaseCluster in order to update the status
  • DAC fills in the defaults
  • The operator sees a difference and does something about it

The quick fix is to just add in fsGroup attribute and gloss over the race condition.

What I actually do in our testing is provision the DAC, then I perform server-side dry-run creates until I can see the DAC is applying defaults correctly, that removes the race entirely.

In 2.1–when released–it will actually raise an error when this race condition happens, which is good, you know it’s working then.

You’re exactly right! That solved it :slight_smile:

Thanks!

1 Like

:boom: :smiley:

So, just to whet your appetite, in 2.2 next year, the base version of Kubernetes will be 1.17, so we can upgrade to CRD V1 and start making use of native defaulting, less DAC requirements and less chance of races. It’s not a perfect fix because we still need to somehow create an empty securityPolicy object, and only the DAC can do that, but steady progress at least!