Possible bug: eventing function stuck on undeploying

cmaster11 · September 2, 2018, 6:50am

Hi, while playing with eventing functions, I managed to get a function in a state where it is marked as undeployed and paused (green tick), but it’s impossible to delete it (gray button) and deploy it (CB says the function is being undeployed).

I’m using CB 6.0.0-beta on Docker (Windows host).

Once I restarted CB, the function got marked as paused (orange tick) and could be deleted (only apparently, on page reload the function appears again).

I also previously managed to cause some weird desynchronization between the eventing service and the kvmeta one… can I send you my logs somewhere?

What I see in eventing.log is:

2018-09-02T07:56:13.037+00:00 [Error] util::ReadAppContent App: db_eventing_integration_test_testFn unmarshal failed for checksum
2018-09-02T07:56:13.037+00:00 [Error] Producer::metakvAppCallback [db_eventing_integration_test_testFn:0] Failed to lookup path: /eventing/apps/ from metakv, err: unexpected end of JSON input

venkat · September 4, 2018, 8:48am

Can you let us know how many nodes are being used? and which services are being configured in each of the nodes?

cmaster11 · September 4, 2018, 8:51am

I’m testing on local machine, with CB running in Docker. So, single node, data/index/query/eventing services all together. The test is basic in any case, so no indexes are being generated and only the eventing service is tested.

After some times of executing the test, I noticed this issue started happening when:

The function is bootstrapping.
I POST multiple times the setting to undeploy the function.

asingh · September 4, 2018, 5:32pm

Hi,

Does any cbcollect attached to Eventing timer function takes long time to get triggered capture this issue? If not, would request share a cbcollect from the setup when you get chance.

Based on that, I could provide some suggestions.

Thanks.

cmaster11 · September 4, 2018, 6:17pm

So, when I managed to see this issue I manually saved the logs folder. I’m not able to reproduce at the moment the issue, but I attached the folder content at the time.

logs_stuckonundeployed_mask.zip (732.6 KB)

If I’ll manage to reproduce the issue, I’ll also post the properly collected logs.

asingh · September 4, 2018, 6:19pm

Thanks, will review the logs and share update here.

cmaster11 · September 4, 2018, 6:28pm

Oh, I managed to reproduce it!

Steps:

Create a function
Set setting to deploy/undeploy repetedly (e.g. 100 times, i % 2 == 0 ? deploy : undeploy)

Outcome:

Function is shown in CB UI as undeployed, paused.
Function cannot be deleted in UI, and REST API returns: ERR_APP_NOT_UNDEPLOYED.
Errors in eventing.log.

Logs: https://s3.amazonaws.com/cb-customers/Alberto+Marchetti/collectinfo-2018-09-04T182436-ns_1%40127.0.0.1.zip

asingh · September 6, 2018, 3:49pm

That seems unrealistic to do in real world i.e. you might not want to do deploy and undeploy every few seconds. Deploy and undeploy operations have cost associated to them.

Some of the overheads during deploy:

Get state of vbucket distribution across data nodes
Open on change streams from relevant data nodes for vbucket that they are hosting
Plan generated on eventing nodes to do even workload distribution across available nodes
Depending on the state of source bucket and feed_boundary for the eventing function - cost of streaming some/all items from disk on data nodes would be expensive.

During undeploy:

Again change streams are opened up from data nodes for metadata bucket to clear up all system related metadata.

That said, if you feel that is actual use-case necessitating frequent deploy & undeploy operations - please feel free to let us know. We could accordingly prioritize the request.

Thanks,
Abhishek

cmaster11 · September 6, 2018, 5:46pm

I guess this is a really extreme brute force misusage of Couchbase that has no representation in the real world. This issue got caused mostly by testing (by not having an endpoint that told me when the function was still deploying), so I thing that with a new build this problem will be solved

asingh · September 6, 2018, 6:14pm

Yes, with new builds we’ve exposed /api/v1/status that would summarize the state of all functions within the cluster.

Thanks.