Eventing timer function takes long time to get triggered


#1

Hi, I’m testing the createTimer function with a similar code:

function OnUpdate(doc, meta) {

    log(doc, meta);

    const now = new Date();
    const executeAtDate = new Date(now.getTime() + 2000);

    log('Creating timer', executeAtDate);

    // Setup the timer
    createTimer(onTimer, executeAtDate, meta.id, {id: meta.id});
}

function onTimer(context) {
    log('Running timer', context.id);

    const doc = bucket_source[context.id];

    // Simple set
    doc.processedSimple = true;

    bucket_target[context.id] = doc;

    log('Set document flags', context.id);
}

By inspecting the function logs (in the @eventing dir), I can see that the first block gets invoked correctly on every document update, while the second block takes some times even more than 120 seconds to be invoked at least once.

I can see that the timer is planned:

2018-09-03T04:15:09.037+00:00 [INFO] {"docType":"eventing_integration_test_model","id":"DbModelEventingFunctionTestTimer_1535948053460","createdAt":1535948053460,"updatedAt":1535948053460,"counter":6,"processedSimple":false,"response":null} {"cas":1535948103505477600,"id":"wdm_eventing_integration_test_model_DbModelEventingFunctionTestTimer_1535948053460","expiration":0,"flags":33554432,"vb":622,"seq":7} 
2018-09-03T04:15:09.037+00:00 [INFO] "Creating timer" "2018-09-03T04:15:11.036Z" 

Timer than takes a long time to start:

2018-09-03T04:16:54.201+00:00 [INFO] "Running timer" "wdm_eventing_integration_test_model_DbModelEventingFunctionTestTimer_1535948053460" 
2018-09-03T04:16:54.205+00:00 [INFO] "Set document flags" "wdm_eventing_integration_test_model_DbModelEventingFunctionTestTimer_1535948053460"

I’m running Couchbase 6.0.0-beta in Docker (Windows host), with:

  • Processor Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz, 2904 Mhz, 4 Core(s), 8 Logical Processor(s), 4 logical processors made available to Docker.
  • ~1GB of RAM associated to the eventing service.

I’m trying to figure out if my setup is too weak for the eventing service to operate, or if there’s any tuning I can perform on the function settings.

Do you have any clue what could be going on?

Thanks,
Alberto


Possible bug: eventing function stuck on undeploying
#2

Can you give us an idea of how many timers are getting created at the same timestamp? And also how big individual nodes and the overall cluster is? Also, doc size(both key and value) and mutation rates?


#3

I’m running the whole cluster in Docker, so single node, with the specs listed above. Also, ~256MB of index RAM for each bucket (no indexes created, in any case).

I’m running a simple test though, which means:

  • 1 timer function
  • A mutation happens every 5 seconds, where a counter value gets increased with sub-doc operations.
  • Key size ~85 chars
  • Value size ~200 bytes

It looks like quite a basic and simple scenario, to have problems with it :thinking:


#4

@cmaster11 could you please upload cbcollect from your setup? Details to collect it from UI available here - https://www.couchbase.com/support/working-with-technical-support


#5

Sure! This is a record of when the timer didn’t even got triggered once in 120 seconds.

https://s3.amazonaws.com/cb-customers/Alberto+Marchetti+-+cmaster11/collectinfo-2018-09-04T101537-ns_1%40127.0.0.1.zip


#6

Thanks for the cbcollect.

Function was undeployed when cbcollect was taken, stats from rest endpoint was missing - would have been useful if stats were around. Observation based on stats that gets logged within eventing.log:

  • 6 DCP Mutations(doc creates/updates) were seen by Eventing. Can’t confirm if they are from same document or different documents.
  • 5 timers were sent from eventing-consumer to eventing-producer, not sure how many got created and stored in metadata bucket.
  • None of timers have fired yet.
{"app_worker_setting_events_lost":0,"bucket_op_exception_count":0,"checkpoint_failure_count":0,"dcp_events_lost":0,"debugger_events_lost":0,"delete_events_lost":0,"mutation_events_lost":0,"n1ql_op_exception_count":0,"timeout_count":0,"timer_events_lost":0,"timestamp":"2018-09-04T10:11:58Z","v8worker_events_lost":0} execution stats: {"agg_queue_memory":267,"agg_queue_size":0,"dcp_delete_msg_counter":0,"dcp_mutation_msg_counter":6,"enqueued_dcp_delete_msg_counter":0,"enqueued_dcp_mutation_msg_counter":6,"enqueued_timer_msg_counter":0,"feedback_queue_size":1,"lcb_retry_failure":0,"messages_parsed":46,"on_delete_failure":0,"on_delete_success":0,"on_update_failure":0,"on_update_success":6,"timer_create_failure":0,"timer_msg_counter":0,"timer_responses_sent":5,"timestamp":"2018-09-04T10:11:58Z","uv_try_write_failure_counter":0}

Questions:

(a) When you say timers are taking more than 120s to fire - are you including the time it takes function to bootstrap as well[1]? If yes, you could let bootstrap finish and then trigger events within source bucket to create timers. Would be useful to know how much delay do you see in that case.
(b) If suggestion in (a) doesn’t solve your problem, cbcollect from running system with function(s) in deployed state would be useful for debugging.

QE folks here tried replicating the issue with supplied function JS code, but they weren’t able to see delay in timer execution beyond 15s.

[1] In Beta build, we waited for bootstrap to wrap before firing timers.


#7

It’s also possible Alberto is running into one of the bugs that were fixed since Beta. In particular the ones with floor calculation. @venkat is there any possibility of sharing a beta refresh build?


#8

Thanks for the answer.

I’m including the time the function takes to bootstrap.

I ran the tests again, and got this scenarios (and collected info with deployed functions):

Thanks a lot :slight_smile:


#9

Hi Alberto,

At 2018-09-04T12:50:37.938 db_eventing_integration_test_testFn_timer_timerFunction finished bootstrap

2018-09-04T12:50:37.938+00:00 [Info] SuperSupervisor::HandleSupCmdMsg [1] App: db_eventing_integration_test_testFn_timer_timerFunction_1536065408579 added to deployed apps map

From stats:

  "DCP_MUTATION": 22 - Same doc got created/updated multiple times
  "TIMER_RESPONSES_RECEIVED": 22 - 22 times Javascript code within OnUpdate asked for timers to be created. Given they were with same reference and timer callback function, subsequent timer request would overwrite previous one.
  "TIMER_EVENTS": 1 - One timer event fired off.

Between 2018-09-04T12:55:20.200 & 2018-09-04T12:56:20.20 - timer had fired

2018-09-04T12:55:20.200+00:00 [Info] Consumer::processEvents [worker_db_eventing_integration_test_testFn_timer_timerFunction_1536065408579_1:40117:633] DCP events: DCP_STREAMREQ:341 DCP_SNAPSHOT:21 DCP_MUTATION:21 V8 events: THR_COUNT:1 SOURCE_MAP:1 EXECUTION_STATS:58 LATENCY_STATS:58 LCB_EXCEPTION_STATS:58 LOG_LEVEL:3 THR_MAP:1 V8_INIT:1 V8_LOAD:1 HANDLER_CODE:1 FAILURE_STATS:58 Timer events: Doc: 0, vbs owned len: 341 vbs owned: [342-682]


2018-09-04T12:56:20.200+00:00 [Info] Consumer::processEvents [worker_db_eventing_integration_test_testFn_timer_timerFunction_1536065408579_1:40117:633] DCP events: DCP_STREAMREQ:341 DCP_SNAPSHOT:22 DCP_MUTATION:22 V8 events: LATENCY_STATS:70 THR_COUNT:1 SOURCE_MAP:1 EXECUTION_STATS:70 V8_LOAD:1 HANDLER_CODE:1 FAILURE_STATS:70 LCB_EXCEPTION_STATS:70 LOG_LEVEL:3 THR_MAP:1 V8_INIT:1 Timer events: Doc: 1, vbs owned len: 341 vbs owned: [342-682]

Tue Sep 4 12:55:50 UTC 2018 is when cbcollect was started, probably in the middle of cbcollect run - timer was fired. I see there are delays in timer execution on your setup, not sure if its because of some bug related to floor calculation logic which we may’ve recently fixed. Will check with Venkat if we share an updated stable build with you.

Thanks,
Abhishek


#10

Between 2018-09-04T12:55:20.200 & 2018-09-04T12:56:20.20 - timer had fired

Oh! I see, my bad, it was definitely not capture by my test suite (I have sensible timeouts).

Will check with Venkat if we share an updated stable build with you.

Well, in case, I’m really happy to be a tester!

Thank you for all these analysis :slight_smile:

Alberto