OperationTimeout from the .NET SDK... but that seems unlikely

pkramer · September 15, 2017, 5:52pm

We are getting an operation timeout from the SDK, but it’s happening 500 ms after the request was submitted, so the error seems to be misleading. The cluster is up and healthy, and other requests after this are successful. Here is a snippet from our logs including the SDK stack trace:

2017-09-15 17:08:26,337 [7612] DEBUG ZOLL.Core.Couchbase.Sync.DurableSequenceRepository - Attempting to insert change feed marker document 'PcrCompleteChangeFeed_394870' (attempt 1 of 5): ZOLL.Core.Couchbase.Sync.ChangeFeedMarkerDoc
2017-09-15 17:08:26,852 [10024] DEBUG ZOLL.Core.Couchbase.Sync.DurableSequenceRepository - Save change marker result status for marker document 'PcrCompleteChangeFeed_394870': OperationTimeout, message: , should retry: False, exception: System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Couchbase.Core.Buckets.KeyObserver.<ObserveEveryAsync>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Couchbase.Core.Buckets.KeyObserver.<ObserveAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at Couchbase.Core.Buckets.CouchbaseRequestExecuter.<SendWithDurabilityAsync>d__10`1.MoveNext()

Also, apparently an operation timeout cannot be retried according to the operation result??

jmorris · September 16, 2017, 12:29am

@pkramer -

For observe, the timeout is set to 500ms by default; you can override this by changing the value of ClientConfiguration.ObserveTimeout. The reason it doesn’t retry is because it timed out, however the timeout for retries should be based upon ClientConfiguration.DefaultOperationLifespan (2500ms) not upon the ObserveTimeout - so that might be a bug - would need to look deeper into it.

Note that the reason that we do not retry when the DefaultOperationLifespan is exceeded is because eventually the operation must succeed or fail; we cannot loop forever retrying. At this point however, the application could do its own retry logic if it made sense for the specific use-case.

pkramer · September 18, 2017, 3:21pm

Is the ObserveTimeout configurable in the app.config?

I can certainly increase the timeout for this setting, but I am more curious as to why the timeout would be occurring in the first place. Is there any way to determine this?