Stale data on a concurrent request using replicate_to=-1

Kris · January 28, 2020, 3:20pm

I’m having an issue where Couchbase is returning stale data only first thing in the morning. Let me explain…

I’m running production Couchbase servers (4.6) with 3 nodes. In this scenario, I have 2 web requests that are hit by a client app in succession:

endpoint that (over)writes a couchbase document with data pulled from an external source
endpoint that reads the content of the document that was provided in the response from the first endpoint

So, there are 2 separate requests on the web server cluster, each request is likely to hit a different webserver that will connect to my couchbase cluster.

First thing in the morning, the 2nd endpoint will return old data. If I run the process over and over again I can not reproduce this behavior, as the first endpoint updates the data, the second one reads the updated data as expected. It’s just the first one in the morning (so say after 12 hours of no activity) that I see the problem.

I should also note that in the first endpoint hit, after writing the data, I read it back and log the contents, which are correct. So the 1st endpoint is working properly, and seems to be writing the data properly. This is consistent even on the first attempt of the morning. However the 2nd endpoint still shows the stale data. This specific functionality is to transfer an end user’s data when they move periodically. I have to be able to rely on the durability of the data on the first try after long periods of inactivity.

I have tried passing the replace_to = -1 and persist_to = -1 (a PHP Couchbase lib comment says that “-1” = all active nodes)

Any help would be much appreciated. Thanks

matthew.groves · January 28, 2020, 10:07pm

Hi @Kris,

Do you think you could post some of the code, especially for the #2 endpoint? the replace_to and persist_to may not figure in to the problem, but depending on how you’re retrieving data, you may be getting stale data (if you’re using a javascript View or a N1QL query, for instance).

Kris · January 29, 2020, 4:29pm

Hi @matthew.groves,

It’s part of a larger application, but I’ll pull out the applicable lines. To retrieve the data, we’re just using the PHP SDK ‘Couchbase\Bucket\get()’.

Request #1

            $key = 'p_123';
            $dataObject = json_encode([ 'level' => 200]);

            $oldData = $this->bucket->get($key);
            $oldLevel = $oldData->value->level;   // level = 100

            $this->bucket->upsert($key, $dataObject, ['expiry'=>0, 'persist_to'=>0, 'replicate_to'=>1]

            $doc = $this->bucket->getAndTouch($key, 0, []);
            $level = $doc->value->level; // level = 200

Request #2

            $key = 'p_123';
            $data = $this->bucket->get($key);
            $currentLevel = $data->value->level;
           // $currentLevel == 100 ??? (should be 200)

Kris · January 29, 2020, 4:53pm

Also, this only happens once or twice in the morning… Subsequent tests do result in the expected result.

matthew.groves · January 29, 2020, 5:12pm

@Kris,

I don’t see anything that stands out, but I am curious why you are specify an expiry and why you’re using getAndTouch. Is there anything in your app that is supposed to create documents with a TTL or use document expiration in some way? Since I see "expiry=0’, I’m wondering why you’re using getAndTouch. I can’t think of a reason why this would cause the behavior, but it does appear curious to me.

Kris · January 29, 2020, 5:30pm

@matthew.groves, the expiry setting is just because I’ve pulled these lines from wrapper classes that provide the option to set an expiry. In these specific cases the expiry is set to 0 .

Also, I added the “getAndTouch” in a desperate attempt to “shake” the system into propagating the data more reliably. It hasn’t had an effect. What I’ve done now is added a “sleep(1)” after my transfer step, which I will test for tomorrow morning. (Again, there’s no point testing it now because since I tried it earlier today it will work as expected until I leave it alone for a long period of time).

Could there be a configuration issue with our Couchbase cluster that might cause this behavior? We run 3 nodes in the cluster, and the bucket is set to 1 replica. Both the web servers and couchbase servers are hosted in AWS and communicate directly using private DNS in a VPC.

Kris · January 30, 2020, 5:31pm

So, even after introducing a 1 second sleep at the end of the 1st request that write data, the 2nd request is returning stale data.

Kris · January 31, 2020, 2:26pm

I’ve been able to confirm that the problem is on the read. Yesterday I setup 3 tests to break (stop execution) between the first and second request. All 3 tests resulted in the correct data being written to couchbase. I manually checked couchbase using the admin portal after each of the initial requests went through (again, having disabled the second request which reads the data in quick succession). This tells me the problem is in the second request reading couchbase data that is stale, even as much as 1 second later.

The current code that saves the data is still passing “replicate_to=1” to the ‘upsert’ options.

Kris · February 1, 2020, 2:28pm

@matthew.groves , I’m trying new tests every morning now. This morning I changed back to using “persist_to=1” (using the explicit replica value rather then -1) and make 2 reads on the 2nd request, in another attempt to assume the first read might “wake it up” and the second read give real data.

I set 3 tests each day now to run the next morning. In each case I change the source data (different data center where the transfer in data originates. I then run one of the test through the client game (Android app) which hits the first request, takes the key Id provided from the request and includes it in the 2nd request. For the other 2 tests I hit the link manually to avoid the 2nd request being fired at all.

The test run through the game results in stale data, but the 2 tests run manually that never hit the 2nd endpoint (read endpoint) result in the correct data in couchbase (when viewed through the admin portal).

Would it help for me to force a CAS value change, is there a way for me to do that (is it already being done when I save the data in the first request)? Are there any server configurations that could affect this? For example, logging that could show every read/write request along with meta data about each…

Thanks in advance, I appreciate any direction you can point me in…

paragkashyap · August 30, 2022, 1:00pm

Hi @Kris ,

Were you able to identify the root cause of the read returning stale data.