Java SDK v3 deficiencies

unhuman · January 28, 2020, 7:59pm

I’ve got a wrapper library around the Couchbase client, which is fortunate due to the changing underlying implementation.

The current things I’ve identified as gaps (so far) are:

Counters. Appear to be totally missing.
ANSWER: This is in BinaryCollection. Trying to figure this out now.
Documentation. For example, there’s not documentation on how to perform bulk operations, as there was with 2.x.
Append. I’d swear I saw something about this, but I haven’t been able to find it…
ANSWER: This is in BinaryCollection. Trying to figure this out now.

-H

davidkelly · January 28, 2020, 8:36pm

https://docs.couchbase.com/java-sdk/current/project-docs/migrating-sdk-code-to-3.n.html

That is a reasonable place to start.

david.nault · January 28, 2020, 8:36pm

Hi unhuman,

Thanks for the feedback – we really appreciate it, especially at this early stage in the 3.x line.

Append and Counters have moved to the BinaryCollection API, as mentioned in the migration guide. Here’s a little example that demonstrates calling Collection.binary() to access the binary operations:

Cluster cluster = Cluster.connect("localhost", "Administrator", "password");
Collection c = cluster.bucket("default").defaultCollection();
c.binary().increment("myCounter", IncrementOptions.incrementOptions().initial(0));

To your second point, yes, the documentation is still pretty thin. We’ve been working hard to align the SDK release with the release of Couchbase Server 6.5, and as a result the SDK documentation is not as complete as we’d like… but we’re moving in the right direction, I hope.

Richard_Smedley · January 28, 2020, 9:51pm

Hi unhuman,
I echo David’s point that yes, we are not yet where we want with the SDK3 documentation. You should see much of the shortfall remedied over the next month.
Meanwhile, you can find bulk ops at:
https://docs.couchbase.com/java-sdk/3.0/howtos/concurrent-async-apis.html#batching
Perhaps not the most intuitive place - we have amalgamated bulk ops & async docs to even out coverage across the SDKs - but it will show up if you search java bulk in our searchbar.
Thanks for being an early adopter of the 3.0 SDK!

unhuman · January 31, 2020, 9:49pm

The migration has been going pretty well so far, but I’ve run into a few issues that are fairly severe for us:

The N1QL DSLs were removed. We’d built a query framework around them. What’s the reasoning behind their elimination? Seems like I could copy the entire DSL over into our own repo, but… That doesn’t seem right.

Bulk operations are a bit tricky. They work in the success path, but I’ve got some issues with error handling. For example: my logic for bulk write:

 // parameter passed in: Map<K, T> entities

 Flux
     .fromIterable(entities.keySet())
     .flatMap(key -> reactiveCollection.upsert(
             getKeyProvider().getCouchbaseKey(key), entities.get(key), upsertOptions).onErrorResume(e -> {
                 if (LOG.isDebugEnabled()) {
                     LOG.debug("Failed to upsert document with Id: " + key.toString(), e);
                 }
                 errorResults.put(key, e);
                     // Note MutationResult can't be null
                     // so we blockLast() to ensure we get what we need.
                     return Mono.empty();
                 }))
         .doOnNext(successResults::add)
         .blockLast();

I changed from the example .last().block() to .blockLast() to get past:
java.util.NoSuchElementException: Flux#last() didn't observe any onNext signal

When multiple documents are written, but one of them is bad (in my test, the key is too long), the “good” writes fail with RequestCanceledException. This isn’t desirable. We still want the good writes to still go through.

Another bad case I had was if there is a null key parameter passed in, it blows up with:
java.lang.NullPointerException: The iterator returned a null value on the blockLast(). I’m not so worried about this case because having a null key is a pretty invalid case. It worked, however, with 2.x.

One thing I’d like to try is returning (instead of Mono.empty(), which sort of breaks things that are worked around with blockLast()) is my own FailedMutationResult, but that’s not possible because there’s no constructor access (package-private).

daschl · February 4, 2020, 1:06pm

okay so this is tricky. we’ve been thinking about this but felt it is not worth bringing forward because while it works in simple cases in complex queries it falls apart because of its complexity in the java language. (expressing it all). Also it is tricky to keep it extending with N1QL changes and try to cover it all.

One of the potential remedies is I could see this evolving as a separate project somewhere, but maybe not as part of the SDK directly. I still think there is value to it, we just haven’t seen as much value in the field as we’d hoped initially. And 3.0 felt like a chance to “clean” things up a bit.

w.r.t to your problems with Reactor - can you show me the 2.x equivalent that worked? I can help you come up with the 3.x equivalent and the same semantics (modulo certain changes that rxjava accepted but reactor is a little more strict on, but we might be able to work around this as well).

If you have a standalone sample I can run with what is expected and what should happen, I can work with that too.

unhuman · February 4, 2020, 9:25pm

I will create. Should be pretty trivial.

The code for existing (with internal underlying implementations):

        // Map to collect exceptions during async inserts, use ConcurrentHashMap since it needs to be thread-safe
        Map<K, OperationStatus> upsertExceptions = new ConcurrentHashMap<>();

        // See: http://docs.couchbase.com/developer/java-2.1/documents-bulk.html
        Observable
                .from(documents.entrySet())
                .flatMap(entry -> getCouchProvider().getBucket().async()
                        .upsert(entry.getValue(), getCouchProvider().getDefaultPersistTo())
                        .doOnError(e -> {
                            if (LOG.isDebugEnabled()) {
                                LOG.debug("Could not upsert document with Id: " + entry.getKey(), (Throwable) e);
                            }
                            upsertExceptions.put(entry.getKey(), new OperationStatus((Throwable) e));
                        })
                        .onErrorResumeNext(Observable.<Document>empty()))
                .toList()
                .toBlocking()
                .single();

Still need to put together examples…

daschl · February 6, 2020, 8:03am

@unhuman In your example I’m not sure if you even need t the list creation at the end, unless you are actually doing something with the result in your real world code. If the list codes away the stream might be empty so you’d need to switch to singleOrDefault(null) or so.

Otherwise I think you can pretty much turn those operators to reactor 1:1:

Flux
  .fromIterable(data.entrySet())
  .flatMap(entry -> collection
    .reactive()
    .upsert(entry.getKey(), entry.getValue())
    .onErrorResume(throwable -> {
      System.err.println("Got error: " + throwable);
      // write to your map here
      return Mono.empty();
  }))
  .blockLast();

You can combine the logging into the onErrorResume as well i think like in my example above.

unhuman · February 6, 2020, 2:17pm

Yep, for our batch processing, we handle each response.

This isn’t as bad as I first thought.

But… I can highlight a problem still.

Attached, please find a zip file containing 2 different versions of my test code. I don’t necessarily think this is a v2 vs v3 problem, but… Still some funny business in v3.

CouchbaseV2V3BulkComparison.zip (16.0 KB)

Now, the problem is caused by the key being too long. In 2.x one cannot create a document with a key that is too long, so the problem would actually never gets to the upsert, so this example really couldn’t happen with v2.

If you look at the code (as provided) for the 3x version, it fails all 3 documents:

class com.couchbase.client.core.error.ReplicaNotConfiguredException Not enough replicas configured on the bucket: Could not upsert document with Id: U:king_arthur_jr
class com.couchbase.client.core.error.CouchbaseException UpsertRequest failed with unexpected status code: Could not upsert document with Id: U:king_arthur_jr1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
class com.couchbase.client.core.error.RequestCanceledException UpsertRequest: Could not upsert document with Id: U:king_arthur
Key: U:king_arthur_jr Success: false
Key: U:king_arthur_jr1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 Success: false
Key: U:king_arthur Success: false

If the arthur_baby entity with the very long key is commented out (lines 48-54), then it works as expected:

class com.couchbase.client.core.error.ReplicaNotConfiguredException Not enough replicas configured on the bucket: Could not upsert document with Id: U:king_arthur_jr
Key: U:king_arthur_jr Success: false
Key: U:king_arthur Success: true

But, I wouldn’t expect the class com.couchbase.client.core.error.CouchbaseException UpsertRequest failed to force-cancel the “good” request (RequestCanceledException)

EDIT: one more issue… If you add a null-keyed value to the map, it totally breaks everything… Something like:

        entities.put(null, arthur);

Again, this probably falls into a similar category (nobody should ever do this) but the code does just flat out fail in that case.

graham.pople · February 7, 2020, 10:34am

Hi @unhuman,

This is a bit of a guess, but when I’ve seen this pattern happen in my own testing (a failed request followed by a RequestCanceledException on the next), it’s usually because the first failure has caused the KV node we’re connected to, to disconnect. It can do this for safety reasons, e.g. if it believes it may have corrupted data and can no longer trust the framing. It takes a little time for the client to recreate this connection, which is why the RequestCanceledException happens next.
You can check this by looking at the KV logs (/opt/couchbase/var/lib/couchbase/logs/memcached.log.0000*.txt if on Linux). The KV service is usually good at logging why it’s disconnected - which, from the above, looks like it’s related to the large key.
I’ve opened two tickets to investigate the large key and null key issues further.
https://issues.couchbase.com/browse/JCBC-1586
https://issues.couchbase.com/browse/JVMCBC-815

daschl · February 7, 2020, 1:25pm

@unhuman the too long validation is fixed here: http://review.couchbase.org/#/c/121860/ so that will make 3.0.2.

The other issue I could not reproduce in isolation, there must be something going on in your code. can you try a collection.upsert(null, null) or collection.upsert(null, value) in isolation? It should throw an invalid argument exception (same for the reactive equivalent).

unhuman · February 7, 2020, 2:16pm

It’s 100% reproducible for me here. I’ll update again with an example.

unhuman · February 7, 2020, 10:46pm

Here you go. Couchbase3xBulkInsert.v2.zip (8.2 KB)

All you have to do is run this… and you’ll see the null pushes:

Exception in thread "main" java.lang.NullPointerException: The iterator returned a null value
	at java.base/java.util.Objects.requireNonNull(Objects.java:246)
	at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:230)
	at reactor.core.publisher.FluxIterable$IterableSubscription.request(FluxIterable.java:201)
	at reactor.core.publisher.FluxFlatMap$FlatMapMain.onSubscribe(FluxFlatMap.java:363)
	at reactor.core.publisher.FluxIterable.subscribe(FluxIterable.java:139)
	at reactor.core.publisher.FluxIterable.subscribe(FluxIterable.java:63)
	at reactor.core.publisher.Flux.subscribe(Flux.java:8143)
	at reactor.core.publisher.Flux.blockLast(Flux.java:2481)
	at Couchbase3xBulkInsert.main(Couchbase3xBulkInsert.java:82)
	Suppressed: java.lang.Exception: #block terminated with an error
		at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:93)
		at reactor.core.publisher.Flux.blockLast(Flux.java:2482)
		... 1 more

Again, this is really bad (what we’re doing) - a null document key - but it was something we had tested for previously… Although I think it breaks the creation of the document instead.

unhuman · March 10, 2020, 8:46pm

@graham.pople @david.nault @daschl

Still working on our port and I’m down to just a few cases (and subdocuments)

Looking at: https://issues.couchbase.com/browse/JVMCBC-815 seems like that (Long Keys) has been fixed in some cases, but still running into a situation where having that occur in a bulk operation fails (some of) the other operations. I’ve been able to come up with an example that uses a map of entities that are bulk set and it looks like we wind up with:

com.couchbase.client.core.error.RequestCanceledException: UpsertRequest {"cancelled":true,"completed":true,"coreId":1,"idempotent":false,"lastChannelId":"0000000000000001/0000000004AD3998","lastDispatchedFrom":"127.0.0.1:53954","lastDispatchedTo":"127.0.0.1:11210","reason":"NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT)","requestId":99,"requestType":"UpsertRequest","retried":0,"service":{"bucket":"data1","collection":"_default","documentId":"Local::DropwizardCouchbaseTest::couchbaseTestEntity::testBulkSetKey2","opaque":"0x73","scope":"_default","syncDurability":{"empty":true,"present":false},"type":"kv"},"timeoutMs":30000000,"timings":{"encodingMicros":16870,"totalMicros":21217062}}

Was this missed?

Also, in my tests (which wrap Couchbase client), I have seen some tests where inserting the same item sometimes returns INVAL and other times CANCELLED. This seems to be sort of a race condition as it seems to work fine when run independantly or with one group of tests, but when run with our entire suite, it fails sometimes.

This one is trickier since it’s inconsistent.

daschl · March 11, 2020, 5:39am

@unhuman I’m going to look at your zip file in a bit, but

NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT)

means that we could not retry the upsert because it was in-flight on the socket and the socket closed. In the logs you should have more info why the socket closed.

daschl · March 11, 2020, 7:16am

@unhuman your code fails because of the reactor contract, not the sdk. This has the same exception, Reactor is not accepting null values on the stream:

        List<String> it = new ArrayList<>();
        it.add(null);
        Flux.fromIterable(it).subscribe();

unhuman · March 11, 2020, 5:08pm

@graham.pople @daschl
Yep, it looks like the bad key length is forcing Couchbase to close the connection. I was able to prevent some of the problems in the cleanup of our tests by preventing it from deleting keys that are too long and then that forces a reconnect.

However, that means that long key handling is not full supported in the client fixes. This is what I wind up seeing in the logs:

2020-03-11T13:05:42.220854-04:00 INFO 77: HELO [{"a":"java/3.0.0 (Mac OS X 10.15.3 x86_64; OpenJDK 64-Bit Server VM 11.0.6+10)","i":"0000000000000001/00000000A9A2446A"}] Mutation seqno, XATTR, XERROR, Select bucket, Snappy, Tracing, AltRequestSupport, SyncReplication [ 127.0.0.1:61138 - 127.0.0.1:11210 (not authenticated) ]
2020-03-11T13:05:42.414251-04:00 INFO 77: Client 127.0.0.1:61138 authenticated as <ud>data1</ud>
2020-03-11T13:05:42.414250-04:00 INFO 71: Client 127.0.0.1:61132 authenticated as <ud>data1</ud>
2020-03-11T13:05:42.598797-04:00 WARNING 77: Invalid format specified for "SET" - Status: "Invalid arguments" - Closing connection. Packet:[{"bodylen":538,"cas":0,"datatype":["Snappy"],"extlen":8,"keylen":347,"magic":"ClientRequest","opaque":318767104,"opcode":"SET","vbucket":5}] Reason:"Key length exceeds 250"
2020-03-11T13:05:42.612539-04:00 INFO 77: HELO [{"a":"java/3.0.0 (Mac OS X 10.15.3 x86_64; OpenJDK 64-Bit Server VM 11.0.6+10)","i":"0000000000000001/00000000AF5EE900"}] Mutation seqno, XATTR, XERROR, Select bucket, Snappy, Tracing, AltRequestSupport, SyncReplication [ 127.0.0.1:61140 - 127.0.0.1:11210 (not authenticated) ]
2020-03-11T13:05:42.621316-04:00 INFO 77: Client 127.0.0.1:61140 authenticated as <ud>data1</ud>```
So, that explanation allows me to move forward with explanation.  Still would love to be able to handle things w/o that failure, but it at least makes sense and this is a very unlikely scenario that is just a change for our tests.

unhuman · March 11, 2020, 8:30pm

Subdocument conversion wasn’t bad.

This documentation:
https://docs.couchbase.com/java-sdk/3.0/concept-docs/xattr.html
appears to be wrong. Particularly:

bucket.lookupIn(key).get("$document.exptime", new SubdocOptionsBuilder().xattr(true)).execute()

I’m still working through it, but that’s not correct.

unhuman · March 11, 2020, 11:44pm

getAndLock() behaves very differently now. Before, if you couldn’t get the lock, it would throw a TemporaryLockFailureException. Now, however, it waits for the lock to be released. Was this intentional? Why?

ingenthr · March 12, 2020, 12:57am

I’ll defer to @daschl and others, but I believe it’s in keeping with the general idea that in SDK3 we retry automatically where possible. You can adjust to fast fail if you want your own retry with the FailFastStrategy.