Performance concerns with USE KEYS?

tkeating · April 5, 2018, 4:25pm

We have some older code that is using an old version of the couchbase client, which performs bulk operations directly by calling into the spymemcached library. In the newer releases of CB, it seems like that has all gone away, and now our wrapper lib does what appears to be the prescribed modern thing: it creates an Observable that bundles up a bunch of individual REST GETs.

This works fine for small datasets, but we have recently encountered a use case where we end up fetching documents numbering in the thousands, and the overhead of doing this is prohibitively high.

After doing some digging, I uncovered the USE KEYS element of the N1QL SELECT clause. Rudimentary experimentation suggests this is a far faster method of bulk-fetching a lot of document by key, but our CB admins have reservations about it, having not encountered it before. Can anyone speak to the performance characteristics of USE KEYS as a way of fetching large numbers of records? Any pitfalls, particularly as it pertains to clusters with several nodes?

Edit: should probably also have said, if there are better approaches to bulk fetches of documents on modern Couchbase architecture, I would appreciate a pointer in that direction.

vsr1 · April 5, 2018, 4:36pm

If you already have keys and don’t need any post processing, directly do Data Node fetch in your SDKs instead of N1QL. cc @ingenthr

tkeating · April 5, 2018, 5:01pm

To clarify, I misspoke when I said we were doing batches of REST calls – we are, in fact, using an Observable to bundle up AsyncBucket::get calls, as advised under Bulk Operations. So I believe we already are doing KV. It’s just unacceptably slow under the upgraded CB client right now. It’s possible there’s some other factor at play here.

sgudavalli · April 5, 2018, 5:03pm

hello, I have also couple of questions around optimizations of USE KEYS
does it need to be backed up by PRIMARY KEY in couch base? how Is USE KEYS different from KV fetch.

we are planning to rely on USE KEYS to bulk fetch multiple documents and apply selectors on Top of It

Does it have any workload on Index Nodes?

My understanding following query “select * from sample USE KEYS [“FooBar”]” will just need resources from Query Nodes and Data Nodes.

please correct.

sgudavalli · April 5, 2018, 5:04pm

@tkeating, which version of cb are you referring here.

vsr1 · April 5, 2018, 5:10pm

USE KEYS does not required any index. Query service skip indexes and fetch the document from Data node. (KV fetch means Data Node fetch)

tkeating · April 5, 2018, 5:11pm

Couchbase server 5.1.0, with version 2.5.6 of the Java client.

ingenthr · April 5, 2018, 8:06pm

Are you doing any projection/transformation in your query? If not and you have the keys, it will be more efficient to .get(). Even if you’re not using all of the fields but the items are small, it may be more efficient for the system to go ahead and do a simple fetch.

Of course, this is if you want to optimize the ‘last mile’ so to speak.

tkeating · April 5, 2018, 9:13pm

Nope, just straight up fetching all the docs. It’s a near-textbook implementation of the call from the Bulk Operations page I linked above.

Now mind you, there are a large-ish number of them (~4k). The docs themselves are pretty small – under 512B, would be my guess. But during testing for a release that included a CB client upgrade, we encountered some broken automated tests due to timeouts getting hit. We traced this back to a series of these bulk fetches, which were getting called multiple times, and were now taking vastly longer than they had under the older CB client.

Where this whole thread came from was me taking this (partial) test dataset & dropping it into a local instance of couchbase, then running a series of tests against it. Bulk fetching 3700 documents locally consistently takes around 700ms. However, taking the same set of keys and building a giant N1QL string query with all the keys in it in a USE KEYS clause takes about 70ms. A 10x performance increase is not consistent with what I’m hearing from anybody who knows anything about Couchbase going from KV -> N1QL. So it seems like we’re doing something wrong somewhere, though it’s eluding me what it could be, or how to figure it out.

vsr1 · April 5, 2018, 9:21pm

As far as N1QL is concerned it might be used all available cores and fetched all keys in parallel. As you are giving all keys once with single query at a time.

ingenthr · April 5, 2018, 10:21pm

A couple of thoughts…

Is the environment always the same?
Is it consistently 700ms? Java isn’t known for speedy startup time on server hotspot. All of that JITting and early GC stuff might slow it down if it’s just one invocation.
Maybe turn up the log level and see if the workload is even or spread out? Has a dropout?

Can you gist a snippet of code?

Sorry for the trouble, but we can probably figure it out.

daschl · April 6, 2018, 6:12am

Can you please share how your java client is configured? How many KV nodes do you have in prod and/or testing? N1QL under the covers opens many connections in parallel to KV and by default the SDK only does one. This can all be configured and tuned (as a first step you can play around with the kvEndpoints/kvServiceConfig settings).

Alternatively you can just dump the info here that the SDK logs at startup so we can look at the config

tkeating · April 6, 2018, 2:53pm

The environment is the same. Bear in mind, this is a local test using a copy of production data to try and identify the cause of an observed slowdown. I plan to run similar tests against a more production-like environment today to see whether similar results may be observed.

It is consistently longer on first run by about 20%, which I attribute (with no evidence whatsoever, mind) to Couchbase warming up caches. I generally run my tests ten times, and the 700ms for KV, 70ms for N1QL is a rough average.

This sounds like an interesting candidate for a culprit. If we haven’t configured the updated client lib to allow for multiple threads, then it seems like using the async/observable bulk fetch approach wouldn’t provide a lot in the way of benefit.

If today’s testing doesn’t reveal any kind of smoking gun I’ll work with our CB guys to pull more detailed log data to post here, since we’re getting a bit out of my comfort zone.

I did want to say – thanks for all the feedback, everybody. Pretty much a CB n00b, so this has been very educational!

ingenthr · April 6, 2018, 11:18pm

Actually, it would in that we pipeline operations, so it’s generally very efficient. The client continues to send requests as the cluster sends responses in parallel. Also, it’s less about threading as there are multiple threads, but the number isn’t terribly important here as they’re performing asynchronous IO and do not block.

The SDK is usually deployed in environments with lots of application servers (we’ve seen > 15,000 in production deployments) so out of the box it won’t try to create lots of highways to pipeline over. There are some good academic papers that demonstrate many TCP connections deliver higher aggregate throughput. In theory, if you’re pipelining it shouldn’t matter. In practice, many software systems (OS, apps, libraries) have a small buffer allocated per connection. If you’re after aggregate throughput, buffering usually helps.

Since it sounds like it’s on a local env and easy to test, maybe just turn up kvEndpoints as @daschl points out and see if there’s a difference?

This also makes me wonder, does this approximate something you’d do in production? Totally happy to help in any case, but just thought I’d ask in case we need to consider how prod would look like compared to the test rig.

tkeating · April 16, 2018, 4:50pm

Brief final summary of this issue:

Adding kvEndpoints made a notable improvement in performance of full-document scan using the prescribed bulk-fetch method (using RxJava). For local tests with about 3k documents, I saw fetch times drop from about 10s to ~half that. Still slower than the USE KEYS N1QL query, but getting manageable.

I moved my testing into our staging cluster, which is a close analog to production, with multiple nodes. Running my service locally, with a test of about 7K documents, it was taking 15s to load in bulk, but USE KEYS was actually WORSE, at around 21 sec. I surmise that on a local system, the cost of transmitting the query is greatly reduced, using some OS fakery to bypass actually using sockets, but that’s only a guess as to why it is so absurdly cheap locally.

Increasing kvEndpoints gradually showed marked improvement in this test scenario as well, with diminishing returns. The bulk fetch that took 15s with 1 connection/node dropped to 4-6s (saw a lot of variance in this scenario) with 6. This seems like a practical max, at least for my particular environment – increasing beyond this point didn’t show additional performance improvements, and at 12 kvEndpoints it actually hung the service. (My service connects to 18 nodes, so that’s 216 open sockets, which seems like a lot!)

However, more kvEndpoints also improved the performance of N1QL with USE KEYS, to the point where it was very comparable to bulk fetch.

The final round of testing was to deploy my service to our staging environment and make those calls from there (so the calls were not being made across a VPN). Both KV and N1QL saw marked improvement with a modest increase in kvEndpoints (I think I inadvertently deployed with 8, but realistically, I didn’t see a whole lot of improvement beyond about 4). This took that 7k document fetch down to about 3400ms, which is workable for us.

At this point, I’m not planning on doing a whole lot more digging. It seems reasonable to conclude that for our use case, a modest increase in kvEndpoints does show a healthy increase in performance. Take that with a grain of salt, though, as all the testing was done by 1 guy in a fairly ad-hoc manner. It may or may not be a broadly useful metric.

ingenthr · April 16, 2018, 6:11pm

Thanks for the update @tkeating; we’ll probably have a further look into this but glad you’re unblocked!