N1QL parallel queries return random empty or incomplete results

I have the problem that when a query is run in parallel one of them returns empty results.
Initially I found that out when testing in the browser and pressing F5 / refresh 2 or more times very fast afteranother.

I could reproduce it also in Java code with a very simple query:

select meta(beer-sample).id frombeer-sample`

My environment:
Couchbase Server 4.6. DP
Java SDK Client: 2.4.0 (downloaded yesterday)
(exact same behaviour with Java SDK Client 2.3.5)

Here is my simplistic test code:

public class Testcase {

	private static DateTimeFormatter fmt = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.SSS");

	public static void main(String[] args) {

		CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder()
				// autoReleaseAfter is to avoid "The content of this Observable
				// is already released." errors
				// see
				// https://www.couchbase.com/forums/t/n1ql-query-with-adhoc-false-query-runs-into-illegalstateexception-the-content-of-this-observable-is-already-released/11004/4?u=synesty
				.autoreleaseAfter(50000).queryTimeout(10000).retryStrategy(FailFastRetryStrategy.INSTANCE)
				.reconnectDelay(Delay.fixed(30, TimeUnit.SECONDS)).build();

		CouchbaseCluster cluster = CouchbaseCluster.create(env, "127.0.0.1");
		Bucket bucket = cluster.openBucket("beer-sample");

		String q = "select meta(`beer-sample`).id from `beer-sample`  ";
		
		Map<String, String> values = new HashMap<String, String>();
		JsonObject params = JsonObject.from(values);
		N1qlParams n1qlParams = N1qlParams.build();
		N1qlQuery query = N1qlQuery.parameterized(q, params, n1qlParams);

		N1qlQueryResult r1 = bucket.query(query);
		N1qlQueryResult r2 = bucket.query(query);

		System.out.println(fmt.print(new DateTime()) + " - sync result 1: " + r1.allRows().size());
		System.out.println(fmt.print(new DateTime()) + " - sync result 2: " + r2.allRows().size());

		Observable.just(q, q)
				.flatMap(q1 -> Observable.just(q1).subscribeOn(Schedulers.io()).map(q2 -> runQuery(bucket, q2)), 4)
				.toBlocking().subscribe();

		bucket.close();
		env.shutdown();

	}

	private static Object runQuery(Bucket bucket, String q) {
		

		Map<String, String> values = new HashMap<String, String>();
		JsonObject params = JsonObject.from(values);
		N1qlParams n1qlParams = N1qlParams.build();
		N1qlQuery query = N1qlQuery.parameterized(q, params, n1qlParams);

		N1qlQueryResult r = bucket.query(query);
		int numRows = r.allRows().size();
		System.out.println(fmt.print(new DateTime()) + " - async result:  " + numRows);
		if (numRows == 0) {
			System.err.println("empty result. Errors: " + r.errors());
		}
		return null;
	}
}

Now I run this test program a couple of times:

  1. Run
    2017-01-09 08:02:04.495 - sync result 1: 7303
    2017-01-09 08:02:04.526 - sync result 2: 7303
    2017-01-09 08:02:04.540 - async result: 0
    empty result. Errors:
    2017-01-09 08:02:05.129 - async result: 7303

  2. Run
    2017-01-09 08:03:17.166 - sync result 1: 7303
    2017-01-09 08:03:17.201 - sync result 2: 7303
    2017-01-09 08:03:17.222 - async result: 0
    empty result. Errors:
    2017-01-09 08:03:18.055 - async result: 7303

Always the same. The sync query executions are fine, but from the async queries only one returns results.
Notice that the errors list is also empty, so I don’t know what’s going on.

Notice that the results in one of the async results start jumping up and down.
When you debug it set a breakpoint in the runQuery method in the line:
int numRows = r.allRows().size();

You will notice two running threads you your debug stack and in one of them r.allRows() is empty.

Now I started to bring some random sleep in each thread into the mix by adding the following code in the beginning of the runQuery method:

try {
			Thread.sleep(new Random().nextInt(1000));
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}

Now it looks like this:

1. Run
2017-01-09 08:04:59.197 - sync result 1: 7303
2017-01-09 08:04:59.228 - sync result 2: 7303
2017-01-09 08:04:59.669 - async result:  1349
2017-01-09 08:05:00.341 - async result:  7303

2. Run
2017-01-09 08:05:21.288 - sync result 1: 7303
2017-01-09 08:05:21.322 - sync result 2: 7303
2017-01-09 08:05:22.115 - async result:  4061
2017-01-09 08:05:22.706 - async result:  7303

3. Run
2017-01-09 08:05:57.189 - sync result 1: 7303
2017-01-09 08:05:57.219 - sync result 2: 7303
2017-01-09 08:05:57.979 - async result:  3638
2017-01-09 08:05:58.569 - async result:  7303

Conclusion:
To me that indicates that the queries are blocking each other somehow.

Am I doing something wrong here?
Is this a known 4.6 DP problem?

Thanks
Christoph

Hi @synesty, can you please set adhoc to true and try this again? Thanks.

Same problem, even with adhoc=true

I had tried adhoc=true already when I upgrade to new Java SDK 2.4.0 two days ago (because I was the one reporting and waiting for https://issues.couchbase.com/browse/JCBC-1029).

N1qlParams n1qlParams = N1qlParams.build().adhoc(true);
N1qlQuery query = N1qlQuery.parameterized(q, params, n1qlParams);

2017-01-09 18:31:32.427 - sync result 1: 7303
2017-01-09 18:31:32.461 - sync result 2: 7303
2017-01-09 18:31:32.488 - async result:  0
empty result. Errors: []
2017-01-09 18:31:33.375 - async result:  7303

Hi @synesty,

Can you enable trace logging. You should be able to see the request and response from the query service in the logs. That should help us debug more. Thanks.

hi @subhashni
See attached my trace logging which I enabled like this:

private final static Logger logger = Logger.getLogger("com.couchbase.client");
logger.setLevel(Level.FINEST);
		try {
	        FileHandler handler = new FileHandler("couchbaseTRACE.log", true);
	        logger.addHandler(handler);
	    } catch (IOException e) {
	        throw new IllegalStateException("Could not add file handler.", e);
	    }
for(Handler h : logger.getParent().getHandlers()) {
		    if(h instanceof ConsoleHandler){
		        h.setLevel(Level.FINEST);
		    }
		}

The previous System.out.println logging statements above are logged in INFO level in the logfile.
Search for the string async result in the logfiles to find those places.

couchbaseTRACE.log.zip (1.5 MB)

@synesty the query service replied with a closed status.
I think this is related to https://issues.couchbase.com/browse/MB-22184. Could you experiment with increasing the query endpoints (it can be changed on CouchbaseEnvironment)

[id: 0x3c5cab57, L:/127.0.0.1:56486 - R:localhost/127.0.0.1:8093] RECEIVED: 429B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d |HTTP/1.1 200 OK.| |00000010| 0a 43 6f 6e 74 65 6e 74 2d 4c 65 6e 67 74 68 3a |.Content-Length:| |00000020| 20 33 30 35 0d 0a 43 6f 6e 74 65 6e 74 2d 54 79 | 305..Content-Ty| |00000030| 70 65 3a 20 61 70 70 6c 69 63 61 74 69 6f 6e 2f |pe: application/| |00000040| 6a 73 6f 6e 3b 20 76 65 72 73 69 6f 6e 3d 31 2e |json; version=1.| |00000050| 36 2e 30 0d 0a 44 61 74 65 3a 20 4d 6f 6e 2c 20 |6.0..Date: Mon, | |00000060| 30 39 20 4a 61 6e 20 32 30 31 37 20 32 32 3a 31 |09 Jan 2017 22:1| |00000070| 38 3a 31 31 20 47 4d 54 0d 0a 0d 0a 7b 0a 20 20 |8:11 GMT....{. | |00000080| 20 20 22 72 65 71 75 65 73 74 49 44 22 3a 20 22 | "requestID": "| |00000090| 62 63 66 37 33 34 37 64 2d 30 33 64 66 2d 34 38 |bcf7347d-03df-48| |000000a0| 38 37 2d 38 63 32 38 2d 66 65 37 34 36 66 66 66 |87-8c28-fe746fff| |000000b0| 61 36 32 34 22 2c 0a 20 20 20 20 22 73 69 67 6e |a624",. "sign| |000000c0| 61 74 75 72 65 22 3a 20 7b 0a 20 20 20 20 20 20 |ature": {. | |000000d0| 20 20 22 69 64 22 3a 20 22 6a 73 6f 6e 22 0a 20 | "id": "json". | |000000e0| 20 20 20 7d 2c 0a 20 20 20 20 22 72 65 73 75 6c | },. "resul| |000000f0| 74 73 22 3a 20 5b 0a 20 20 20 20 5d 2c 0a 20 20 |ts": [. ],. | |00000100| 20 20 22 73 74 61 74 75 73 22 3a 20 22 63 6c 6f | "status": "clo| |00000110| 73 65 64 22 2c 0a 20 20 20 20 22 6d 65 74 72 69 |sed",. "metri| |00000120| 63 73 22 3a 20 7b 0a 20 20 20 20 20 20 20 20 22 |cs": {. "| |00000130| 65 6c 61 70 73 65 64 54 69 6d 65 22 3a 20 22 31 |elapsedTime": "1| |00000140| 2e 34 38 35 32 39 39 6d 73 22 2c 0a 20 20 20 20 |.485299ms",. | |00000150| 20 20 20 20 22 65 78 65 63 75 74 69 6f 6e 54 69 | "executionTi| |00000160| 6d 65 22 3a 20 22 31 2e 34 34 32 32 39 38 6d 73 |me": "1.442298ms| |00000170| 22 2c 0a 20 20 20 20 20 20 20 20 22 72 65 73 75 |",. "resu| |00000180| 6c 74 43 6f 75 6e 74 22 3a 20 30 2c 0a 20 20 20 |ltCount": 0,. | |00000190| 20 20 20 20 20 22 72 65 73 75 6c 74 53 69 7a 65 | "resultSize| |000001a0| 22 3a 20 30 0a 20 20 20 20 7d 0a 7d 0a |": 0. }.}. | +--------+-------------------------------------------------+----------------+

@subhashni yes increasing the queryEndpoints() of the CouchbaseEnvironment helps.

e.g.

CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder().queryEndpoints(4).....

default value for queryEndpoints=1 (seems to be)

My observation: queryEndpoints has to be equal to the number of parallel threads you are using to make N1ql queries

Examples:

  • queryEndpoints=1 (default), numberOfThreads=2 (Observable.just(q, q)…) -> 1 ERROR and 1 SUCCESS (like in my initial example)
  • e.g. queryEndpoints=1 (default), numberOfThreads=4 (Observable.just(q, q,q,q)…) -> 3 ERROR and 1 SUCCESS
  • queryEndpoints=4, numberOfThreads=4 (Observable.just(q, q,q,q)…) -> 4 SUCCESS
  • queryEndpoints=4, numberOfThreads=6 (Observable.just(q, q,q,q,q,q)…) -> 2 ERROR and 4 SUCCESS

Questions:

  • so is this more a Java SDK Bug / problem then?
  • is there any recommendation / documentation on how large/small to set the value for queryEndpoints? I haven’t found yet…

Thanks
Christoph

Actually, I think you might be hitting https://issues.couchbase.com/browse/MB-22184 . The reason is that something is messed up if the java client pipelines requests - which you implicitly avoid when you have the number of query endpoints per threads, since then never more than 1 request is pending on a socket.

Ok, thanks @daschl . I just go ahead trying to set this to… let’s say the number of threads in our AppServer-Threadpool. Because that is actually the number of parallel user requests e.g. hitting a webpage doing the query on a single appserver. It basically corresponds to the number of parallel users / requests / threads you try to serve.

Do you have any plans adding a section to the documentation or will this go away and works with queryEndpoints=1 (or whatever the default will be) once https://issues.couchbase.com/browse/MB-22184 is solved?

Yes, one way or another (either in the server or the client) this will be addressed and is expected to work fine with the default settings of queryEndpoints=1.

If you have a large number of app server threadpool threads and you don’t want that many sockets you can also temporarily put N1Ql queries behind a semaphore of max your configured socket size which would have the same effect but you can decouple the app threads from the number of sockets.

Ok, I guess a fix should also address the problem, how the server or client reacts on this problem. Because at the moment you get an empty response without any errors. There should be errors or exceptions when this happens.

On that note:
Would checking the N1qlQueryResult.status() for “closed” be a way to recognize that error (at least for the time being)?
e.g.

N1qlQueryResult result = bucket.query(query);
if("closed".equals(result.status()){
 throw IllegalStateException("Looks like we got Couchbase Error MB-22184.... ");
}

From reading the bug-entry and extending my testcase to output the status() it seems that “closed” could be a way to at least recognize it in our code and throw an error instead of return empty results to the user (because empty result could lead to wrong effects in other code, because it is not the truth).

@synesty there shouldn’t be an error in the first place, ideally the server does the proper thing on pipelining.

Checking on the status() field is not a good idea, since we have a streaming parser and the onNext of the AsyncN1QlQueryResult is already signalled. In theory we could add more logic for the sync variant you are using (which is placed on top) but then we’d have different semantics for the async and sync codes of n1ql responses which is unituitive. So I think the real way to address this is fix an issue on the server side (or if we find the client is doing something weird fix it there) or in the worst case disable pipelining for the short term internally completely.

@synesty we’ve boiled it down to go not being able to pipeline in http 1.1, so we went ahead and disabled pipelining against http endpoints for the java SDK. It will be part of the next 2.4.1 release, or you can build it from source if you want to. (https://github.com/couchbase/couchbase-jvm-core/commit/711afb196388f7bf9f4f3ee97695e0e74f0800a7 if you are curious)

1 Like

Thanks for the heads up.

Does this impact performance and if yes, how?
Do I now have to increase queryEndpoints to get more parallelism?
Or am I thinking in the wrong direction?