Intermittent failure of couchbase node API for fnFtsQuery


#1

First post, because couchbase has been working really well so far, and this is the first issue that I’m stumped on. I really do appreciate any insights or help that might come from this.

Setup:
Couchbase “version”: "4.0.0-4051-community"
OSX 10.11.6
docker version 1.12.0-a build 11213

Webserver is node 6.4.0 node_modules couchbase npm installation is 2.2.2

Situation: Running almost 800 unit tests with mocha against ottoman models.
Every 2 or 3 or 4 times through the test suite, instead of success, a hard stop without a stack trace comes from the internals.

events.js:160
      throw er; // Unhandled 'error' event
      ^
SyntaxError: Unexpected number in JSON at position 4
  at parse (native)

Importantly, this unit test is exercising code to look for a record that doesn’t exist. I’ve added a bunch of debugging statements to ottoman (up to this level everything looks ok, ottoman has the right data) and then traced this problem down the stack to the couchbase javascript library, and finally this code block:

Bucket.prototype._n1qlReq = function(host, q, adhoc, emitter) {
  var rows = [];
  console.log('this._cb.n1qlquery');
  this._cb.n1qlQuery(
      host, q, adhoc,
  function(errCode, val) {
    console.log('n1ql handler called');

The first log statement prints, the second one does not, which takes me down into the c code for this which seems to be this function in operations.cc

NAN_METHOD(CouchbaseImpl::fnN1qlQuery) {

So this is sort of no-man’s land for me and I’m not sure how to debug it further. In another life I’ve used C and C++ so I’ve got the concepts but haven’t done node integrated c programming. Before getting too crazy with that I figure it’s worth asking to see if there’s something obvious that someone will know straight away.

One last note… the token is always different. Examples:

Unexpected token a in JSON at position 1
Unexpected token , in JSON at position 39

Thanks so much for reading and any help on this!


#2

Some more information, managed to figure out how to rebuild and put some debug information in there. I’m getting to the point where lcb_n1ql_query in ./deps/lcb/src/n1ql/n1ql.cc returns LCB_SUCCESS so i’m kind of stumped as to what the error is. CouchbaseImpl::fnN1qlQuery Also seems to complete with return info.GetReturnValue().Set(true); so I’m a bit lost in finding this bug.


#3

Ok I think I’ve figured out where this is coming from. After 4 tests, one failed and produced some output from couchbase_impl.cc in the function n1qlrow_callback

    printf("datares from jsonParseLcl call\n");
    dataRes = jsonParseLcl->Call(Nan::GetCurrentContext()->Global(), 1, &metaStr);
    printf("dataRes toObject\n");

The first printf statement is shown, the second isn’t, which jives with the error about a failure to parse json. My guess is that sometimes couchbase server doesn’t send back json and this function blows up.

Still need to figure out how to emit the metaStr value (it’s a Handle type) and then how to protect the jsonParseLcl from being called in this case… Any help on this one is very welcome.


#4

Ok got some more debug. The string being sent to the parser has a bit of mangling on it either on the beginning or end. I strongly suspect some kind of memory problem or concatenation issue. It looks like metrics is also screwed up. Either way, the json parse fails and the exception isn’t being caught by the coucbase_impl.cc functions for any of the query row handlers before callbacks. I’m going to try to do some try catch action…

metaStr => c",
    "signature": {
        "id": "json"
    },
    "results": [
    ],
    "status": "success",
    "metrics": {
        "e],
    "status": "success",
    "metrics": {
        "elapsedTime": "14.278066ms",
        "executionTime": "14.184776ms",
        "resultCount": 0,
        "resultSize": 0
    }
}


events.js:160
      throw er; // Unhandled 'error' event
      ^
SyntaxError: Unexpected token c in JSON at position 0
  at parse (native)

#5

So I noticed that in the compiler rebuild output that Nan:TryCatch was a deprecated function, and when I used it my stack would lock up and only respond to a kill -9 non-maskable interrupt, so I replaced it with the new v8 hotness with an isolate, and my exception was caught and things kept rolling when I returned a null result set from this function. Notably all of these functions for callbacks would need similar treatment. Looks like some kind of stack overflow issue on the data in the response object…

Here’s my code with my workaround in the couchbase source file for couchbase_impl.cc (i put an audible bell in there because this is intermittent and I have loads of output…)

   Handle<Value> metaStr =
            Nan::New<String>((const char*)resp->row, (int)resp->nrow).ToLocalChecked();

    String::Utf8Value str(metaStr->ToString());
    printf("(cc)>>>> metaStr [%s]\n", (const char*)(*str));
    v8::Isolate* isolate = v8::Isolate::GetCurrent();
    v8::TryCatch tc(isolate);
    dataRes = jsonParseLcl->Call(Nan::GetCurrentContext()->Global(), 1, &metaStr);
    if( tc.HasCaught() ){
      printf("\a");
      printf("*****************************************Exception **********************************\n");
      dataRes = Nan::Null();
    }
    else{
      Local<Object> metaObj = dataRes->ToObject();
      if (!metaObj.IsEmpty()) {
        metaObj->Delete(Nan::New(CouchbaseImpl::resultsKey));
      }
    }

Here’s the output:

(cc)>>>> metaStr [son"
    },
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "74.68103ms",
        "exe],
    "status": "success",
    "metrics": {
        "elapsedTime": "74.68103ms",
        "executionTime": "74.579949ms",
        "resultCount": 0,
        "resultSize": 0
    }
}
]
*****************************************Exception **********************************
callback call

#6

It would appear that in both cases of failure, no results are actually received. It would be interesting indeed to actually poke and see the raw HTTP response resulting from this situation. The JSON parser in C once had issues with not handling 0-result situations, but that should have been long fixed.

It would seem to be a bug in the parsing code, perhaps?

If you dig in the C/C++ code, you’ll see that lcb_RESPN1QL has an htresp field… poke around there


#7

At this point it appears the http response is code 200 with an empty body, the following code block prints 200 and 0 => (null)

    printf("Got HTTP Status: %d\n", resp->htresp->htstatus);
    printf("Got paylod: %i => %s\n", (int)resp->htresp->nbody, resp->htresp->body);

But it appears this is always the case when the FINAL flag is set - i’m guessing that either that’s how it knows it’s the last part of the response to trigger the callback or that the htresp body is getting free’d and nulled whenever (const char*)resp->row, (int)resp->nrow are set by someone higher in the call stack…