N1QL query performance with large dataset

Hi,
I have been playing with N1QL for sometime now and it works great with small dataset. I just created a bucket and loaded it with about 8.5 M records on 3 node cluster with total data in RAM around 4 GB. The bucket has 7 GB per node setting.
I am trying to run some simple N1QL queries like “select count(*)” with some where clauses expecting it to return 1M+ count. But seems like no matter how many time i run the query, N1QL seems to load all of the 8.5 records from the #primary view every time with the following similar calls on the cbq console log
_time=“2015-03-26T15:19:31Z” _level=“INFO” _msg="Request View: http://serverURL/pViews/design/ddl%23primary/_view/%23primary?limit=1001&startkey=%5B160%2C%5B112%2C118%2C105%2C101%2C119%2C95%2C49%2C48%2C48%2C48%2C56%2C57%2C55%5D%5D&startkey_docid=pview_1000897"
My question is:
is this behavior expected? If yes, then any way to improve eg. changing the limit=1001 parameter etc. If this is not the expected behavior, then what am i missing? Please help.

Hi Jagsaha,

this is expected behavior for a bucket that has primary index as the only index on that bucket. If you create an index on the attribute that you are using in the where clause you should see better query performance. For example, if the query is:
SELECT count(*) FROM myBucket WHERE foo > 20
Then creating an index:
CREATE INDEX bidx1 ON myBucket(foo)
should give a more responsive query.

So I ran the CREATE INDEX command as…
CREATE INDEX pview_cnt_idx ON pViews(cnt). Adding the view code it created for review in case i missed something…

    function (doc, meta) {
  if (meta.type != "json") return;

  var stringToUtf8Bytes = function (str) {
    var utf8 = unescape(encodeURIComponent(str));
    var bytes = [];
    for (var i = 0; i < str.length; ++i) {
        bytes.push(str.charCodeAt(i));
    }
    return bytes;
  };

  var indexFormattedValue = function (val) {
    if (val === null) {
      return [64];
    } else if (typeof val == "boolean") {
      return [96, val];
    } else if (typeof val == "number") {
      return [128, val];
    } else if (typeof val == "string") {
      return [160, stringToUtf8Bytes(val)];
    } else if (typeof val == "object") {
      if (val instanceof Array) {
        return [192, val];
      } else {
        innerKeys = [];
        for (var k in val) {
          innerKeys.push(k);
        }
        innerKeys.sort()
        innerVals = [];
        for (var i in innerKeys) {
          innerVals.push(indexFormattedValue(val[innerKeys[i]]));
        }
        return [224, [innerKeys, innerVals]];
      }
    } else {
        return undefined;
    }
  };

  var key1 = indexFormattedValue(doc.cnt);
  var key = [key1];
  var pos = key.indexOf(undefined);
  if (pos == 0) {
    return;
  } else if (pos > 0) {
    key.splice(pos)
  }

  emit(key, null);
}
// salt: 1889555792

then ran…
select count(*) from pViews where cnt = 1; Still i see it running the view queries as below and the query took a long time to return. Any other way to optimize this.
_time=“2015-03-26T20:46:54Z” _level=“INFO” _msg=“Request View: http://serverURL/pViews/_design/ddl_pview_cnt_idx/_view/pview_cnt_idx?endkey=[[128%2C1]]&limit=1001&startkey=[[128%2C1]]&startkey_docid=pview_8502325”.
Adding a sample document json…

{
   "primaryid": 11111111,
   "secondaryid": 22222222,
   "cnt": 2,
   "actdt": "2010-02-10T21:16:00",
   "id": "11111111",
   "cas": 0
}

Hi Jagsaha,

Please post the results of:

explain select count(*) from pViews where cnt = 1;

Thanks,
Gerald

{
“results”: [
{
#operator”: “Sequence”,
“~children”: [
{
#operator”: “IndexScan”,
“index”: “pview_cnt_idx”,
“keyspace”: “pViews”,
“limit”: 9223372036854776000,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
1
],
“Inclusion”: 3,
“Low”: [
1
]
},
“Seek”: null
}
]
},
{
#operator”: “Parallel”,
“~child”: {
#operator”: “Sequence”,
“~children”: [
{
#operator”: “Fetch”,
“keyspace”: “pViews”,
“namespace”: “default”
},
{
#operator”: “Filter”,
“condition”: “((pViews.cnt) = 1)”
},
{
#operator”: “InitialGroup”,
“aggregates”: [
“count()"
],
“group_keys”: []
},
{
#operator”: “IntermediateGroup”,
“aggregates”: [
"count(
)”
],
“group_keys”:
}
]
}
},
{
#operator”: “IntermediateGroup”,
“aggregates”: [
“count()"
],
“group_keys”: []
},
{
#operator”: “FinalGroup”,
“aggregates”: [
"count(
)”
],
“group_keys”:
},
{
#operator”: “Parallel”,
“~child”: {
#operator”: “Sequence”,
“~children”: [
{
#operator”: “InitialProject”,
“result_terms”: [
{
“expr”: “count(*)”
}
]
},
{
#operator”: “FinalProject”
}
]
}
}
]
}
]
}

As i specified earlier, every time i run this query against my 8.5M data set, i see on the cbq-engine console logs that it is trying to query the primary index and return results the default 1001 at a time, which is believe couchbase can do a better job of. is there a way to play around with that value? like increasing it to 100K. I am thinking, that would help with the overall performance of the query. Also, any time lines around N1QL GA?

thanks
Jag

Hi Jag,

Thanks for this valuable input. We have also been thinking about tweaking this 1k value.

There will be a Developer Preview release of CB Server 4.0 later this month. With that, you can use our new indexes, called GSI indexes. Can you please try that out and let us know?

Meanwhile, I will pass along your observation about the 1k batch size.

Thanks,
Gerald

Thanks Gerald! For using the GSI indexes, will i need to upgrade the CB server or will that be just a replacement for the cbq-engine?

Thanks again!
-Jag

Hi Jag,

The Developer Preview in April will not support upgrades, although the Beta and GA will. For the Developer Preview, you will need to clean install Couchbase Server, re-import your data, and then create the GSI indexes.

Gerald