Querying large documents is extremly slow from SDK, but fast from web-ui or cbq

I’m using couchbase to store blacklists of disallowed passwords of sizes up to ~15mb.
Each list contains a json array that is I’ve indexed to allow fast “any-in-satisfies” lookups.
If I check whether some password is contained in some blacklist, the query finishes after ~5ms when executing it via the web ui or cbq. When I execute the same query via the sdk, it takes more than 1000ms!

Reproducing this issue:

I’m running a new Couchbase EE 6.5 instance.

The dataset can be downloaded here
You should be able to import it like this:
/opt/couchbase/bin/cbimport json -u <user> -p <password> -b <bucket-name> -f lines -c http://localhost:8091 --generate-key %_id_% -d file://blacklists

I’ve created following index to allow quick search for passwords. No primary index exists.

CREATE INDEX blacklisted_passwords_idx
ON `blacklist`(DISTINCT ARRAY LOWER(pw) FOR pw IN passwords END)
WHERE type="password-blacklist"

The following query will complete within a few milliseconds when run from the web-ui or cbq:

SELECT RAW COUNT(*) FROM `blacklist`
USE INDEX (blacklisted_passwords_idx)
WHERE ANY pw IN passwords SATISFIES LOWER(pw) = LOWER("p4ssword") END
      AND type="password-blacklist"

When executing the query via the sdk (scala in my case), it takes ~1200ms to complete!

import com.couchbase.client.scala.Cluster
import com.couchbase.client.scala.query.{QueryOptions, QueryParameters}

object TestApp extends App{
val cluster = Cluster.connect(
“127.0.0.1”,
“”,
“”
).get

val bucket = this.cluster.bucket(“blacklist”)
val collection = this.bucket.defaultCollection

val n1ql = s"““SELECT RAW COUNT(*) FROM blacklist
|USE INDEX (blacklisted_passwords_idx)
|WHERE ANY pw IN passwords SATISFIES LOWER(pw) = LOWER($$password) END
| AND type=“password-blacklist”
|””".stripMargin

var start = 0L

for (i ← 0 until 10) {
println(“Starting…”)
start = System.currentTimeMillis()
val result = cluster.query(n1ql, QueryOptions().parameters(QueryParameters.Named(“password” → “p4ssword”)).metrics(true)).get
println(s"Took ${System.currentTimeMillis() - start} ms")
println(“executionTime: " + result.metaData.metrics.get.executionTime.toMillis + " ms”)
println(“elapsedTime: " + result.metaData.metrics.get.elapsedTime.toMillis + " ms”)
}

}

This yields following results:

SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See SLF4J Error Codes for further details.
Starting…
Took 3015 ms
executionTime: 1273 ms
elapsedTime: 1273 ms
Starting…
Took 1353 ms
executionTime: 1350 ms
elapsedTime: 1350 ms
Starting…
Took 1194 ms
executionTime: 1189 ms
elapsedTime: 1189 ms
Starting…
Took 1267 ms
executionTime: 1263 ms
elapsedTime: 1263 ms
Starting…
Took 1248 ms
executionTime: 1245 ms
elapsedTime: 1245 ms

Hi @aki. I’m the developer of the Scala SDK. I’ll see if I can replicate this with your instructions. There shouldn’t be any major performance difference between the UI and the SDK.

I can replicate, and Wireshark shows that the delay is waiting for the response from the query service. I’m not sure why at this stage, but if you replace the named parameter - e.g. put ‘LOWER(“p4ssword”)’ instead of LOWER($$password) - then the query completes in a few milliseconds, like the UI.
I confirmed that the same behaviour happens with Java SDK 2.x: specify $password in the query string and it takes nearly a second, hardcode the parameter and it’s a few milliseconds.
I’m not sure why this is the case, perhaps a member of the query team can shed some light - @vsr1?

Query using ARRAY index and part of the ANY clause using the parameters. In that situation it can’t use implicit covering index MB-33009

After you have query parameters replace the paramaters in actual statement and execute better performance.
Also you can try the following index ( Index has ALL to cover through UNNEST) and query

CREATE INDEX blacklisted_passwords_idx
ON `blacklist`(ALL ARRAY LOWER(pw) FOR pw IN passwords END)
WHERE type="password-blacklist";

SELECT RAW COUNT( DISTINCT META(b).id ) FROM `blacklist` AS b
USE INDEX (blacklisted_passwords_idx)
UNNEST b.passwords AS pw
WHERE b.type="password-blacklist" AND LOWER(pw) = LOWER($password) ;

Please checkout reply from this post Query with embedded parameters 10x faster than parameterized query

1 Like

Thanks a lot @vsr1 , your query delivers the result in within a few milliseconds, just as when a constant is used inline.
Is there a general issue with prepares statements and index utilization or are ANY-IN-SATISFIES queries + prepared statements + array indices just one special edge case that I’ve hit?
So far I though when I optimized queries to run efficiently with string constants in the web-ui, then I assumed that they would run equally when I replace them against parameters of prepared statements. Are there any special rules I need to pay attention to?

A great shout out to @graham.pople, you did a great job designing the scala sdk! :smiley: :+1:
While e.g. the mongodb scala sdk feels like a wrapper around the java sdk, the couchbase sdk feels like it was designed for scala from ground up. The mongodb sdk e.g. uses reflection to find codes and can therefore not deserialize lists due to type erasure. The couchbase SDK allows to use the full flexibility of scala implicits and moves codec resolution from the runtime to the compile-time! The mutable json api is not that idiomatic, but those are just tiny, acceptable tradeoffs allowing to deliver great performance.

3 Likes

If query uses Array Index and ANY-SATISFIES clause has query positional/named parameters (constants are oaky), It must Fetch the document and can’t use implicit covered array index.

https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/indexing-arrays.html#implicit-covered-array-index

Thanks very much for that feedback @aki, I’m glad you’re enjoying the SDK. And indeed, compile-time codec generation and a truly native API were high priorities.
I hear you on the mutable JSON library, it’s not going to be everyone’s cup of tea :slight_smile: There’s a defence of the approach here https://docs.couchbase.com/scala-sdk/current/howtos/json.html#jsonobject-and-jsonarray, but the main point to takeaway is that using that built-in JSON is 100% optional. The SDK includes first-class support for several immutable third-party JSON libraries (including the excellent Circe), plus it’s easy for the application to plug in more (literally just a few lines of boilerplate). So please feel free to use an alternative for JSON - there’s plenty of examples on the docs I linked to.

1 Like

@vsr1 I’ve just migrated to couchbase server community 6.5.0 and the index does now seem to have a bug.
When I use the blacklisted_passwords_idx index, I do always get an empty response.

SELECT pw FROM `inseven` AS blacklist
USE INDEX(blacklisted_passwords_idx)
UNNEST blacklist.passwords AS pw
WHERE blacklist.type="password-blacklist" AND LOWER(pw) = LOWER("123456")

{
  "results": []
}

When I force a primary scan, then I am getting the correct results:

SELECT pw FROM `inseven` AS blacklist
USE INDEX(`#primary`)
UNNEST blacklist.passwords AS pw
WHERE blacklist.type="password-blacklist" AND LOWER(pw) = LOWER("123456")

[
  {
    "pw": "123456"
  }
]

Are you able to reproduce this issue?

Check If the index has any pending items

Execute query in the UI and post the output from the Plan Text tab also post index definition.

I’ve just set up the container another time with once again community 6.5.0 and also enterprise editon 6.5.0
The error only appears in the community edition. Everything works fine on enterprise.

Notice the difference in the item count of the indices.


Community Edition 6.5.0 build 4966:

{
    "#operator": "Sequence",
    "~children": [
        {
            "#operator": "DistinctScan",
            "scan": {
                "#operator": "IndexScan3",
                "as": "blacklist",
                "index": "blacklisted_passwords_idx",
                "index_id": "5fab170492ce7769",
                "index_projection": {
                    "primary_key": true
                },
                "keyspace": "inseven",
                "namespace": "default",
                "spans": [
                    {
                        "exact": true,
                        "range": [
                            {
                                "high": "lower(\"123456\")",
                                "inclusion": 3,
                                "low": "lower(\"123456\")"
                            }
                        ]
                    }
                ],
                "using": "gsi"
            }
        },
        {
            "#operator": "Fetch",
            "as": "blacklist",
            "keyspace": "inseven",
            "namespace": "default"
        },
        {
            "#operator": "Parallel",
            "~child": {
                "#operator": "Sequence",
                "~children": [
                    {
                        "#operator": "Unnest",
                        "as": "pw",
                        "expr": "(`blacklist`.`passwords`)"
                    }
                ]
            }
        },
        {
            "#operator": "Parallel",
            "~child": {
                "#operator": "Sequence",
                "~children": [
                    {
                        "#operator": "Filter",
                        "condition": "(((`blacklist`.`type`) = \"password-blacklist\") and (lower(`pw`) = lower(\"123456\")))"
                    },
                    {
                        "#operator": "InitialProject",
                        "result_terms": [
                            {
                                "expr": "`pw`"
                            }
                        ]
                    },
                    {
                        "#operator": "FinalProject"
                    }
                ]
            }
        }
    ]
}

Enterprise Edition 6.5.0 build 4960:

{
  "#operator": "Sequence",
  "#stats": {
    "#phaseSwitches": 1,
    "execTime": "1.99µs"
  },
  "~children": [
    {
      "#operator": "Authorize",
      "#stats": {
        "#phaseSwitches": 3,
        "execTime": "4.625µs",
        "servTime": "1.335019ms"
      },
      "privileges": {
        "List": [
          {
            "Target": "default:inseven",
            "Priv": 7
          }
        ]
      },
      "~child": {
        "#operator": "Sequence",
        "#stats": {
          "#phaseSwitches": 1,
          "execTime": "2.889µs"
        },
        "~children": [
          {
            "#operator": "DistinctScan",
            "#stats": {
              "#itemsIn": 4,
              "#itemsOut": 4,
              "#phaseSwitches": 21,
              "execTime": "35.883µs",
              "kernTime": "2.004699ms"
            },
            "scan": {
              "#operator": "IndexScan3",
              "#stats": {
                "#itemsOut": 4,
                "#phaseSwitches": 21,
                "execTime": "27.207µs",
                "kernTime": "3.507µs",
                "servTime": "1.966096ms"
              },
              "as": "blacklist",
              "index": "blacklisted_passwords_idx",
              "index_id": "6b96febaa2a67c41",
              "index_projection": {
                "primary_key": true
              },
              "keyspace": "inseven",
              "namespace": "default",
              "spans": [
                {
                  "exact": true,
                  "range": [
                    {
                      "high": "lower(\"123456\")",
                      "inclusion": 3,
                      "low": "lower(\"123456\")"
                    }
                  ]
                }
              ],
              "using": "gsi",
              "#time_normal": "00:00.001",
              "#time_absolute": 0.001993303
            },
            "#time_normal": "00:00.000",
            "#time_absolute": 0.000035883
          },
          {
            "#operator": "Fetch",
            "#stats": {
              "#itemsIn": 4,
              "#itemsOut": 4,
              "#phaseSwitches": 21,
              "execTime": "42.66µs",
              "kernTime": "2.048277ms",
              "servTime": "87.164822ms"
            },
            "as": "blacklist",
            "keyspace": "inseven",
            "namespace": "default",
            "#time_normal": "00:00.087",
            "#time_absolute": 0.087207482
          },
          {
            "#operator": "Unnest",
            "#stats": {
              "#itemsIn": 4,
              "#itemsOut": 3067023,
              "#phaseSwitches": 6134057,
              "execTime": "10.389846706s",
              "kernTime": "2.166762357s"
            },
            "as": "pw",
            "expr": "(`blacklist`.`passwords`)",
            "#time_normal": "00:10.389",
            "#time_absolute": 10.389846706
          },
          {
            "#operator": "Sequence",
            "#stats": {
              "#phaseSwitches": 1,
              "execTime": "2.182µs"
            },
            "~children": [
              {
                "#operator": "Filter",
                "#stats": {
                  "#itemsIn": 3067023,
                  "#itemsOut": 4,
                  "#phaseSwitches": 6134057,
                  "execTime": "7.825075115s",
                  "kernTime": "4.731631103s"
                },
                "condition": "(((`blacklist`.`type`) = \"password-blacklist\") and (lower(`pw`) = lower(\"123456\")))",
                "#time_normal": "00:07.825",
                "#time_absolute": 7.825075115
              },
              {
                "#operator": "InitialProject",
                "#stats": {
                  "#itemsIn": 4,
                  "#itemsOut": 4,
                  "#phaseSwitches": 17,
                  "execTime": "62.58µs",
                  "kernTime": "12.556635856s"
                },
                "result_terms": [
                  {
                    "expr": "`pw`"
                  }
                ],
                "#time_normal": "00:00.000",
                "#time_absolute": 0.00006258
              },
              {
                "#operator": "FinalProject",
                "#stats": {
                  "#itemsIn": 4,
                  "#itemsOut": 4,
                  "#phaseSwitches": 13,
                  "execTime": "8.483µs",
                  "kernTime": "11.011µs"
                },
                "#time_normal": "00:00.000",
                "#time_absolute": 0.000008483
              }
            ],
            "#time_normal": "00:00.000",
            "#time_absolute": 0.000002182
          }
        ],
        "#time_normal": "00:00.000",
        "#time_absolute": 0.0000028889999999999998
      },
      "#time_normal": "00:00.001",
      "#time_absolute": 0.001339644
    },
    {
      "#operator": "Stream",
      "#stats": {
        "#itemsIn": 4,
        "#itemsOut": 4,
        "#phaseSwitches": 11,
        "execTime": "91.264µs",
        "kernTime": "12.557988942s"
      },
      "#time_normal": "00:00.000",
      "#time_absolute": 0.000091264
    }
  ],
  "~versions": [
    "2.0.0-N1QL",
    "6.5.0-4960-enterprise"
  ],
  "#time_normal": "00:00.000",
  "#time_absolute": 0.00000199
}

Plan looks right. check indexer log any skipped document cc @jeelan.poola

@aki
Index Item Size and Index Data Size stats for the index differ starkly in the two screenshots shared by you. Please confirm if the data set used for both builds were exactly the same. As @vsr1 suggested, check indexer logs for any errors about skipped mutation in case of community edition. Could you share the last PeriodicStats logging in indexer.log for both the builds for this index to check what stats differ?

I’ve just setup the two databases identical in docker containers, so you can try it yourself and get all metrics that you need. The name/password for the couchbase instance in both containers is “Administrator” / “Administrator”.
https://mega.nz/#!BIhVBAZT!0dD10lVWVrypUh0uNgBJz11sJL9dZgcu9GGoin43T1E

@aki I tried creating the containers but I see no logs in the logs directory. Could you please share indexer.log from both community and enterprise setups?

Thanks a lot for looking into this issue. Inside the zip file you’ll find the “/opt/couchbase/var” directory of both couchbase containers. I think that’s the only directory where couchbase stores persistent data/logs, isn’t it?

The indexer.log should be located inside the zip file at community/data/lib/couchbase/logs/indexer.log and enterprise/data/lib/couchbase/logs/indexer.log if those files are what you’re looking for.

For the sake of completeness here, they are again:

  • On enterprise there was just one single small log file.
  • On community these logs only contain weird numbers and got that huge, that they’ve been split and gzipped.

HI @aki I was able to download the enterprise-indexer-log.zip but the link you provided for community is not working. Could you please recheck and share the link?

Seems like I forgot to paste the link, I’ve added it now :sweat_smile:.

@aki Alright, I see both the logs now :slight_smile:
In Enterprise editions, large keys are always allowed. But there are some limitations in community edition regarding size of the keys to be indexed. It looks like the size of the “passwords” array in blacklisted_passwords_idx is too large. The logs in community edition indicate keys being skipped indexing due to size limits. Can you look at the array and estimate it’s size?
Please note that enterprise edition does not have any such limitation and hence you see all items indexed in the stats page.

The documents are rather large. The sizes of the arrays in the documents are up to 15mb each. The longest strings in the array are 255 characters long.

I’m kind of surprised to hear of such a restriction since it seemed to work fine on community 6.0.
Was this limitation added recently and is it documented somewhere? So far the only restriction that I’ve heard of are document sizes of 20mb each or 10mb when using acid transactions.

@aki The behavior should be similar in 6.0 and 6.5 community. I can try out a similar setup and let you know. Please provide below details:

  1. The average number of items in passwords array
  2. the average size of each item in the array (I understand max could be upto 255).