Design indices to support several filters on the same data

Hi,

my bucket main contains potentially millions of user documents like this one:
{
“userId”: …
“customerId”: …
“creationTimestamp”: 1493302910000
"lastUpdateTimestamp": 1593302910000
"type": “User”
}

It shall be possible to specify zero up to two time frames to filter the users:
One filters the users according to creation time,
the other filters users according to their last updated time.

The returned users shall be ordered by creationTimestamp and can be limited to e.g. 1000.

Now I wonder, how should my indices and the N1QL query look like to fetch the correctly filtered users with good performance?

So far, I have tried many different things, but each of them failed in one or the other way, and I am running out of ideas.

The closest I could get to a working solution was creating this index:

CREATE INDEX lastUpdateIndex ON main (type,customerId,creationTimestamp, lastUpdateTimestamp) WHERE type="User";

This N1QL query runs performant (ca. 100ms for 200_000 users) for large timespans, e.g. from 0 to “NOW”.

SELECT userId, customerId, creationTimestamp, lastUpdateTimestamp 
FROM `main` 
WHERE `type` = "User" AND customerId="testuser" 
AND creationTimestamp >= 0 AND creationTimestamp <= 1493380543000
AND lastUpdateTimestamp >= 0 AND lastUpdateTimestamp <= 1493380543000
ORDER BY customerId, creationTimestamp, lastUpdateTimestamp ASC
LIMIT 1000;

As soon as the timespans for the lastUpdateTimestamp get smaller, the performance drops considerably. Especially, if no “creation-timespan” is provided, like e.g. in this query, which takes more than 3s to execute for 200_000 users in the bucket:

SELECT userId, creationTimestamp 
FROM `main` 
WHERE `type` = "User" AND customerId="testuser" 
AND lastUpdateTimestamp >= 0 AND lastUpdateTimestamp <= 1493214005000
ORDER BY customerId, creationTimestamp
LIMIT 1000;

If you using 4.5.0+

The following Index makes optimal for Query1, Do EXPLAIN and check it is covered.

CREATE INDEX custIdCreatimeIndex ON main (customerId,creationTimestamp, lastUpdateTimestamp,useId) WHERE type="User";
SELECT userId, customerId, creationTimestamp, lastUpdateTimestamp 
FROM `main` 
WHERE `type` = "User" AND customerId="testuser" 
AND creationTimestamp >= 0 AND creationTimestamp <= 1493380543000
AND lastUpdateTimestamp >= 0 AND lastUpdateTimestamp <= 1493380543000
ORDER BY customerId, creationTimestamp, lastUpdateTimestamp ASC
LIMIT 1000;

For Q2 you have the following options.
Add creationTimestamp as predicate with higher bound as huge value or current time

SELECT userId, creationTimestamp
FROM `main` 
WHERE `type` = "User" AND customerId="testuser" 
AND creationTimestamp >= 0 AND creationTimestamp <= now_millis();
AND lastUpdateTimestamp >= 0 AND lastUpdateTimestamp <= 1493380543000
ORDER BY customerId, creationTimestamp
LIMIT 1000;

OR
Try the following index.

CREATE INDEX lastUpdateIndex ON main (customerId,lastUpdateTimestamp,creationTimestamp,userId) WHERE type="User";
SELECT userId, creationTimestamp 
FROM `main` 
WHERE `type` = "User" AND customerId="testuser" 
AND lastUpdateTimestamp >= 0 AND lastUpdateTimestamp <= 1493214005000
ORDER BY customerId, creationTimestamp
LIMIT 1000;

Indexer uses B-tree, if creationTimestamp is missing in predicate lastUpdateTimestamp can’t pushed to IndexScn, i.e Query Engine needs to get all the data from indexer and do post process.

Hi,
thanks for the quick response!

I’m on 4.5.0-2601 Community Edition.

I created the covering index you proposed:

CREATE INDEX end_user_creation_index ON main(customerId,creationTimestamp,lastUpdateTimestamp,userId) WHERE (type = “User”)

This query needs ca 60ms, so that’s fine:

SELECT userId, creationTimestamp
FROM `main` 
WHERE `type` = "User" AND customerId="testuser" 
AND creationTimestamp >= 0 AND creationTimestamp <= now_millis();
AND lastUpdateTimestamp >= 0 AND lastUpdateTimestamp <= 1493380543000
ORDER BY customerId, creationTimestamp
LIMIT 1000;

BUT: As soon as I change the lastUpdateTimestamp a bit, e.g. to 1493213820000,
the performance drops to more than 1.5 seconds (for just 200000 users).

Is there no faster way?

EXPLAIN for both queries is the same (besides the different value for lastUpdateTimestamp):
[
{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Sequence”,
"~children": [
{
"#operator": “IndexScan”,
“covers”: [
“cover ((main.customerId))”,
“cover ((main.creationTimestamp))”,
“cover ((main.lastUpdateTimestamp))”,
“cover ((main.userId))”,
“cover ((meta(main).id))”
],
“filter_covers”: {
“cover ((main.type))”: “User”
},
“index”: “end_user_creation_index”,
“index_id”: “10f53e1981eadff1”,
“keyspace”: “main”,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
“successor(“testuser”)”
],
“Inclusion”: 1,
“Low”: [
"“testuser”",
“0”,
“0”
]
}
}
],
“using”: “gsi”
},
{
"#operator": “Parallel”,
“maxParallelism”: 1,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Filter”,
“condition”: “((((((cover ((main.type)) = “User”) and (cover ((main.customerId)) = “testuser”)) and (0 <= cover ((main.creationTimestamp)))) and (cover ((main.creationTimestamp)) <= now_millis())) and (0 <= cover ((main.lastUpdateTimestamp)))) and (cover ((main.lastUpdateTimestamp)) <= 1493213820000))”
},
{
"#operator": “InitialProject”,
“result_terms”: [
{
“expr”: “cover ((main.userId))”
},
{
“expr”: “cover ((main.creationTimestamp))”
}
]
}
]
}
}
]
},
{
"#operator": “Limit”,
“expr”: “1000”
},
{
"#operator": “FinalProject”
}
]
},
“text”: “SELECT userId, creationTimestamp\nFROM main \nWHERE type = “User” AND customerId=“testuser” \nAND creationTimestamp >= 0 AND creationTimestamp <= now_millis()\nAND lastUpdateTimestamp >= 0 AND lastUpdateTimestamp <= 1493213820000\nORDER BY customerId, creationTimestamp\nLIMIT 1000;”
}
]

Do you have data that have creationTimestamp, lastUpdateTimestamp MISSING or NULL values.

If you are not interested on those

CREATE INDEX end_user_creation_index ON keypasco_main(customerId,creationTimestamp,lastUpdateTimestamp,userId) WHERE (type = "User" AND creationTimestamp >= 0 AND lastUpdateTimestamp >= 0);

Each user in the bucket contains values for creationTimestamp and lastUpdateTimestamp.
So adding creationTimestamp >= 0 to the create index statement won’t make the index smaller.

It looks like now_millis() not pushed to indexer.
Change creationTimestamp <= now_millis() TO creationTimestamp <= 4102559999000 i.e.
str_to_millis(“2100-01-01T23:59:59”) ===> 4102559999000

Hi,

I’m afraid that did not help either. Running this query

SELECT `type`, creationTimestamp, lastUpdateTimestamp 
FROM `main` 
WHERE type = "User" 
    AND customerId = "testuser" 
    AND creationTimestamp > 0 
    AND creationTimestamp < STR_TO_MILLIS("2100-01-01T00:00:01")
    AND lastUpdateTimestamp >= 0 
    AND lastUpdateTimestamp < 1493213965675
ORDER BY `customerId`, `creationTimestamp`
LIMIT 1000;

took more than 1.5 s.

I am not restricted to solve this performance issue solely by optimizing this N1QL query. If it was possible to solve it by other couchbase features I am not aware of, I’m glad to hear about suggestions in these directions as well.

Would it for example help to create another index over just “lastUpdateTimestamp” or run two subqueries (one for creationTimestamp, and one for lastUpdateTimestamp)?

Can you post the EXPLAIN

Here you are:

[
{
“plan”: {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Sequence”,
"~children": [
{
"#operator": “IndexScan”,
“covers”: [
“cover ((main.customerId))”,
“cover ((main.creationTimestamp))”,
“cover ((main.lastUpdateTimestamp))”,
“cover ((main.userId))”,
“cover ((meta(main).id))”
],
“filter_covers”: {
“cover ((main.type))”: “User”
},
“index”: “end_user_creation_index”,
“index_id”: “42017fae23d6e6e8”,
“keyspace”: “main”,
“namespace”: “default”,
“spans”: [
{
“Range”: {
“High”: [
"“testuser”",
“4102441201000”,
“1493213965675”
],
“Inclusion”: 0,
“Low”: [
"“testuser”",
“0”,
“0”
]
}
}
],
“using”: “gsi”
},
{
"#operator": “Parallel”,
“maxParallelism”: 1,
"~child": {
"#operator": “Sequence”,
"~children": [
{
"#operator": “Filter”,
“condition”: “((((((cover ((main.type)) = “User”) and (cover ((main.customerId)) = “testuser”)) and (0 < cover ((main.creationTimestamp)))) and (cover ((main.creationTimestamp)) < str_to_millis(“2100-01-01T00:00:01”))) and (0 <= cover ((main.lastUpdateTimestamp)))) and (cover ((main.lastUpdateTimestamp)) < 1493213965675))”
},
{
"#operator": “InitialProject”,
“result_terms”: [
{
“expr”: “cover ((main.type))”
},
{
“expr”: “cover ((main.creationTimestamp))”
},
{
“expr”: “cover ((main.lastUpdateTimestamp))”
}
]
}
]
}
}
]
},
{
"#operator": “Limit”,
“expr”: “1000”
},
{
"#operator": “FinalProject”
}
]
},
“text”: “SELECT type, creationTimestamp, lastUpdateTimestamp \nFROM main \nWHERE type = “User” \n AND customerId = “testuser” \n AND creationTimestamp > 0 \n AND creationTimestamp < STR_TO_MILLIS(“2100-01-01T00:00:01”)\n AND lastUpdateTimestamp >= 0 \n AND lastUpdateTimestamp < 1493213965675\nORDER BY customerId, creationTimestamp\nLIMIT 1000;”
}
]

Query Plan looks good. May be lot of values present outside range you are looking. You can take look profile stats of the statement in “select * from system:completed_requests”;

The pofile stats for the slow query looks like this:

completed _requests
ClientContextID 5d4be9be-a085-4093-943c-47dd42c396ea
ElapsedTime 4.577228152s
ErrorCount 0
PhaseCounts
IndexScan 450035
PhaseOperators
IndexScan 1
RequestId 51076e9d-7dea-4e79-90ce-b5341cc5a5aa
ResultCount 1
ResultSize 180
ServiceTime 4.577175475s
State completed
Statement SELECT type, userId, creationTimestamp, lastUpdateTimestamp
FROM main
WHERE type = "User"
AND customerId = "testuser"
AND creationTimestamp > 0
AND creationTimestamp < now_millis()
AND lastUpdateTimestamp >= 1493705466000
AND lastUpdateTimestamp < 1493705481000
ORDER BY customerId, creationTimestamp ASC
LIMIT 200
OFFSET 0
Time 2017-05-03 14:33:54.636459563 +0200 CEST

The fast query does not appear in the results for “SELECT * FROM system:completed_requests”:

SELECT type, userId, creationTimestamp, lastUpdateTimestamp
FROM main
WHERE type = "User"
AND customerId = "testuser"
AND creationTimestamp > 0
AND creationTimestamp < 1495799743000
AND lastUpdateTimestamp > 0
ORDER BY customerId, creationTimestamp ASC
LIMIT 200
OFFSET 0

It was executed in 46 ms.
So when not setting an upper boundary for the lastUpdateTimestamp, queries run much faster.

Profile stats are captured queries that take above 1sec.
Look for the statement that have creationTimestamp < STR_TO_MILLIS(“2100-01-01T00:00:01”) .
In above out has creationTimestamp < now_millis(), which doesn’t push the predicate to indexer and Scanned 45K (IndexScan 450035) items to produce qualified 200 results.

If you don’t set upper bound for lastUpdateTimestamp if it runs fast means you have less items with that less than that value, indexer needs to scan more to qualify the limit.

Indexer uses RangeScan when you have two range predicates there will be false positives from indexer(it treats single range on composite key) and query will filter each key separately (This has been addressed in 5.0.0).

If you switch index keys lastUpdateTimestamp, creationTimestamp it may faster but ORDER BY creationTimestamp will make slow because it needs to fetch all qualified items. You can try following how much time it takes.
CREATE INDEX custIdlastUpdatedIndex ON main (customerId, lastUpdateTimestamp,creationTimestamp,useId) WHERE type=“User”;

Here one with STR_TO_MILLIS(…)

completed_requests
ClientContextID	a05164e2-3992-416e-a998-3d746b0c4abd
ElapsedTime	4.387076628s
ErrorCount	0
PhaseCounts
IndexScan	450035
PhaseOperators
IndexScan	1
RequestId	dc83d1bb-d3b7-4e19-9c63-715865cfd5b1
ResultCount	1
ResultSize	180
ServiceTime	4.454965372s
State	completed
Statement	SELECT `type`, `userId`, `creationTimestamp`, `lastUpdateTimestamp` 
FROM `main` 
WHERE `type` = "User" 
AND customerId = "testuser" 
AND creationTimestamp > 0
AND creationTimestamp < STR_TO_MILLIS("2100-01-01T00:00:01")
AND lastUpdateTimestamp >= 1493705466000
AND lastUpdateTimestamp < 1493705481000
ORDER BY `customerId`, `creationTimestamp` ASC 
LIMIT 200 
OFFSET 0
Time	2017-05-03 15:18:07.669838339 +0200 CEST

and another one with the milliseconds instead of STR_TO_MILLIS(…)

completed_requests
ClientContextID	912f7a86-ecb0-43eb-a064-0d35e98587f9
ElapsedTime	4.107983727s
ErrorCount	0
PhaseCounts
IndexScan	450035
PhaseOperators
IndexScan	1
RequestId	2507d0fe-13f6-4a6d-96c9-9bad74ed81d7
ResultCount	1
ResultSize	180
ServiceTime	4.107905186s
State	completed
Statement	SELECT `type`, `userId`, `creationTimestamp`, `lastUpdateTimestamp` 
FROM `main` 
WHERE `type` = "User" 
AND customerId = "testuser" 
AND creationTimestamp > 0
AND creationTimestamp < 4102444801000
AND lastUpdateTimestamp >= 1493705466000
AND lastUpdateTimestamp < 1493705481000
ORDER BY `customerId`, `creationTimestamp` ASC 
LIMIT 200 
OFFSET 0
Time	2017-05-03 15:21:12.646487842 +0200 CEST

I have updated my previous post with more details. The plan looks good and query need to scan 450K items to produce the results. That is why it is taking time This has been improved in 5.0.0.

Another option will be if you are looking for the lastUpdateTimestamp on specific old date (say end of 2016) you can create partial index on that date.

Thanks for your help so far, vsr1!

I’ll try out couchbase 5 as soon as it is released.

Developer builds of 5.0.0 available At https://www.couchbase.com/downloads under developer tab.

I installed the 5.0.0 developer preview and ran a very similar query on a bucket with exactly 1 million Documents with type “User”:

SELECT `type`, `userId`, `creationTimestamp`, `lastUpdateTimestamp` 
FROM `main` 
WHERE `type` = "User" 
  AND customerId = "testuser" 
  AND creationTimestamp > 0
  AND creationTimestamp < 9999999999999
  AND lastUpdateTimestamp >= 1493882615000 
  AND lastUpdateTimestamp <= 1493884600000 
  ORDER BY `customerId`, `creationTimestamp` ASC 
  LIMIT 200 
  OFFSET 0

It ran 1.45s. Without the two boundaries for lastUpdateTimestamp, the query executed in 18ms.
Am I missing something here to make use of the latest features you mentioned?

The “Plan Text”:

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IndexScan2",
            "covers": [
              "cover ((`main`.`customerId`))",
              "cover ((`main`.`creationTimestamp`))",
              "cover ((`main`.`lastUpdateTimestamp`))",
              "cover ((`main`.`userId`))",
              "cover ((meta(`main`).`id`))"
            ],
            "filter_covers": {
              "cover ((`main`.`type`))": "User"
            },
            "index": "end_user_creation_index",
            "index_id": "550d1fbeb51d7379",
            "index_projection": {
              "entry_keys": [
                0,
                1,
                2,
                3
              ]
            },
            "keyspace": "main",
            "limit": "200",
            "namespace": "default",
            "spans": [
              {
                "exact": true,
                "range": [
                  {
                    "high": ""testuser"",
                    "inclusion": 3,
                    "low": ""testuser""
                  },
                  {
                    "high": "9999999999999",
                    "inclusion": 0,
                    "low": "0"
                  },
                  {
                    "high": "1493884600000",
                    "inclusion": 3,
                    "low": "1493882615000"
                  }
                ]
              }
            ],
            "using": "gsi"
          },
          {
            "#operator": "Parallel",
            "maxParallelism": 1,
            "~child": {
              "#operator": "Sequence",
              "~children": [
                {
                  "#operator": "Filter",
                  "condition": "((((((cover ((`main`.`type`)) = "User") and (cover ((`main`.`customerId`)) = "testuser")) and (0 < cover ((`main`.`creationTimestamp`)))) and (cover ((`main`.`creationTimestamp`)) < 9999999999999)) and (1493882615000 <= cover ((`main`.`lastUpdateTimestamp`)))) and (cover ((`main`.`lastUpdateTimestamp`)) <= 1493884600000))"
                },
                {
                  "#operator": "InitialProject",
                  "result_terms": [
                    {
                      "expr": "cover ((`main`.`type`))"
                    },
                    {
                      "expr": "cover ((`main`.`userId`))"
                    },
                    {
                      "expr": "cover ((`main`.`creationTimestamp`))"
                    },
                    {
                      "expr": "cover ((`main`.`lastUpdateTimestamp`))"
                    }
                  ]
                },
                {
                  "#operator": "FinalProject"
                }
              ]
            }
          }
        ]
      },
      {
        "#operator": "Limit",
        "expr": "200"
      }
    ]
  },
  "text": "SELECT `type`, `userId`, `creationTimestamp`, `lastUpdateTimestamp` 
FROM `main` 
WHERE `type` = "User" 
  AND customerId = "testuser" 
  AND creationTimestamp > 0
  AND creationTimestamp < 9999999999999
  AND lastUpdateTimestamp >= 1493882615000 
  AND lastUpdateTimestamp <= 1493884600000 
  ORDER BY `customerId`, `creationTimestamp` ASC 
  LIMIT 200 
  OFFSET 0"
}

The query plan looks good. IndexScan might taking time to read the records that qualify the predicate. The plan shows all the predicate are passed to Indexer (spans in the plan). Even “limit”: “200” is pushed to indexer ( “#operator”: “IndexScan2” section of the plan). cc @deepkaran.salooja

@MacGyves, you can try giving more memory quota to Indexer (UI->Settings).