Performance with geospacial queries

I just noticed, when am doing geospatial queries on a given index, when I increase the distance, the query becomes slow at around 400km,

  • Sometimes when I use size>=10k, the queries become slower too,
      SELECT  *
      FROM `dev` AS dev
      WHERE SEARCH(dev, 
        { 
          "from": 0,
          "size": 600, 
          "query": { 
            "location": { "lon": -73.84963268654913, "lat": 45.90213678754309 }, 
            "distance": "400km", 
            "field": "geo" 
          }
        }, { "index": "dev-fts" })

What are some of the ways to increase performance with geospatial queries

@vsr1 - why don’t I just query the geo field with logical math using n1ql instead of the spacial query,

or should I try top_left and top_bottom

please share your experience on this, thank you

@ceddy As the number of documents become eligible for a queries, the latency tends to get slower.

I can recommend a way to fasten your query by disabling sort - which is ON by default over score (relevancy determined based on tf-idf).

If you are using couchbase-server >= 6.6.1, would you try this …

      SELECT *
      FROM `dev` AS dev
      WHERE SEARCH(dev, 
        { 
          "from": 0,
          "size": 600, 
          "query": { 
            "location": { "lon": -73.84963268654913, "lat": 45.90213678754309 }, 
            "distance": "400km", 
            "field": "geo" 
          },
          "score": "none"
        }, { "index": "dev-fts" })

removing sort and add score=none, reduces very little ms, the problem is still there though, when using a large distance, it’s super slow

when you increase the size it makes things worse.

Would you give us an idea of how many documents you’re trying to match with your geo query?

@abhinav @vsr1 - since couchbase rocks at indexing fields, why don’t I just query the geo fields as numbers,

the tricky part is figuring out the algorithm for the target values,

for example, if am querying for location lat: 20, and lon: -20, somewhere my query should be like this

  • first I create a top_left and top_bottom maximum,
  • e.g top_left max would be pseudo lat: 40, lon: -40
  • e.g top_bottom max would be pseudo lat: -20, lon: 10

Then in my query I check for documents, WHERE lon <= top_left.lon AND lat <= top_left.lat then a add clauses for the top_bottom too

what do you think?

My bucket is about 40K docs, but the target size is defined in the query using the size correct?

My bucket is about 40K docs, but the target size is defined in the query using the size correct?

Well, my question was more about how many document hits your query obtains. Like I mentioned earlier - queries that obtain more document hits tend to run longer.

If you would like to index the geo fields as numeric data and move the geo-spatial logic to your application - I’d recommend using a GSI index for the query.

Oh i see,

I have a lot of matches when I zoom out of the map, but I thought the size limited how many hits it should make

and below is my index, what is the GSI index?

I have a lot of matches when I zoom out of the map, but I thought the size limited how many hits it should make

  1. I’d recommend to try and keep the query limited to match a limited number of documents as opposed to all or a majority of the documents.
  2. If (1) is not an option, and if you’re not doing this already - I’d recommend distributing the index over multiple nodes hosting the search service - to distribute the query load across a cluster of nodes as opposed to a single node…
  3. The size parameter of the query is more of a filter/limit setting for the search service to truncate the number of results that the query matches.

Additionally, I’d adapt the index definition you’ve shared a bit …

  • “edit” the default type mapping and checkonly index specified fields
  • “edit” the geo field and uncheckstore”, “include in _all field” options

This would reduce the size of the search index that queries will have to work with.

A GSI (Global Secondary Index) is a different type of index that Couchbase supports that can be accessed from the query workbench. There’s detailed documentation for it here … Index Service | Couchbase Docs

1- Can you clarify on how to limit the number of documents matched not majority, since size is only effect for results, like you’ve described?

2- is not an option, i think a simple GSI index on the geo fields can do the trick!

PS: The edits don’t improve the performance, i wonder how the geo query works internally, because if it was using a basic GSI index on the geo fields it can be faster,

Let me try that and see

The “size”/“limit” setting is the only way to limit the number of documents returned by a query. This limit is however NOT necessarily the actual number of document hits the query would work on.

For example - setting geo-distance to 10km from a certain {lat, lon} could match 100 documents, and if you specify "size": 2 for the query, only 2 of those 100 documents are returned by the query.

So keeping the query narrow/precise would reduce the number of document hits that become eligible for the search criteria - and would keep the query latency lower.

Geo-spatial support is available only within search indexes. If you choose to use a GSI index, you’ll have to add the necessary logic within your application.

Yes, I agree, a GSI index is the way to go,

Let me dig into the logic

@ceddy ,

Can you please confirm a few details here?

  • what are the latency expectations here/SLAs?
  • what is the latency increase observed as you increased your radius to 400KM from the original values?
  • what is the FTS ram quota set for this cluster? override the FTS RAM quota to a higher value like a few GBs.
  • how many cores do you have in the FTS node? performance peaks if spare a core per partition.
  • can you try direct FTS curl calls and measure latency, esp after revisiting the RAM quota?

Just want to give a heads up here, embedding the geo logic in the application and solving this using GSI index would be a non-trivial job.

Cheers!

With the GSI, I can now pull 20k documents in 2s, over a very large area like 50000km,

This really helped me a lot from @vsr1 Appropriate index for query with unnest - #2 by vsr1

I’ve been reading a lot, and it clearly shows that the GSI is much faster than the internal spacial queries,

Now;

  • I’ve put together the spacial logic in the query,
  • Let’s say I have a Point A [43.65, -79.44]
  • I create a SELECT using the geo formula against this Point A, which generates the distance between Point A and the documents in DB, as distance in KM

This query works

SELECT 
       d.env, 
       d.geo, 
       d.city,
       (ACOS(SIN( RADIANS(43.65)) * SIN(RADIANS(d.geo.lat)) + COS( RADIANS(43.65)) * COS(RADIANS(d.geo.lat)) * COS(RADIANS(d.geo.lon) - RADIANS(-79.44))) * 6371) AS distance
FROM `dev` d
WHERE d.geo.lat IS NOT NULL AND d.geo.lat IS NOT MISSING
AND (ACOS(SIN( RADIANS(43.65)) * SIN(RADIANS(d.geo.lat)) + COS( RADIANS(43.65)) * COS(RADIANS(d.geo.lat)) * COS(RADIANS(d.geo.lon) - RADIANS(-79.44))) * 6371) <= 50000
ORDER BY distance DESC
LIMIT 20000;

**

This second one doesn’t work, I tried to make it cleaner like this but it fails,

**

SELECT 
       d.env, 
       d.geo, 
       d.city,
       (ACOS(SIN( RADIANS(43.65)) * SIN(RADIANS(d.geo.lat)) + COS( RADIANS(43.65)) * COS(RADIANS(d.geo.lat)) * COS(RADIANS(d.geo.lon) - RADIANS(-79.44))) * 6371) AS distance
FROM `dev` d
WHERE d.geo.lat IS NOT NULL AND d.geo.lat IS NOT MISSING
AND distance <= 50000
ORDER BY distance DESC
LIMIT 20000;

@vsr1 - How can you make the second query work, the first query is working but I am duplicating the geo formula, both in the SELECT clause and in the WHERE AND ... <= 50000

As you can see in the second query am trying to use distance but it breaks it and fails.