Joining a document into another document's array

ahuynh · October 4, 2016, 5:23pm

Would love some advice on a bit of a problem I’m running into. I have two documents, one:

{
    "_type": "graph",
    "id": "graph::1",
    "name": "graph-1",
    "nodes": [{
        "name": "node-1",
        "data_ref": "data::1"
    }, {...}]
}

and the other where the “data_ref” key from above refers to this document and “graph_ref” from the one below refers to the document above.

{
    "id": "data::1"
    "graph_ref": "graph::1"
    "_type": "data",
    "name": "blah",
    "desc": "blah"
}

I’d like to be able to do two things.

Retrieve the entire graph document with the data documents joined into it.
Retrieve just the “nodes” field also with the data documents joined into it.

I’ve played around with N1QL and have a (slow) solution below:

SELECT graph.id, graph.name, data.data
FROM (
    SELECT nodes.*
    FROM default
    USE KEYS "graph::1"
    UNNEST nodes
    WHERE default._type = "graph"
) node
JOIN default data ON KEYS node.data_ref
WHERE data._type = "data"

Would love to get some feedback on the query above on the best way to optimize it for performance and how to get #1 without two separate requests, which is how it is currently implemented. Also open to changing the schema if that makes future queries easier/faster!

geraldss · October 4, 2016, 7:10pm

Both queries below.

SELECT graph.*, (SELECT data.* FROM data USE KEYS graph.nodes[*].data_ref) AS node_data 
FROM graph;

SELECT (SELECT data.* FROM data USE KEYS graph.nodes[*].data_ref) AS node_data
FROM graph;

ahuynh · October 4, 2016, 7:28pm

Thanks Gerald!

Some issues I ran into: The first query doesn’t quite join the “node_data” back into the “nodes” array and the second query grabs just the data documents (without any node data). I’d like to be able to have the node information alongside the data document since they’re related to each other if possible.

geraldss · October 4, 2016, 7:42pm

Ok.

SELECT g.*, ARRAY_AGG({"node":n, "data":d}) AS node_data
FROM graph AS g UNNEST g.nodes AS n JOIN data AS d ON KEYS n.data_ref
GROUP BY g;


SELECT ARRAY_AGG({"node":n, "data":d}) AS node_data
FROM graph AS g UNNEST g.nodes AS n JOIN data AS d ON KEYS n.data_ref
GROUP BY g;

ahuynh · October 4, 2016, 7:52pm

Thanks!

Should I have any particular indices to help with the query performance? Joining 375 data documents into the graph using that query took around 6 seconds.

geraldss · October 4, 2016, 8:37pm

How many graphs was that? Is it a single graph document?

geraldss · October 4, 2016, 8:38pm

And how long did the first two queries take, just for reference?

ahuynh · October 4, 2016, 9:31pm

This is a single graph document with a node array and with many node data documents.

Just ran the queries on a more realistic load, a single graph and 1,153 data documents:

The first two queries ran in under a second, ~800ms
The second two queries ran in a minute, ~1m

The node data documents are often very big so having them nested in the graph document itself reached the 20MB limit pretty quickly.

geraldss · October 6, 2016, 3:47am

You should use the first two queries. The data in the nodes array is also contained in the data documents, right? In that case, the first two queries should work. If not, we can take another look.

ahuynh · October 6, 2016, 5:55pm

There is some data in the nodes array that is not in the data documents. I’d be open to moving those fields in the data document if that’s the more ideal way to represent this structure.

geraldss · October 6, 2016, 6:05pm

Yes, all the data fields should be in the data documents.

You should also try this, and tell us the execution time.

CREATE INDEX idxg ON data( graph_ref );

SELECT * FROM data WHERE graph_ref = 'graph::1';

ahuynh · October 7, 2016, 9:32pm

For a graph with ~1000 nodes the query clocks in at around ~325ms

Status: success	
Elapsed: 325.31ms	
Execution: 325.13ms	
Result Count: 952	
Result Size: 3068333

On a larger graph with ~2400 nodes the query clocks in at around ~4s (albeit it has more data associated per node).

Status: success	
Elapsed: 4.15s
Execution: 4.15s	
Result Count: 2378
Result Size: 42538275

geraldss · October 9, 2016, 6:33pm

Ok. So what are your SLAs and expectations? I would also suggest using Couchbase 4.5.1 and using adequate hardware (RAM and cores). Finally, @vsr1 and @keshav_m can suggest if parallelism will help here.

vsr1 · October 9, 2016, 11:10pm

@ahuynh, Your results size are huge as @geraldss suggested use Couchbse 4.5.1 with “pretty” as false.

In addition to that could you please post the timings for @geraldss first suggestion based on that we can rewrite query to produce desired results

ahuynh · October 11, 2016, 3:47pm

Thanks @geraldss and @vsr1.

Our expectations would be inline or better than what our current MySQL database currently delivers. We’re setting up a test cluster to run benchmarks on a more realistic scale.

How would we set “pretty” to false? Is this a command line option when starting the server?

vsr1 · October 11, 2016, 3:58pm

You can turn off at query service level by using REST API call. Also you can turn off statement level as query parameter. I cbq shell Issue following command.
\set -pretty false;