Best Approach to Model Data and Lookup Data

ivoecpereira · September 2, 2016, 10:24am

Hello,

I’ve started researching Couchbase in the last days and I will be using Couchbase 4.5 EE on a production server very soon. From what I’ve been reading, the following are the main document lookup strategies:

Document Lookup Strategies

1) Lookup by Key. Totally preferred over all other methods and the data structure should be built with this in mind.
2) Lookup by View Queries using MapReduce. Should be used in second instance, when we need data but we don’t have the needed data keys. We should create a view for each feature with a MapReduce, and if we need to get specific data, instead of joins, we should get Object IDs with Reduce and do a MultiGET request with those Object IDs.
3) Lookup by N1QL. Should be used only in last, as it does not use primary nor secondary indexes. It tries to optimize the query but it is not as performant as first and second options. It should be used mainly for dynamic data search or when we aren’t able to get data with option 2).

At first I would like to know if my current assessments of each concept is correct.

Data Modelling

Taking into consideration the previous points, I’ve been thinking about the structure for my data.

As I’ve pointed out in 1) (Document Lookup Strategies), I am making the effort to create documents to be found based on document key.

I am currently building a social feed so let me present one of the concepts I am currently wondering. Let’s see the following example:

user_activity::134

{
  "_id": 134,
  "_type": "user_activity",
  "type": "user_follow",
  "related_id": "user::8",
  "created_by": "user::1",
  "updated_at": 1472741437,
  "created_at": 1472741410
}

Problem 1:
In the Activity we have an entry of what could be a social feed item. At first I see one problem here. I came from a RDBMS background, and my first thought is to relate data instead of putting it directly in the document.

Of course we have advantanges and disadvantages on both approaches, however for this case, as I want to have updated user data on the user I follow (and even my information of course) I guess this could be the best way.

If not, I would need to update every entry once the user changes his username for example. Let’s say is related to or created ~300.000 activities. At first it would be impossible to solve that with a single call to server, and background workers should be implemented to solve this (and add another layer of complexity). Am I thinking well?

Problem 2:
Activites may have likes. People may like activities on their social feed. Activities do show the total number of likes, so I changed activities to the following (ignore the fact that people like a “user_follow” action, it is just an example):

user_activity::134

{
  "_id": 134,
  "_type": "user_activity",
  "type": "user_follow",
  "related_id": "user::8",
  "total_likes": 27,
  "created_by": "user::1",
  "updated_at": 1472741437,
  "created_at": 1472741410
}

As I don’t need to present which users liked an activity I haven’t added that to the document. However, in social feed I need to inform the user if he has liked the post or not (to present an active button or not). So as 1) (Document Lookup Strategies) is preferred, I thought about having something like the following:

user::1737::user_activity::134::user_activity_like

{
  "_type": "user_activity_like",
  "user_id": 1737,
  "user_activity_id": 134,
  "created_at": 1472741416
}

At this point, I already have the user ID that I want to check if has liked the activity or not (the logged-in user) and the user_activity the user is seeing.

This has the disadvantage of having to do another GET request to check if this value exists or not (the user has liked or not), however it should be extremly performant as it is a direct key lookup.

In the future if I do need to see what users like X activity of what activities are liked by Y user, I can use approach 2) (Document Lookup Strategies).

Am I correct on this approach?

Data Parsing

At a third stage, I’ve been noticing that people do relate data by specifying the key of another document. That seems good to me, however I am a little curious of how people do build their outputs to webservers based on that. Let’s take the following example:

user::184

{
  "_id": 184,
  "_type": "user",
  "name": "John Perez",
  "best_friends": [
    "user::184",
    "user::9062",
    "user::123",
  ],
  "created_at": 1472741416
}

Seeing this example, if we want to grab the best friends user IDs what would you do?

It doesn’t make sense to me the need to join each document to get user IDs as we already have them in the key (like user::184). Do you just get the integer in the key? Or do you eventually do a join to guarantee the consistency of data?

Is it just a matter of personal choice?

Sorry for the long questions, but I’ve tried to be the most clear possible to prevent any doubts on my approaches.

Please let me know your feedback.

Thanks

manuel.hurtado · September 5, 2016, 4:41pm

Hello,

You are right in your first point: lookup by key is the more performant way to access Couchbase, and should be used if possible. Even designing your data with this in mind is a good idea.

When you need to access data without knowing the key, you have to make queries. N1QL is the more flexible way to do it. N1QL uses indexes to accelerate the query execution, and have nice tricks to make things really fast. Take some time to study:

[Memory Optimized Indexes] (http://developer.couchbase.com/documentation/server/current/indexes/gsi-for-n1ql.html#concept_bb5_khb_ys__memopt-gsi)
Covering Indexes
Understanding Index Scans in Couchbase N1QL Query

Views is another option for creating indexes and making queries. Views are not as flexible as N1QL for making queries: you need to know your queries in advance, and create custom MapReduce function in Javascript. The good thing about indexes is that it create incremental indexes, so are indicated typically in complex aggregation scenarios. As downside, view execution are scattered among all the data nodes, and access the disks to scan the indexes, so can be slower than N1QL.

Problem 1:

In the scenario you mention, it have sense to maintain relationships pointing to the desired document. This is the easy way to maintain consistency. However, you have to consider:

Data access pattern: When you access your user_activity document, would you need always access to the whole referenced object? If this is the case, you will save a lot of data access by embedding the whole referenced documents. Otherwise, referencing is preferred, and you save data transferred through the network.

Data writing and consistency: How often will be the referenced documents modified? How much effort will you need to maintain consistency bay hand, updating all the referenced products? Depending on your answer referencing can be better.

Problem 2:

I thing you are right in using document lookup for this.
Do not worry about extra access by key. It will be extremely fast.

A different approach can be to maintain an array with the “liked” items for each users.

Data Parsing

I do not understand what are you trying to achieve. Can you elaborate?

From your example, you can directly get the IDs by doing a single get and then referencing
[document].best_friends

If you need the detail of the best friends, you can do it easily with a N1QL query.

ivoecpereira · September 6, 2016, 1:37pm

Hello Manuel,

Thanks a lot for your feedback.

I am gonna give a read to the mentioned docs. I am sure they can help a lot structuring the database.

Thanks for clarification on N1QL being faster than lookup by using views. Didn’t known they would go through all data nodes and would be accessing disks to scan the indexes. That explains a lot.

Regarding Problem 1: Yes, possibly I would need to access the whole referenced object a lot of times, however would an update to something like ~300.000 (just an example, it could be more, it could be less) be easily done (in no time)? I mean, if we need to replace the entire JSON, doing a N1QL query to search for all activities from username “test_example”, and then updating the whole document, wouldn’t it be a little slower (even if indexes are setup properly)?

Problem 2: Thanks. As I don’t need the information about who liked what, I guess I’ll keep it simple to a direct document key lookup.

Another question: Let’s say I would start using N1QL for more complicated queries where I can’t do direct key lookups. What would you return? The whole object or just the object IDs and do direct lookup on each one (as it is supposed to be a lot faster)?

At least for view queries - if I am not mistaken - this was the recommended approach.

Data Parsing: Regarding this question:

Let’s pretend we get the output just from best_friends key. We will have the following:

Let’s say we want to create a link for each user best friend. Let’s say we want to pass the user ID of each user to a link. E.g.

We would need to get the 184 from “user::184”, correct? Usually how do people handle that? Just build a function to get IDs from keys? It would be something easy to build, however I am just curious on how people are handling this.

Thanks a lot!

manuel.hurtado · September 8, 2016, 5:20pm

Hello

if we need to replace the entire JSON, doing a N1QL query to search for all activities from username “test_example”, and then updating the whole document, wouldn’t it be a little slower (even if indexes are setup properly)?

Three things can help here:

Create an index on the activiy.username field. Remember that Couchbase can create indexes on arrays.
Check the Subdocument API feature: you can update only a certain part of the document. This way you save lot of bandwith and parsing time by moving only a part of the document.
A pure N1QL approach may be done by sing massive update

Another question: Let’s say I would start using N1QL for more complicated queries where I can’t do direct key lookups. What would you return? The whole object or just the object IDs and do direct lookup on each one (as it is supposed to be a lot faster)?

It depends, you will typically return only the required fields from the N1QL query.
So, if you only need attributes “name” and “created_at”, you can use:

SELECT name,created_at FROM my_bucket WHERE …

Better than
SELECT * FROM my_bucket WHERE …

You must also know about “USE KEYS” N1QL structure, which specify directly which documents to get if you know the KEYS. This is very fast.

We would need to get the 184 from “user::184”, correct? Usually how do people handle that? Just build a function to get IDs from keys? It would be something easy to build, however I am just curious on how people are handling this.

You can do that with N1QL with String functions. The following will do the trick:

select SPLIT(“user::184”, “::”)[1];

returns “184”

At this point, I recommend you to access the public N1QL tutorial:
http://query.pub.couchbase.com/tutorial
This is the best starting point for mastering N1QL

ivoecpereira · September 9, 2016, 3:22pm

Thanks a lot for all the clarifications @manuel.hurtado!

You helped a lot!

synesty · November 29, 2017, 11:38pm

Are these things about lookup documents still relevant for CB 5.0?
We think yes. We basically need to lookup a main-document by an additional field. We found that doing that lookup with N1QL is too much overhead and very slow. So we currently look into designing it using Lookup documents with the additional complications in the app-layer which needs to maintain these lookup documents.

Are there any new recommendations one year after this inital question? Maybe anything where you could leverage GSI for single document lookups by an additional GSI-indexed field without the whole N1QL overhead?