Indexing Arrays

gizmo74 · February 3, 2016, 8:41am

Hi,

We have a personalised news site. A user can suscribe to editors, tags and gets it’s own newsfeed. We did this “fan out on write”, so every user has his couchbase object with an array of id’s. If a new story/image is published, it’s attached to each user that subscribed to that topic.

The Object looks like this:

{ news : [1,2,3,4,5,6,…],
images: [12,3,4,5,6…],
name: “test”,
… some more metadata
}

If a editor deletes a story/image it should also be removed from the couchbase object. So we have to know which users are subscribed to that story/image. For this we created a view (because N1QL can’t intex arrays) like this:

for( var x=0, l=doc.stories.length; x<l; x++) {
var idstring = doc.stories.[x];
emit(parseInt(idstring),doc.id);
}

This works so far… but: indexing is very slow and can take hours. If we have f.e. 100’000 Users subscribed to a tag and in each objects stores the newest 1’000 stories and images, then each update of an object creates 100’000 * 2’000 Index inserts.

Any idea to make that better?

Thanks,
Pascal

cihangirb · February 3, 2016, 11:50pm

We are getting ready to showcase indexing arrays in the preview coming up in few weeks. Would you be able to wait for 2 more week to try this out. That will be the easiest way to make this work smoothly for you.

gizmo74 · February 4, 2016, 6:23am

That sounds good, thanks No problem to wait some weeks, it works for a year now, but the more users we have the longer takes indexing the view

How much faster do you expect the new indexing compared to indexing views? If I understand correctly it has still to re-index the whole document if it’s updated? So one set of a document containing 2’000 ID’s in a array is still 2’000 inserts in the index, right?

An other idea could be that we remove index for this completly and run a “cleanup-job” nightly, that loops over each document and removes the objects that were deleted during the day. 1 million objects get / edit the JSON / set should be done in some minutes I think.

cihangirb · February 4, 2016, 4:46pm

Global indexes are getting a few important updates in the release we will preview so I expect both index maintenance and query performance to be better but I don’t have a number to say how much better now. However we have data that show in some cases, Global Indexes in 4.1 is order of magnitude better in latency compared to Map/Reduce Views. Not every case experiences this but it is still there.

that is correct. we won’t have this fully optimized yet in the developer preview coming in the next few weeks but in future we won’t have to do that. As an aside, with the new preview, we are lighting up a new storage option that does index maintenance completely in memory. These are called memory-optimized indexes and it is an additional option that you can enable for global indexes. With memory-optimized indexes, the maintenance run at the speed of memory which means, it is much lighter weight and lower latency.

thanks
-cihan

cihangirb · February 4, 2016, 4:47pm

1 question the 2000 array items you mentioned in a single document; do you know the total size of the items in the array you’d like to index is? is it over 8K total?
thanks

gizmo74 · February 5, 2016, 8:59am

Hi Cihan,

Thanks for your detailed reply We index an array of items like this:

story_15741080

so 2’000 * 14 characters = 28K. Too much?

cihangirb · February 5, 2016, 3:47pm

Thanks we have a configurable in the system that can be tuned. I just wanted to get an idea of what the total size is in your case.
thanks
-cihan