Large amount of GSI with continual data ingestion

csjoblom · January 10, 2018, 7:41pm

Hello,

We’re currently building a bulk data product using N1QL and have run into some limitations.

We have around 200 gigs of data spread over 60ish million documents. We need to subdivide the data into ~600 parts for an API. Our data set is expanding over time and we expect even more additions moving forward. We do continual ingestion of data for all 600 feeds (every 15-60 minutes depending).

We tried going with large indexes for what we are trying to accomplish, but it was too slow. Ultimately we swapped to each part having it’s own set of indexes but that resulted in over 3000 GSI that we are now having to manage.

We’re reaching out to see if there is something we are missing, or if there are other potential solutions to scaling this up.

deepkaran.salooja · January 12, 2018, 12:07am

@csjoblom, 3000 GSI indexes are not a good idea. It would be better to create something like 10 indexes (using a where clause).

If you are using Enterprise Edition, the storage engines should be able to work with 60M data even with a single index if there are enough resources on the box.

For CE, you can experiment how much a single index can hold given the resources available(memory quota, cpu etc).

csjoblom · January 12, 2018, 9:52pm

We are on enterprise. We’ll beef up our box and try some larger indexes instead.

Thanks for the quick response!