What is the impact of secondary indexes on data ingestion rates?

dbsc · August 9, 2016, 10:10pm

Hi, Could you help me understand the impact on data ingestion rates, when multiple secondary indexes have been created for a table/bucket? (Data inserts and updates). Especially in a scenario where there is large number of inserts and updates. Let’s say the index is on attribute A and when the value of attribute A gets updated - then there will be index maintenance for which I presume the system needs to read the base table (assuming it’s not cached in memory, for this situation). Does it not add overhead?

(Even if there is Multi dimension scaling, and indexes can be on separate nodes from data - when the index value is updated - to maintain the index, you still need to consult the base table?)

Hence the question is - What is the impact due to secondary indexes, on data ingestion rates? in scenarios where the indexed values are updated as well as part of the operational application.

jliang · August 10, 2016, 2:16am

The index itself maintain additional data for the “base table”, specifically for index maintenance. For “base table”, here I refer to previous state of the index before mutation is applied, and I hope that’s what you refer to as well. Therefore, when a mutation arrives, it simply forward the mutation to the indexer, without incurring any additional I/O on the latency-sensitive write-path within data service. The indexer will then consult its “base table” and determine how to update the index. At this point, indexer may incur additional I/O, but that’s all separate from the data ingestion path. The whole point is to minimize/eliminate overhead form the performance-sensitive data ingestion path, and keep the work of index maintenance within indexer itself.

dbsc · August 11, 2016, 5:35pm

Thanks for your response!

Couple of follow up questions

How much additional resources are needed for the index service. From the description it seems that index service maintains the previous state of the base table before mutation. So is it a complete copy of my base table (assuming index contains most of my base table columns) ? So I need to have double the number of nodes, half for base table and half for the index service ?
If my load is primarily high speed data ingestion - then how does the index maintenance keep up with this load ? Does it slow down the index update (because index maintenance needs to do additional I/O as you mentioned). Do these additional I/Os compete with the I/O needed for the incoming data ingestion?
What is the best practice for sizing the resources for index service as compared to data service (for the basetable) ?
Is there any benchmark report / numbers which can show the impact on index on ingestion rate ? For my usecase I am worried that as I create more indexes the ingestion performance may degrade.

dbsc · August 15, 2016, 5:28pm

Hi @jliang - Looking forward to hear on this. I need this help to understand the performance characteristics and see applicability to my high ingestion rate usecase.

dbsc · August 22, 2016, 5:51pm

@jliang Looking for some insight into this, would appreciate an update