Building index before inserting documents

Poofcakes · November 23, 2017, 1:58pm

I was wondering if it’s better to build GSI indexes before inserting the documents or after inserting the documents?

vsr1 · November 23, 2017, 5:20pm

Insert the documents first, create required indexes in deferred mode and build all of them at once. https://developer.couchbase.com/documentation/server/current/n1ql/n1ql-language-reference/build-index.html

cc @deepkaran.salooja

Poofcakes · November 23, 2017, 7:41pm

Is it normal for that to take multiple days to finish though? It feels like I’m doing something wrong.

jeelan.poola · November 24, 2017, 2:46am

It is typical and a better practice to build index after inserting documents. Could you please share your full CREATE INDEX and BUILD INDEX statements? How many indexes are you attempting to build? Which CB versions are you using, storage mode of indexes? Could you please share cbcollectinfo as well?

Poofcakes · November 24, 2017, 1:13pm

Yeah I found that if I build indexes before importing, that my RAM goes through the roof and I get time-out errors when I try inserting the documents afterwards. So I won’t do that anymore. Though perhaps when my indexes are done building after inserting the documents, maybe the RAM will be through the roof too.

I use version 5.0.0-3519-1 on a virtual cloud running on ubuntu 16, 32 GB RAM. I got one bucket called ‘samples’. I imported two “runs”, one run is about 14 million docs and 18 GB, so 2 runs is 28 million docs and 36 GB. I only got 1 node, and I don’t currently have access to more servers. This is just a ‘test’ environment. I need to import a lot more “runs” on the real end environment.

I’m attempting to build 28 covering indexes (one for every key in the documents). I don’t know if this is the right way to go about indexing, but if I don’t do it and I query something then the query just times out instead of giving me a result.

I do this first:

create index idx_sample_name on samples(Sample_Name) USING GSI WITH {“defer_build”:true};
create index idx_acc on samples(acc) USING GSI WITH {“defer_build”:true};
create index idx_j_flag on samples(J_flag) USING GSI WITH {“defer_build”:true};
etc etc

and then build them all at once
BUILD INDEX ON samples(idx_acc, idx_beforeMID, idx_MID, idx_afterMID, idx_readingframe, idx_cdr3pep, idx_cdr3nuc, idx_cdr3_qual_min, idx_cdr3_qual_max, idx_cdr3_qual_avg, idx_cdr3_qual, idx_V_flag, idx_V_gene, ifx_J_flag, idx_J_gene, idx_readingframe_seq, idx_seq, idx_pep, idx_qual, idx_V_sub, idx_J_sub, idx_V_main, idx_nr_v_mains, idx_nr_v_subs, idx_nr_v_alleles, idx_nr_j_subs, idx_nr_j_alleles, idx_sample_name, idx_exp_name) USING GSI;

vsr1 · November 24, 2017, 4:49pm

Check out http://blog.couchbase.com/wp-content/uploads/2017/03/N1QL-A-Practical-Guide-v2.pdf how to create right index for query

steeve.descarpentrie · November 24, 2017, 6:06pm

Hi @Poofcakes,

In my opinion, it’s a mistake to build 28 indexes. You dont need an index for every key.

Try to add 2 or 3 index on the most important keys, and make some querries, mixing keys in the indexes, and keys out of them. If some querries are much slower than others, try to identify which key could help, and add a specific index.

Too much indexes is worst than no index. Every time you will try to add a document, couchbase will need to update the indexes, all of them, it’s too much work to do.

Give a feed back when you want, I’m interesting to know how couchbase react with 28 million docs, and the right indexes.

Regards,

Steeve