Reading a subset of documents using Couchbase View

query

#1

Hi. I have around 25M documents in my cluster. I need to read 1M documents at a time without any specific criterion. I don’t have the access to the keys. So I need to create a view which will emit documents till I reach a counter which goes up to 1M.

I have written a Map function inside which I am trying to create a static variable, but JS doesn’t support static variables. I am not sure how to do this operation. The map function which I have written is just to return 1000 documents and it is full of errors. Can someone help me with this functionality?

function (doc, meta) {
  value = foo();
  if(value < 1000)
  {
    emit(meta.id, null);
  }else{
    return;       
  }
}

function incrementor(){
  if(typeof incrementor.counter == 'undefined'){
       incrementor.counter = 0; 
  }
  
  return ++incrementor.counter;
}

#2

Hi @nagaraj.irock, your map function can’t have an outer context. Why don’t you just emit(meta.id, null); straight away and then use the limit at query time to limit your results to the desired number of documents?

There is no built-in way to control the number of documents maximum indexed other than you defining some criterion (so for example imagine your doc has a “counter” field like {"counter: 400}and in your map function you do if (doc.counter <= 1000) { emit()…}


#3

How to limit straightaway? Can I achieve that using ViewQuery? Right now I am using

            ViewQuery query = ViewQuery.from("LCDD", "findAllUsers").stale(Stale.FALSE);
            ViewResult result = theBucket.query(query);

to retrieve the documents.

I think the second method would work, but I don’t have the control over the fields in the document.


#4

after the .stale you can use a .limit(1000)


#5

Ah! Okay. How bad that I didn’t see the options available.Will that return 1000 random documents from the database?


#6

It will return the first 1000 that are stored in the index, but if the index is not mutated they will be same on the next query. What you can do is there is also a “skip” command in addition to limit which you can use with a random number to generate an offset every time, just make sure your skip + limit <= docs_in_index, otherwise your resultset won’t be 1000 entries.


#7

on a side note skip with views still forces the view query to visit (then discard) the skipped entries. so using it repetitively in queries with an incrementing skip isn’t very performant (eg. you could be tempted to do that for paging, but it’s not the best way).


#8

I can store the batch’s last document’s id and then pass it in the next iteration which can then be used as a parameter for .startKey(String Id) right? Will that improve the performance?


#9

Yes, almost, but depending on the emit of the view it could need an extra step, which is to also use startKeyDocId(lastDocumentIdInPage).

Sometimes your view will emit several documents that share the same view key. But their document Id will be unique, so you can combine both information to restart correctly at the next page. See this blog post on what startkey_docId is all about: http://blog.couchbase.com/startkeydocid-behaviour


#10

Hi, are these operations thread safe? Can I use multiple threads to read from different blocks of data? I have used multi-threaded approach for insertion. I am not sure if it will work for read also, given that I have to use .startKey and .startKeyDocID. Right now I am able to read the documents using a single thread (i.e, the main process) without any problem. But the op/sec is just ~1500 which is low for my application. Using threads can I improve the performance?


#11

The view requests are thread safe, but not synchronized across requests. What I mean by that is that you can use the client and its methods totally fine from multiple threads but if you want to pass around the startKey(docId) across threads you need to handle that yourself in a thread safe manner (maybe a volatile String, but I don’t know your exact application semantics)


#12

Yes. I was able to put in my own logic in order to make the threads work without any problem. Thanks a lot!