Calculating Stats on a Collection of Documents

#1

I need to calculate basic stats (count, mean and standard deviation) of a collection of documents in order to remove outliers.

If I were to run one view to return a collection of document IDs
and another view to get the stats on those document IDs (using the built-in _stats reduce function), and one or more new documents were added to the database between running the first view and second view, wouldn’t this make the collection of document IDs out of sync with the stats returned?

Reading the Couchbase info on simulating transactions, one workaround might be to set a status attribute in the documents to “analysing” or whatever and then look for this status when running the view to get the stats.

Any thoughts?

#2

@shusseina one approach would be to put the time when the document is created in the map function and then filter for only a specific time frame. This would allow you “ignore” docs that have been added later, but it would not work if you are removing documents in between.

What you can also do is to apply some logic on the client side to filter documents that are not in both versions. Of course you can also use alternative approaches to add more “fields” to your documents that identify a processing state and run your queries based on those.

In future server versions there will be a better way to handle this, but it’s a little too early to share all the details.

Does that help for now? If you have some concrete code to look at it would also be helpful

#3

That would work in my case as the documents being analysed will only ever be inserted and never updated. However, for documents which may be updated one would need a “modify timestamp” to check against.

Am I right to think it is the responsibility of a Couchbase client to include and set a create timestamp at the time a document is created or does a document’s meta data include such information?

I notice my documents contain a _sync json object with a time_saved attribute, I assume this is because the document resulted from a sync with another Couchbase database (a Couchbase Lite database in my case).

{

"_sync": {
“time_saved”: “2015-01-13T02:15:47.861827851+13:00”,
“sequence”: 316,
“rev”: “1-d5a68ab5f84ca0c96a3252e6ec82867e”,
“history”: {

}

#4

@shusseina okay one thing upfront: it is not a good idea to interfere with the the sync gateway documents directly, see: http://developer.couchbase.com/mobile/develop/guides/sync-gateway/wcbs/bucket-shadowing/index.html

Also, the SDK does not add a timestamp of any means to your document, it’s up to you to add it (but it’s quite trivial anyways). The metadata does not contain that kind of information, all it contains is a TTL if you’ve set it, but no “created” or “modified” timestamp. Note that the sync GW and couchbase lite do things differently, but check the bucket shadowing from above for more info.