Limit on number of bytes or items returned from the Reduce function? or, How do I do a UNION DISTINCT of array elements?

JeffSaxe · June 8, 2015, 5:22pm

Good day, community. I had been posting here in Topic 3825 (“Rebalance is stuck”) because I thought we were experiencing an offshoot of that problem, but it turns out we have a more fundamental problem with how we’re using Couchbase, so I’d like to formally start a fresh Topic and ask for assistance on this. Essentially we’re trying to do a UNION DISTINCT on the elements in arrays within documents, and we’re trying to return from the Reduce function an array of short strings, and it will return 10 or 100 but completely chokes returning 1000.

Picture a bunch of customers, and we want to restrict each one to only download a certain limited number of some resource per 24-hour sliding time window. (Could be electronic documents, data rows in a database, electronically-purchased songs, or anything where we want to keep track of the items downloaded and not let the customer have more than X of them in a day… after that number, they have to wait until tomorrow, or at least until some number of hours go by.) But to be polite to the customer, we don’t want to double-count the items if they download them, then re-download them because their browser crashed, or they wanted to open the query in a new window to print it, or their Internet failed 98% of the way through the progress bar. I.e., if they download a sample set of specific 15 items, and then they download those same 15 again, then they still have a total of only 15 for the day, not 30. And if they download a set of 25 items, and then download a different set of 10 items that include 5 overlapping with the first set, then the total so far for the day is 30, not 35. What matters is all the unique keys of all the items downloaded by that customer in a 24-hour window. I hope this makes sense and seems like a reasonable thing to want to do in Couchbase or in any storage /query engine.

So each of our little source documents is inserted like this…

{
   "actor": "kou:1073745525",
   "action": "songDownloads",
   "timestamp": 1433688093,
   "count": 0,
   "keys": [
       "-2143267641",
       "-2143270000",
       "-2143026020",
       "-2143031946"
   ]
}

“actor” is basically a user ID (the human / customer who’s asking for the items). “action” is the category of thing the customer is downloading — we want to make this system multi-tenant, so that the same general “how many things in a day” facility can be used internally across different applications. “timestamp” is obviously when the user pulled the items. Then the there are two varieties of user-limiting available: a plain “count” which is just an integer, and a “keys” which is a list (array) of unique keys. We want to make this facility flexible to either do the unique-keys-based limiting that I’m posting about here, or a more traditional rate-limiting system where the only thing that matters is the sum of all the counts of hits per day, depending on the application. So, an inserted document would always either fill in the “count” field with a positive integer, and leave “keys” as an empty array, or leave “count” as zero and fill in “keys” with an array of one or more keys.

The regular “count” thing we already have working… we can do a nice MapReduce view and get a sum of all the counts of that user over a time window, and it’s quite fast. That’s not the problem. What we’re trying to do is return the set-wise Union of all of the “keys” arrays of all the matching documents (i.e., for one “actor”, for one “action”, and for all "timestamp"s over a certain time period), and also make it Distinct, i.e., eliminate duplicates from the list — as I said in the above example, 25 documents plus 10 more documents with 5 repeats equals a total of only 30. So here is our map/reduce function pair:

function (doc, meta) {
  
    // emit a row
    emit([ doc.action, doc.actor, doc.timestamp ], { keys: keys, count: doc.count });
}


function( keys, values, rereduce ) {
  
    var keys = [];
    var count = 0;
    var keyMap = {};
	
    // loop through all values
    for (var valueIndex=0; valueIndex<values.length; valueIndex++) {

        // access the value
        var value = values[valueIndex];

        // add the counts
        count += value.count;

        // compute the union of the keys
        for (var keyIndex=0; keyIndex < value.keys.length; keyIndex++) {

            // access the keys
            var key = value.keys[keyIndex];

            // add the key if not already added	
            if (!keyMap[key]) {
                keys.push(key);
                keyMap[key] = true;				
            }

        }
		

    }

    // return the combined result
    return { keys: keys, count: count };
  
}

I apologize on behalf of my developer using the same variable names for different purposes — “keys” means either the name of the original field in the document, or the passed parameter, or a temporary local-scope array used within the reduce function simply to accumulate the uniquified list of keys before returning them. Sorry about that. Anyway, basically what it’s doing is going through each incoming document (“values” array) and, for each one, going through that document’s “keys” array and, for each key, putting it into a temp array for output unless a small associative array says that it’s already been inserted there. So the associative array (“keyMap”) is just the duplicate-detecting mechanism, while the “keys” array becomes the new output of the reduce step.

This entire logic works great, as long as the end result “keys” has fewer than 10 elements, or even fewer than 100 elements. But if we try to send back 1000 elements, the Couchbase service goes off into the ozone, chewing up RAM until it his OOM condition and crashing and starting over. (Either it’s the Indexing process or the Compacting process, but either process is calling our Reduce code, I guess, to perform its work.) If it will help, I can post a couple of the altered code snippets of our attempts to figure out what part of it Couchbase doesn’t like, but in simple English:

we added a circuit-breaker to simply break out of the for loops when keys.length became >10, or >100. Then it completed, but at 1000 it does not, and we need to return 1000 (or more) in our application.

thinking that maybe the associative array / hash implementation was the problem, we tried just not doing the duplicate checks, i.e., adding together 25 elements and 10 more elements and returning all 35, even with the duplicates. But this still choked in the same way, so it seems to be the final “return keys: keys” part that Couchbase doesn’t like. When we changed it to go ahead and perform the associative array set for all the elements, but still only return 10 of them, it works, so the associative array isn’t the trouble. In fact, here is that code snippet:

 function( keysParm, values, rereduce ) {

 var keys = [];
 var count = 0;
 var keyMap = {};
 var cont = true;
 
 // loop through all values
 for (var valueIndex=0; valueIndex<values.length; valueIndex++) {

     // break if appropriate
     if (!cont)
     {
         break;
     }
   
     // access the value
     var value = values[valueIndex];

     // add the counts
     count += value.count;

     // compute the union of the keys
     for (var keyIndex=0; keyIndex < value.keys.length; keyIndex++) {
       
         // break if appropriate
         if (!cont)
         {
             break;
         }


         // access the keys
         var key = value.keys[keyIndex];

         // add the key if not already added	
         if (!keyMap[key]) {
         keys.push(key);
             keyMap[key] = true;				
         }
       
         if (keys.length >= 10) {
           cont = false;
         }
     }

 }

 // return the combined result
 return { keys: keys, count: count };

}

So “cont = false” is our circuit-breaker. If we change the length test to >=1000 and then put in several source documents with a few hundred array elements, then Couchbase goes insane.

Thanks to Robert Hamon (@robert_hamon_) for kindly showing me a tuning parameter called “max_kv_size_per_doc”, which looks like a self-protection mechanism within Couchbase, to protect the server from users (i.e., us!) trying to do something dumb. He recommends setting the value lower than the default of 1MiB, so that it will error out and abort before it gets into the memory-hogging collapse state. So that seems as if it would have cleaner failure behavior than we have now, but it still doesn’t explain to me how to actually succeed in what we’re trying to do. Is what we’re doing reasonable? Is there a better way to do it that will cooperate with Couchbase’s MapReduce implementation? Thank you in advance, everyone.

– Jeff Saxe
SNL Financial, Charlottesville, Virginia, USA

tgreenstein · June 9, 2015, 11:18pm

What if you were to use a doc for each event with a field referencing the user, rather than store the keys in an array? Storage permitting, this could provide a lot more flexibility for how you aggregate and qualify the information at the granularity you’re after?

JeffSaxe · June 10, 2015, 2:35pm

Thanks for the reply, Todd. So you’re saying instead of inserting one document representing one download to the customer, containing an array of all 10 or 25 or 300 keys that the customer did in that batch, to insert 10 or 25 or 300 documents, each with the same timestamp and actor and action, but with just one key? That seems plausible — it would increase our storage requirements, but perhaps not badly — but then how does that help the desire to pull back a list of all distinct keys retrieved by that customer over a time period? Doesn’t it still require that MapReduce return the large result set that it’s seeming to have a problem returning?

I was thinking of a way that the returning of the big result set could be avoided… in the code for our check for whether the customer is allowed to have the next little result set they’re asking for, we don’t actually need to know the full set of keys; we really only need to know if it is going to be more than the daily limit. So it’s possible we could only return the count of distinct keys over the time period, and the Couchbase-calling client would just compare that single integer against the configured daily limit, but I don’t think this would work if the Reduce had to be done in stages, i.e., “rereduce”. If it were guaranteed that all of the input documents would be sent into just one call of the Reduce function, then that one call could insert them all into the associative array, then at the end it could just return the length of that array, and it would be done. But this doesn’t work if the Reduce needs to be done in stages, with Reduce output being passed as input to another Reduce stage… each time that happened, you would lose the identity of the individual keys, and therefore you have no way to continue the DISTINCTivness all the way down to the final result. Is there some flag we can set (either at the general bucket level or in a specific client call) to force a one-step Reduce, i.e., force absolutely no Re-Reduce steps?

JeffSaxe · June 15, 2015, 6:13pm

Hmm… no one has come forth with a complete solution to our problem yet, so I’m asking again. I’m looking for a way to reliably perform a UNION DISTINCT of all the keys in a bunch of matching input documents over a time period. Or, heck, as I said in my last email, we don’t really need the entire UNION of unique keys returned… we really just need a COUNT of DISTINCT elements. Should I be installing a beta of Couchbase 4.0 and using N1QL? I don’t think we particularly care which version of Couchbase we use, as long as this query works. As I said, I can envision our Reduce function simply returning a COUNT, not returning the actual elements, but (unless I’m missing something) that would only give me the correct answer if the Map-Reduce can be guaranteed to happen in one Reduce step, with no successive ReReduce stages. @tgreenstein Did you see my previous reply?

We’re now drifting toward a non-Couchbase solution to this problem (inserting rows in a SQL Server), but that seems so… old-fashioned. Please help show us that Couchbase can do this task as well as the others we’ve thrown at it, with scalability and reliability. Thanks in advance, community members.

geraldss · June 16, 2015, 9:22pm

Hi Jeff,

Yes, you should use Couchbase 4.0 and N1QL to achieve this. If you load your data into Couchbase 4.0, we can provide the N1QL queries to perform your distinct counts. Please post here or via email when the data is loaded.

Thanks,
Gerald

JeffSaxe · June 17, 2015, 3:22pm

Thank you, Gerald. I went through the N1QL tutorial last night, and the SQL-like implementation looks pretty great, actually, assuming performance and scalability holds up! I will throw together a Couchbase 4.0 Beta server here and see if we can figure out what syntax to use, or I will reach out to you. I appreciate the assistance.