I’m trying to learn a bit about the various noSQL stuff out there and started with couchbase. We have a SQL database with a couple of fact tables with a few hundred million rows in each one… doing any sort of analysis of data in the whole table is… arduous. So I figured that could be a good set of data to play with.
The SQL tables look like this:
-factMetricFloats
-300 million rows
-FK links to host, metric and configuration dimensions
-sample date (generally 15 second intervals)
-metric value as a float (double in .NET)
-factMetricInts … same as above table, but with integer values instead of floats
-dimension tables for metric and host… just have key and name as a string
-configuration dimension has a slew of string attributes in it
I’ve tried to create the documents a few different ways in couchbase. The most recent way I tried, I created a single data bucket and used an incrementing key for the internal key (using the .NET CLI Increment method). All the documents have a “type” key/value that indicates if it is a host, a metric, a configuration, or a sample entry. I used 0,1,2,3 instead of strings. The sample document looks something like this:
{
“levelkey”: 2162129,
“metrickey”: 3102,
“hostkey”: 38,
“sampledt”: “2013-03-02T14:11:35”,
“value”: 1,
“type”: “0”
}
I created a view to give me the sum, count, average, and eightieth percentile of the value grouped by the three keys (ignoring date).
I have two nodes, each with 16GB RAM and 4 vCPU. Disk backing is an FC disk array, but I see very little in the way of IO on the two nodes when they are under load. I loaded in about 27 million items to play with.
My view (at the bottom of this post), takes about 2 minutes to return the data each time I run it. CPU on the hosts seems to be around 40%. Data seems excessive given the small amount of data… looks like it takes about 11GB… the entire SQL database is about 25GB.
I assume I’m doing something wrong and I can’t find any resources that offer examples or advice for designing datamodels for couchbase that don’t power a blog site (if I were doing that, there are plenty of examples). If I’m not doing anything wrong… am I trying to use the “wrong tool for the job”?
The view…
map:
function (doc, meta) {
if (doc.type == 0) {
emit([doc.levelkey,doc.metrickey,doc.hostkey],doc.value);
}
}
reduce:
function (key, values, rereduce)
{
var result = {total: 0, count: 0, eightieth: 0, average: 0};
if (rereduce)
{
var vals = [];
for(i=0; i< values.length; i++)
{
result.total = result.total + values[i].total;
result.count = result.count + values[i].count;
vals[i] = values[i].eightieth;
}
vals[values.length] = result.eightieth;
result.eightieth = sum(vals) / vals.length;
}
else
{
values.sort();
result.total = sum(values);
result.count = values.length;
result.eightieth = values[Math.floor(values.length * 0.8)];
}
result.average = result.total / result.count;
return(result);
}