Cartesian data modeling / query problem

avandor · May 20, 2015, 9:54am

I need some help with the following problem:

person:{
	email: 'john@gmail.com',
	data:{
		category1: {
			"dp": { 
				value: 'AAA',
				otherinfo1: 1,
				otherinfo2: 2,
				}
			"datapoint2": { 
				value: 'BBB',
				otherinfo1: 1,
				otherinfo2: 2,
				}
			"some datapoint name2": { 
				value: 'BBB',
				otherinfo1: 1,
				otherinfo2: 2,
				}
			....
		},
		category2: {
			"datapointAA": { 
				value: 'BBB',
				otherinfo1: 1,
				otherinfo2: 2,
				}
			....
		},
		......
	}
}

provided:

a large number of persons (millions)
unknown (10-100) number and names of “categories”
unknown (5-50) number and names of “dp”,“datapoint2”,etc…
I just know that person.data...value exists

working with set/get operations is easy, for 1 person, but it gets tricky when I’m trying to build views/query for more complex things.

the use-cases I’m looking for:
1. get the list of people who have category1.dp.value=AAA and category2.datapointAA.value=‘BBB’ and people.data.categoryN.some data.value='YES’
2. count all the values sums of all the values for a given list of people or using the same clause above. (Here’s the desired output)

{
			total: 33, // people
			data:{
				category1:{
					"dp":{
						"value":{
						"AAA": 12, //people
						"BBB": 1, //people
						"CCC": 1, //people
						...
						},
						"otherinfo1":{
							"1":33,
							"2":22,
							..
						},
						"otherinfo2":{
							"1":33,
							"2":22,
							..
						}
					},
					'datapoint2': {
					   // similar result 
					}
				}
			}
}

the options I already gave a try:

map everything as key with everything as value. Then reduce using a special function which counts all occurrences. This works, but for each person, the values are duplicated 100+ times as value. Besides the HUGE index size (each information being duplicated 100+ times), it still doesn’t solve the “AND” bit.
```
  emit( ["category","dp",<its value>,time_of_last_change] , doc.data.<all values from the doc>)
```
map just the key with null as a value. Again, no intersection, but could run multi-get on those docs and compute on the fly. Less data storage, painful processing of millions of records, takes some serious time.

Mapreduce approach helps calculating EVERYTHING, ALL the time, which is good, but eats up a lot of space/cpu. In reality, I only need SOME of these aggregate reports and only sometimes (mostly on user report request).

Should I stick to Map/Reduce or go the N1QL route (in this case, how to create indexes)? Perhaps, should I create map/reduce functions on the fly, for each query I know I will need, whenever I need something specific ?

Any help/tips/pointers in getting this sorted out nicely (from a space/speed perspective) is highly appreciated.

tgreenstein · May 29, 2015, 4:57am

I suspect with N1QL and our new in memory secondary indexing, using a field type identifier and a where clause index would allow you to aggregate on specific conditions and get the information you’re after.