Interval and nearest match

How would I go about intervals and matching closest value with n1ql?
file.zip (7.9 KB)
I have several files like this one, one for each day. What I want is to select a initial interval like:

BETWEEN “2015-01-02T10:00:00” AND “2015-06-04T06:02:00” END

which I got help with (thank you, vsr1 :smiley: )

SELECT d.panelid, d.sensorid, ARRAY_FLATTEN(ARRAY_AGG(qdata),2) AS data
FROM data d
LET qdata = ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN “2015-01-02T10:00:00” AND “2015-06-04T06:02:00” END
WHERE panelid=“a-si” and sensorid=“voc” AND ARRAY_LENGTH(qdata) > 0
GROUP BY d.panelid, d.sensorid;

This query will give me all objects between the two dates, but let’s say I want only one object for each hour from “2015-01-02T10:00:00” to “2015-06-04T06:02:00”. The interval will be given in this format: HH:MM. Managing this is my first problem, but I also need some way to match the closest timestamp (the objects contain timestamps), ie if I want the object closest to 11:00 which isn’t necessarily exactly 11:00 (could be 10:58 f.eks)

SELECT   q.panelid, q.sensorid , MIN(data) FROM (
        SELECT d.panelid, d.sensorid, ARRAY_FLATTEN(ARRAY_AGG(qdata),2) AS data
        FROM data d
        LET qdata = ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN "2015-01-02T10:00:00" AND "2015-06-04T06:02:00" END
        WHERE panelid="a-si" and sensorid="voc" AND ARRAY_LENGTH(qdata) > 0
        GROUP BY d.panelid, d.sensorid) AS q UNNEST q.data
GROUP BY q.panelid, q.sensorid, SUBSTR(data,0,13);

What happens when the range falls in multiple dates which date hour it should return.

The following query should give one entry for every day. I used 11h time you can change accordingly.

SELECT   q.panelid, q.sensorid , ARRAY_SORT(ARRAY_AGG([timediff, el]))[0][1] data  FROM (
        SELECT d.panelid, d.sensorid, ARRAY_FLATTEN(ARRAY_AGG(qdata),2) AS data
        FROM data d
        LET qdata = ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN "2015-01-02T10:00:00" AND "2015-06-04T06:02:00" END
        WHERE panelid="a-si" and sensorid="voc" AND ARRAY_LENGTH(qdata) > 0
        GROUP BY d.panelid, d.sensorid) AS q UNNEST q.data el
        LET timediff = ABS((STR_TO_MILLIS(el.timestamp)-STR_TO_MILLIS(DATE_FORMAT_STR(el.timestamp,"2017-01-01")))-(str_to_duration("11h")/1000000))
GROUP BY q.panelid, q.sensorid, SUBSTR(el.timestamp,0,10);

Pre CB 4.6.0

SELECT   q.panelid, q.sensorid , ARRAY_SORT(ARRAY_AGG([timediff, el]))[0][1] data  FROM (
        SELECT d.panelid, d.sensorid, ARRAY_FLATTEN(ARRAY_AGG(qdata),2) AS data
        FROM data d
        LET qdata = ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN "2015-01-02T10:00:00" AND "2015-06-04T06:02:00" END
        WHERE panelid="a-si" and sensorid="voc" AND ARRAY_LENGTH(qdata) > 0
        GROUP BY d.panelid, d.sensorid) AS q UNNEST q.data el
        LET timediff = ABS((STR_TO_MILLIS(el.timestamp)-STR_TO_MILLIS(SUBSTR(el.timestamp,0,10)))-(str_to_duration("11h")/1000000))
GROUP BY q.panelid, q.sensorid, SUBSTR(el.timestamp,0,10);

[
{
“code”: 3000,
“msg”: “Invalid function DATE_FORMAT_STR. - at )”,
“query_from_user”: “SELECT q.panelid, q.sensorid , ARRAY_SORT(ARRAY_AGG([timediff, el]))[0][1] data FROM (\r\n SELECT d.panelid, d.sensorid, ARRAY_FLATTEN(ARRAY_AGG(qdata),2) AS data\r\n FROM data d\r\n LET qdata = ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN “2015-01-02T10:00:00” AND “2015-06-04T06:02:00” END\r\n WHERE panelid=“a-si” and sensorid=“voc” AND ARRAY_LENGTH(qdata) > 0\r\n GROUP BY d.panelid, d.sensorid) AS q UNNEST q.data el\r\n LET timediff = ABS((STR_TO_MILLIS(el.timestamp)-STR_TO_MILLIS(DATE_FORMAT_STR(el.timestamp,“2017-01-01”)))-(str_to_duration(“11h”)/1000000))\r\nGROUP BY q.panelid, q.sensorid, SUBSTR(el.timestamp,0,10);”
}
]

DATE_FORMAT_STR() supported in 4.6.0. Replace DATE_FORMAT_STR(el.timestamp,“2017-01-01”) With SUBSTR(el.timestamp,0,10)

Like you said, the query doesn’t fully do what I want. If my dates are:

2015-01-01T10:00:00 and 2015-01-03T10:00:00

with timestep: 05:00 I want the the first value closest to 10:00:00 on 2015-01-01 then the one closest to 15:00:00 then 20:00:00 and so on until my end date. At this point the query is so complicated that I can’t figure anything out.

Seems doable. We have DATE_DIFF() and ABS() functions. You may need a script to generate each hour, and then pass the hour as a parameter to the query. The query would only get the closest entry for that hour.

Trying to do all the hours in one query is maybe too ambitious.

SELECT q.panelid, q.sensorid, MILLIS_TO_STR(iv.interval) AS interval,
       MIN([ABS(STR_TO_MILLIS(el.timestamp)-iv.interval),el.timestamp])[1] AS el FROM (
           SELECT d.panelid, d.sensorid,
               ARRAY_FLATTEN(ARRAY_AGG(ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-06-04T17:00:00" END),2) AS data,
               MIN(ARRAY {"interval": v} FOR v IN ARRAY_RANGE(STR_TO_MILLIS("2015-06-04T09:00:00"),STR_TO_MILLIS("2015-06-04T17:00:00"),str_to_duration("3h")/1000000) END) AS interval
        FROM default d
        WHERE panelid="a-si" and sensorid="voc" AND
              ANY el IN d.data SATISFIES el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-06-04T17:00:00" END
        GROUP BY d.panelid, d.sensorid) AS q UNNEST q.interval AS iv UNNEST q.data AS el
GROUP BY  q.panelid, q.sensorid, iv.interval;

Is this IoT stuff? cc @keshav_m

Thanks for the answer, but I need some more :stuck_out_tongue: How would I get the corresponding values? I tried adding
el.value
in the first SELECT, but then i need to group or use aggregate func. If I group I get the wrong answer.

SELECT q.panelid, q.sensorid, MILLIS_TO_STR(iv.interval) AS interval,
       MIN([ABS(STR_TO_MILLIS(el.timestamp)-iv.interval),el])[1] AS el FROM (
           SELECT d.panelid, d.sensorid,
               ARRAY_FLATTEN(ARRAY_AGG(ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-06-04T17:00:00" END),2) AS data,
               MIN(ARRAY {"interval": v} FOR v IN ARRAY_RANGE(STR_TO_MILLIS("2015-06-04T09:00:00"),STR_TO_MILLIS("2015-06-04T17:00:00"),str_to_duration("3h")/1000000) END) AS interval
        FROM default d
        WHERE panelid="a-si" and sensorid="voc" AND
              ANY el IN d.data SATISFIES el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-06-04T17:00:00" END
        GROUP BY d.panelid, d.sensorid) AS q UNNEST q.interval AS iv UNNEST q.data AS el
GROUP BY  q.panelid, q.sensorid, iv.interval;

The query you suggested results in this:

 [
      {
        "el": {
          "quality": "raw",
          "timestamp": "2015-06-04T15:00:29",
          "value": 35.612473
        },
        "interval": "2015-06-04T15:00:00Z",
        "panelid": "a-si",
        "sensorid": "voc"
      },
      {
        "el": {
          "quality": "raw",
          "timestamp": "2015-06-04T09:00:15",
          "value": 36.509649
        },
        "interval": "2015-06-04T09:00:00Z",
        "panelid": "a-si",
        "sensorid": "voc"
      },
      {
        "el": {
          "quality": "raw",
          "timestamp": "2015-06-04T12:01:06",
          "value": 34.314411
        },
        "interval": "2015-06-04T12:00:00Z",
        "panelid": "a-si",
        "sensorid": "voc"
      }
    ]

I need it to be structured like this:

[
  {
    "data": [
      {
        "quality": "raw",
        "timestamp": "2015-06-04T09:00:15",
        "value": 36.509649
      },
      {
        "quality": "raw",
        "timestamp": "2015-06-04T12:01:06",
        "value": 34.314411
      },
      {
        "quality": "raw",
        "timestamp": "2015-06-04T15:00:29",
        "value": 35.612473
      }
    ],
    "panelid": "a-si",
    "sensorid": "voc"
  }
]

Thank you for all the help :slight_smile:

you need one more level group, for example

SELECT p.panelid, p.sensorid, 
       ARRAY_AGG({
	            "quality":p.el.quality,
	            "timestamp":p.el.timestamp,
	            "value":p.el.value
                }) AS `data`
  FROM (
		SELECT q.panelid, q.sensorid, MILLIS_TO_STR(iv.interval) AS interval,
		       MIN([ABS(STR_TO_MILLIS(el.timestamp)-iv.interval),el])[1] AS el FROM (
		           SELECT d.panelid, d.sensorid,
		               ARRAY_FLATTEN(ARRAY_AGG(ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-06-04T17:00:00" END),2) AS data,
		               MIN(ARRAY {"interval": v} FOR v IN ARRAY_RANGE(STR_TO_MILLIS("2015-06-04T09:00:00"),STR_TO_MILLIS("2015-06-04T17:00:00"),str_to_duration("3h")/1000000) END) AS interval
		        FROM default d
		        WHERE panelid="a-si" and sensorid="voc" AND
		              ANY el IN d.data SATISFIES el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-06-04T17:00:00" END
		        GROUP BY d.panelid, d.sensorid) AS q UNNEST q.interval AS iv UNNEST q.data AS el
		GROUP BY  q.panelid, q.sensorid, iv.interval;
  ) p
 GROUP BY p.panelid, p.sensorid

{"error": "you cannot issue more than one query at once."}

sorry, try this:

SELECT p.panelid, p.sensorid, 
       ARRAY_AGG({
	            "quality":p.el.quality,
	            "timestamp":p.el.timestamp,
	            "value":p.el.value
                }) AS `data`
  FROM (
		SELECT q.panelid, q.sensorid, MILLIS_TO_STR(iv.interval) AS interval,
		       MIN([ABS(STR_TO_MILLIS(el.timestamp)-iv.interval),el])[1] AS el FROM (
		           SELECT d.panelid, d.sensorid,
		               ARRAY_FLATTEN(ARRAY_AGG(ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-06-04T17:00:00" END),2) AS data,
		               MIN(ARRAY {"interval": v} FOR v IN ARRAY_RANGE(STR_TO_MILLIS("2015-06-04T09:00:00"),STR_TO_MILLIS("2015-06-04T17:00:00"),str_to_duration("3h")/1000000) END) AS interval
		        FROM default d
		        WHERE panelid="a-si" and sensorid="voc" AND
		              ANY el IN d.data SATISFIES el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-06-04T17:00:00" END
		        GROUP BY d.panelid, d.sensorid) AS q UNNEST q.interval AS iv UNNEST q.data AS el
		GROUP BY  q.panelid, q.sensorid, iv.interval
  ) p
 GROUP BY p.panelid, p.sensorid

Thank you so much for your help, it works perfectly! :smile:

After some testing it turns out that it only works for a small number of days, if I change the query into:

SELECT p.panelid, p.sensorid, 
       ARRAY_AGG({
	            "quality":p.el.quality,
	            "timestamp":p.el.timestamp,
	            "value":p.el.value
                }) AS `data`
  FROM (
		SELECT q.panelid, q.sensorid, MILLIS_TO_STR(iv.interval) AS interval,
		       MIN([ABS(STR_TO_MILLIS(el.timestamp)-iv.interval),el])[1] AS el FROM (
		           SELECT d.panelid, d.sensorid,
		               ARRAY_FLATTEN(ARRAY_AGG(ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-05-04T17:00:00" END),2) AS data,
		               MIN(ARRAY {"interval": v} FOR v IN ARRAY_RANGE(STR_TO_MILLIS("2015-06-04T09:00:00"),STR_TO_MILLIS("2015-05-04T17:00:00"),str_to_duration("3h")/1000000) END) AS interval
		        FROM default d
		        WHERE panelid="a-si" and sensorid="voc" AND
		              ANY el IN d.data SATISFIES el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-05-04T17:00:00" END
		        GROUP BY d.panelid, d.sensorid) AS q UNNEST q.interval AS iv UNNEST q.data AS el
		GROUP BY  q.panelid, q.sensorid, iv.interval
  ) p
 GROUP BY p.panelid, p.sensorid

I get:

[
  {
    "data": [
      {}
    ]
  }
]

When you use BETWEEN a AND b, a is smallest and b is largest other way round will not qualify any items.

My bad, a silly mistake. But do you have an idea why the query uses several minutes to complete? The bucket contains approx. 20 000 documents.

each document you have many data points and it needs to calculate for every day.
Also try the following index.
CREATE INDEX ix1 on default( DISTINCT ARRAY el.timestamp FOR el IN data END) WHERE panelid=“a-si” and sensorid=“voc”;

I need some insight in performance:

SELECT q.panelid, q.sensorid, MILLIS_TO_STR(iv.interval) AS interval,
               MIN([ABS(STR_TO_MILLIS(el.timestamp)-iv.interval),el])[1] AS el FROM (
                   SELECT d.panelid, d.sensorid,
                       ARRAY_FLATTEN(ARRAY_AGG(ARRAY el FOR el IN d.data WHEN el.timestamp BETWEEN "2015-06-04T09:00:00" AND "2015-07-05T17:00:00" END),2) AS data,
                       array_flatten(array_agg(ARRAY {"interval": v} FOR v IN ARRAY_RANGE(STR_TO_MILLIS("2015-06-04T09:00:00"),STR_TO_MILLIS("2015-07-05T17:00:00"),str_to_duration("12h")/1000000) END),2) AS interval
                FROM (select panelid, sensorid, array_flatten(array_agg(c.data),2) as data FROM `data` c
                   WHERE panelid="a-si" AND sensorid="voc" AND date BETWEEN "2015-06-04" AND "2015-07-05"
                   GROUP BY panelid, sensorid) as d
                GROUP BY d.panelid, d.sensorid) AS q UNNEST q.interval AS iv UNNEST q.data AS el
        GROUP BY  q.panelid, q.sensorid, iv.interval

This query uses 1 minute to complete, but why is that? The first subquery uses only 3 seconds and the innermost one uses only 300 ms. So why does the outermost query take so long? Are the calculation the culprit (STR_TO_MILLIS(el.timestamp)-iv.interval)? I fail to why it is so inefficient when the subqueries are so much faster.