Join using arbitrary criteria

n1ql
query

#1

Hello, I want to perform join using not a primary key. Is it possible?

What I need is something like the SQL below:

select *
from activity
left join activity as future_activity on
activity.user_id = future_activity.user_id
and activity.date = future_activity.date - interval ‘1 day’

I understand syntax does not support it. But maybe you guys know a workaround? (I tried to perform a join of select, but obviously failed).

Thank you very much in advance!


#2

Hi @Viacheslav

You should try to have parent-child relationship via child referencing the parent’s document-id.

Logically, this will be the equivalent query:

select *
from activity
left join activity as future_activity ON KEYS
activity.user_id = meta(future_activity).id /* future_activity.user_id */
WHERE date_diff_str(activity.date, future_activity.date, “day”) = 1;

actual query is:
select *
from activity
left join activity as future_activity ON KEYS
activity.user_id
WHERE date_diff_str(activity.date, future_activity.date, “day”) = 1;

If the document id is a composite key, you can concatnate the values to form a ON KEYS expression.

With Couchbase, you can join from parent-to-child as well.

See this article https://dzone.com/articles/join-faster-with-couchbase-index-joins for details.


#3

@keshav_m Hello Kechav, thank you for the fast answer.

I’m not sure I follow your idea.

You wrote:

activity.user_id = meta(future_activity).id /* future_activity.user_id */

You commented out future_activity.user_id: but this is what I actually need: activity.user_id = future_activity.user_id as activity.user_id definitely can not be a primary key in future_activity because it’s the same “table”.

Here http://developer.couchbase.com/documentation/server/current/n1ql/n1ql-language-reference/from.html all forms of joins / nest / unnest requires a primary key for LHS or RHS. Though it’s said:

ON [ PRIMARY ] KEY rhs-expression.lhs-expression-key FOR lhs-expression

I like the word expression in lhs-expression. This is an expression! A subquery?? But again use keys is mandatory for subqueries.

As I understand you offer somehow to turn it to a situation with “parent-child”, and I don’t mind to modify my data to make it possible. But I don’t see how. Maybe you know a trick or a common practice having such an impressive experience?

Thank you very much for your attention indeed!


#4

The query you mentioned is the LOGICAL equivalent. Below is the ACTUAL query syntax.

select * 
from activity
left join activity as future_activity 
          ON KEYS activity.user_id 
WHERE date_diff_str(activity.date, future_activity.date, "day") = 1;

#5

activity.user_id is not a primary key in future_activity, so it can not work, unfortunately

Do you suppose to support join not by primary-keys in the future?


#6

What we are trying to achieve is finding out if a user is a new subscriber to our services during a calendar month. We would know if they are a new subscriber if they did not make a subscription last month (note: some subscribers are resurrected so querying the first time they made a subscription payment ever does not work).

The subscription payment document has the following schema:

{
“doc_type”:“subspayment”,
“account_uid”:“acc_HAYPTvMCDGCShwmBFeJqSc”,
“created”:“2015-10-10T12:30:40”
}

The created property is the date when a subscription payment was made. If the account made a payment in October we want to know if they also made a payment in September. If they did not then they are a new user. We would like to be able to get a data series (daily / weekly / monthly) data series of new users.

This type of query is extremely useful for SAAS companies who need to calculate retention metrics etc. for reporting purposes.

Thanks very much for taking a look!


#7

Hi @cwoolgar

I tried this, trying to mimic the model you gave. See the query below.

If you give full sample documents with document-keys, we can try that.
In Couchbase (and NoSQL in general), document key design is important since they form the reference/relationship.

select meta().id as dockey, * from pay;

[
  {
    "dockey": "subpayment::ca0bef29-0007-4d1b-a8cc-6671151a6b36",
    "pay": {
      "account_uid": "user::acc_HAYPTvMCDGCShwmBFeJqSc",
      "created": "2015-08-21T12:30:40",
      "doc_type": "subspayment"
    }
  },
  {
    "dockey": "subpayment::cc9d320b-6eb0-48f2-a59b-b187739318f2",
    "pay": {
      "account_uid": "user::acc_HAYPTvMCDGCShwmBFeJqSc",
      "created": "2015-04-21T12:30:40",
      "doc_type": "subspayment"
    }
  },
  {
    "dockey": "subpayment::f52cfa5f-1ca1-4c04-a475-dead49988d97",
    "pay": {
      "account_uid": "user::acc_HAYPTvMCDGCShwmBFeJqSc",
      "created": "2015-03-23T12:30:40",
      "doc_type": "subspayment"
    }
  },
  {
    "dockey": "subpayment::fbb6c992-e189-4edd-97ce-c64a8ef16faa",
    "pay": {
      "account_uid": "user::acc_HAYPTvMCDGCShwmBFeJqSc",
      "created": "2015-05-21T12:30:40",
      "doc_type": "subspayment"
    }
  },
  {
    "dockey": "user::acc_HAYPTvMCDGCShwmBFeJqSc",
    "pay": {
      "fname": "John",
      "lname": "Miller",
      "udate": "2015-10-10T12:30:40",
      "zip": 94523
    }
  }
]

select * 
FROM pay p USE KEYS ["user::acc_HAYPTvMCDGCShwmBFeJqSc"] 
   INNER JOIN pay s ON KEY s.account_uid  FOR p 
WHERE date_diff_str(p.udate, s.created, "month") = 1;

#8

Hello - thanks for your reply.

Here is a sample of data for each subscription document. There are 3 accounts and what we need to know is when they started subscribing by looking at whether they made a payment the month before also. If they did not make a payment the month before then they are considered newly subscribed.

{
“account_uid”:“acc_HAYPTvMCDGCShwmBFeJqSc”,
“created”:“2015-09-09T12:30:40”
}

{
“account_uid”:“acc_HAYPTvMCDGCShwmBFeJqSc”,
“created”:“2015-10-10T12:30:40”
}

{
“account_uid”:“acc_HTEQTvMCDGCShwmBFeJqSc”,
“created”:“2015-10-10T12:30:40”
}

{
“account_uid”:“acc_HTEQTvMCDGCShwmBFeJqSc”,
“created”:“2015-11-05T12:30:40”
}

{
“account_uid”:“acc_UTCQTvMCDGCShwmBFeJqSc”,
“created”:“2015-10-10T12:30:40”
}

{
“account_uid”:“acc_UTCQTvMCDGCShwmBFeJqSc”,
“created”:“2015-11-05T12:30:40”
}

and the output from the query:

date_series new_accounts
2015-09 1
2015-10 2

Thanks

Catherine


#9
SELECT * FROM c;    
[
      {
        "c": {
          "account_uid": "acc_UTCQTvMCDGCShwmBFeJqSc",
          "created": "2015-11-05T12:30:40"
        }
      },
      {
        "c": {
          "account_uid": "acc_HAYPTvMCDGCShwmBFeJqSc",
          "created": "2015-10-10T12:30:40"
        }
      },
      {
        "c": {
          "account_uid": "acc_HAYPTvMCDGCShwmBFeJqSc",
          "created": "2015-09-09T12:30:40"
        }
      },
      {
        "c": {
          "account_uid": "acc_HTEQTvMCDGCShwmBFeJqSc",
          "created": "2015-10-10T12:30:40"
        }
      },
      {
        "c": {
          "account_uid": "acc_HTEQTvMCDGCShwmBFeJqSc",
          "created": "2015-11-05T12:30:40"
        }
      },
      {
        "c": {
          "account_uid": "acc_UTCQTvMCDGCShwmBFeJqSc",
          "created": "2015-10-10T12:30:40"
        }
      }
    ]

SELECT account_uid,  max(created)
FROM c
GROUP BY account_uid
HAVING (DATE_DIFF_STR("2015-11-15T12:30:40", max(created), "month")) > 0

[
  {
    "$1": "2015-10-10T12:30:40",
    "account_uid": "acc_HAYPTvMCDGCShwmBFeJqSc"
  }
]


SELECT account_uid,  max(created)
FROM c
GROUP BY account_uid
HAVING (DATE_DIFF_STR("2016-11-15T12:30:40", max(created), "month")) > 0;

[
  {
    "$1": "2015-10-10T12:30:40",
    "account_uid": "acc_HAYPTvMCDGCShwmBFeJqSc"
  },
  {
    "$1": "2015-11-05T12:30:40",
    "account_uid": "acc_HTEQTvMCDGCShwmBFeJqSc"
  },
  {
    "$1": "2015-11-05T12:30:40",
    "account_uid": "acc_UTCQTvMCDGCShwmBFeJqSc"
  }
]

SCROLL the window above to see additional queries.

Add the grouping of these results by month (use the date functions…)

SELECT  (TOSTR(date_part_str(dcreated, "year")) || "-" || TOSTR(date_part_str(dcreated, "month"))) as month, count(1) total
FROM
(SELECT account_uid,  max(created) as dcreated
FROM c
GROUP BY account_uid
HAVING (DATE_DIFF_STR("2016-11-15T12:30:40", max(created), "month")) > 0 ) AS newacts
GROUP BY (TOSTR(date_part_str(dcreated, "year")) || "-" || TOSTR(date_part_str(dcreated, "month")))

[
  {
    "month": "2015-11",
    "total": 2
  },
  {
    "month": "2015-10",
    "total": 1
  }
]