Feature request: Elasticsearch Connector should be able to filter documents

We increasingly faced with the requirement to maintain separate Elasticsearch indexes for our Couchbase documents of a certain bucket that contains roughly 240 million documents.

The current Elasticsearch Connector (4.2.2-SNAPSHOT) [1] does not allow Couchbase documents to be filtered before written to Elasticsearch, except the following example: [2]

[[elasticsearch.type]]
  # Index can be inferred from document ID by including a capturing group
  # named "index". This example matches IDs that start with one or more
  # characters followed by "::". It directs "user::alice" to index "user",
  # and "foo::bar::123" to index "foo".
  regex = '(?<index>.+?)::.*'

We mostly have randomized document IDs, for example, 32 character long alphabetical strings. Other document classes are stored in the document itself. For example type, class, interestGroup and such.

It would be nice to have filters like type="News" AND interestGroup="Group57" - not expressed exactly like this in the configuration file at [2], but simple AND + OR and exact matching would be great!

  • Is there anything like this ever been considered?
  • There is a slight chance I think that we missed something, and there is something like this, isn’t there?
  • If to anywhere, I would probably place the filter right here, [3] and not let it to go to the eventSink. What do you think? @david.nault

[1] https://github.com/couchbase/couchbase-elasticsearch-connector
[2] https://github.com/couchbase/couchbase-elasticsearch-connector/blob/765cf72351719ebbfa52bf437a5c470ca8bb2eb8/src/dist/config/example-connector.toml#L162
[3] https://github.com/couchbase/couchbase-elasticsearch-connector/blob/765cf72351719ebbfa52bf437a5c470ca8bb2eb8/src/main/java/com/couchbase/connector/dcp/DcpHelper.java#L144

1 Like

Hi Zoltan,

This is a reasonable thing to want to do. The database change protocol used by the connector places some limitations on us though. Deletion notifications do not include the content of the deleted document. This means the connector would not be able to route deletions to the correct index. In the past we’ve accommodated this by adding an “ignoreDeletes” flag to the the definition, and requiring it be set to true for all rules that need to inspect the document content.

Does your use case require that deletions propagate to Elasticsearch, or would you be happy with setting ignoreDeletes=true for types whose ES index is determined by content fields?

Thanks,
David

1 Like

Hi David,

There are no deletes in our use-cases. If there are, deletes are managed yearly by dropping the last year’s data and persisting it to something like HDFS. Reloading the whole index yearly is accepted. Ignoring deletes are fine with us.

I assume that when a DeleteMutation would hit an index that has no matching document to delete (because previously an UpdateMutation was filtered out), it would fail silently. Therefore even with a small amount of DeleteMutation failing silently, it would still be practical. Does this sound reasonable?

Would this be a big change/feature? We could help in the development/testing of this right away.

P.S.: Will DCP support Couchbase collections eventually or will it stay as a v-bucket-level protocol? (I have no idea how collections are being implemented.)

Thanks,
Zoltán

Hi Zoltan,

Just wanted to check in. I can’t make a commitment about the timeline, but I do think it would make sense to add this feature to the connector. It seems like a good use for some “json stream matching” code we’ve got sitting on a shelf.

Turn out we’re already tracking this feature request as CBES-146. I’ve reprioritized it to target the June 16th release. That might be optimistic, but we’ll see. I’ll ping you as soon as it’s ready to play with.

Will DCP support Couchbase collections eventually or will it stay as a v-bucket-level protocol?

Yes, support for collections is the next big feature we’re adding to the connectors.

Thanks,
David

1 Like