I’m not sure if this is a problem on the Elasticsearch or the Connector side, but some of the documents can not be indexed because of the following. (I did not create the indexes upfront for the Couchbase collections.)
400 failed to parse field [doc.number] of type [long] in document with id '123456'. Preview of field's value: '15000000000000000282'
We store documents with these incredibly large values in Couchbase with no effort. These values come and go between Couchbase and a JVM backend (serialized with
How is the index created in Elasticsearch with the Connector? Is this something that the Connector would create in advance, before inserting the first document?
Based on my search of Elastic issues, it should have support for
BigIntegers in some way, but there is no dedicated Elasticsearch Type for it. Some sources suggest indexing these as keywords.
I also attempted to drop these values in an Elasticsearch Pipeline, however, the JSON is possibly parsed before it is sent to the Pipeline.
Did someone else stumbled upon this problem with the Couchbase Elasticsearch Connector?
The connector does not create indexes or type mappings. It relies on Elasticsearch’s automatic index creation and dynamic type mapping features, or pre-existing indexes and mappings.
What happens if you manually create the index with a type mapping that says the field is a keyword?
We do’not lose precision. I immediately had that wierd feeling as well and got paniced, but we serialize
BigDecimals in Scala and it seems to me we are properly using
BigInt on the JS side as well.
This is what we must do in this case. However, I see that the Connector attempts to insert these big numbers
15000000000000000282, so I’m thinking on maybe we could do some automatic conversion to String, therefore it would transform to a
keyword type in ES. This would win sleepless hours in developing and maintaining index mappings. Would such a “transform” feature be useful for the Connector? If so, I’ll look into the complexity of developing it.
If it’s useful to you, others might find it useful as well
Perhaps the type definitions in the connector config could include a list of JSON pointers to fields that should be coerced to strings? Something like:
matchOnQualifiedKey = true
regex = '[^.]+.widgets.*' # all documents in the "widgets" collection in any scope
coerceToString = ['/doc/number'] # new field, list of JSON pointers
This would let users continue to use automatic index creation and dynamic type mapping.
Let me know how your investigation goes. I’m happy to answer any questions about the code.
Thanks for your suggestions David, I’ll do the research in the near future to obtain this feature and get back to you with my findings.
I checked the code and quickly started out by adding this to
Then I started to dig deeper and into the part that would write out a mutation to ES.
I transformed the part where it takes the mutation
byte bytes and converts it into a
Map<String, Object> to be able to create an ES document. I saw that it calls the
com.fasterxml dependency shaded into the DCP libarary to do the
Map<String, Object> conversion. This looks like a good place to inject some configuration to the
fasterxml ObjectMapper as follows:
As you can see on the above code snippets, the
ObjectMapper could be configured with custom deserializers and possibly with
SerializationFeature’s. There are quite a few possibilities there, so I suggest creating a
serialization block to each ES type in the configuration so that in the near/far future, more feature could be added. A configuration key could be added under the
TypeConfig for example
serialization.writeBigDecimalAsString boolean and it could be
true by default.
David, please suggest directions in which the idea could be better.
I’d prefer not to directly expose the Jackson features, just in case we need to migrate to a different JSON library in the future.
However, I agree it would be great to have a config option that enabled Jackson’s
WRITE_NUMBERS_AS_STRINGS feature as part of a type definition (and the type defaults).
coerceToString property would still be useful, in case users want only certain fields to be stringified.
I would recommend:
and parse the JSON pointers as the config is read. This is more in line with how the other config fields are parsed, and would make it easy to fail fast if the user specifies an invalid pointer.
Something else to be aware of: there’s an “optimized passthrough” fast path for when the Elasticsearch document is exactly the same as the Couchbase document. In that case, we’d need to check whether any coercion is required; if so, we’d have to parse, transform, and re-serialize the document.