Elasticsearch connector's `BigInteger` support

I’m not sure if this is a problem on the Elasticsearch or the Connector side, but some of the documents can not be indexed because of the following. (I did not create the indexes upfront for the Couchbase collections.)

400 failed to parse field [doc.number] of type [long] in document with id '123456'. Preview of field's value: '15000000000000000282'

We store documents with these incredibly large values in Couchbase with no effort. These values come and go between Couchbase and a JVM backend (serialized with BigDecimal support using JSON4S), in addition, JavaScript frontends can handle these values as well.

How is the index created in Elasticsearch with the Connector? Is this something that the Connector would create in advance, before inserting the first document?

Based on my search of Elastic issues, it should have support for BigIntegers in some way, but there is no dedicated Elasticsearch Type for it. Some sources suggest indexing these as keywords.

I also attempted to drop these values in an Elasticsearch Pipeline, however, the JSON is possibly parsed before it is sent to the Pipeline.

Did someone else stumbled upon this problem with the Couchbase Elasticsearch Connector?

Hi Zoltan,

I’m under the impression JavaScript cannot accurately represent integers larger than 2^53 - 1 unless you’re using the new BigInt type added in ECMAScript2020. Is it possible the frontend is silently losing precision?

The connector does not create indexes or type mappings. It relies on Elasticsearch’s automatic index creation and dynamic type mapping features, or pre-existing indexes and mappings.

What happens if you manually create the index with a type mapping that says the field is a keyword?

Thanks,
David

Hi David,

We do’not lose precision. I immediately had that wierd feeling as well and got paniced, but we serialize BigDecimals in Scala and it seems to me we are properly using BigInt on the JS side as well.

This is what we must do in this case. However, I see that the Connector attempts to insert these big numbers 15000000000000000282, so I’m thinking on maybe we could do some automatic conversion to String, therefore it would transform to a keyword type in ES. This would win sleepless hours in developing and maintaining index mappings. Would such a “transform” feature be useful for the Connector? If so, I’ll look into the complexity of developing it.

If it’s useful to you, others might find it useful as well :slight_smile:

Perhaps the type definitions in the connector config could include a list of JSON pointers to fields that should be coerced to strings? Something like:

[[elasticsearch.type]]
  matchOnQualifiedKey = true
  regex = '[^.]+.widgets.*' # all documents in the "widgets" collection in any scope
  coerceToString = ['/doc/number'] # new field, list of JSON pointers

This would let users continue to use automatic index creation and dynamic type mapping.

Let me know how your investigation goes. I’m happy to answer any questions about the code.

Thanks,
David

1 Like

Thanks for your suggestions David, I’ll do the research in the near future to obtain this feature and get back to you with my findings.

I checked the code and quickly started out by adding this to TypeConfig:

@Nullable
ConfigArray coerceToString();

Then I started to dig deeper and into the part that would write out a mutation to ES.

I transformed the part where it takes the mutation byte[] bytes and converts it into a Map<String, Object> to be able to create an ES document. I saw that it calls the com.fasterxml dependency shaded into the DCP libarary to do the byte[] to Map<String, Object> conversion. This looks like a good place to inject some configuration to the fasterxml ObjectMapper as follows:

As you can see on the above code snippets, the ObjectMapper could be configured with custom deserializers and possibly with SerializationFeature’s. There are quite a few possibilities there, so I suggest creating a serialization block to each ES type in the configuration so that in the near/far future, more feature could be added. A configuration key could be added under the TypeConfig for example serialization.writeBigDecimalAsString boolean and it could be true by default.

David, please suggest directions in which the idea could be better.

Thanks,
Zoltán

Hi Zoltan,

I’d prefer not to directly expose the Jackson features, just in case we need to migrate to a different JSON library in the future.

However, I agree it would be great to have a config option that enabled Jackson’s WRITE_NUMBERS_AS_STRINGS feature as part of a type definition (and the type defaults).

The coerceToString property would still be useful, in case users want only certain fields to be stringified.

Instead of:

@Nullable
ConfigArray coerceToString();

I would recommend:

List<JsonPointer> coerceToString();

and parse the JSON pointers as the config is read. This is more in line with how the other config fields are parsed, and would make it easy to fail fast if the user specifies an invalid pointer.

Something else to be aware of: there’s an “optimized passthrough” fast path for when the Elasticsearch document is exactly the same as the Couchbase document. In that case, we’d need to check whether any coercion is required; if so, we’d have to parse, transform, and re-serialize the document.

Thanks,
David