Couchbase sorting Swedish Å Ä Ö in the wrong order


#1

I am running this query in Query Editor.

SELECT name FROM bucket WHERE type = ‘city’ ORDER BY name ASC

This is the result I get and it’s not in the right order. In Swedish we sort Å Ä Ö. The last 3 cities should be Åmål, Älvsjö and Östersund but Couchbase gives me Älvsjö, Åmål and Östersund instead. What is the problem?

[
{
“name”: “Borås”
},
{
“name”: “Filipsstad”
},
{
“name”: “Göteborg”
},
{
“name”: “Linköping”
},
{
“name”: “Älvsjö”
},
{
“name”: “Åmål”
},
{
“name”: “Östersund”
}
]


#2

N1QL uses byte order collation


#3

Hi Olle,
have you got an index on name WHERE type=“city” per chance?
Could I see you query plan?

Thanks,
Marco


#5

The only index I have right now is the primary index.

{
“plan”: {
#operator”: “Sequence”,
“~children”: [
{
#operator”: “Sequence”,
“~children”: [
{
#operator”: “PrimaryScan”,
“index”: “#primary”,
“keyspace”: “bucket”,
“namespace”: “default”,
“using”: “gsi”
},
{
#operator”: “Fetch”,
“keyspace”: “bucket”,
“namespace”: “default”
},
{
#operator”: “Parallel”,
“~child”: {
#operator”: “Sequence”,
“~children”: [
{
#operator”: “Filter”,
“condition”: “((bucket.type) = “city”)”
},
{
#operator”: “InitialProject”,
“result_terms”: [
{
“expr”: “(bucket.name)”
}
]
}
]
}
}
]
},
{
#operator”: “Order”,
“sort_terms”: [
{
“expr”: “(bucket.name)”
}
]
},
{
#operator”: “FinalProject”
}
]
},
“text”: “SELECT name from bucket WHERE type = ‘city’ ORDER BY name ASC”
}


#6

I see that now when I look at a extended ascii table. Ä = 142, Å = 143, Ö = 153. But the question remain.

Swedish_alphabet


#7

Hi Olle,
since you are not using an index path, the sort is done by the golang sort algorithm, which uses >byte< comparison for sorting, so I’m afraid the order as you get it is expected.

HTH,
Marco


#8

I just created an index.

CREATE INDEX name ON bucket(name) WHERE (type = “city”)

The result is still wrong.


#9

N1QL uses byte order collation i.e memcmp() during sorts, so based on that it sorts.
Ä = 142, Å = 143, Ö = 153. based on ascii code order collation 142 comes before 143.

N1QL doesn’t support unicode collation.


#10

So here we have a modern nosql database that can’t sort Swedish Å Ä Ö in the right order. Ok, when will this be fixed?


#12

Sorry - I have been away.

Let me cover the issue in a bit more detail: N1QL and the Indexing Service rely on the golang string implementation to handle text.
Golang strings are a slice of bytes which are interpreted as a sequence of Unicode runes, which is nice and simple, because JSON strings are in fact Unicode code points.

So far so good - you may even get away with using single byte character encoding like CP1252, and still display properly (although this is technically not supported, since U+00C5 compares as different to CP1252 0xC5, although they both display as Å).

We now enter the realm of locales, collation, equivalences and ligatures, and here is where things turn nasty.
In order to sort things properly, you need to know what language is involved: in Estonian, for instance, the ‘Z’ group is considered to be part of the ‘S’ group, and the sort collation is …S, Š, Z, Ž, T, U, V, Õ, Ä, Ö, Ü. To make things worse, F is not even part of the alphabet!
In German, the diatrictic letters Ä, Ö, Ü are not part of the alphabet, and are considered to be modification of A, O and U (in fact, before the Umlaut, they were printed as Ae, Oe and Ue), so they collate as A, Ä,… O, Ö,… U, Ü, which is different from Estonian (or Swedish) where Õ, Ä, Ö, Ü are considered to be separate from A, O, U.

From a collation point of view, certain runes are considered equivalent to others: the EsZett (ẞ) sorts as equivalent to SS and sits in between “SR” and “ST”, or, to go back to Estonian, Ü is fast becoming equivalent to Y.
Lowercase runes are equivalent to upper case for collation purposes, and ‘a’ definitely comes before ‘B’.

I digress. The long and the short is that you have to know the locale involved before you apply the rules.
Traditionally, this has been done using environmental variables (LANG, NLS_LANG, LC_CTYPE, CLIENT_LOCALE, you name it…) to inform the operating system or your database engine how to sort and display.
In fact, things are even worse than that, because, normally, you would have to indicate to the database engine what locale needs to be used when storing and inserting data, AND tell the client what locale is used on its side: the idea being that if you are using a client application in a Spanish territory to make (say) a booking on a Swedish database, the client and the server should handshake so that characters unknown on one side get translated by the other, so that, for instance, you do not insert invalid characters on the database.

I digress again.
The rub is that JSON does not have any means of specifying the locale that has been used in a document, and golang does not have any support to set a locale for processing purposes, so we do not know what locale to use for the sorting, nor, lacking that, do we have a way to force anything other than byte order at the service level.

This does not mean to say that your requirement is incorrect.
It’s just that the underlying standard, JSON, has not defined the problem, so “fixing the issue” is non trivial because lacking indication from the storage, we’ll have to find some way to enforce a locale that is consistent across services and, more importantly, across clients.