IndexOutOfRangeException during fail over with .NET SDK 2.0

#1

Hello,

I am using the .NET SDK 2.0.2 and created a sample application that was reading from a bunch of documents that I previously inserted. The keys of those documents are just GUIDs.

Here’s my code:

class Program
{
    private static readonly Cluster cluster = new Cluster("couchbaseClients/couchbase");
    private static readonly IBucket bucket = cluster.OpenBucket(cluster.Configuration.BucketConfigs.Single().Value.BucketName);

    static void Main()
    {
        var keys = JsonConvert.DeserializeObject<string[]>(File.ReadAllText("keys.json")).ToList();

        for (int i = 0; i < keys.Count; i++)
        {
            string key = keys[i];
            var result = bucket.Get<string>(key);

            if (!result.Success)
            {
                Console.WriteLine(result.Status);
                Console.WriteLine(result.Message);
                Console.WriteLine(result.Exception);
            }

            Thread.Sleep(250);

            Console.WriteLine("processed {0} records: {1}", (i + 1), result.Value);
        }
    }
}

I have a total of 10000 documents and a Couchbase cluster 3.0 with a total of 3 nodes. During the read process I tried performing a graceful fail over of one of the nodes followed by rebalance. During the fail over process, the .NET client intermittently throws the following exception for some of the documents being read:

System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at Couchbase.IO.Converters.AutoByteConverter.ToByte(Byte[] buffer, Int32 offset)
   at Couchbase.IO.Operations.OperationBase`1.Read(Byte[] buffer, Int32 offset,Int32 length)
   at Couchbase.IO.Strategies.DefaultIOStrategy.Execute[T](IOperation`1 operation)
   at Couchbase.Core.Server.Send[T](IOperation`1 operation)

The status of the operation is ClientFailure. Do you have any idea how this can be avoided? Would reading from a replica help in this situation?

#2

@dimitrod -

I am guessing the client doesn’t throw the exception, but returns it as a result? So, during a rebalance the client’s cluster map will be updated and the TCP connections at times will be torn down and recreated based off the latest cluster map. In certain cases the client will retry the operation that fails before outright returning a failed operation result. In the case of error that falls under the classification “ClientFailure”; the server itself did not return an error, instead for some reason the failure occurred on the client side.

As of 2.0.2 these errors (ClientFailure) are not retried by the SDK, this could change in later releases though. In this case, being that it’s a non-mutation operation, the application code should probably retry the Get. That being said, without knowing what lead up to the IOORE, I would consider this a bug in the client. Here is a ticket: https://issues.couchbase.com/browse/NCBC-823

-Jeff

#3

Yeah, it should unless NCBC-823 impacted the replica read.

#4

@jmorris -

Thanks for the info. In my test setup when the ClientFailure was returned I tried using the GetFromReplica<T> method but no matter how many times I called it (in an infinite while(result.Success) loop) it always returned VBucketBelongsToAnotherServer. On the other hand if I retried the original Get<T> operation eventually it succeeded. So I guess that the GetFromReplica<T> method doesn’t work as expected in 2.0.2 or maybe I am not using it in the correct situation?

Does this mean that if there’s an IOException, the operation will not be retried? In our production environment I have a situation where idle TCP connections will be reset by a firewall. I have verified that the 2.0.1 SDK was correctly handling this case and simply marked the connection in the pool as dead and retried the operation.

#5

@dimitrod -

I’ll take a look into why replica read isn’t working as expected (and I am assuming you are storing replicas?).

There was some work done in 2.0.2 to improve the handling of connections during rebalance scenarios, but nothing specifically changed as to the retry policy. In general were trying reduce the number of connections torn down and created because of the latency it introduces.

-Jeff

#6

Hi jmorris

I am having the same type of error while calling Upsert()

System.IndexOutOfRangeException: Index was outside the bounds of the array.
at Couchbase.IO.Converters.AutoByteConverter.CopyAndReverse(Byte[] src, Int32 offset, Int32 length)
at Couchbase.IO.Converters.AutoByteConverter.ToInt16(Byte[] buffer, Int32 offset)
at Couchbase.IO.Operations.OperationBase1.Read(Byte[] buffer, Int32 offset, Int32 length) at Couchbase.IO.Strategies.DefaultIOStrategy.Execute[T](IOperation1 operation)
at Couchbase.Core.Server.Send[T](IOperation`1 operation)

I tried with 2.0.3 and 2.0.2.

Thanks,
Amalraj

#7

@amalraj_charles -

Could you explain the state of the cluster when this occurs (e.g. during normal/failover/rebalance/etc)? What version of the server? Does it happen on every Upsert? Could you provide example data and your code you are using that illustrates the failure (you can attach it to this ticket: https://issues.couchbase.com/browse/NCBC-823).

Thanks,

-Jeff

#8

Updated the JIRA incident with request info.

#9

@jmorris -

Hi was able to solve this issue by giving key as integers values instead of string (combination of different characters) .

E.g documents with the below keys failed giving the same error as my above post

object_2_C_PROJ_PRC
object_2_C_PRPS_VNR
object_2_K_PKSA
object_2_K_SUM_PROJ
object_7_C_PROJ_PRC

Hope you can guide me whats wrong in my code.

Thanks,
Amalraj

#10

We are experiencing the same error under similar conditions.
.Net client version 2.1.0

Couchbase server:
2 nodes running on Ubuntu
Version: 3.0.3-1716 Enterprise Edition (build-1716)
Cluster State ID: 025-020-213

A few points that might be helpful:

  • Exception isn’t thrown, but operation return as unsuccessful
  • Some documents consistently fail when adding
  • Changing the content of the document even slightly (usually by removing any characters) almost always results in a successful add operation for subsequent attempts
  • Even when the IndexOutOfRangeException happens, the documents appear to be saved successfully in CB (subsequent GET operations with the same ID return the expected document)
  • The exception itself traces back to the Read method of OperationBase when the buffer parameter is too short. Usually 4 or 5 bytes. Seems like either the server is returning wonky data, or the buffer isn’t initialized properly.
public virtual void Read(byte[] buffer, int offset, int length)
{
    if (Header.BodyLength == 0)
    {
        Header = new OperationHeader
        {
            Magic = Converter.ToByte(buffer, HeaderIndexFor.Magic), 
            OperationCode = Converter.ToByte(buffer, HeaderIndexFor.Opcode).ToOpCode(),
            KeyLength = Converter.ToInt16(buffer, HeaderIndexFor.KeyLength),
            ExtrasLength = Converter.ToByte(buffer, HeaderIndexFor.ExtrasLength),
            Status = (ResponseStatus) Converter.ToInt16(buffer, HeaderIndexFor.Status),
            BodyLength = Converter.ToInt32(buffer, HeaderIndexFor.Body),
            Opaque = Converter.ToUInt32(buffer, HeaderIndexFor.Opaque),
            Cas = Converter.ToUInt64(buffer, HeaderIndexFor.Cas)
        };
    }
    LengthReceived += length;
    Data.Write(buffer, offset, length);
}
#11

I’m experiencing a similar issue as well. I have prepared a test code to find out for which document size ranges it fails, as through experimenting with it, I came to the conclusion that it appears to be size related.
So, I’ve run the below code to test it:

public void TestProblematicSizes()
{
const int start = 0;
const int end = 50000;
var errorSb = new StringBuilder();
var sb = new StringBuilder();
for (var x = 0; x < start; ++x)
sb.Append("+");
var str = sb.ToString();

for (var i = start; i < end; ++i)
{
    str += "+";
    var testClass = new TestClass {Id = Guid.NewGuid().ToString(), Message = str};
    try
    {
        _testClassRepository.Add(testClass.Id, testClass);
    }
    catch (Exception)
    {
        var len = JsonConvert.SerializeObject(testClass).Length;
        errorSb.AppendLine(len.ToString());
    }
}
Debug.WriteLine(errorSb.ToString());

}

Also, below is from the repository object:

public virtual bool Add(T item)
{
var result = Add(item.Id, item);
return result;
}

public virtual bool Add(string key, T item)
{
using (var bucket = CouchbaseManager.Instance.OpenBucket(Bucket))
{
var result = bucket.Insert(new Document { Id = key, Content = item });
if (!result.Success)
throw new Exception(result.Message, result.Exception);

    return result.Success;
}

}

The output is the numbers between 16317 - 16339, 32701 - 32723, 49085 - 49107. Three ranges, of same size, near 16k, 32k and 48k (exact same distances respectively).
The exception we got is from the same place brettman said above.
I hope the code above helps identify issue. Please let me know if you need more information on my end.