Receiving multiple Error Code :: 5003 when we run multiple JAVA threads concurrently

We have a Java Swing UI where we create multiple threads and in each thread we are doing a pull replication. We are trying to create more threads(1000 threads or more, if neccesary). While creating and running 10 threads, we are able to complete the transactions. but when we are running more than 50 threads we are getting error code 5003.

In the thread class of Java, we are using the Java platform code of Couchbase 2.7 https://docs.couchbase.com/couchbase-lite/2.7/java-platform.html

Basically what we do is, we give the credentials, number of thread count and iterations of each thread in the Java Swing UI. Based on the inputs we will be creating the threads and perform the action. We also iterate the threads with some interval of time. We try to run multiple threads concurrently.

When we provide less number of thread count and iteration, we were able to complete all the transactions. But when we increase the thread count and iteration, and when the users have more documents, we are getting error code :: 5003 continuously. Based on the document count for the user we observe the CPU is spiking up. Even a 64 core CPU has max out when we try to trigger <5 PULL replications concurrently with user having document count some were around 800 to 2000.

Note:

  1. We have NOT set the replication mode as Continuous, i.e, replicator.setContinuous(false)
  2. While creating threads in JAVA SWING UI, we create a thread and add the thread in a ThreadGroup.

E/CouchbaseLite/NETWORK:{C4SocketImpl#45}==> class litecore::repl::C4SocketImpl
E/CouchbaseLite/NETWORK:{C4SocketImpl#45} No response received after 15 sec – disconnecting
E/CouchbaseLite/REPLICATOR:{Repl#46}==> class litecore::repl::Replicator D:\CbLite\Resources\Thread-11-1\syncdb.cblite2\ ->
E/CouchbaseLite/REPLICATOR:{Repl#46} Got LiteCore error: Network error 3 "connection timed out"
Error code :: 5003

the requests from multiple thread will be queued up in couchbase lite and got executed in the queue order. In our test app, we use java.util.concurrent.Executors to optimize the concurrent calls:

@eunice.huang Thanks for your quick turnaround. We still observe the CPU to hit 100% at times when we have around 10 - 20 pull replications concurrently, where each replication pulls around 500 - 1500 documents.

Is there anyway we can increase the timeout duration ( Network error 3 “connection timed out” – 15 secs) or please advice us on how to resolve this 5003 error?

I believe this behaviour is very similar to what we see here (but from PHP)

@flaviu @eunice.huang We can start multiple independent connections concurrently using the thread concept in java. For pull replication when the users are having fewer documents (10< & <500) where we can see the CPU usage is less the 40% and we don’t observe any 5003 errors. At the same time, if we run for users having more documents (500< & <2500) after few successful replications we are facing 5003 errors where we can see the CPU usage has almost reached 100%. So we tried using a machine having 64 core CPU with 256 GB RAM still we are facing the same issue. Kindly, please advise us on this.

I believe the CPU spike is normal as you are getting more documents. In my case, what is not normal is that when I start more workers the DB starts to fail to connect to the DB.

and I tested with an MTR started to the DB servers and the ping is less than 1 ms with no packets loss.

So, what I think is that there is some kind of buffer in the libcouchbase which is filling up with new connections (for each new worker started) and instead of creating a new connection to the DB is just throwing an error after a certain limit.

I don’t have other explanations. All my workers have the same identical code, the only thing that is different is the key which we intend to get from the DB. So there is no reason for which some workers would be able to connect to the DB and others, not from the code perspective. (less than 10 lines of code)

There should be something in the libcoucbase or in the DB. I know there are some limitations of 60k connections per node, but we are nowhere so close. Probably we are in the 1000-2000 get per second for the entire cluster.
@avsej any idea what may be the problem?

@flaviu It appears that you are talking about the Couchbase Server Java SDK . That is a completely different product (and implementation) than Couchbase Lite for Java which is what the original poster is discussing. Can you post a separate question and tag it so appropriate folks can respond

Hello Priya, actually I did this, but no one answered. Here is the link to my post

This was very similar with what I encounter, so, they may have similar cause, that’s why I commented here.

@Sanjay
I can practically guarantee that the architecture you describe will not work. 1000 threads is a lot of threads. 200 threads is a lot of threads. You should, instead, be using a task queue of some kind… possibly as @eunice.huang suggested, a thread pool fronted by an Executor.

I’m not at all surprised that you see the CPU hit 100% when you run 20 replications. Each is running on its own thread, competing for processor cores. Even on a machine with quite a few cores, when you get 20 replications running at once, if one gets suspended while doing I/O, another will get scheduled. I’d expect to see your CPU closer to N * 100%, where N is the number of cores on your processor.

As for the exact cause of the 5003 error, there could be several problems. You haven’t included any information at all about what is on the other side of the connection. It is impossible to tell whether the server actually is responding in time or not.

Here are some possibilities: There are limits on the number of file descriptors that a system can have open at once. It might be that either the server or the client is reaching that limit and is stuck waiting for one. It could be that the network stack observes a large queue of waiting messages and discards some (the safest way of avoiding hanging the whole computer). It could just be that the CPUs are maxed out. If all of these replicators are running against a single database (on either the client or the serrver side), the file system might unable to keep up.

Amended to add: A colleague just pointed out a very likely cause of the problem. Initiating and then closing a lot of connections will leave a lot of connections in TIMED_WAIT state. If that happens the server will not observe new connections and the client will observe time-outs.

You can’t put an elephant into a matchbox, no matter how hard you push.

There is no way, currently, to change the timeout delay. Making it client-code-configurable is a feature under discussion for possible inclusion in a future release.

@blake.meike @eunice.huang

  • By maximum, How many threads can perform replications concurrently ? As per our settings each threads do replications with some intervals of time. we are not performing replications concurrently for all the threads.
  • We have built a java GUI in which we give inputs such as no.of threads, no.of iterations and user credentials. So if we use System.exit(0) to kill the executor threads, the utility is closing completely. Is there any way to kill the threads without using System.exit(0)
  • We have monitored the couchbase server during the run we did not see any abnormalities during the run. Furthermore, looking at multiple timeout errors, I see that there was no activity for the user which faced the timeout error in Sync Gateway. This indicates that the Sync Gateway did not receive the sync request for the given user at all.
  • How many thread can perform replications concurrently
    The rule of thumb for threads in an application (that is, total number of threads started by the entire app) is 2 * CpuCores < N < 3 * CpuCores
  • Is there any way to “kill” threads?
    No. Because what it means to “kill” a thread is so difficult to define, stopping, resuming, suspending, etc, threads was deprecated in Java 5. The correct way to stop a thread is to use Thread.interrupt(), to ask the thread to stop, and for the thread to cooperate, by stopping.
  • The SyncGateway did not receive the sync request for the user.
    That sounds reasonable. You didn’t provide any information on your network setup, so I’m left to guess at how the connection is failing. If the client is receiving a 5003 status, though, it is certain that the SG is not getting the sync.