CPU Load on upsert bulk process


#1

Hi gurus!.

I'm starting to use couchbase (for testing right now) for a project that will be handling around 40 millions documents. This is the environment:

 2 servers: 4 CPUs, 26 GB RAM (each)
 Ubuntu 14.04 LTS, Couchbase 4.5.0-2601 ED

I'm using upsert from Java SDK to load the data from csv. After around 20 millions documents, I get an Exception TemporaryFailure, I made some modification to the code and wait some time and retry the upsert when the exception appears. But when the program retries I see high CPU Load and I get the exception again.

Do you think that it is a hw problem (I need more capacity) or can I solve the issue by other way?

Here some printscreen before and after the issue.

CPU Before:

CPU After:

OPS Before:

OPS After:

HTOP Before:

HTOP After:

Really basic code to load the data:

Observable
.from(reader)
.map(
csvRow -> {
linea++;
JsonObject object = JsonObject.create();
object.put(“type”, tabla.toLowerCase());
object.put(“rc”, rc.toUpperCase());
try {
for (int j = 0; j < line.length; j++) {
setValor(object, campos[j], csvRow[j], tipos[j]);
}
} catch (Exception e) {
System.out.println(e.getMessage() + " en archivo: " + archivo + System.getProperty(“line.separator”) + “–> Linea: " + linea + " <–”);
}
if (!tabla.toLowerCase().equals(“utp_common_objects”)) {
return JsonDocument.create(UUID.randomUUID().toString() + new Random().nextInt(100), object);
} else {
return JsonDocument.create(csvRow[0], object);
}
}
)
.subscribe(document ->
{
for (int i = 0; i < 6; i++) {
try {
bucket.upsert(document);
break;
} catch (TemporaryFailureException ex) {
try {
if (i == 5)
throw new TemporaryFailureException(ex);
System.out.println(“Esperando " + (i+1) + " min…”);
Thread.sleep(60000*(i+1));
continue;
} catch (InterruptedException e) { }
}
}
},
error -> {

  		    	System.out.println(error);
  		    	
  		    });

#2

As covered in the documentation, Couchbase intentionally returns TMPFAIL in temporary out of memory situations. This is one thing that is intentionally different in Couchbase compared to traditional databases and comes from our memcached history.

The philosophy here is to always be fast and let the application developer decide whether or not to redo the operation given that the cluster can’t service the operation at the moment.

So, what I’d recommend is adjusting the code to do an exponential backoff and retry with a do-while loop or something similar. The documentation covers this in the Error Handling section on Retry with delay.


#3

Thanks ingenthr for your comments!

I modified my code to do an exponential backoff and retry when an exception appears, and it takes lot of time to load the data, I made some research before trying to expand my cluster in one or two nodes, and …

Lesson learnt (what I did to get better performance):

  1. Schedule the auto-compactation for an hour that the DB has low load.
  2. Disable replicas (I dont need it for now).
  3. Configure cache meta data to Full Ejection (its ok for me)
  4. Monitor Disk queues: I think that it was the root of the issue, i saw that after some time around a half of my RAM was occupied by cache. Monitoring the queues I found more than 4M of items waiting to be drained to the disk… I have my two nodes on one cloud (in trial) and sometimes this hard disks are very very slow.

Memory used by cache:

Disk queues:

Disk write speed:

I’m new to couchbase, but I hope that this information could help somebody, if you are reading this topic and have additional information it will be really appreciated any comments.

Thanks guys!