Unable to update 6m+ documents on community edition 3.0.1

shreyas · June 29, 2015, 6:41am

I am trying to update 6 million+ documents in a community edition 3.0.1 server cluster. I am using java sdk and tried various ways in which I could read a batch of documents from a View, update them and replace them back to the bucket.

It seems to me that as the process progresses the throughput gets too slow that its not even 300 op/s. I tried using many ways to do this using bulk operation method (using Observable) to speed it up but in vain. I even let the process run for hours only to see Timeout exception later.

The last option I tried was to read all the document IDs into a temp file from the View so that I can read the file back and update the records. But, after 3 hrs and only 1.7m IDs read from the View, the DB gives Timeout exception.

Note that the the couchbase cluster contains 3 servers with 8 cores, 24GB RAM & 1TB SSD each and the java code running to update data is in the same network. And there is no other load running on this cluster.

It seems, reading even all the IDs from the view of the server is impossible. I checked the network throughput and the DB server was giving the data barely at 1mbps.

Below is the sample code used to read all the doc IDs from the view:

final Bucket statsBucket = db.getStatsBucket();
int skipCount = 0;
int limitCount = 10000;

System.out.println("reading stats ids ...");

try (DataOutputStream out = new DataOutputStream(new FileOutputStream("rowIds.tmp")))
{
	while (true)
	{
		ViewResult result = statsBucket.query(ViewQuery.from("Stats", "AllLogs").skip(skipCount).limit(limitCount).stale(Stale.TRUE));

		Iterator<ViewRow> rows = result.iterator();

		if (!rows.hasNext())
		{
			break;
		}

		while (rows.hasNext())
		{
			out.writeUTF(rows.next().id());
		}

		skipCount += limitCount;
		System.out.println(skipCount);
	}
}

Is there a way to do this?

shreyas · June 30, 2015, 7:16am

@anil, @ingenthr, can anyone check if this is a known issue?

czajkowski · June 30, 2015, 2:10pm

hey @daschl can you advise please

shreyas · July 2, 2015, 7:09am

I found the solution. The ViewQuery.skip() method is not really skipping and should not be used for pagination. The skip() method will just read all the data from beginning of the view and only start giving output after the number of records are read, just like a linked list.

Solution is to use startKey() and startKeyDocId(). The ID that goes into these methods is the last item’s ID you had read. Got this solution from here: http://tugdualgrall.blogspot.in/2013/10/pagination-with-couchbase.html

So the final code to read all items in a view is:

final Bucket statsBucket = db.getStatsBucket();
int limitCount = 10000;
int skipCount = 0;

System.out.println("reading stats ids ...");

try (DataOutputStream out = new DataOutputStream(new FileOutputStream("rowIds.tmp")))
{
    String lastKeyDocId = null;

    while (true)
    {
        ViewResult result;

        if (lastKeyDocId == null)
        {
            result = statsBucket.query(ViewQuery.from("Stats", "AllLogs").limit(limitCount).stale(Stale.FALSE));
        }
        else
        {
            result = statsBucket.query(ViewQuery.from("Stats", "AllLogs").limit(limitCount).stale(Stale.TRUE).startKey(lastKeyDocId).skip(1));
        }

        Iterator<ViewRow> rows = result.iterator();

        if (!rows.hasNext())
        {
            break;
        }

        while (rows.hasNext())
        {
            lastKeyDocId = rows.next().id();
            out.writeUTF(lastKeyDocId);
        }

        skipCount += limitCount;
        System.out.println(skipCount);
    }
}

simonbasle · July 3, 2015, 9:10am

ah, I should have seen that! glad that you found a solution

daschl · July 3, 2015, 9:27am

@shreyas for reference you can also take a look at the paginator we have in the 1.x series: https://github.com/couchbase/couchbase-java-client/blob/release14/src/main/java/com/couchbase/client/protocol/views/Paginator.java