Problem while exporting data as csv using cbtransfer tool


#1

Hi team,

I am facing an issue when I am trying to export data from a bucket using cbtransfer tool. My environment is as follows:

Couchbase version: Version: 2.1.1 community edition (build-764)
Cluster details:3 node couchbase cluster
One of the buckets is DIMENSION_METADATA. At the moment there are only 19 documents in this bucket.

I am trying to use cbtransfer tool to export the data as csv file. I was under the impression that this tool would export all the data belonging to that bucket from all the couchbase nodes into the specified csv file. From my experiments I felt that the tool is clearing the file before appending the contents from each node. In the following example I had expected 19+1(header of csv) lines in my csv output file but received only 5+1(including header) lines in my output file.

EXAMPLE1*****************************************************START
[root@smaster1 ~]# /opt/couchbase/bin/cbtransfer http://localhost:8091 csv:/tmp/dm_1.csv -b DIMENSION_METADATA -u Administrator -p password -v
2016-04-01 15:48:55,403: mt cbtransfer…
2016-04-01 15:48:55,404: mt source : http://localhost:8091
2016-04-01 15:48:55,404: mt sink : csv:/tmp/dm_1.csv
2016-04-01 15:48:55,404: mt opts : {‘username’: ‘’, ‘source_vbucket_state’: ‘active’, ‘destination_vbucket_state’: ‘active’, ‘verbose’: 1, ‘dry_run’: False, ‘extra’: {‘max_retry’: 10.0, ‘rehash’: 0.0, ‘data_only’: 0.0, ‘nmv_retry’: 1.0, ‘conflict_resolve’: 1.0, ‘cbb_max_mb’: 100000.0, ‘try_xwm’: 1.0, ‘batch_max_bytes’: 400000.0, ‘report_full’: 2000.0, ‘batch_max_size’: 1000.0, ‘report’: 5.0, ‘design_doc_only’: 0.0, ‘recv_min_bytes’: 4096.0}, ‘single_node’: False, ‘bucket_destination’: None, ‘destination_operation’: None, ‘vbucket_list’: None, ‘threads’: 4, ‘key’: None, ‘password’: ‘’, ‘id’: None, ‘bucket_source’: ‘DIMENSION_METADATA’}
2016-04-01 15:48:55,531: mt bucket: DIMENSION_METADATA
2016-04-01 15:48:55,626: w0 source : http://localhost:8091(DIMENSION_METADATA@50.50.50.127:8091)
2016-04-01 15:48:55,627: w0 sink : csv:/tmp/dm_1.csv(DIMENSION_METADATA@50.50.50.127:8091)
2016-04-01 15:48:55,627: w0 : total | last | per sec
2016-04-01 15:48:55,627: w0 batch : 22 | 22 | 231.5
2016-04-01 15:48:55,627: w0 byte : 123 | 123 | 1294.3
2016-04-01 15:48:55,627: w0 msg : 7 | 7 | 73.7
2016-04-01 15:48:55,674: w0 source : http://localhost:8091(DIMENSION_METADATA@smaster2.product.company.com:8091)
2016-04-01 15:48:55,674: w0 sink : csv:/tmp/dm_1.csv(DIMENSION_METADATA@smaster2.product.company.com:8091)
2016-04-01 15:48:55,674: w0 : total | last | per sec
2016-04-01 15:48:55,674: w0 batch : 15 | 15 | 318.9
2016-04-01 15:48:55,674: w0 byte : 110 | 110 | 2338.6
2016-04-01 15:48:55,674: w0 msg : 7 | 7 | 148.8
2016-04-01 15:48:55,720: w0 source : http://localhost:8091(DIMENSION_METADATA@snfs2.product.company.com:8091)
2016-04-01 15:48:55,720: w0 sink : csv:/tmp/dm_1.csv(DIMENSION_METADATA@snfs2.product.company.com:8091)
2016-04-01 15:48:55,721: w0 : total | last | per sec
2016-04-01 15:48:55,721: w0 batch : 10 | 10 | 218.3
2016-04-01 15:48:55,721: w0 byte : 83 | 83 | 1811.9
2016-04-01 15:48:55,721: w0 msg : 5 | 5 | 109.2
[####################] 100.0% (19/19 msgs)
bucket: DIMENSION_METADATA, msgs transferred…
: total | last | per sec
batch : 47 | 47 | 219.0
byte : 316 | 316 | 1472.3
msg : 19 | 19 | 88.5
2016-04-01 15:48:55,750: mt warning: cannot save bucket design on a CSV destination
done

[root@smaster1 ~]# wc -l /tmp/dm_1.csv
6
EXAMPLE1*****************************************************END

However if I try to export the data on per-node basis using the --single-node parameter, then I can find all keys in the csv files generated (although the records are in different files).

EXAMPLE2*****************************************************START
[root@smaster1 ~]# /opt/couchbase/bin/cbtransfer http://50.50.50.127:8091 csv:/tmp/dm_2.csv -b DIMENSION_METADATA -u Administrator -p password -v --single-node
2016-04-01 16:03:32,393: mt cbtransfer…
2016-04-01 16:03:32,393: mt source : http://50.50.50.127:8091
2016-04-01 16:03:32,393: mt sink : csv:/tmp/dm_2.csv
2016-04-01 16:03:32,393: mt opts : {‘username’: ‘’, ‘source_vbucket_state’: ‘active’, ‘destination_vbucket_state’: ‘active’, ‘verbose’: 1, ‘dry_run’: False, ‘extra’: {‘max_retry’: 10.0, ‘rehash’: 0.0, ‘data_only’: 0.0, ‘nmv_retry’: 1.0, ‘conflict_resolve’: 1.0, ‘cbb_max_mb’: 100000.0, ‘try_xwm’: 1.0, ‘batch_max_bytes’: 400000.0, ‘report_full’: 2000.0, ‘batch_max_size’: 1000.0, ‘report’: 5.0, ‘design_doc_only’: 0.0, ‘recv_min_bytes’: 4096.0}, ‘single_node’: True, ‘bucket_destination’: None, ‘destination_operation’: None, ‘vbucket_list’: None, ‘threads’: 4, ‘key’: None, ‘password’: ‘’, ‘id’: None, ‘bucket_source’: ‘DIMENSION_METADATA’}
2016-04-01 16:03:32,521: mt bucket: DIMENSION_METADATA
2016-04-01 16:03:32,575: w0 source : http://50.50.50.127:8091(DIMENSION_METADATA@50.50.50.127:8091)
2016-04-01 16:03:32,575: w0 sink : csv:/tmp/dm_2.csv(DIMENSION_METADATA@50.50.50.127:8091)
2016-04-01 16:03:32,575: w0 : total | last | per sec
2016-04-01 16:03:32,575: w0 batch : 15 | 15 | 279.7
2016-04-01 16:03:32,575: w0 byte : 123 | 123 | 2293.9
2016-04-01 16:03:32,575: w0 msg : 7 | 7 | 130.5
[####################] 100.0% (7/7 msgs)
bucket: DIMENSION_METADATA, msgs transferred…
: total | last | per sec
batch : 15 | 15 | 160.8
byte : 123 | 123 | 1318.4
msg : 7 | 7 | 75.0
2016-04-01 16:03:32,617: mt warning: cannot save bucket design on a CSV destination
done
[root@smaster1 ~]# wc -l /tmp/dm_2.csv
8 /tmp/dm_2.csv

[root@smaster1 ~]# /opt/couchbase/bin/cbtransfer http://smaster2.product.company.com:8091 csv:/tmp/dm_3.csv -b DIMENSION_METADATA -u Administrator -p password -v --single-node
2016-04-01 16:02:04,811: mt cbtransfer…
2016-04-01 16:02:04,811: mt source : http://smaster2.product.company.com:8091
2016-04-01 16:02:04,811: mt sink : csv:/tmp/dm_3.csv
2016-04-01 16:02:04,811: mt opts : {‘username’: ‘’, ‘source_vbucket_state’: ‘active’, ‘destination_vbucket_state’: ‘active’, ‘verbose’: 1, ‘dry_run’: False, ‘extra’: {‘max_retry’: 10.0, ‘rehash’: 0.0, ‘data_only’: 0.0, ‘nmv_retry’: 1.0, ‘conflict_resolve’: 1.0, ‘cbb_max_mb’: 100000.0, ‘try_xwm’: 1.0, ‘batch_max_bytes’: 400000.0, ‘report_full’: 2000.0, ‘batch_max_size’: 1000.0, ‘report’: 5.0, ‘design_doc_only’: 0.0, ‘recv_min_bytes’: 4096.0}, ‘single_node’: True, ‘bucket_destination’: None, ‘destination_operation’: None, ‘vbucket_list’: None, ‘threads’: 4, ‘key’: None, ‘password’: ‘’, ‘id’: None, ‘bucket_source’: ‘DIMENSION_METADATA’}
2016-04-01 16:02:04,945: mt bucket: DIMENSION_METADATA
2016-04-01 16:02:05,005: w0 source : http://smaster2.product.company.com:8091(DIMENSION_METADATA@smaster2.product.company.com:8091)
2016-04-01 16:02:05,005: w0 sink : csv:/tmp/dm_3.csv(DIMENSION_METADATA@smaster2.product.company.com:8091)
2016-04-01 16:02:05,005: w0 : total | last | per sec
2016-04-01 16:02:05,005: w0 batch : 15 | 15 | 251.2
2016-04-01 16:02:05,005: w0 byte : 110 | 110 | 1842.0
2016-04-01 16:02:05,005: w0 msg : 7 | 7 | 117.2
[####################] 100.0% (7/7 msgs)
bucket: DIMENSION_METADATA, msgs transferred…
: total | last | per sec
batch : 15 | 15 | 152.2
byte : 110 | 110 | 1116.1
msg : 7 | 7 | 71.0
2016-04-01 16:02:05,046: mt warning: cannot save bucket design on a CSV destination
done

[root@smaster1 ~]# wc -l /tmp/dm_3.csv
8 /tmp/dm_3.csv


[root@smaster1 ~]# /opt/couchbase/bin/cbtransfer http://snfs2.product.company.com:8091 csv:/tmp/dm_4.csv -b DIMENSION_METADATA -u Administrator -p password -v --single-node
2016-04-01 16:00:13,267: mt cbtransfer…
2016-04-01 16:00:13,267: mt source : http://snfs2.product.company.com:8091
2016-04-01 16:00:13,267: mt sink : csv:/tmp/dm_4.csv
2016-04-01 16:00:13,267: mt opts : {‘username’: ‘’, ‘source_vbucket_state’: ‘active’, ‘destination_vbucket_state’: ‘active’, ‘verbose’: 1, ‘dry_run’: False, ‘extra’: {‘max_retry’: 10.0, ‘rehash’: 0.0, ‘data_only’: 0.0, ‘nmv_retry’: 1.0, ‘conflict_resolve’: 1.0, ‘cbb_max_mb’: 100000.0, ‘try_xwm’: 1.0, ‘batch_max_bytes’: 400000.0, ‘report_full’: 2000.0, ‘batch_max_size’: 1000.0, ‘report’: 5.0, ‘design_doc_only’: 0.0, ‘recv_min_bytes’: 4096.0}, ‘single_node’: True, ‘bucket_destination’: None, ‘destination_operation’: None, ‘vbucket_list’: None, ‘threads’: 4, ‘key’: None, ‘password’: ‘’, ‘id’: None, ‘bucket_source’: ‘DIMENSION_METADATA’}
2016-04-01 16:00:16,194: mt bucket: DIMENSION_METADATA
…2016-04-01 16:00:16,247: w0 source : http://snfs2.product.company.com:8091(DIMENSION_METADATA@snfs2.product.company.com:8091)
2016-04-01 16:00:16,248: w0 sink : csv:/tmp/dm_4.csv(DIMENSION_METADATA@snfs2.product.company.com:8091)
2016-04-01 16:00:16,248: w0 : total | last | per sec
2016-04-01 16:00:16,248: w0 batch : 11 | 11 | 207.5
2016-04-01 16:00:16,248: w0 byte : 83 | 83 | 1565.6
2016-04-01 16:00:16,248: w0 msg : 5 | 5 | 94.3
[####################] 100.0% (5/5 msgs)
bucket: DIMENSION_METADATA, msgs transferred…
: total | last | per sec
batch : 11 | 11 | 159.4
byte : 83 | 83 | 1202.4
msg : 5 | 5 | 72.4
2016-04-01 16:00:16,266: mt warning: cannot save bucket design on a CSV destination
done
[root@smaster1 ~]# wc -l /tmp/dm_4.csv
6 /tmp/dm_4.csv

EXAMPLE2*****************************************************END

If we take the count of records, excluding the headers, in dm_2.csv, dm_3.csv and dm_4.csv, then we will get 19.

Can you please confirm whether the behaviour of EXAMPLE1 is correct?
If it is not correct, could you please tell me whether this behaviour is different on newer releases of couchbase?

Thanks,
Richards.


#2

It looks like there is a bug in Couchbase 2.1.1 relating to CSV export. As you mentioned, during the export each time the data from a new node is exported it overwrite the data in which was exported from the previous node. I tested this out with Couchbase 4.5 DP and the issue is fixed in that release. The correct behavior should be for a new file to be created for each node when exporting data from the entire cluster. As a workaround you can do what you are doing in example 2 and just export dat from each node individually.


#3

Hi user mikew,

Thank you for confirming about the behavior of couchbase cbtransfer tool.

Richards Peter.


#4

Hi all,

I would also like to report one more behavior that I noticed with cbtransfer tool in Couchbase 2.1.1.

When I export data from a couchbase bucket as per example 2 (exporting data from each node separately), I sometimes get duplicate records in my csv files. I find this behavior when my bucket has large number of records (in millions, eg: 9 million).

In one of my tests involving 12m records in a bucket on a standalone couchbase node, I got 4m duplicate records. There were 16m records in my csv file.

In another test involving 9.7m records in a bucket on 3 node couchbase cluster, I got 9.8m records in my csv file.

It would be very helpful if this behavior can also be verified on the new couchbase server.

Thanks,
Richards Peter.