How do I get all keys from the bucket?


#1

While not fully supported, owing to the fact that you can make requests which would cause lots of memory usage or resource consumption if you’re not careful, it sounds like what you want is TAP. See:
http://www.couchbase.com/wiki/display/couchbase/TAP+Protocol

Note that the Java client (couchbase.com/develop/java/current) has TAP implemented. Use it with caution and at your own risk. As long as you stay away from checkpoint and registration, it will generally be okay but can cause quite a bit of disk IO if you have much more data on disk than in memory.

Updated Sept. 2016: TAP has been updated some time ago (around Couchbase server 3.0) with a new protocol called DCP. The Go client now has DCP implemented as an unsupported/uncommitted feature. There is also a new Java DCP client as a separate library, also unsupported/uncommitted.


#2

Hello,

I’ve a question about Couchbase 2.0.
I’m Using Membase 1.7 and I need to retrieve all keys from the bucket.
I’ve read that Couchbase 2.0 adds query support.
With Couchbase 2.0 it will be possible to query the bucket and retrieve all keys and/or values? How?

Thanks.


#3

I’ve downloaded jtap (https://github.com/mikewied/jtap) and I’ve compiled this example to retrieve all keys:

— TapRunner.java —
import com.membase.jtap.;
import com.membase.jtap.exporter.
;
import com.membase.jtap.ops.*;
public class TapRunner
{
public static void main(String args[])
{
TapStreamClient client = new TapStreamClient(“localhost”, 11210, “default”, null);
Exporter exporter = new FileExporter(“results.txt”);
CustomStream tapListener = new CustomStream(exporter, “node1”);
tapListener.keysOnly();
tapListener.doDump();
client.start(tapListener);
}
}
— TapRunner.java —
I can compile it with no errors with

javac -cp .:jtap.jar TapRunner.java

But when I run it I receive these errors:

java -cp .:jtap.jar TapRunner

Exception in thread “main” java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
at com.membase.jtap.TapStreamClient.(Unknown Source)
at TapRunner.main(TapRunner.java:11)
Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:336)
… 2 more

Where is the problem?
Is this the only way to retrieve all keys with TAP protocol, in a txt file?
Can’t I iterate all the keys in other ways?

Thanks!


#4

Sorry for the confusion, but definitely do not use JTap. All of that functionality has been added to the Couchbase Java Client at couchbase.com/develop/java/current

Again, fair warning, use this at your own risk (have a look at that wiki page).


#5

No, it’s not currently available in the .NET client library and it’s only experimental in the Java client library.


#6

Is it possible to iterate through all the key/value pairs using the .NET api?

Thanks


#7

Wouldn’t it be easiest to just add a view with the following map?

function (doc, meta) {
emit(meta.id, null);
}

It does add the overhead of an index, but then you can just use the regular query() calls to get all keys.


#8

Yes, create a primary index as adavidson pointed out.
function (doc, meta) {
emit(meta.id, null);
}
this will give you the ability to get all the doc IDs back or search over a range etc. Then get the documents back using the GET api or using mget. That’s the most performant way.
One thing to remember though is that this will give you ONLY the persisted indexed documents. Given Couchbase’s asynchronous architecture, there may be addition documents in the managed cache that haven’t been persisted yet.
You can also use “limit” and “skip” to step through the result set.

http://127.0.0.1:8092/beer-sample/_design/dev_primary_key/view/primary
http://127.0.0.1:8092/beer-sample/_design/dev_primary_key/view/primary
{“total_rows”:7315,“rows”:[
{“id”:“110f033e61”,“key”:“110f033e61”,“value”:null},
{“id”:“110f03499b”,“key”:“110f03499b”,“value”:null},
{“id”:“110f035200”,“key”:“110f035200”,“value”:null},
{“id”:“110f035db2”,“key”:“110f035db2”,“value”:null},
{“id”:“110f035e84”,“key”:“110f035e84”,“value”:null},
{“id”:“110f03622c”,“key”:“110f03622c”,“value”:null},
{“id”:“110f036718”,“key”:“110f036718”,“value”:null},




hope this helps


#9

@dipti,
I tried to create a view with the following definition:

function (doc, meta) {
  emit(meta.id, null);
}

And I’m trying to query (I tried both simply by calling the view URL and with the .NET SDK).

My bucket contains ~17 million documents. If I’m requesting a range of documents at the beginning of the bucket, for example &limit=1000&skip=1000, it returns quickly, under 200ms. If I request 1000 documents starting at 500.000 (&limit=500000&skip=1000), it’s way slower, takes over 5 seconds. I I request docs starting at 10.000.000 (&limit=10000000&skip=1000), then it takes 2 and a half minutes.

Is there a way to speed it up? Just by creating this view, was the “primary index” automatically created, or do I have to explicitly create it?


#10

I worked around this by simply not doing my processing in batches, but just download the whole json without any skip or limit, and then loading the whole thing in memory at the beginning of my script.


#11

I have a similar question.

Using the java api I often want to perform operations of large sets of data. I obviously don’t want to load everything in to memory at once. What i think I want to do is load a batch of keys, perform the data changes, then more onto the next batch of keys. Is there a recommended way to do this/


#12

You can use Java DCP client to iterate the bucket contents consistently. It does not accumulate objects into batches, but it could be done on the user side.

You can start with

And then look into flow control settings

If you will accumulate data in your application, and do not send back acknowledgement, the cluster will pause transmission until your application will crunch the data and release the objects.