How do I bulk insert a question?


#1

Hello!
We have about 50M records already in Json which we are looking to stuff into Couchbase in 350k batches.
because we are transforming from logfiles => json, we are using curl to try and do the bulk operations. I’ve tried googling and following examples that seemed to work for other folk, but I’ve not been able to successfully get a bulk operation using _bulk_docs
here is an example:

curl -X POST 'http://localhost:8092/default/_bulk_docs' -H 'Content-Type: application/json' -d '{"docs":[{"name":"tom"},{"name":"bob"}]}'

to which I get back

{"error":"doc_validation","reason":"User data must be in the `json` field, please nest `name`"}

I see the same thing when I copy-pasta the example in the API with the port being the exception… when i try using that port, I get no response

curl -X POST ‘http://localhost:8092/default/_bulk_docs’ -H ‘Content-Type: application/json’ -d ‘{“docs” : [{"_id" : “FishStew”,“servings” : 4,“subtitle” : “Delicious with fresh bread”,“title” : “Fish Stew”},{"_id" : “LambStew”,“servings” : 6,“subtitle” : “Delicious with scone topping”,“title” : “Lamb Stew”},{“servings” : 8,“subtitle” : “Delicious with suet dumplings”,“title” : “Beef Stew”}]}’


{“error”:“doc_validation”,“reason”:"User data must be in the json field, please nest _id"}

I’m wondering what I might change in my doc array to get the bulk action going or if anyone has any tips?

Thanks in advance!
Jop


#2

Thanks for the reply anil!

I’ve not tried the php script just yet, about to give that a go.

I tried the 2nd example refered to in the blog and downloaded a json dataset from the site. When I attempt to input said zip file I get errors .It seems to connect and find my bucket, then it gets angry in Python and quits with this output:

{‘username’: ‘jopsUserName’, ‘node’: ‘192.168.231.58:8091’, ‘password’: ‘jopsPassword’, ‘bucket’: ‘default’, ‘ram_quota’: 1000} [‘test.zip’]
[2013-01-31 21:35:44,412] - [rest_client] [139716557670144] - INFO - existing buckets : [u’default’, u’logging’]
[2013-01-31 21:35:44,412] - [rest_client] [139716557670144] - INFO - found bucket default
Traceback (most recent call last):
File “/opt/couchbase/lib/python/cbdocloader”, line 237, in
main()
File “/opt/couchbase/lib/python/cbdocloader”, line 229, in main
docloader.populate_docs()
File “/opt/couchbase/lib/python/cbdocloader”, line 179, in populate_docs
self.bucket = cb[self.options.bucket]
File “/opt/couchbase/lib/python/couchbase/client.py”, line 161, in getitem
return self.bucket(key)
File “/opt/couchbase/lib/python/couchbase/client.py”, line 118, in bucket
return Bucket(bucket_name, self)
File “/opt/couchbase/lib/python/couchbase/client.py”, line 217, in init
self.bucket_password)
File “/opt/couchbase/lib/python/couchbase/couchbaseclient.py”, line 686, in init
self.init_vbucket_connections()
File “/opt/couchbase/lib/python/couchbase/couchbaseclient.py”, line 781, in init_vbucket_connections
self.start_vbucket_connection(i)
File “/opt/couchbase/lib/python/couchbase/couchbaseclient.py”, line 791, in start_vbucket_connection
serverPort, self.bucket)
File “/opt/couchbase/lib/python/couchbase/couchbaseclient.py”, line 1250, in direct_client
.encode(‘ascii’))
File “/opt/couchbase/lib/python/couchbase/couchbaseclient.py”, line 431, in sasl_auth_plain
password]))
File “/opt/couchbase/lib/python/couchbase/couchbaseclient.py”, line 426, in sasl_auth_start
return self._doCmd(MemcachedConstants.CMD_SASL_AUTH, mech, data)
File “/opt/couchbase/lib/python/couchbase/couchbaseclient.py”, line 315, in _doCmd
self._sendCmd(cmd, key, val, opaque, extraHeader, cas)
File “/opt/couchbase/lib/python/couchbase/couchbaseclient.py”, line 261, in _sendCmd
vbucketId=self.vbucketId)
File “/opt/couchbase/lib/python/couchbase/couchbaseclient.py”, line 270, in _sendMsg
self.s.send(msg + extraHeader + key + val)
socket.error: [Errno 32] Broken pipe

I’ll let you know how the php script goes, althought I’d be elated to get this docloader working as well, it looks just like what we need.

Thanks
Jop


#3

Hello,

For bulk operations we provide two options i.e. if you choose to use SDK we have Couchbase SDK APIs ‘Performing a Bulk Set’ here is the documentation (http://www.couchbase.com/docs/couchbase-devguide-2.0/populating-cb.html) and we also provide ‘CBDocLoader’ tool for bulk loading json document here is the blog post on that (http://blog.couchbase.com/loading-json-data-couchbase) some examples here (https://github.com/couchbase/couchbase-examples)

Hope that helps…
Anil


#4

Sure, we definitely would like to get the CBDocLoader working for you. Can you send us the sample snippet of the json document we are interested in seeing the ‘format’ of the document.

Anil


#5

totally, one might find said file Here

I basically just borrowed one of the .json files from the trees data and put it into a directory named ‘test’, zipping it all up.

I also tried just feeding in the directory itself and the json file itself, but none of the things worked. I dug around in the beer examples and they seem to only have 1 json object per file. Is that the proper format?


#6

Yes that’s correct one json object per file is the proper format. If you check the blog post I mentioned (http://blog.couchbase.com/loading-json-data-couchbase) we use simple python script to split each json object into multiple files to produce one json object per file. We then loaded the data into Couchbase using the cbdocloader tool.

Hope that helps…
Anil


#7

That totally worked. Thanks a bunch for your help.

As an aside, the issue now is that we have 350k json files in a directory when zipped, creates a 174MB zip from a 101Mb log file. This is okay, and our ops/sec are now at ~300 vs. ~50 which is a fantastic improvement. We’ve found the pointing it to a directory to be working but pointing it to a .zip didn’t… which is weird… no errors, it just says ‘done’ and we have no docs. I assumed it might be the async write so

I waited a bit, but the docs never showed up. I’ll dig into that though, its probably something with the zip.

Do you know if there will be eventual support for the couchdb style _bulk_docs action in the future?

Thanks again for all your help and the 6x improvment on throughput!
Jop