Loading json into buckets

Hi,

I am having one file which will contain multiple jsons seperated by comma.

No i want to load the json1 from a file to a bucket x(creating document as y)and the same file(json 2) to the same bucket x(but to an another document z).

Using cbcdocloader we can load data into bucket of a file which is having only one json but not multiple jsons.

Can anyone please help me to find the solution of loading the data into bucket of multiple jsons(same file) to multiple documents.

You can probably use a script with N1QL INSERT.

@ogrdsnielsen as gerald said I think you won’t get around a simple script, or in bash you split it up into multiple docs first. That said if you use a language where we have official SDKs you’d be better off using KV directly since it gives you better performance on those kind of operations (insert where you know the key and the value)

The performance should be similar with N1QL bulk INSERT.

But we are planning for millions of jsons(app 4 million).

For example we want to load 400 files and each file consists of 10,000 jsons max.

If i use N1QL it will be difficult to insert manually that many jsons.

why not just write convertor to cbdocloader format? cbdocloader cannot support any kind of possible formatting your documents.

Lets say you have test.json with the following content:

$ cat tmp.json
{"foo":"bar"},{"baz":"quux"}

With simple one liner you can split it into multiple documents

$ ruby -rjson -e 'file="tmp.json"; docs = JSON.load("[" + File.read(file) + "]"); docs.each_with_index{|d,i| File.write(file.sub(".json", "-#{i}.json"), JSON.dump(d))}'
$ cat tmp-0.json 
{"foo":"bar"}
$ cat tmp-1.json 
{"baz":"quux"}

The script is really trivial:

file="tmp.json"
docs = JSON.load("[" + File.read(file) + "]")
docs.each_with_index do |d,i|
  File.write(file.sub(".json", "-#{i}.json"), JSON.dump(d))}
end

You only need to put results into docs/ directory and zip the all. For example you can take a look at how travel-sample dataset has been created

1 Like

Thanq @avsej…But we are already using the syntax for splitting the jsons into multiple files.

But there was some problem in unix box like limiting the number of files.

Suppose for example our table has data for some 20million records and when we are running the script for generating the jsons only 4million+ jsons are getting created in unix box.Is there any threshold limit in unix?

Is there any solution to overcome that and generate 20million jsons in the unix box.

In this case, why don’t you use regular SDK to load the documents from that huge file? You might use streaming JSON parser (which does not load full file into memory to parse) and then upsert all the docs.