Is there any sample code to split a json file into smaller chunks?
text processing - Split a JSON array into multiple files - Unix & Linux Stack Exchange
python - Split a large json file into multiple smaller files - Stack Overflow
posix - Split JSON array into separate files/objects - Stack Overflow
Can I split a JSON array into multiple smaller files?
What happens if I split a JSON object instead of an array?
Will the resulting split files still be valid JSON?
From this SO thread:
jq -cr 'keys[] as $k | "\($k)\n\(.[$k])"' input.json | while read -r key; do
fname=$(jq --raw-output ".[$key].billingAccountNumber" input.json)
read -r item
printf '%s\n' "$item" > "./$fname"
done
Give a try with this code to save each element with names 0.json, 1.json and 2.json. FYI, it can work for any number of JSON items in an array.
for i in
(jq '.|length' sample.json)); do \
j=$( expr $i - 1 ); \
jq ".[$j]" sample.json > $j.json; \
done
Explanation:
below line finds the length of array to name the objects:
(jq '.|length' sample.json))
Since the array starts with 0, lets fix the output file name
j=$( expr $i - 1 );
below lines fetches one element from json document and save to file
jq ".[$j]" sample.json > $j.json;
to save literally to x.json, y.json and z.json:
for i in x y z; do \
jq ".[$count]" sample.json > $i.json; \
count=$( expr $count + 1 ) \
done
Use this code in linux command prompt
split -b 53750k <your-file>
cat xa* > <your-file>
Refer to this link: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file
Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.
If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:
- https://www.npmjs.com/package/JSONStream
- https://www.npmjs.com/package/stream-json
- https://www.npmjs.com/package/json-stream
If you go with Python, there are streaming JSON parsers here as well:
- https://github.com/kashifrazzaqui/json-streamer
- https://github.com/danielyule/naya
- http://www.enricozini.org/blog/2011/tips/python-stream-json/
To split a json with many records into chunks of a desired size I simply use:
jq -c '.[0:1000]' mybig.json
which works like python slicing.
See the docs here: https://stedolan.github.io/jq/manual/
Array/String Slice: .[10:15]
The .[10:15] syntax can be used to return a subarray of an array or substring of a string. The array returned by .[10:15] will be of length 5, containing the elements from index 10 (inclusive) to index 15 (exclusive). Either index may be negative (in which case it counts backwards from the end of the array), or omitted (in which case it refers to the start or end of the array).
Using jq, one can split an array into its components using the filter:
.[]
The question then becomes what is to be done with each component. If you want to direct each component to a separate file, you could (for example) use jq with the -c option, and filter the result into awk, which can then allocate the components to different files. See e.g. Split JSON File Objects Into Multiple Files
Performance considerations
One might think that the overhead of calling jq+awk would be high compared to calling python, but both jq and awk are lightweight compared to python+json, as suggested by these timings (using Python 2.7.10):
time (jq -c .[] input.json | awk '{print > "doc00" NR ".json";}')
user 0m0.005s
sys 0m0.008s
time python split.py
user 0m0.016s
sys 0m0.046s
I haven't had much to do with json files so far, but now I need them for datahoarding. Now I have a 150GB json file here and can't open it because there's not enough free HDD & RAM on my laptop and computers.
So I have to split the file into several pieces (prefer 1 GB files) and open and view them one after the other. How can I do this on Windows?
Google spits out ancient results (and mostly for Linux) which, as usual, have contradictory information.
Maybe it is possible with a Python script. I don't care if the last and first line of the new file look slightly different than in the original json file
Here is a Python solution to your problem.
Don't forget to change the
in_file_pathto the location of your big JSON file.
import json
in_file_path='path/to/file.json' # Change me!
with open(in_file_path,'r') as in_json_file:
# Read the file and convert it to a dictionary
json_obj_list = json.load(in_json_file)
for json_obj in json_obj_list:
filename=json_obj['_id']+'.json'
with open(filename, 'w') as out_json_file:
# Save each obj to their respective filepath
# with pretty formatting thanks to `indent=4`
json.dump(json_obj, out_json_file, indent=4)
Side Note: I ran this in Python3, it should work in Python2 as well
I ran into this problem today as well, and did some research. Just want to share the resulting Python snippet that lets you also customise the length of split files (thanks to this slicing method).
import os
import json
from itertools import islice
def split_json(
data_path,
file_name,
size_split=1000,
):
"""Split a big JSON file into chunks.
data_path : str, "data_folder"
file_name : str, "data_file" (exclude ".json")
"""
with open(os.path.join(data_path, file_name + ".json"), "r") as f:
whole_file = json.load(f)
split = len(whole_file) # size_split
for i in range(split + 1):
with open(os.path.join(data_path, file_name + "_"+ str(split+1) + "_" + str(i+1) + ".json"), 'w') as f:
json.dump(dict(islice(whole_file.items(), i*size_split, (i+1)*size_split)), f)
return
Update: Then, when you need to combine them together again, use the following code:
json_all = dict()
split = 4 # this is the 1-based actual number of splits
for i in range(1, split+1):
with open(os.path.join("data_folder", "data_file_" + str(split) + "_" + str(i) + ".json"), 'r') as f:
json_i = json.load(f)
json_all.update(json_i)
Hi!
I have a huge JSON file containing company data that I want to split into several smaller files based on their companyId. The JSON file looks like this:
[
{
"companyId": "123456789",
"name": "Foobar Ltd.",
// more company data
},
// etc.
]
Ideally, I want to split this based on the X first characters of companyId, so that I end up with companies that share the first part of their companyId in separate smaller files;
companyId 123456789 => 1234.json companyId 234567890 => 2345.json // etc
I could write a Perl script to do this for me, but I was wondering if it's at all possible to do it with a one-liner without too much "outside of bash", if that makes sense, at least without having to rely on Perl, Python etc. The only progress I have made so far is this:
cat huge.json | jq '.[]' | jq '.companyId'
...which outputs the companyId, and I could probably get the X first characters from that, but where is the rest of the JSON record?
Thanks in advance!
EDIT: Specified that I don't want to use Perl (or similar tools), because I want to do this as "minimal" as possible.