Here is a Python solution to your problem.
Don't forget to change the
in_file_pathto the location of your big JSON file.
import json
in_file_path='path/to/file.json' # Change me!
with open(in_file_path,'r') as in_json_file:
# Read the file and convert it to a dictionary
json_obj_list = json.load(in_json_file)
for json_obj in json_obj_list:
filename=json_obj['_id']+'.json'
with open(filename, 'w') as out_json_file:
# Save each obj to their respective filepath
# with pretty formatting thanks to `indent=4`
json.dump(json_obj, out_json_file, indent=4)
Side Note: I ran this in Python3, it should work in Python2 as well
Answer from Stefan Collier on Stack OverflowHere is a Python solution to your problem.
Don't forget to change the
in_file_pathto the location of your big JSON file.
import json
in_file_path='path/to/file.json' # Change me!
with open(in_file_path,'r') as in_json_file:
# Read the file and convert it to a dictionary
json_obj_list = json.load(in_json_file)
for json_obj in json_obj_list:
filename=json_obj['_id']+'.json'
with open(filename, 'w') as out_json_file:
# Save each obj to their respective filepath
# with pretty formatting thanks to `indent=4`
json.dump(json_obj, out_json_file, indent=4)
Side Note: I ran this in Python3, it should work in Python2 as well
I ran into this problem today as well, and did some research. Just want to share the resulting Python snippet that lets you also customise the length of split files (thanks to this slicing method).
import os
import json
from itertools import islice
def split_json(
data_path,
file_name,
size_split=1000,
):
"""Split a big JSON file into chunks.
data_path : str, "data_folder"
file_name : str, "data_file" (exclude ".json")
"""
with open(os.path.join(data_path, file_name + ".json"), "r") as f:
whole_file = json.load(f)
split = len(whole_file) # size_split
for i in range(split + 1):
with open(os.path.join(data_path, file_name + "_"+ str(split+1) + "_" + str(i+1) + ".json"), 'w') as f:
json.dump(dict(islice(whole_file.items(), i*size_split, (i+1)*size_split)), f)
return
Update: Then, when you need to combine them together again, use the following code:
json_all = dict()
split = 4 # this is the 1-based actual number of splits
for i in range(1, split+1):
with open(os.path.join("data_folder", "data_file_" + str(split) + "_" + str(i) + ".json"), 'r') as f:
json_i = json.load(f)
json_all.update(json_i)
python - Split a large json file into multiple smaller files - Stack Overflow
Split 150GB json file with Python?
Split one json file into multiple files of same size
How to split data from a json file into two json files with Python - Stack Overflow
Videos
Use this code in linux command prompt
split -b 53750k <your-file>
cat xa* > <your-file>
Refer to this link: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file
Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.
If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:
- https://www.npmjs.com/package/JSONStream
- https://www.npmjs.com/package/stream-json
- https://www.npmjs.com/package/json-stream
If you go with Python, there are streaming JSON parsers here as well:
- https://github.com/kashifrazzaqui/json-streamer
- https://github.com/danielyule/naya
- http://www.enricozini.org/blog/2011/tips/python-stream-json/
I haven't had much to do with json files so far, but now I need them for datahoarding. Now I have a 150GB json file here and can't open it because there's not enough free HDD & RAM on my laptop and computers.
So I have to split the file into several pieces (prefer 1 GB files) and open and view them one after the other. How can I do this on Windows?
Google spits out ancient results (and mostly for Linux) which, as usual, have contradictory information.
Maybe it is possible with a Python script. I don't care if the last and first line of the new file look slightly different than in the original json file
I can't figure it out how I can do that. Already tried multiple solutions and googled. Does anyone have a script for doing that? thanks!
You can try:
import json
dct = {
"client_id": {"0": "abc123", "1": "def456"},
"client_name": {"0": "companyA", "1": "companyB"},
"revenue": {"0": "54,786", "1": "62,754"},
"rate": {"0": "4", "1": "5"},
}
tmp = {}
for k, v in dct.items():
for kk, vv in v.items():
tmp.setdefault(kk, {}).update({k: vv})
for i, v in enumerate(tmp.values(), 1):
with open(f"File{i}.json", "w") as f_out:
json.dump(v, f_out, indent=4)
This creates two files File1.json, File2.json:
{
"client_id": "abc123",
"client_name": "companyA",
"revenue": "54,786",
"rate": "4"
}
and
{
"client_id": "abc123",
"client_name": "companyA",
"revenue": "54,786",
"rate": "4"
}
EDIT: To create output dictionary:
dct = {
"client_id": {"0": "abc123", "1": "def456"},
"client_name": {"0": "companyA", "1": "companyB"},
"revenue": {"0": "54,786", "1": "62,754"},
"rate": {"0": "4", "1": "5"},
}
tmp = {}
for k, v in dct.items():
for kk, vv in v.items():
tmp.setdefault(kk, {}).update({k: vv})
out = {}
for i, v in enumerate(tmp.values(), 1):
out[f"File{i}"] = v
print(out)
Prints:
{
"File1": {
"client_id": "abc123",
"client_name": "companyA",
"revenue": "54,786",
"rate": "4",
},
"File2": {
"client_id": "def456",
"client_name": "companyB",
"revenue": "62,754",
"rate": "5",
},
}
You can use the json package to read your json file and process it in a for loop
import json
with open('json_data.json') as json_file:
data = json.load(json_file)
# Contains the "0", "1", ...
list_of_new_dicts = data["client_id"].keys()
new_data = {}
for key, dico in data.items():
for num, value in dico.items():
new_data[num][key] = value
Your new data dictionnary should look like the following:
{
"0":{
"client_id" : "abc123",
"client_name": "companyA",
"revenue": "54,786",
"rate" : "4"
},
"1":{
"client_id" : "def456",
"client_name": "companyB",
"revenue": "62,754",
"rate" : "5"
}
}
Then to save the file you can do something like:
with open('json_data_0.json', 'w') as outfile:
json.dump(new_data["0"], outfile)
Use an iteration grouper; the itertools module recipes list includes the following:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
This lets you iterate over your tweets in groups of 5000:
for i, group in enumerate(grouper(input_tweets, 5000)):
with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
json.dump(list(group), outputfile)
I think your first thought is good. Just iterate over all tweets you got, save them in a temp array and track an index which you increment by one every tweet. Always when the current-index modulo 5000 is equals 0 call a method that converts the tweets in string-format and save this in a file with the index in the filename. If you reach the end of tweets, do the same on this last rest.
To split a json with many records into chunks of a desired size I simply use:
jq -c '.[0:1000]' mybig.json
which works like python slicing.
See the docs here: https://stedolan.github.io/jq/manual/
Array/String Slice: .[10:15]
The .[10:15] syntax can be used to return a subarray of an array or substring of a string. The array returned by .[10:15] will be of length 5, containing the elements from index 10 (inclusive) to index 15 (exclusive). Either index may be negative (in which case it counts backwards from the end of the array), or omitted (in which case it refers to the start or end of the array).
Using jq, one can split an array into its components using the filter:
.[]
The question then becomes what is to be done with each component. If you want to direct each component to a separate file, you could (for example) use jq with the -c option, and filter the result into awk, which can then allocate the components to different files. See e.g. Split JSON File Objects Into Multiple Files
Performance considerations
One might think that the overhead of calling jq+awk would be high compared to calling python, but both jq and awk are lightweight compared to python+json, as suggested by these timings (using Python 2.7.10):
time (jq -c .[] input.json | awk '{print > "doc00" NR ".json";}')
user 0m0.005s
sys 0m0.008s
time python split.py
user 0m0.016s
sys 0m0.046s
The original file is not valid JSON while the json.dump creates a file with valid JSON. My suggestion would be to convert the line items to JSON one at a time when writing to file.
Replace this:
for i in range(total+1):
json.dump(ll[i * size_of_the_split:(i + 1) * size_of_the_split], open(
json_file+"\\split50k" + str(i+1) + ".json", 'w',
encoding='utf8'), ensure_ascii=False, indent=True)
with this:
for i in range(len(ll)):
if i % size_of_the_split ==0:
if i != 0:
file.close()
file = open(json_file+"\\split50k"+str(i+1)+".json",'w')
file.write(str(ll[i]))
file.close()
Try using json.loads(line) when reading the file:
with open(os.path.join(json_file, 'test.json'), 'r',
encoding='utf-8') as f1:
ll = [json.loads(line) for line in f1.readlines()]
# The rest