Here is a Python solution to your problem.
Don't forget to change the
in_file_pathto the location of your big JSON file.
import json
in_file_path='path/to/file.json' # Change me!
with open(in_file_path,'r') as in_json_file:
# Read the file and convert it to a dictionary
json_obj_list = json.load(in_json_file)
for json_obj in json_obj_list:
filename=json_obj['_id']+'.json'
with open(filename, 'w') as out_json_file:
# Save each obj to their respective filepath
# with pretty formatting thanks to `indent=4`
json.dump(json_obj, out_json_file, indent=4)
Side Note: I ran this in Python3, it should work in Python2 as well
Answer from Stefan Collier on Stack OverflowHere is a Python solution to your problem.
Don't forget to change the
in_file_pathto the location of your big JSON file.
import json
in_file_path='path/to/file.json' # Change me!
with open(in_file_path,'r') as in_json_file:
# Read the file and convert it to a dictionary
json_obj_list = json.load(in_json_file)
for json_obj in json_obj_list:
filename=json_obj['_id']+'.json'
with open(filename, 'w') as out_json_file:
# Save each obj to their respective filepath
# with pretty formatting thanks to `indent=4`
json.dump(json_obj, out_json_file, indent=4)
Side Note: I ran this in Python3, it should work in Python2 as well
I ran into this problem today as well, and did some research. Just want to share the resulting Python snippet that lets you also customise the length of split files (thanks to this slicing method).
import os
import json
from itertools import islice
def split_json(
data_path,
file_name,
size_split=1000,
):
"""Split a big JSON file into chunks.
data_path : str, "data_folder"
file_name : str, "data_file" (exclude ".json")
"""
with open(os.path.join(data_path, file_name + ".json"), "r") as f:
whole_file = json.load(f)
split = len(whole_file) # size_split
for i in range(split + 1):
with open(os.path.join(data_path, file_name + "_"+ str(split+1) + "_" + str(i+1) + ".json"), 'w') as f:
json.dump(dict(islice(whole_file.items(), i*size_split, (i+1)*size_split)), f)
return
Update: Then, when you need to combine them together again, use the following code:
json_all = dict()
split = 4 # this is the 1-based actual number of splits
for i in range(1, split+1):
with open(os.path.join("data_folder", "data_file_" + str(split) + "_" + str(i) + ".json"), 'r') as f:
json_i = json.load(f)
json_all.update(json_i)
python - Split a large json file into multiple smaller files - Stack Overflow
Python Split json into separate json based on node value - Stack Overflow
Split 150GB json file with Python?
python - Split JSON File into multiple JSONs according to their ID? - Stack Overflow
Use this code in linux command prompt
split -b 53750k <your-file>
cat xa* > <your-file>
Refer to this link: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file
Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.
If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:
- https://www.npmjs.com/package/JSONStream
- https://www.npmjs.com/package/stream-json
- https://www.npmjs.com/package/json-stream
If you go with Python, there are streaming JSON parsers here as well:
- https://github.com/kashifrazzaqui/json-streamer
- https://github.com/danielyule/naya
- http://www.enricozini.org/blog/2011/tips/python-stream-json/
I haven't had much to do with json files so far, but now I need them for datahoarding. Now I have a 150GB json file here and can't open it because there's not enough free HDD & RAM on my laptop and computers.
So I have to split the file into several pieces (prefer 1 GB files) and open and view them one after the other. How can I do this on Windows?
Google spits out ancient results (and mostly for Linux) which, as usual, have contradictory information.
Maybe it is possible with a Python script. I don't care if the last and first line of the new file look slightly different than in the original json file
I won't teach you how to do file I/O and assume you can do that yourself.
Once you have loaded the original file as a dict with the json module, do
>>> org = {"one":"Some data", "two":"Some data"}
>>> dicts = [{k:v} for k,v in org.items()]
>>> dicts
[{'two': 'Some data'}, {'one': 'Some data'}]
which will give you a list of dictionaries that you can dump to a file (or separate files named after the keys), if you wish.
After loading the JSON file you can treat it as a dictionary in python and then save the contents in file by looping through as you would in normal python dictionary. Here is an example related to what you want to achieve
Data = {"one":"Some data", "two":"Some data"}
for item in Data:
name = item + '.json'
file = open(name, 'w')
file.write('{"%s":"%s"}' % (item, Data[item]))
file.close()
To split a json with many records into chunks of a desired size I simply use:
jq -c '.[0:1000]' mybig.json
which works like python slicing.
See the docs here: https://stedolan.github.io/jq/manual/
Array/String Slice: .[10:15]
The .[10:15] syntax can be used to return a subarray of an array or substring of a string. The array returned by .[10:15] will be of length 5, containing the elements from index 10 (inclusive) to index 15 (exclusive). Either index may be negative (in which case it counts backwards from the end of the array), or omitted (in which case it refers to the start or end of the array).
Using jq, one can split an array into its components using the filter:
.[]
The question then becomes what is to be done with each component. If you want to direct each component to a separate file, you could (for example) use jq with the -c option, and filter the result into awk, which can then allocate the components to different files. See e.g. Split JSON File Objects Into Multiple Files
Performance considerations
One might think that the overhead of calling jq+awk would be high compared to calling python, but both jq and awk are lightweight compared to python+json, as suggested by these timings (using Python 2.7.10):
time (jq -c .[] input.json | awk '{print > "doc00" NR ".json";}')
user 0m0.005s
sys 0m0.008s
time python split.py
user 0m0.016s
sys 0m0.046s
I can't figure it out how I can do that. Already tried multiple solutions and googled. Does anyone have a script for doing that? thanks!
You already have a list; the commas are put there by Python to delimit the values only when printing the list.
Just access element 2 directly:
print ting[2]
This prints:
[1379962800000, 125.539504822835]
Each of the entries in item['values'] (so ting) is a list of two float values, so you can address each of those with index 0 and 1:
>>> print ting[2][0]
1379962800000
>>> print ting[2][1]
125.539504822835
To get a list of all the second values, you could use a list comprehension:
second_vals = [t[1] for t in ting]
When you load the data with json.loads, it is already parsed into a real list that you can slice and index as normal. If you want the data starting with the third element, just use ting[2:]. (If you just want the third element by itself, just use ting[2].)
Hi!
I have a huge JSON file containing company data that I want to split into several smaller files based on their companyId. The JSON file looks like this:
[
{
"companyId": "123456789",
"name": "Foobar Ltd.",
// more company data
},
// etc.
]
Ideally, I want to split this based on the X first characters of companyId, so that I end up with companies that share the first part of their companyId in separate smaller files;
companyId 123456789 => 1234.json companyId 234567890 => 2345.json // etc
I could write a Perl script to do this for me, but I was wondering if it's at all possible to do it with a one-liner without too much "outside of bash", if that makes sense, at least without having to rely on Perl, Python etc. The only progress I have made so far is this:
cat huge.json | jq '.[]' | jq '.companyId'
...which outputs the companyId, and I could probably get the X first characters from that, but where is the rest of the JSON record?
Thanks in advance!
EDIT: Specified that I don't want to use Perl (or similar tools), because I want to do this as "minimal" as possible.
Use an iteration grouper; the itertools module recipes list includes the following:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
This lets you iterate over your tweets in groups of 5000:
for i, group in enumerate(grouper(input_tweets, 5000)):
with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
json.dump(list(group), outputfile)
I think your first thought is good. Just iterate over all tweets you got, save them in a temp array and track an index which you increment by one every tweet. Always when the current-index modulo 5000 is equals 0 call a method that converts the tweets in string-format and save this in a file with the index in the filename. If you reach the end of tweets, do the same on this last rest.