You have a JSON Lines format text file. You need to parse your file line by line:
import json
data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))
Each line contains valid JSON, but as a whole, it is not a valid JSON value as there is no top-level list or object definition.
Note that because the file contains JSON per line, you are saved the headaches of trying to parse it all in one go or to figure out a streaming JSON parser. You can now opt to process each line separately before moving on to the next, saving memory in the process. You probably don't want to append each result to one list and then process everything if your file is really big.
If you have a file containing individual JSON objects with delimiters in-between, use How do I use the 'json' module to read in one JSON object at a time? to parse out individual objects using a buffered method.
Answer from Martijn Pieters on Stack OverflowYou have a JSON Lines format text file. You need to parse your file line by line:
import json
data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))
Each line contains valid JSON, but as a whole, it is not a valid JSON value as there is no top-level list or object definition.
Note that because the file contains JSON per line, you are saved the headaches of trying to parse it all in one go or to figure out a streaming JSON parser. You can now opt to process each line separately before moving on to the next, saving memory in the process. You probably don't want to append each result to one list and then process everything if your file is really big.
If you have a file containing individual JSON objects with delimiters in-between, use How do I use the 'json' module to read in one JSON object at a time? to parse out individual objects using a buffered method.
In case you are using pandas and you will be interested in loading the json file as a dataframe, you can use:
import pandas as pd
df = pd.read_json('file.json', lines=True)
And to convert it into a json array, you can use:
df.to_json('new_file.json')
You will go crazy if you try to parse a json file line by line. The json module has helper methods to read file objects directly or strings i.e. the load and loads methods. load takes a file object (as shown below) for a file that contains json data, while loads takes a string that contains json data.
Option 1: - Preferred
import json
with open('test.json', 'r') as jf:
weatherData = json.load(jf)
print weatherData
Option 2:
import json
with open('test.json', 'r') as jf:
weatherData = json.loads(jf.read())
print weatherData
If you are looking for higher performance json parsing check out ujson
In the first snippet, you try to parse it line by line. You should parse it all at once. The easiest is to use json.load(jsonfile). (The jf variable name is misleading as it's a string). So the correct way to parse it:
import json
with open('test.json', 'r') as jsonFile:
weatherData = json.loads(jsonFile)
Although it's a good idea to store the json in one line, as it's more concise.
In the second snippet your problem is that you print it as unicode string which is and u'string here' is python specific. A valid json uses double quotation marks
Videos
I have been working on Project Euler problem 8 ( https://projecteuler.net/problem=8 ), which gives a 1000-digit number. I am trying to import this data into Python as easily as possible, so I copied the number into a .txt and wrapped it in double-quotes. The number is still on 20 lines.
When I try to parse the code into a Python string using json.load(), I get an error that there is an invalid control character at the end of each line. I did some research and found that sometimes converting to a raw-string (starting the JSON with an r does this) will allow the number to be parsed, but I get the error that no JSON object could be detected. I do not understand fully the difference between json.load() and json.loads(), but I know that json.loads() also does not work, with an error that a string or buffer was expected.
My code to parse the string is as follows:
import json
number = json.load(open("ProjectEuler8Number.txt", "r"))
Is there any way to parse a multi-line JSON into a single-line string in Python?
I found about YAML, but would knowledge of reading and writing from Json transfer pretty easily.
My dictionary keys would be several sentences each, for now. If it grows more I don't want it to be too hard to read when manually editing.
Get rid of all of the backslashes and all of the "Pythonic" quoting in the settings file. Works fine if the file is just:
{
"user":"username",
"password":"passwd"
}
Note also that JSON strings are quoted with double quotes, not single quotes. See JSON spec here:
http://www.json.org/
>>> s = """
{
"user":"username",
"password":"passwd"
}
"""
>>> json.loads(s)
{'password': 'passwd', 'user': 'username'}
json doesn't consider \ to be a line-continuation character.
Just read each line and construct a json object at this time:
with open(file_path) as f:
for line in f:
j_content = json.loads(line)
This way, you load proper complete json object (provided there is no \n in a json value somewhere or in the middle of your json object) and you avoid memory issue as each object is created when needed.
There is also this answer.:
https://stackoverflow.com/a/7795029/671543
contents = open(file_path, "r").read()
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
Why don't you make it a dictionary and set variables then use the json library to make it into json
import json
json_serial = "123"
my_json = {
'settings': {
"serial": json_serial,
"status": '2',
"ersion": '3',
},
'config': {
'active': '4',
'version': '5'
}
}
print(json.dumps(my_json))
If you absolutely insist on generating JSON with string concatenation -- and, to be clear, you absolutely shouldn't -- the only way to be entirely certain that your output is valid JSON is to generate the substrings being substituted with a JSON generator. That is:
'''"settings" : {
"serial" : {serial},
"version" : {version}
}'''.format(serial=json.dumps("5"), version=json.dumps(1))
But don't. Really, really don't. The answer by @davidejones is the Right Thing for this scenario.
Note: Line separated json is now supported in read_json (since 0.19.0):
CopyIn [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
a b
0 1 2
1 3 4
or with a file/filepath rather than a json string:
Copypd.read_json(json_file, lines=True)
It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:
CopyIn [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...
CopyIn [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop
In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop
In [23]: test_100 = '\n'.join([test] * 100)
In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop
In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop
In [26]: test_1000 = '\n'.join([test] * 1000)
In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop
In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop
Note: of that time the join is surprisingly fast.
If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:
Copywith open('test.json') as f:
data = pd.DataFrame(json.loads(line) for line in f)
Also, if you import simplejson as json, the compiled C extensions included with simplejson are much faster than the pure-Python json module.
Hi all,
I am working on a project where I have text data stored in a massive (30.6G) json lines file. While I do have 32G of RAM, I would obviously like to avoid loading the entire file into memory.
What is the best way to go about loading a json file like this in without hogging memory?
There are several problems with the logic of your code.
ss = s.read()
reads the entire file s into a single string. The next line
for line in ss:
iterates over each character in that string, one by one. So on each loop line is a single character. In
line = ss[7:]
you are getting the entire file contents apart from the first 7 characters (in positions 0 through 6, inclusive) and replacing the previous content of line with that. And then
T.append(json.loads(line))
attempts to convert that to JSON and store the resulting object into the T list.
Here's some code that does what you want. We don't need to read the entire file into a string with .read, or into a list of lines with .readlines, we can simply put the file handle into a for loop and that will iterate over the file line by line.
We use a with statement to open the file, so that it will get closed automatically when we exit the with block, or if there's an IO error.
import json
table = []
with open('simple.json', 'r') as f:
for line in f:
table.append(json.loads(line[7:]))
for row in table:
print(row)
output
{'color': '33ef', 'age': '55', 'gender': 'm'}
{'color': '3444', 'age': '56', 'gender': 'f'}
{'color': '3999', 'age': '70', 'gender': 'm'}
We can make this more compact by building the table list in a list comprehension:
import json
with open('simple.json', 'r') as f:
table = [json.loads(line[7:]) for line in f]
for row in table:
print(row)
If you use Pandas you can simply write
df = pd.read_json(f, lines=True)
as per doc the lines=True:
Read the file as a json object per line.
If your data is exactly in that format, we can edit it into valid JSON.
import json
source = '''\
{
"A":0,
"B":2
}{
"A":3,
"B":4
}{
"C":5,
"D":6
}
'''
fixed = '[' + source.replace('}{', '},{') + ']'
lst = json.loads(fixed)
print(lst)
output
[{'A': 0, 'B': 2}, {'A': 3, 'B': 4}, {'C': 5, 'D': 6}]
This relies on each record being separated by '}{'. If that's not the case, we can use regex to do the search & replace operation.
Add [ and ] around your input and try this:
import json
with open('data.json') as data_file:
data = json.load(data_file)
print (data)
This code returns this line
[{'A': 0, 'B': 2}, {'A': 3, 'B': 4}]
when I put this data into the file:
[
{
"A":0,
"B":2
},{
"A":3,
"B":4
}
]
If you can't edit the file data.json, you can read string from this file, add [ and ] around this string, and call json.loads().
Update: Oh, I see that I added comma separator between JSON files. For initial input this my code doesn't work. But may be it is better to modify generator of this file? (i.e. to add comma separator)