In versions of Pandas > 0.19.0, DataFrame.to_json has a parameter, lines, that will write out JSONL format.
Given that, a more succinct version of your solution might look like this:
import pandas as pd
data = [{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)
# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})
# Output in JSONL format
print(df.to_json(orient='records', lines=True))
Output:
{"label":"DRUG","pattern":{"lower":"aspirin"}}
{"label":"DRUG","pattern":{"lower":"trazodone"}}
{"label":"DRUG","pattern":{"lower":"citalopram"}}
Answer from kmsquire on Stack OverflowIn versions of Pandas > 0.19.0, DataFrame.to_json has a parameter, lines, that will write out JSONL format.
Given that, a more succinct version of your solution might look like this:
import pandas as pd
data = [{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)
# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})
# Output in JSONL format
print(df.to_json(orient='records', lines=True))
Output:
{"label":"DRUG","pattern":{"lower":"aspirin"}}
{"label":"DRUG","pattern":{"lower":"trazodone"}}
{"label":"DRUG","pattern":{"lower":"citalopram"}}
Very short code that should work for easy coping-pasting.
output_path = "/data/meow/my_output.jsonl"
with open(output_path, "w") as f:
f.write(df_result.to_json(orient='records', lines=True, force_ascii=False))
If you are using jupyter notebook, you should use with open(output_path, "w") as f instead of f = open(output_path, "w") to make sure file is saved (correctly close) and ready to read in next cell.
Videos
The pd.read_json() function in the pandas library is used to read JSON data into a DataFrame. When reading a JSON Lines (JSONL) file, where each line represents a separate JSON object, you can use the lines=True parameter to properly parse the file, treating each line in the file as a separate JSON object.
df = pd.read_json("test.jsonl", lines=True)
If the file is large, you can also pass the chunksize to manipulate it in chunks.
This medium article provides a fairly simple answer, which can be adapted to be even shorter. All you need to do is read each line then parse each line with json.loads(). Like this:
import json
import pandas as pd
lines = []
with open(r'test.jsonl') as f:
lines = f.read().splitlines()
line_dicts = [json.loads(line) for line in lines]
df_final = pd.DataFrame(line_dicts)
print(df_final)
As cgobat pointed out in a comment, the medium article adds a few extra unnecessary steps, which have been optimized in this answer.
To create newline-delimited json from a dataframe df, run the following
df.to_json("path/to/filename.json",
orient="records",
lines=True)
Pay close attention to those optional keyword args! The lines option was added in pandas 0.19.0.
You can pass a buffer in to df.to_json():
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"a":[1,3,5], "b":[1.1,1.2,1.2]})
In [3]: df
Out[3]:
a b
0 1 1.1
1 3 1.2
2 5 1.2
In [4]: f = open("temp.txt", "w")
In [5]: for row in df.iterrows():
row[1].to_json(f)
f.write("\n")
...:
In [6]: f.close()
In [7]: open("temp.txt").read()
Out[7]: '{"a":1.0,"b":1.1}\n{"a":3.0,"b":1.2}\n{"a":5.0,"b":1.2}\n'
In newer versions of pandas (0.20.0+, I believe), this can be done directly:
df.to_json('temp.json', orient='records', lines=True)
Direct compression is also possible:
df.to_json('temp.json.gz', orient='records', lines=True, compression='gzip')
The output that you get after DF.to_json is a string. So, you can simply slice it according to your requirement and remove the commas from it too.
out = df.to_json(orient='records')[1:-1].replace('},{', '} {')
To write the output to a text file, you could do:
with open('file_name.txt', 'w') as f:
f.write(out)
Note: Line separated json is now supported in read_json (since 0.19.0):
In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
a b
0 1 2
1 3 4
or with a file/filepath rather than a json string:
pd.read_json(json_file, lines=True)
It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:
In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...
In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop
In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop
In [23]: test_100 = '\n'.join([test] * 100)
In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop
In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop
In [26]: test_1000 = '\n'.join([test] * 1000)
In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop
In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop
Note: of that time the join is surprisingly fast.
If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:
with open('test.json') as f:
data = pd.DataFrame(json.loads(line) for line in f)
Also, if you import simplejson as json, the compiled C extensions included with simplejson are much faster than the pure-Python json module.