You have a JSON Lines format text file. You need to parse your file line by line:

import json

data = []
with open('file') as f:
    for line in f:
        data.append(json.loads(line))

Each line contains valid JSON, but as a whole, it is not a valid JSON value as there is no top-level list or object definition.

Note that because the file contains JSON per line, you are saved the headaches of trying to parse it all in one go or to figure out a streaming JSON parser. You can now opt to process each line separately before moving on to the next, saving memory in the process. You probably don't want to append each result to one list and then process everything if your file is really big.

If you have a file containing individual JSON objects with delimiters in-between, use How do I use the 'json' module to read in one JSON object at a time? to parse out individual objects using a buffered method.

Answer from Martijn Pieters on Stack Overflow
🌐
Reddit
reddit.com › r/learnpython › parsing multi-line json into single-line python
r/learnpython on Reddit: Parsing Multi-Line JSON into Single-Line Python
March 14, 2020 -

I have been working on Project Euler problem 8 ( https://projecteuler.net/problem=8 ), which gives a 1000-digit number. I am trying to import this data into Python as easily as possible, so I copied the number into a .txt and wrapped it in double-quotes. The number is still on 20 lines.

When I try to parse the code into a Python string using json.load(), I get an error that there is an invalid control character at the end of each line. I did some research and found that sometimes converting to a raw-string (starting the JSON with an r does this) will allow the number to be parsed, but I get the error that no JSON object could be detected. I do not understand fully the difference between json.load() and json.loads(), but I know that json.loads() also does not work, with an error that a string or buffer was expected.

My code to parse the string is as follows:

import json
number = json.load(open("ProjectEuler8Number.txt", "r"))

Is there any way to parse a multi-line JSON into a single-line string in Python?

🌐
Spark By {Examples}
sparkbyexamples.com › home › hbase › pyspark read multiple lines (multiline) json file
PySpark Read Multiple Lines (multiline) JSON File - Spark By {Examples}
March 27, 2024 - # Read multiline json file from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[1]") \ .appName("SparkByExamples.com") \ .getOrCreate() multiline_df = spark.read.option("multiline", "true") \ .json("resources/multiline-zipcode.json") multiline_df.printSchema() multiline_df.show()
🌐
PYnative
pynative.com › home › python › json › python parse multiple json objects from file
Python Parse multiple JSON objects from file | Solve ValueError: Extra data
May 14, 2021 - If your file contains a list of JSON objects, and you want to decode one object one-at-a-time, we can do it. To Load and parse a JSON file with multiple JSON objects we need to follow below steps: ... Read the file line by line because each line contains valid JSON. i.e., read one JSON object ...
🌐
Medium
sankettantia.medium.com › multiline-a-python-package-for-multi-line-json-values-c4f7a76f0305
Multiline — a Python package for multi-line JSON values | by Sanket Tantia | Medium
December 29, 2020 - In case, we want to store multiline strings then we will have to manually convert them by removing all the newlines or replacing them with \n character. Python’s default json package can parse a Json file or a string, only if it is valid that ...
Find elsewhere
🌐
Stack Overflow
stackoverflow.com › questions › 74632449 › how-to-read-multiline-json-like-file-with-multiple-json-fragments-separated-by-j
python - How to read multiline json-like file with multiple JSON fragments separated by just a new line? - Stack Overflow
If you know the objects are each a line, then loading and parsing multiple JSON objects per file would help, but if this is not the case, then you either need to fix the one writing the JSON or try to find the interception points manually.
Top answer
1 of 4
119

Note: Line separated json is now supported in read_json (since 0.19.0):

CopyIn [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
   a  b
0  1  2
1  3  4

or with a file/filepath rather than a json string:

Copypd.read_json(json_file, lines=True)

It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:

CopyIn [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'

For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...

CopyIn [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop

In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop

In [23]: test_100 = '\n'.join([test] * 100)

In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop

In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop

In [26]: test_1000 = '\n'.join([test] * 1000)

In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop

In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop

Note: of that time the join is surprisingly fast.

2 of 4
28

If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:

Copywith open('test.json') as f:
    data = pd.DataFrame(json.loads(line) for line in f)

Also, if you import simplejson as json, the compiled C extensions included with simplejson are much faster than the pure-Python json module.

Top answer
1 of 4
9

There are several problems with the logic of your code.

ss = s.read()

reads the entire file s into a single string. The next line

for line in ss:

iterates over each character in that string, one by one. So on each loop line is a single character. In

    line = ss[7:]

you are getting the entire file contents apart from the first 7 characters (in positions 0 through 6, inclusive) and replacing the previous content of line with that. And then

T.append(json.loads(line))

attempts to convert that to JSON and store the resulting object into the T list.


Here's some code that does what you want. We don't need to read the entire file into a string with .read, or into a list of lines with .readlines, we can simply put the file handle into a for loop and that will iterate over the file line by line.

We use a with statement to open the file, so that it will get closed automatically when we exit the with block, or if there's an IO error.

import json

table = []
with open('simple.json', 'r') as f:
    for line in f:
        table.append(json.loads(line[7:]))

for row in table:
    print(row)

output

{'color': '33ef', 'age': '55', 'gender': 'm'}
{'color': '3444', 'age': '56', 'gender': 'f'}
{'color': '3999', 'age': '70', 'gender': 'm'}

We can make this more compact by building the table list in a list comprehension:

import json

with open('simple.json', 'r') as f:
    table = [json.loads(line[7:]) for line in f]

for row in table:
    print(row)
2 of 4
8

If you use Pandas you can simply write df = pd.read_json(f, lines=True)

as per doc the lines=True:

Read the file as a json object per line.

🌐
Databricks Documentation
docs.databricks.com › data engineering › lakeflow connect › data formats › json
JSON file | Databricks on AWS
CREATE TEMPORARY VIEW multiLineJsonTable USING json OPTIONS (path="/tmp/multi-line.json",multiline=true) ... val mdf = spark.read.option("multiline", "true").format("json").load("/tmp/multi-line.json") mdf.show(false)