You’re trying to flatten 2 different “depths” in the json file, which can’t be done in a single json_normalize call. You could simply use 2 pd.json_normalize since all entries contain ids to match all the parsed data later:
>>> pd.json_normalize(d, record_path='view')
id user_id parent_id created_at updated_at rating_count rating_sum message replies
0 109205 6354 None 2020-11-03T23:32:49Z 2020-11-03T23:32:49Z None None message text1 [{'id': 109298, 'user_id': 5457, 'parent_id': ...
>>> pd.json_normalize(d, record_path=['view', 'replies'])
id user_id parent_id created_at updated_at rating_count rating_sum message
0 109298 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text2
1 109299 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text3
(I’ve added as second reply to your example with same data and id incremented by 1 so we can see what happens for several replies per view.)
Alternately, you can use your second pd.json_normalize on the replies column of your previous result, which is probably less work. This is more interesting if you .explode() the column first to get one row per reply:
>>> pd.json_normalize(view['replies'].explode())
id user_id parent_id created_at updated_at rating_count rating_sum message
0 109298 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text2
1 109299 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text3
So here’s a way to construct a single dataframe with all the info:
>>> view = pd.json_normalize(d, record_path='view')
>>> df = pd.merge(
... view.drop(columns=['replies']),
... pd.json_normalize(view['replies'].explode()),
... left_on='id', right_on='parent_id', how='right',
... suffixes=('_view', '_reply')
... )
>>> df
id_view user_id_view parent_id_view created_at_view updated_at_view rating_count_view rating_sum_view message_view id_reply user_id_reply parent_id_reply created_at_reply updated_at_reply rating_count_reply rating_sum_reply message_reply
0 109205 6354 None 2020-11-03T23:32:49Z 2020-11-03T23:32:49Z None None message text1 109298 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text2
1 109205 6354 None 2020-11-03T23:32:49Z 2020-11-03T23:32:49Z None None message text1 109299 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text3
>>> df[['user_id_view', 'message_view', 'user_id_reply', 'message_reply']]
user_id_view message_view user_id_reply message_reply
0 6354 message text1 5457 message text2
1 6354 message text1 5457 message text3
Answer from Cimbali on Stack OverflowPandas.DataFrame.json_normalize optimization
python - pandas json_normalize with very nested json - Stack Overflow
Json_normalize help, not able to normalize in loop
Json normalize (pandas) in mixed type columns
Videos
You could just pass data without any extra params.
df = pd.io.json.json_normalize(data)
df
complete mid.c mid.h mid.l mid.o time volume
0 True 119.743 119.891 119.249 119.341 1488319200.000000000 14651
1 True 119.893 119.954 119.552 119.738 1488348000.000000000 10738
2 True 119.946 120.221 119.840 119.888 1488376800.000000000 10041
If you want to change the column order, use df.reindex:
df = df.reindex(columns=['time', 'volume', 'complete', 'mid.h', 'mid.l', 'mid.c', 'mid.o'])
df
time volume complete mid.h mid.l mid.c mid.o
0 1488319200.000000000 14651 True 119.891 119.249 119.743 119.341
1 1488348000.000000000 10738 True 119.954 119.552 119.893 119.738
2 1488376800.000000000 10041 True 120.221 119.840 119.946 119.888
The data in the OP (after deserialized from a json string preferably using json.load()) is a list of nested dictionaries, which is an ideal data structure for pd.json_normalize() because it converts a list of dictionaries and flattens each dictionary into a single row. So the length of the list determines the number of rows and the total number of key-value pairs in the dictionaries determine the number of columns.
However, if a value under some key is a list, then that no longer is true because presumably the items in the those lists need to be in their separate rows. For example, if my_data.json file is like:
# my_data.json
[
{'price': {'mid': ['119.743', '119.891', '119.341'], 'time': '123'}},
{'price': {'mid': ['119.893', '119.954', '119.552'], 'time': '456'}},
{'price': {'mid': ['119.946', '120.221', '119.840'], 'time': '789'}}
]
and then you'll want to put each value in the list as its own row. In that case, you can pass the path to these lists as record_path= argument. Also, you can make each record have its accompanying metadata, whose path you can also pass as meta= argument.
# deserialize json into a python data structure
import json
with open('my_data.json', 'r') as f:
data = json.load(f)
# normalize the python data structure
df = pd.json_normalize(data, record_path=['price', 'mid'], meta=[['price', 'time']], record_prefix='mid.')

Ultimately, pd.json_normalize() cannot handle anything more complex than this kind of structure. For example, it cannot add another metadata to the above example if it's nested inside another dictionary. Depending on the data, you'll most probably need a recursive function to parse it (FYI, pd.json_normalize() is a recursive function as well but it's for a general case and won't work for a lot of specific objects).
Often times, you'll need a combination of explode(), pd.DataFrame(col.tolist()) etc. to completely parse the data.
Pandas also has a convenience function pd.read_json() as well but it's even more limited than pd.json_normalize() in that it can only correctly parse a json array of one nesting level. Unlike pd.json_normalize() however, it deserializes a json string under the hood so you can directly pass the path to a json file to it (no need for json.load()). In other words, the following two produce the same output:
df1 = pd.read_json("my_data.json")
df2 = pd.json_normalize(data, max_level=0) # here, `data` is deserialized `my_data.json`
df1.equals(df2) # True
I’ve been using the json_normalize function in pandas to read through a folder of json files and build a dataframe for the entire folder. It’s working today, but the issue is that it takes much longer than I was hoping it would, are there any optimizations that I can look into or even an alternative to pandas that could normalize json faster?
For context the json folder holds about 800mb of json files (each file ~2mb), and it takes roughly 14 minutes to parse through them all and build the dataframe
Is it also possible the slow piece is concatenating the dataframes together? How would I go about optimizing that?
In the pandas example (below) what do the brackets mean? Is there a logic to be followed to go deeper with the []. [...]
result = json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
Each string or list of strings in the ['state', 'shortname', ['info', 'governor']] value is a path to an element to include, in addition to the selected rows. The second argument json_normalize() argument (record_path, set to 'counties' in the documentation example) tells the function how to select elements from the input data structure that make up the rows in the output, and the meta paths adds further metadata that will be included with each of the rows. Think of these as table joins in a database, if you will.
The input for the US States documentation example has two dictionaries in a list, and both of these dictionaries have a counties key that references another list of dicts:
>>> data = [{'state': 'Florida',
... 'shortname': 'FL',
... 'info': {'governor': 'Rick Scott'},
... 'counties': [{'name': 'Dade', 'population': 12345},
... {'name': 'Broward', 'population': 40000},
... {'name': 'Palm Beach', 'population': 60000}]},
... {'state': 'Ohio',
... 'shortname': 'OH',
... 'info': {'governor': 'John Kasich'},
... 'counties': [{'name': 'Summit', 'population': 1234},
... {'name': 'Cuyahoga', 'population': 1337}]}]
>>> pprint(data[0]['counties'])
[{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]
>>> pprint(data[1]['counties'])
[{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]
Between them there are 5 rows of data to use in the output:
>>> json_normalize(data, 'counties')
name population
0 Dade 12345
1 Broward 40000
2 Palm Beach 60000
3 Summit 1234
4 Cuyahoga 1337
The meta argument then names some elements that live next to those counties lists, and those are then merged in separately. The values from the first data[0] dictionary for those meta elements are ('Florida', 'FL', 'Rick Scott'), respectively, and for data[1] the values are ('Ohio', 'OH', 'John Kasich'), so you see those values attached to the counties rows that came from the same top-level dictionary, repeated 3 and 2 times respectively:
>>> data[0]['state'], data[0]['shortname'], data[0]['info']['governor']
('Florida', 'FL', 'Rick Scott')
>>> data[1]['state'], data[1]['shortname'], data[1]['info']['governor']
('Ohio', 'OH', 'John Kasich')
>>> json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
So, if you pass in a list for the meta argument, then each element in the list is a separate path, and each of those separate paths identifies data to add to the rows in the output.
In your example JSON, there are only a few nested lists to elevate with the first argument, like 'counties' did in the example. The only example in that datastructure is the nested 'authors' key; you'd have to extract each ['_source', 'authors'] path, after which you can add other keys from the parent object to augment those rows.
The second meta argument then pulls in the _id key from the outermost objects, followed by the nested ['_source', 'title'] and ['_source', 'journal'] nested paths.
The record_path argument takes the authors lists as the starting point, these look like:
>>> d['hits']['hits'][0]['_source']['authors'] # this value is None, and is skipped
>>> d['hits']['hits'][1]['_source']['authors']
[{'affiliations': ['Punjabi University'],
'author_id': '780E3459',
'author_name': 'munish puri'},
{'affiliations': ['Punjabi University'],
'author_id': '48D92C79',
'author_name': 'rajesh dhaliwal'},
{'affiliations': ['Punjabi University'],
'author_id': '7D9BD37C',
'author_name': 'r s singh'}]
>>> d['hits']['hits'][2]['_source']['authors']
[{'author_id': '7FF872BC',
'author_name': 'barbara eileen ryan'}]
>>> # etc.
and so gives you the following rows:
>>> json_normalize(d['hits']['hits'], ['_source', 'authors'])
affiliations author_id author_name
0 [Punjabi University] 780E3459 munish puri
1 [Punjabi University] 48D92C79 rajesh dhaliwal
2 [Punjabi University] 7D9BD37C r s singh
3 NaN 7FF872BC barbara eileen ryan
4 NaN 0299B8E9 fraser j harbutt
5 NaN 7DAB7B72 richard m freeland
and then we can use the third meta argument to add more columns like _id, _source.title and _source.journal, using ['_id', ['_source', 'journal'], ['_source', 'title']]:
>>> json_normalize(
... data['hits']['hits'],
... ['_source', 'authors'],
... ['_id', ['_source', 'journal'], ['_source', 'title']]
... )
affiliations author_id author_name _id \
0 [Punjabi University] 780E3459 munish puri 7AF8EBC3
1 [Punjabi University] 48D92C79 rajesh dhaliwal 7AF8EBC3
2 [Punjabi University] 7D9BD37C r s singh 7AF8EBC3
3 NaN 7FF872BC barbara eileen ryan 7521A721
4 NaN 0299B8E9 fraser j harbutt 7DAEB9A4
5 NaN 7DAB7B72 richard m freeland 7B3236C5
_source.journal
0 Journal of Industrial Microbiology & Biotechno...
1 Journal of Industrial Microbiology & Biotechno...
2 Journal of Industrial Microbiology & Biotechno...
3 The American Historical Review
4 The American Historical Review
5 The American Historical Review
_source.title \
0 Development of a stable continuous flow immobi...
1 Development of a stable continuous flow immobi...
2 Development of a stable continuous flow immobi...
3 Feminism and the women's movement : dynamics o...
4 The iron curtain : Churchill, America, and the...
5 The Truman Doctrine and the origins of McCarth...
You can also have a look at the library flatten_json, which does not require you to write column hierarchies as in json_normalize:
from flatten_json import flatten
data = d['hits']['hits']
dict_flattened = (flatten(record, '.') for record in data)
df = pd.DataFrame(dict_flattened)
print(df)
See https://github.com/amirziai/flatten.
Hi all, I'm trying to flatten a json to dataframe using json_normalize but for one column I have mixed type data like first few rows as json and later rows as arrays, so json_normalize is not working as expected for this column, Any help would be appreciated