I used the following function (details can be found here):
def flatten_data(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
This unfortunately completely flattens whole JSON, meaning that if you have multi-level JSON (many nested dictionaries), it might flatten everything into single line with tons of columns.
What I used, in the end, was json_normalize() and specified structure that I required. A nice example of how to do it that way can be found here.
I used the following function (details can be found here):
def flatten_data(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
This unfortunately completely flattens whole JSON, meaning that if you have multi-level JSON (many nested dictionaries), it might flatten everything into single line with tons of columns.
What I used, in the end, was json_normalize() and specified structure that I required. A nice example of how to do it that way can be found here.
Cross-posting (but then adapting further) from https://stackoverflow.com/a/62186053/4355695 : In this repo: https://github.com/ScriptSmith/socialreaper/blob/master/socialreaper/tools.py#L8 , I found an implementation of the list-inclusion comment by @roneo to the answer posted by @Imran.
I've added checks to it for catching empty lists and empty dicts. And also added print lines that will help one understand precisely how this function works. You can turn on those print statements by passing crumbs=True in the function's args.
from collections.abc import MutableMapping
def flatten(dictionary, parent_key=False, separator='.', crumbs=False):
"""
Turn a nested dictionary into a flattened dictionary
:param dictionary: The dictionary to flatten
:param parent_key: The string to prepend to dictionary's keys
:param separator: The string used to separate flattened keys
:return: A flattened dictionary
"""
items = []
for key, value in dictionary.items():
if crumbs: print('checking:',key)
new_key = str(parent_key) + separator + key if parent_key else key
if isinstance(value, MutableMapping):
if crumbs: print(new_key,': dict found')
if not value.items():
if crumbs: print('Adding key-value pair:',new_key,None)
items.append((new_key,None))
else:
items.extend(flatten(value, new_key, separator).items())
elif isinstance(value, list):
if crumbs: print(new_key,': list found')
if len(value):
for k, v in enumerate(value):
items.extend(flatten({str(k): v}, new_key, separator).items())
else:
if crumbs: print('Adding key-value pair:',new_key,None)
items.append((new_key,None))
else:
if crumbs: print('Adding key-value pair:',new_key,value)
items.append((new_key, value))
return dict(items)
Test it:
ans = flatten({'a': 1, 'c': {'a': 2, 'b': {'x': 5, 'y' : 10}}, 'd': [1, 2, 3], 'e':{'f':[], 'g':{}} })
print('\nflattened:',ans)
Output:
checking: a
Adding key-value pair: a 1
checking: c
c : dict found
checking: a
Adding key-value pair: c.a 2
checking: b
c.b : dict found
checking: x
Adding key-value pair: c.b.x 5
checking: y
Adding key-value pair: c.b.y 10
checking: d
d : list found
checking: 0
Adding key-value pair: d.0 1
checking: 1
Adding key-value pair: d.1 2
checking: 2
Adding key-value pair: d.2 3
checking: e
e : dict found
checking: f
e.f : list found
Adding key-value pair: e.f None
checking: g
e.g : dict found
Adding key-value pair: e.g None
flattened: {'a': 1, 'c.a': 2, 'c.b.x': 5, 'c.b.y': 10, 'd.0': 1, 'd.1': 2, 'd.2': 3, 'e.f': None, 'e.g': None}
Annd that does the job I need done: I throw any complicated json at this and it flattens it out for me. I added a check to the original code to handle empty lists too
Credits to https://github.com/ScriptSmith whose repo I found the intial flatten function in.
Testing OP's sample json, here's the output:
{'count': 13,
'virtualmachine.0.id': '1082e2ed-ff66-40b1-a41b-26061afd4a0b',
'virtualmachine.0.name': 'test-2',
'virtualmachine.0.displayname': 'test-2',
'virtualmachine.0.securitygroup.0.id': '9e649fbc-3e64-4395-9629-5e1215b34e58',
'virtualmachine.0.securitygroup.0.name': 'test',
'virtualmachine.0.securitygroup.0.tags': None,
'virtualmachine.0.nic.0.id': '79568b14-b377-4d4f-b024-87dc22492b8e',
'virtualmachine.0.nic.0.networkid': '05c0e278-7ab4-4a6d-aa9c-3158620b6471',
'virtualmachine.0.nic.1.id': '3d7f2818-1f19-46e7-aa98-956526c5b1ad',
'virtualmachine.0.nic.1.networkid': 'b4648cfd-0795-43fc-9e50-6ee9ddefc5bd',
'virtualmachine.0.nic.1.traffictype': 'Guest',
'virtualmachine.0.hypervisor': 'KVM',
'virtualmachine.0.affinitygroup': None,
'virtualmachine.0.isdynamicallyscalable': False}
So you'll see that 'tags' and 'affinitygroup' keys are also handled and added to output. Original code was omitting them.
2021-05-30 : Updated: collections.MutableMapping is changed to collections.abc.MutableMapping
2023-01-11 : edited, added separator arg in second items.extend() call as advised by @MHebes
2024-02-20 : how did that .abc go missing from the import statement?
2025-07-21 : moved crumbs param into function's args
» pip install flatten-json
How to Flatten Nested Json Files Efficiently?
Help with flattening really deeply nested json.
How do you flatten nested json/xml?
Dealing with big JSON objects - flatten into tabular or find a way to query JSON efficiently?
Videos
I am working with extremely nested json data and need to flatten out the structure. I have been using pandas json_normalize, but I have only been working with a fraction of the data and need to start flattening out all of the data. With only a few GB of data, Json_normalize is taking me around 3 hours to complete. I need it to run much faster in order to complete my analysis on all of the data. How do I make this more efficient? Is there a better route to go with this function? My team is thinking about transferring our work to pyspark but I am hesitant as the rest of the ETL processing doesn't take long at all, and it is really this part of the process that takes forever. I also saw people online recommend to use pandas json_normalize to do this procedure rather than using pyspark. I would appreciate any insight, thanks!