I actually wrote a package called cherrypicker recently to deal with this exact sort of thing since I had to do it so often!
I think the following code would give you exactly what you're after:
from cherrypicker import CherryPicker
import json
import pandas as pd
with open('file.json') as file:
data = json.load(file)
picker = CherryPicker(data)
flat = picker['tickets'].flatten().get()
df = pd.DataFrame(flat)
print(df)
This gave me the output:
Location_City Location_State Name hobbies_0 hobbies_1 playerId salary teamId year
0 Los Angeles CA Liam Piano Sports barkele01 870000 ATL 1985
1 Los Angeles CA John Music Running bedrost01 550000 ATL 1985
You can install the package with:
pip install cherrypicker
...and there's more docs and guidance at https://cherrypicker.readthedocs.io.
Answer from big-o on Stack OverflowI actually wrote a package called cherrypicker recently to deal with this exact sort of thing since I had to do it so often!
I think the following code would give you exactly what you're after:
from cherrypicker import CherryPicker
import json
import pandas as pd
with open('file.json') as file:
data = json.load(file)
picker = CherryPicker(data)
flat = picker['tickets'].flatten().get()
df = pd.DataFrame(flat)
print(df)
This gave me the output:
Location_City Location_State Name hobbies_0 hobbies_1 playerId salary teamId year
0 Los Angeles CA Liam Piano Sports barkele01 870000 ATL 1985
1 Los Angeles CA John Music Running bedrost01 550000 ATL 1985
You can install the package with:
pip install cherrypicker
...and there's more docs and guidance at https://cherrypicker.readthedocs.io.
An you already have a function to flatten a Json object, you have just to flatten the tickets:
...
with open(args.json_file, "r") as inputFile: # open json file
json_data = json.loads(inputFile.read()) # load json content
final_data = pd.DataFrame([flatten_json(elt) for elt in json_data['tickets']])
...
With your sample data, final_data is as expected:
Location_City Location_State Name hobbies_0 hobbies_1 playerId salary teamId year
0 Los Angeles CA Liam Piano Sports barkele01 870000 ATL 1985
1 Los Angeles CA John Music Running bedrost01 550000 ATL 1985
Convert nested JSON to CSV file in Python - Stack Overflow
How to convert CSV to nested JSON in Python - Stack Overflow
Converting nested JSON data?
nested json to csv
Videos
Please scroll down for the newer, faster solution
This is an older question, but I struggled the entire night to get a satisfactory result for a similar situation, and I came up with this:
import json
import pandas
def cross_join(left, right):
return left.assign(key=1).merge(right.assign(key=1), on='key', how='outer').drop('key', 1)
def json_to_dataframe(data_in):
def to_frame(data, prev_key=None):
if isinstance(data, dict):
df = pandas.DataFrame()
for key in data:
df = cross_join(df, to_frame(data[key], prev_key + '.' + key))
elif isinstance(data, list):
df = pandas.DataFrame()
for i in range(len(data)):
df = pandas.concat([df, to_frame(data[i], prev_key)])
else:
df = pandas.DataFrame({prev_key[1:]: [data]})
return df
return to_frame(data_in)
if __name__ == '__main__':
with open('somefile') as json_file:
json_data = json.load(json_file)
df = json_to_dataframe(json_data)
df.to_csv('data.csv', mode='w')
Explanation:
The cross_join function is a neat way I found to do a cartesian product. (credit: here)
The json_to_dataframe function does the logic, using pandas dataframes. In my case, the json was deeply nested, and I wanted to split dictionary key:value pairs into columns, but the lists I wanted to transform into rows for a column -- hence the concat -- which I then cross join with the upper level, thus multiplying the records number so that each value from the list has its own row, while the previous columns are identical.
The recursiveness creates stacks that cross join with the one below, until the last one is returned.
Then with the dataframe in a table format, it's easy to convert to CSV with the "df.to_csv()" dataframe object method.
This should work with deeply nested JSON, being able to normalize all of it into rows by the logic described above.
I hope this will help someone, someday. Just trying to give back to this awesome community.
---------------------------------------------------------------------------------------------
LATER EDIT: NEW SOLUTION
I'm coming back to this as while the dataframe option kinda worked, it took the app minutes to parse not so large JSON data. Therefore I thought of doing what the dataframes do, but by myself:
from copy import deepcopy
import pandas
def cross_join(left, right):
new_rows = [] if right else left
for left_row in left:
for right_row in right:
temp_row = deepcopy(left_row)
for key, value in right_row.items():
temp_row[key] = value
new_rows.append(deepcopy(temp_row))
return new_rows
def flatten_list(data):
for elem in data:
if isinstance(elem, list):
yield from flatten_list(elem)
else:
yield elem
def json_to_dataframe(data_in):
def flatten_json(data, prev_heading=''):
if isinstance(data, dict):
rows = [{}]
for key, value in data.items():
rows = cross_join(rows, flatten_json(value, prev_heading + '.' + key))
elif isinstance(data, list):
rows = []
for item in data:
[rows.append(elem) for elem in flatten_list(flatten_json(item, prev_heading))]
else:
rows = [{prev_heading[1:]: data}]
return rows
return pandas.DataFrame(flatten_json(data_in))
if __name__ == '__main__':
json_data = {
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{"id": "1001", "type": "Regular"},
{"id": "1002", "type": "Chocolate"},
{"id": "1003", "type": "Blueberry"},
{"id": "1004", "type": "Devil's Food"}
]
},
"topping":
[
{"id": "5001", "type": "None"},
{"id": "5002", "type": "Glazed"},
{"id": "5005", "type": "Sugar"},
{"id": "5007", "type": "Powdered Sugar"},
{"id": "5006", "type": "Chocolate with Sprinkles"},
{"id": "5003", "type": "Chocolate"},
{"id": "5004", "type": "Maple"}
],
"something": []
}
df = json_to_dataframe(json_data)
print(df)
OUTPUT:
id type name ppu batters.batter.id batters.batter.type topping.id topping.type
0 0001 donut Cake 0.55 1001 Regular 5001 None
1 0001 donut Cake 0.55 1001 Regular 5002 Glazed
2 0001 donut Cake 0.55 1001 Regular 5005 Sugar
3 0001 donut Cake 0.55 1001 Regular 5007 Powdered Sugar
4 0001 donut Cake 0.55 1001 Regular 5006 Chocolate with Sprinkles
5 0001 donut Cake 0.55 1001 Regular 5003 Chocolate
6 0001 donut Cake 0.55 1001 Regular 5004 Maple
7 0001 donut Cake 0.55 1002 Chocolate 5001 None
8 0001 donut Cake 0.55 1002 Chocolate 5002 Glazed
9 0001 donut Cake 0.55 1002 Chocolate 5005 Sugar
10 0001 donut Cake 0.55 1002 Chocolate 5007 Powdered Sugar
11 0001 donut Cake 0.55 1002 Chocolate 5006 Chocolate with Sprinkles
12 0001 donut Cake 0.55 1002 Chocolate 5003 Chocolate
13 0001 donut Cake 0.55 1002 Chocolate 5004 Maple
14 0001 donut Cake 0.55 1003 Blueberry 5001 None
15 0001 donut Cake 0.55 1003 Blueberry 5002 Glazed
16 0001 donut Cake 0.55 1003 Blueberry 5005 Sugar
17 0001 donut Cake 0.55 1003 Blueberry 5007 Powdered Sugar
18 0001 donut Cake 0.55 1003 Blueberry 5006 Chocolate with Sprinkles
19 0001 donut Cake 0.55 1003 Blueberry 5003 Chocolate
20 0001 donut Cake 0.55 1003 Blueberry 5004 Maple
21 0001 donut Cake 0.55 1004 Devil's Food 5001 None
22 0001 donut Cake 0.55 1004 Devil's Food 5002 Glazed
23 0001 donut Cake 0.55 1004 Devil's Food 5005 Sugar
24 0001 donut Cake 0.55 1004 Devil's Food 5007 Powdered Sugar
25 0001 donut Cake 0.55 1004 Devil's Food 5006 Chocolate with Sprinkles
26 0001 donut Cake 0.55 1004 Devil's Food 5003 Chocolate
27 0001 donut Cake 0.55 1004 Devil's Food 5004 Maple
As per what the above does, well, the cross_join function does pretty much the same thing as in the dataframe solution, but without dataframes, thus being faster.
I added the flatten_list generator as I wanted to make sure that the JSON arrays are all nice and flattened, then provided as a single list of dictionaries comprising of the previous key from one iteration before assigned to each of the list's values. This pretty much mimics the pandas.concat behaviour in this case.
The logic in the main function, json_to_dataframe is then the same as before. All that needed to change was having the operations performed by dataframes as coded functions.
Also, in the dataframes solution I was not appending the previous heading to the nested object, but unless you are 100% sure you do not have conflicts in column names, then it is pretty much mandatory.
I hope this helps :).
EDIT: Modified the cross_join function to deal with the case when a nested list is empty, basically maintaining the previous result set unmodified. The output is unchanged even after adding the empty JSON list in the example JSON data. Thank you, @Nazmus Sakib for pointing it out.
For the JSON data you have given, you could do this by parsing the JSON structure to just return a list of all the leaf nodes.
This assumes that your structure is consistent throughout, if each entry can have different fields, see the second approach.
For example:
import json
import csv
def get_leaves(item, key=None):
if isinstance(item, dict):
leaves = []
for i in item.keys():
leaves.extend(get_leaves(item[i], i))
return leaves
elif isinstance(item, list):
leaves = []
for i in item:
leaves.extend(get_leaves(i, key))
return leaves
else:
return [(key, item)]
with open('json.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
write_header = True
for entry in json.load(f_input):
leaf_entries = sorted(get_leaves(entry))
if write_header:
csv_output.writerow([k for k, v in leaf_entries])
write_header = False
csv_output.writerow([v for k, v in leaf_entries])
If your JSON data is a list of entries in the format you have given, then you should get output as follows:
address_line_1,company_number,country_of_residence,etag,forename,kind,locality,middle_name,month,name,nationality,natures_of_control,notified_on,postal_code,premises,region,self,surname,title,year
Address 1,12345678,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977
Address 1,12345679,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977
If each entry can contain different (or possibly missing) fields, then a better approach would be to use a DictWriter. In this case, all of the entries would need to be processed to determine the complete list of possible fieldnames so that the correct header can be written.
import json
import csv
def get_leaves(item, key=None):
if isinstance(item, dict):
leaves = {}
for i in item.keys():
leaves.update(get_leaves(item[i], i))
return leaves
elif isinstance(item, list):
leaves = {}
for i in item:
leaves.update(get_leaves(i, key))
return leaves
else:
return {key : item}
with open('json.txt') as f_input:
json_data = json.load(f_input)
# First parse all entries to get the complete fieldname list
fieldnames = set()
for entry in json_data:
fieldnames.update(get_leaves(entry).keys())
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames))
csv_output.writeheader()
csv_output.writerows(get_leaves(entry) for entry in json_data)
A simple way is to add more columns; then use to_json method in pandas:
import pandas as pd
df = pd.read_csv('your_file.csv')
df['Purchase'] = df[['b','c','d']].to_dict('records')
df['Sales'] = df[['d','e']].to_dict('records')
out = df[['a', 'Purchase', 'Sales']].to_json(orient='records', indent=4)
Output:
[
{
"a":1,
"Purchase":{
"b":2,
"c":3,
"d":4
},
"Sales":{
"d":4,
"e":5
}
},
{
"a":9,
"Purchase":{
"b":8,
"c":7,
"d":6
},
"Sales":{
"d":6,
"e":5
}
}
]
You don't need any libraries for this, just specify the right dialect, e.g. for tab-separated:
import csv
import json
with open("tmp4.csv", "r") as f:
result = [
{
"a": row["a"],
"Purchase": {
"b": row["b"],
"c": row["c"],
},
"Sales": {
"d": row["d"],
"e": row["e"],
},
}
for row in csv.DictReader(f, dialect='excel-tab')
]
assert (
json.dumps(result)
== '[{"a": "1", "Purchase": {"b": "2", "c": "3"}, "Sales": {"d": "4", "e": "5"}}, {"a": "9", "Purchase": {"b": "8", "c": "7"}, "Sales": {"d": "6", "e": "5"}}]'
)
Hi, hope this is appropriate. I doing a python course at the minute and thought i would set some challenges for myself during the summer break. I am trying to convert a json file to CSV but am running into some problems with nested data.
The JSON dataset I am working has a structure like the below;
{
"PolicyNumber": "00001",
"PolicyData": [
{
"TableName": "MEMBERS",
"Properties": {
"Surname": "Bloggs",
"Forename": "Joe",
"Gender": "Male",
}
},
{
"TableName": "MEMBERS",
"Properties": {
"Surname": "Jobs",
"Forename": "Steve",
"Gender": "Male",
}
}
],}
As you can see for each policy there there are tables for any Member on the policy. There can be up to 5 members on any policy in this case. The property names are naturally repeated for all members.
Basically I want a column in the CSV file for each parameter. However as there can be multiple "Members" on each "Policy", the same parameters are used multiple times. The code I wrote is only making one column for each pararmeter and overwriting each value with the next. So in the above example the only data that pulls through is for the second member "Steve Jobs". Ideally I would like to be able to create columns something like "Member1_Forename", "Member2_Forename" etc... but I am unsure how to do this...
Any pointers or tips?
Thanks to the great blog post by Amir Ziai which you can find here I managed to output my data in form of a flat table. With the following function:
#Function that recursively extracts values out of the object into a flattened dictionary
def flatten_json(data):
flat = [] #list of flat dictionaries
def flatten(y):
out = {}
def flatten2(x, name=''):
if type(x) is dict:
for a in x:
if a == "name":
flatten2(x["value"], name + x[a] + '_')
else:
flatten2(x[a], name + a + '_')
elif type(x) is list:
for a in x:
flatten2(a, name + '_')
else:
out[name[:-1]] = x
flatten2(y)
return out
#Loop needed to flatten multiple objects
for i in range(len(data)):
flat.append(flatten(data[i]).copy())
return json_normalize(flat)
I am aware of the fact that it is not perfectly generalisable, due to name-value if statement. However, if this exemption for creating the name-value dictionaries is deleted, the code can be used with other embedded arrays.
I had a task to turn a json with nested key and values into a csv file a couple of weeks ago. For this task it was necessary to handle the nested keys properly to concatenate the to be used as unique headers for the values. The result was the code bellow, which can also be found here.
def get_flat_json(json_data, header_string, header, row):
"""Parse json files with nested key-vales into flat lists using nested column labeling"""
for root_key, root_value in json_data.items():
if isinstance(root_value, dict):
get_flat_json(root_value, header_string + '_' + str(root_key), header, row)
elif isinstance(root_value, list):
for value_index in range(len(root_value)):
for nested_key, nested_value in root_value[value_index].items():
header[0].append((header_string +
'_' + str(root_key) +
'_' + str(nested_key) +
'_' + str(value_index)).strip('_'))
if nested_value is None:
nested_value = ''
row[0].append(str(nested_value))
else:
if root_value is None:
root_value = ''
header[0].append((header_string + '_' + str(root_key)).strip('_'))
row[0].append(root_value)
return header, row
This is a more generalized approach based on An Economist answer to this question.