There are many possible solutions. Generally though, you'll probably want to:
- Not loop over fields; instead let Pandas split the fields for you
- Use an actual missing value
- But later if you want to represent it differently, you can do that, e.g. using the
na_repparameter todf.style.format
- But later if you want to represent it differently, you can do that, e.g. using the
For the first step, you can look at Split / Explode a column of dictionaries into separate columns with pandas. I'll use Lech Birek's solution (json_normalize) then drop the "id" columns and rename the "value" columns.
headers_mapping = {'1': 'field1', '2': 'field2', '3': 'field3', '4': 'field4'}
(
pd.json_normalize(df['json_field'])
.filter(like='value')
.rename(columns=lambda label: headers_mapping[label.rstrip('.value')])
)
field1 field2 field3 field4
0 value1 value2 NaN NaN
1 value1 NaN value3 NaN
2 NaN NaN value3 value4
If you also need to sort the columns, tack this on at the end:
.reindex(columns=headers_mapping.values())
Answer from wjandrea on Stack OverflowThere are many possible solutions. Generally though, you'll probably want to:
- Not loop over fields; instead let Pandas split the fields for you
- Use an actual missing value
- But later if you want to represent it differently, you can do that, e.g. using the
na_repparameter todf.style.format
- But later if you want to represent it differently, you can do that, e.g. using the
For the first step, you can look at Split / Explode a column of dictionaries into separate columns with pandas. I'll use Lech Birek's solution (json_normalize) then drop the "id" columns and rename the "value" columns.
headers_mapping = {'1': 'field1', '2': 'field2', '3': 'field3', '4': 'field4'}
(
pd.json_normalize(df['json_field'])
.filter(like='value')
.rename(columns=lambda label: headers_mapping[label.rstrip('.value')])
)
field1 field2 field3 field4
0 value1 value2 NaN NaN
1 value1 NaN value3 NaN
2 NaN NaN value3 value4
If you also need to sort the columns, tack this on at the end:
.reindex(columns=headers_mapping.values())
You can try:
import json
# apply `json.loads` if necessary
df["json_field"] = df["json_field"].apply(json.loads)
data = []
for d in df["json_field"]:
dct = {}
for k, v in d.items():
dct[f"field{k}"] = v["value"]
data.append(dct)
out = pd.DataFrame(data)
print(out)
Prints:
field1 field2 field3 field4
0 value1 value2 NaN NaN
1 value1 NaN value3 NaN
2 NaN NaN value3 value4
should add ignore_index=True argument in explode function to make sure the following join is not messed up.
df = pd.DataFrame(data).explode('countries', ignore_index=True)
df = df.join(pd.json_normalize(df.pop('countries')))
print(df)
You could try this with explode:
df=df.explode('countries')
#we add to each dictionary the respective value of year with key 'year'
df['countries']=[{**dc,**{'year':y}} for dc,y in zip(df['countries'],df['year'])]
pd.DataFrame(df['countries'].tolist())
Example:
j = [{'continent': 'europe',
'country': 'Yugoslavia',
'income': None,
'life_exp': None,
'population': 4687422},
{'continent': 'asia',
'country': 'United Korea (former)',
'income': None,
'life_exp': None,
'population': 13740000}]
df=pd.DataFrame({'countries':[j,j],'year':[1800,1900]})
print(df)
df=df.explode('countries')
print(df)
#Here we add the key 'year' with the respective year row value to each dictionary
df['countries']=[{**dc,**{'year':y}} for dc,y in zip(df['countries'],df['year'])]
print(df['countries'])
finaldf=pd.DataFrame(df['countries'].tolist())
print(finaldf)
Output:
original df:
countries year
0 [{'continent': 'europe', 'country': 'Yugoslavi... 1800
1 [{'continent': 'europe', 'country': 'Yugoslavi... 1900
df(after explode):
countries year
0 {'continent': 'europe', 'country': 'Yugoslavia... 1800
0 {'continent': 'asia', 'country': 'United Korea... 1800
1 {'continent': 'europe', 'country': 'Yugoslavia... 1900
1 {'continent': 'asia', 'country': 'United Korea... 1900
df.countries(with year added):
0 {'continent': 'europe', 'country': 'Yugoslavia', 'income': None, 'life_exp': None, 'population': 4687422, 'year': 1800}
0 {'continent': 'asia', 'country': 'United Korea (former)', 'income': None, 'life_exp': None, 'population': 13740000, 'year': 1800}
1 {'continent': 'europe', 'country': 'Yugoslavia', 'income': None, 'life_exp': None, 'population': 4687422, 'year': 1900}
1 {'continent': 'asia', 'country': 'United Korea (former)', 'income': None, 'life_exp': None, 'population': 13740000, 'year': 1900}
Name: countries, dtype: object
finaldf
continent country income life_exp population year
0 europe Yugoslavia None None 4687422 1800
1 asia United Korea (former) None None 13740000 1800
2 europe Yugoslavia None None 4687422 1900
3 asia United Korea (former) None None 13740000 1900
I hope I've understood your question well. Try:
from ast import literal_eval
df["experimental_properties"] = df["experimental_properties"].apply(
lambda x: {d["name"]: d["property"] for d in literal_eval(x)}
)
df = pd.concat([df, df.pop("experimental_properties").apply(pd.Series)], axis=1)
print(df)
Prints:
Boiling Point Density
0 115.3 °C NaN
1 91 °C @ Press: 20 Torr NaN
2 58 °C @ Press: 12 Torr 0.8753 g/cm<sup>3</sup> @ Temp: 20 °C
Is the expected output really what you are looking for? Another way to visualise the data would be to have "name", "property", and "sourceNumber" as column names.
import json
import pandas as pd
data = [
'''[{'name': 'Boiling Point', 'property': '115.3 °C', 'sourceNumber': 1}]''',
'''[{'name': 'Boiling Point', 'property': '91 °C @ Press: 20 Torr', 'sourceNumber': 1}]''',
'''[{'name': 'Boiling Point', 'property': '58 °C @ Press: 12 Torr', 'sourceNumber': 1}, {'name': 'Density', 'property': '0.8753 g/cm<sup>3</sup> @ Temp: 20 °C', 'sourceNumber': 1}]''']
#Initialise a naiveList
naiveList = []
#String to List
for i in data:
tempStringOfData = i
tempStringOfData = tempStringOfData.replace("\'", "\"")
tempJsonData = json.loads(tempStringOfData)
naiveList.append(tempJsonData)
#Initialise a List for Dictionaries
newListOfDictionaries = []
for i in naiveList:
for j in i:
newListOfDictionaries.append(j)
df = pd.DataFrame(newListOfDictionaries)
print(df)
Which gives you
name property sourceNumber
0 Boiling Point 115.3 °C 1
1 Boiling Point 91 °C @ Press: 20 Torr 1
2 Boiling Point 58 °C @ Press: 12 Torr 1
3 Density 0.8753 g/cm<sup>3</sup> @ Temp: 20 °C 1
much simpler:
df = pd.DataFrame({'address': [{'state': 'MI', 'town': 'Dearborn'} , {'state': 'CA', 'town': 'Los Angeles'}], 'name':['John', 'Jane']})
df = df.join(df['address'].apply(pd.Series))
then
df.drop(columns='address')
if your address column is not a dictionary, you can convert to one by:
import ast
df.address = [ast.literal_eval(df.address[i]) for i in df.index]
then :
df.address.apply(pd.Series)
state town
0 MI Dearborn
1 CA Los Angeles
Though not sure about the length of your dataset, this can also be achieved by:
def literal_return(val):
try:
return ast.literal_eval(val)
except (ValueError, SyntaxError) as e:
return val
df.address.apply(literal_return)
>>%timeit [ast.literal_eval(df.address[i]) for i in df.index]
144 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>%timeit df.address.apply(literal_return)
454 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I can use rdd to get the columns, data and create dataframe with this.
rdd = sc.textFile('test.txt')
import json
cols = rdd.map(lambda x: json.loads(x)['columns']).take(1)[0]
data = rdd.map(lambda x: json.loads(x)['data']).take(1)[0]
df = spark.createDataFrame(data, cols)
df.show(truncate=False)
+--------------+-----------+--------------+---------------------+------------+-----------------+-------------------------------+-----------+--------------+---------+------------+----------------------+------------+-------------+---------------------------+------------------+----------------------+-------------+----------------------------+-------------------------------+-------------------------+-----------+------------------+-------------------+--------------------------------+-----------------------+---------------------------+------------------------------------+------------------+-------------------+------------------------------------+---------------------------------+------------------------------+----------------+--------------+
|ApplicationNum|eads59Us01S|HouseDeal_flag|Liability_Asset_Ratio|CBRAvailPcnt|CMSFairIsaacScore|OweTaxes_or_IRAWithdrawalHistry|eads14Fi02S|GuarantorCount|CBRRevMon|CBRInstalMon|CMSApprovedToRequested|SecIncSource|eads59Us01S_4|Liability_Asset_Ratio_40_90|CBRAvailPcnt_20_95|CMSFairIsaacScore_Fund|eads14Fi02S_2|InstalMonthlyPayments_400_3k|RevolvingMonthlyPayments_1k_cap|ApprovedToRequested_0_100|NoSecIncome|coef_eads59Us01S_4|coef_HouseDeal_flag|coef_Liability_Asset_Ratio_40_90|coef_CBRAvailPcnt_20_95|coef_CMSFairIsaacScore_Fund|coef_OweTaxes_or_IRAWithdrawalHistry|coef_eads14Fi02S_2|coef_GuarantorCount|coef_RevolvingMonthlyPayments_1k_cap|coef_InstalMonthlyPayments_400_3k|coef_ApprovedToRequested_0_100|coef_NoSecIncome|coef_Intercept|
+--------------+-----------+--------------+---------------------+------------+-----------------+-------------------------------+-----------+--------------+---------+------------+----------------------+------------+-------------+---------------------------+------------------+----------------------+-------------+----------------------------+-------------------------------+-------------------------+-----------+------------------+-------------------+--------------------------------+-----------------------+---------------------------+------------------------------------+------------------+-------------------+------------------------------------+---------------------------------+------------------------------+----------------+--------------+
|569325.0 |2 |0.0 |1 |92 |825 |0.0 |4 |1.0 |74 |854 |0.51 |2 |2.0 |0.9 |92.0 |825.0 |4.0 |854.0 |1000.0 |0.51 |0.0 |0.11716245 |0.299528064 |0.392119645 |-0.010826643 |-0.004957868 |0.339407077 |0.061509795 |0.3685047 |1.67603E-4 |2.25742E-4 |0.902205454 |-0.371734864 |2.788087559 |
+--------------+-----------+--------------+---------------------+------------+-----------------+-------------------------------+-----------+--------------+---------+------------+----------------------+------------+-------------+---------------------------+------------------+----------------------+-------------+----------------------------+-------------------------------+-------------------------+-----------+------------------+-------------------+--------------------------------+-----------------------+---------------------------+------------------------------------+------------------+-------------------+------------------------------------+---------------------------------+------------------------------+----------------+--------------+
You can use the function json.loads, transform the json string into a dictionary with column-data-pairs and create new columns from this dictionary with .apply(pd.Series)
import json
import pandas as pd
df = pd.DataFrame([["""{"columns":["ApplicationNum","eads59Us01S","HouseDeal_flag","Liability_Asset_Ratio","CBRAvailPcnt","CMSFairIsaacScore","OweTaxes_or_IRAWithdrawalHistry","eads14Fi02S","GuarantorCount","CBRRevMon","CBRInstalMon","CMSApprovedToRequested","SecIncSource","eads59Us01S_4","Liability_Asset_Ratio_40_90","CBRAvailPcnt_20_95","CMSFairIsaacScore_Fund","eads14Fi02S_2","InstalMonthlyPayments_400_3k","RevolvingMonthlyPayments_1k_cap","ApprovedToRequested_0_100","NoSecIncome","coef_eads59Us01S_4","coef_HouseDeal_flag","coef_Liability_Asset_Ratio_40_90","coef_CBRAvailPcnt_20_95","coef_CMSFairIsaacScore_Fund","coef_OweTaxes_or_IRAWithdrawalHistry","coef_eads14Fi02S_2","coef_GuarantorCount","coef_RevolvingMonthlyPayments_1k_cap","coef_InstalMonthlyPayments_400_3k","coef_ApprovedToRequested_0_100","coef_NoSecIncome","coef_Intercept"],"data":[[569325.0,2,0.0,1,92,825,0.0,4,1.0,74,854,0.51,2,2.0,0.9,92.0,825.0,4.0,854.0,1000.0,0.51,0.0,0.11716245,0.299528064,0.392119645,-0.010826643,-0.004957868,0.339407077,0.061509795,0.3685047,0.000167603,0.000225742,0.902205454,-0.371734864,2.788087559]]}"""]], columns=['json_string'])
df['json_loads'] = df['json_string'].apply(json.loads)
df['column_names'] = df['json_loads'].apply(lambda x: x['columns'])
df['data'] = df['json_loads'].apply(lambda x: x['data'][0])
# turning it into a dictionary
df['dict_values']=df.apply(lambda x: dict(zip(x['column_names'],x['data'])), axis=1)
df = pd.concat([df, df['dict_values'].apply(pd.Series)], axis=1)
print(df.head())
If it's a flat json then you can try :-
new_df = pd.DataFrame(df['tickers'].tolist())
The Dataframe constructor takes in a list of dictionary objects and turns the key into columns as default orientation, this is the simplest way if your data is standardized and doesn't have a complex nested structure.
all_data_jsons = df['tickers'].to_list()
df = pd.DataFrame(all_data_jsons)
Maybe you can try this once