There are many possible solutions. Generally though, you'll probably want to:
- Not loop over fields; instead let Pandas split the fields for you
- Use an actual missing value
- But later if you want to represent it differently, you can do that, e.g. using the
na_repparameter todf.style.format
- But later if you want to represent it differently, you can do that, e.g. using the
For the first step, you can look at Split / Explode a column of dictionaries into separate columns with pandas. I'll use Lech Birek's solution (json_normalize) then drop the "id" columns and rename the "value" columns.
headers_mapping = {'1': 'field1', '2': 'field2', '3': 'field3', '4': 'field4'}
(
pd.json_normalize(df['json_field'])
.filter(like='value')
.rename(columns=lambda label: headers_mapping[label.rstrip('.value')])
)
field1 field2 field3 field4
0 value1 value2 NaN NaN
1 value1 NaN value3 NaN
2 NaN NaN value3 value4
If you also need to sort the columns, tack this on at the end:
.reindex(columns=headers_mapping.values())
Answer from wjandrea on Stack OverflowThere are many possible solutions. Generally though, you'll probably want to:
- Not loop over fields; instead let Pandas split the fields for you
- Use an actual missing value
- But later if you want to represent it differently, you can do that, e.g. using the
na_repparameter todf.style.format
- But later if you want to represent it differently, you can do that, e.g. using the
For the first step, you can look at Split / Explode a column of dictionaries into separate columns with pandas. I'll use Lech Birek's solution (json_normalize) then drop the "id" columns and rename the "value" columns.
headers_mapping = {'1': 'field1', '2': 'field2', '3': 'field3', '4': 'field4'}
(
pd.json_normalize(df['json_field'])
.filter(like='value')
.rename(columns=lambda label: headers_mapping[label.rstrip('.value')])
)
field1 field2 field3 field4
0 value1 value2 NaN NaN
1 value1 NaN value3 NaN
2 NaN NaN value3 value4
If you also need to sort the columns, tack this on at the end:
.reindex(columns=headers_mapping.values())
You can try:
import json
# apply `json.loads` if necessary
df["json_field"] = df["json_field"].apply(json.loads)
data = []
for d in df["json_field"]:
dct = {}
for k, v in d.items():
dct[f"field{k}"] = v["value"]
data.append(dct)
out = pd.DataFrame(data)
print(out)
Prints:
field1 field2 field3 field4
0 value1 value2 NaN NaN
1 value1 NaN value3 NaN
2 NaN NaN value3 value4
python - How can I split a pandas column containing a json into new columns in a given dataframe? - Stack Overflow
How to extract JSON into DataFrame columns?
Dividing json row data into multiple columns of pandas dataframe - Stack Overflow
Help with splitting JSON column from dataframe
I hope I've understood your question well. Try:
from ast import literal_eval
df["experimental_properties"] = df["experimental_properties"].apply(
lambda x: {d["name"]: d["property"] for d in literal_eval(x)}
)
df = pd.concat([df, df.pop("experimental_properties").apply(pd.Series)], axis=1)
print(df)
Prints:
Boiling Point Density
0 115.3 °C NaN
1 91 °C @ Press: 20 Torr NaN
2 58 °C @ Press: 12 Torr 0.8753 g/cm<sup>3</sup> @ Temp: 20 °C
Is the expected output really what you are looking for? Another way to visualise the data would be to have "name", "property", and "sourceNumber" as column names.
import json
import pandas as pd
data = [
'''[{'name': 'Boiling Point', 'property': '115.3 °C', 'sourceNumber': 1}]''',
'''[{'name': 'Boiling Point', 'property': '91 °C @ Press: 20 Torr', 'sourceNumber': 1}]''',
'''[{'name': 'Boiling Point', 'property': '58 °C @ Press: 12 Torr', 'sourceNumber': 1}, {'name': 'Density', 'property': '0.8753 g/cm<sup>3</sup> @ Temp: 20 °C', 'sourceNumber': 1}]''']
#Initialise a naiveList
naiveList = []
#String to List
for i in data:
tempStringOfData = i
tempStringOfData = tempStringOfData.replace("\'", "\"")
tempJsonData = json.loads(tempStringOfData)
naiveList.append(tempJsonData)
#Initialise a List for Dictionaries
newListOfDictionaries = []
for i in naiveList:
for j in i:
newListOfDictionaries.append(j)
df = pd.DataFrame(newListOfDictionaries)
print(df)
Which gives you
name property sourceNumber
0 Boiling Point 115.3 °C 1
1 Boiling Point 91 °C @ Press: 20 Torr 1
2 Boiling Point 58 °C @ Press: 12 Torr 1
3 Density 0.8753 g/cm<sup>3</sup> @ Temp: 20 °C 1
I have a large line JSON file that I am reading through in chunks using pandas read_json.
Everything is going well, except for one field that is coming across in its original JSON form, which is fine, but I need to further parse it into columns.
The fields looks like:
{'food': 'apple', 'type': 'fruit'},{'food': 'beef', 'type': 'meat'},{'food': 'ice-cream', 'type': 'desert'}
I'd like to have three columns in the DataFrame 'Food1','Food2,'Food3' that I would like to populate with data from this field - there are 28 columns before these 3 that read_json is working fine for.
Some rows don't have the above field populated.
But for this record, I'd like the result to be:
| col1 | col2 | .... | col28 | food1 | food2 | food3 |
|---|---|---|---|---|---|---|
| xxx | xxx | xxx | xxx | apple | beef | ice-cream |
There seem to be three issues.
-
extracting the JSON so I can parse how many food items I have in this record (0-3)
-
applying this to the entire chunk that I've read from the JSON Lines file
-
dealing with lines that don't have this field filled in
I could write some code to parse each line individually, but this looks like it will be much slower than using pandas.
I've tried json_loads, ast.literal_eval and none seems to be getting me closer to what I'm looking for... help!
Hi!
I have a CSV file that consists of an id, which is an unique movie, and the keywords for this movie. It looks something like this: 15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392, 'name': 'best friend'}, {'id': 179431, 'name': 'duringcreditsstinger'}, {'id': 208510, 'name': 'old men'}]"
I want to split the data so every movie (the id) gets every keyword. But using read csv-file, it only gets me a column with the id and then one column with all the keywords, including keyword-id and 'name'. Is there any solution to only get the specific keyword?
should add ignore_index=True argument in explode function to make sure the following join is not messed up.
df = pd.DataFrame(data).explode('countries', ignore_index=True)
df = df.join(pd.json_normalize(df.pop('countries')))
print(df)
You could try this with explode:
df=df.explode('countries')
#we add to each dictionary the respective value of year with key 'year'
df['countries']=[{**dc,**{'year':y}} for dc,y in zip(df['countries'],df['year'])]
pd.DataFrame(df['countries'].tolist())
Example:
j = [{'continent': 'europe',
'country': 'Yugoslavia',
'income': None,
'life_exp': None,
'population': 4687422},
{'continent': 'asia',
'country': 'United Korea (former)',
'income': None,
'life_exp': None,
'population': 13740000}]
df=pd.DataFrame({'countries':[j,j],'year':[1800,1900]})
print(df)
df=df.explode('countries')
print(df)
#Here we add the key 'year' with the respective year row value to each dictionary
df['countries']=[{**dc,**{'year':y}} for dc,y in zip(df['countries'],df['year'])]
print(df['countries'])
finaldf=pd.DataFrame(df['countries'].tolist())
print(finaldf)
Output:
original df:
countries year
0 [{'continent': 'europe', 'country': 'Yugoslavi... 1800
1 [{'continent': 'europe', 'country': 'Yugoslavi... 1900
df(after explode):
countries year
0 {'continent': 'europe', 'country': 'Yugoslavia... 1800
0 {'continent': 'asia', 'country': 'United Korea... 1800
1 {'continent': 'europe', 'country': 'Yugoslavia... 1900
1 {'continent': 'asia', 'country': 'United Korea... 1900
df.countries(with year added):
0 {'continent': 'europe', 'country': 'Yugoslavia', 'income': None, 'life_exp': None, 'population': 4687422, 'year': 1800}
0 {'continent': 'asia', 'country': 'United Korea (former)', 'income': None, 'life_exp': None, 'population': 13740000, 'year': 1800}
1 {'continent': 'europe', 'country': 'Yugoslavia', 'income': None, 'life_exp': None, 'population': 4687422, 'year': 1900}
1 {'continent': 'asia', 'country': 'United Korea (former)', 'income': None, 'life_exp': None, 'population': 13740000, 'year': 1900}
Name: countries, dtype: object
finaldf
continent country income life_exp population year
0 europe Yugoslavia None None 4687422 1800
1 asia United Korea (former) None None 13740000 1800
2 europe Yugoslavia None None 4687422 1900
3 asia United Korea (former) None None 13740000 1900