With the pandas library, this is as easy as using two commands!
df = pd.read_json()
read_json converts a JSON string to a pandas object (either a series or dataframe). Then:
df.to_csv()
Which can either return a string or write directly to a csv-file. See the docs for to_csv.
Based on the verbosity of previous answers, we should all thank pandas for the shortcut.
For unstructured JSON see this answer.
EDIT: Someone asked for a working minimal example:
import pandas as pd
with open('jsonfile.json', encoding='utf-8') as inputfile:
df = pd.read_json(inputfile)
df.to_csv('csvfile.csv', encoding='utf-8', index=False)
Answer from vmg on Stack OverflowWith the pandas library, this is as easy as using two commands!
df = pd.read_json()
read_json converts a JSON string to a pandas object (either a series or dataframe). Then:
df.to_csv()
Which can either return a string or write directly to a csv-file. See the docs for to_csv.
Based on the verbosity of previous answers, we should all thank pandas for the shortcut.
For unstructured JSON see this answer.
EDIT: Someone asked for a working minimal example:
import pandas as pd
with open('jsonfile.json', encoding='utf-8') as inputfile:
df = pd.read_json(inputfile)
df.to_csv('csvfile.csv', encoding='utf-8', index=False)
First, your JSON has nested objects, so it normally cannot be directly converted to CSV. You need to change that to something like this:
{
"pk": 22,
"model": "auth.permission",
"codename": "add_logentry",
"content_type": 8,
"name": "Can add log entry"
},
......]
Here is my code to generate CSV from that:
import csv
import json
x = """[
{
"pk": 22,
"model": "auth.permission",
"fields": {
"codename": "add_logentry",
"name": "Can add log entry",
"content_type": 8
}
},
{
"pk": 23,
"model": "auth.permission",
"fields": {
"codename": "change_logentry",
"name": "Can change log entry",
"content_type": 8
}
},
{
"pk": 24,
"model": "auth.permission",
"fields": {
"codename": "delete_logentry",
"name": "Can delete log entry",
"content_type": 8
}
}
]"""
x = json.loads(x)
f = csv.writer(open("test.csv", "wb+"))
# Write CSV Header, If you dont need that, remove this line
f.writerow(["pk", "model", "codename", "name", "content_type"])
for x in x:
f.writerow([x["pk"],
x["model"],
x["fields"]["codename"],
x["fields"]["name"],
x["fields"]["content_type"]])
You will get output as:
pk,model,codename,name,content_type
22,auth.permission,add_logentry,Can add log entry,8
23,auth.permission,change_logentry,Can change log entry,8
24,auth.permission,delete_logentry,Can delete log entry,8
How to get Json keys as columns in a csv with python?
Pandas json to csv with column names
python - Using Pandas to convert JSON to CSV with specific fields - Stack Overflow
How to convert JSON to csv in Python?
Videos
Hello,
I am trying to convert a basic json result to cover file. However, it keeps putting the key names on the side instead of on top.
Import pandas Import json Df = pd.read_json(r'file.json, typ='series', orient='columns') Export_csv(r'file.csv', header = None)
If input header true..it merely puts a 0 at the top
So I get
Color: blue
Size: small
Stock: 5
Instead of
Color size stock
Blue. Sm. 5.
Please help
Hey everybody,
I try to convert a huge file (~65 GB) into smaller subsets. The goal is to split the files into the smaller subsets e.g. 1 mil tweets per file and convert this data to a csv format. I currently have a working splitting code that splits the ndjson file to smaller ndjson files, but I have trouble to convert the data to csv. The important part is to create columns for each exsisting varaiable, so columns named __crawled_url or w1_balanced. There are quite a few nested variabels in the data, like w1_balanced is contained in the variable theme_topic, that need to be flattened.
Splitting code:
import json
#function to split big ndjson file to multiple smaller files
def split_file(input_file, lines_per_file): #variables that the function calls
file_count = 0
line_count = 0
output_lines = []
with open(input_file, 'r', encoding="utf8") as infile:
for line in infile:
output_lines.append(line)
line_count += 1
if line_count == lines_per_file:
with open(f'1mio_split_{file_count}.ndjson', 'w', encoding="utf8") as outfile:
outfile.writelines(output_lines)
file_count += 1
line_count = 0
output_lines = []
#handle any remaining lines
if output_lines:
with open(f'1mio_split_{file_count}.ndjson', 'w',encoding="utf8") as outfile:
outfile.writelines(output_lines)
#file containing tweets
input_file = input("path to big file:" )
#example filepath: C:/Users/YourName/Documents/tweet.ndjson
#how many lines/tweets should the new file contain?
lines_per_file = int(input ("Split after how many lines?: "))
split_file(input_file, lines_per_file)
print("Splitting done!")Here are 2 sample lines from the data I use:
[{"__crawled_url":"https://twitter.com/example1","theme_topic":{"w1_balanced":{"label":"__label__a","confidence":0.3981},"w5_balanced":{"label":"__label__c","confidence":1}},"author":"author1","author_userid":"116718988","author_username":"author1","canonical_url":"https://twitter.com/example1","collected_by":"User","collection_method":"tweety 1.0.9.4","collection_time":"2024-05-27T14:40:32","collection_time_epoch":1716813632,"isquoted":false,"isreply":true,"isretweet":false,"language":"de","mentioning/replying":"twitteruser","num_likes":"0","num_retweets":"0","plain_text":"@twitteruser here is an exmaple text 🤔","published_time":"2024-04-18T20:14:51","published_time_epoch":1713471291,"published_time_original":"2024-04-18 20:14:51+00:00","replied_tweet":{"author":"Twitter User","author_userid":"1053198649700827136","author_username":"twitteruser"},"spacy_annotations":{"de_core_news_lg":{"noun_chunks":[{"text":"@twitteruser","start_char":0,"end_char":9},{"text":"more exapmle text","start_char":20,"end_char":34},{"text":"Gel","start_char":40,"end_char":43},{"text":"Haar","start_char":47,"end_char":51}],"named_entities":[{"text":"@twitteruser","start_char":0,"end_char":9,"label_":"MISC"}]},"xx_ent_wiki_sm":{"named_entities":{}},"da_core_news_lg":{"noun_chunks":{},"named_entities":{}},"en_core_web_lg":{"noun_chunks":{},"named_entities":{}},"fr_core_news_lg":{"noun_chunks":{},"named_entities":{}},"it_core_news_lg":{"noun_chunks":{},"named_entities":{}},"pl_core_news_lg":{"named_entities":{}},"es_core_news_lg":{"noun_chunks":{},"named_entities":{}},"fi_core_news_lg":{"noun_chunks":{},"named_entities":{}}},"tweet_id":"1781053802398814682","hashtags":{},"outlinks":{},"quoted_tweet":{"outlinks":{},"hashtags":{},"mentioning/replying":{},"replied_tweet":{}}}]
[{"__crawled_url":"https://twitter.com/example2","theme_topic":{"w1_balanced":{"label":"__label__a","confidence":0.3981},"w5_balanced":{"label":"__label__c","confidence":1}},"author":"author2","author_userid":"116712288","author_username":"author2","canonical_url":"https://twitter.com/example2","collected_by":"User","collection_method":"tweety 1.0.9.4","collection_time":"2024-05-27T14:40:32","collection_time_epoch":1716813632,"isquoted":false,"isreply":true,"isretweet":false,"language":"de","mentioning/replying":"twitteruser","num_likes":"0","num_retweets":"0","plain_text":"@twitteruser here is another exmaple text 🤔","published_time":"2024-04-18T20:14:51","published_time_epoch":1713471291,"published_time_original":"2024-04-18 20:14:51+00:00","replied_tweet":{"author":"Twitter User","author_userid":"1053198649700827136","author_username":"twitteruser"},"spacy_annotations":{"de_core_news_lg":{"noun_chunks":[{"text":"@twitteruser","start_char":0,"end_char":9},{"text":"more exapmle text","start_char":20,"end_char":34},{"text":"Gel","start_char":40,"end_char":43},{"text":"Haar","start_char":47,"end_char":51}],"named_entities":[{"text":"@twitteruser","start_char":0,"end_char":9,"label_":"MISC"}]},"xx_ent_wiki_sm":{"named_entities":{}},"da_core_news_lg":{"noun_chunks":{},"named_entities":{}},"en_core_web_lg":{"noun_chunks":{},"named_entities":{}},"fr_core_news_lg":{"noun_chunks":{},"named_entities":{}},"it_core_news_lg":{"noun_chunks":{},"named_entities":{}},"pl_core_news_lg":{"named_entities":{}},"es_core_news_lg":{"noun_chunks":{},"named_entities":{}},"fi_core_news_lg":{"noun_chunks":{},"named_entities":{}}},"tweet_id":"1781053802398814682","hashtags":{},"outlinks":{},"quoted_tweet":{"outlinks":{},"hashtags":{},"mentioning/replying":{},"replied_tweet":{}}}]As you can see the lines contain stuff like emojis and are in different languages so the encoding="uft8" must be included while opening the file, here are a few examples what I tried and the error message I get. I should also mention, that since every line is it's own list just calling the elements like with a normal json object didn't work.
Thanks a lot for every answer and even reading this post!
#try1
import json
import csv
data = "C:/Users/Sample-tweets.ndjson"
json_data = json.loads(data)
csv_file ="try3.csv"
csv_obj = open(csv_file, "w")
csv_writer = csv.writer(csv_obj)
header = json_data[0].keys()
csv_writer.writerow(header)
for item in json_data:
csv_writer.writerow(item.values())
csv_obj.close()
#raise JSONDecodeError("Expecting value", s, err.value) from None
#json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
#try2
import json
import csv
with open('Sample-tweets.ndjson', encoding="utf8") as ndfile:
data = json.load(ndfile)
csv_data = data['emp_details']
data_file = open('try1.csv', 'w', encoding="utf8")
csv_writer = csv.writer(data_file)
count = 0
for data in csv_data:
if count == 0:
header = emp.keys()
csv_writer.writerow(header) #spacing error?! can't even run the script
count += 1
csv_writer.writerow(emp.values())
data_file.close()
with open('Sample-tweets.ndjson', encoding="utf8") as ndfile:
jsondata = json.load(ndfile)
data_file = open('try2.csv', 'w', newline='', encoding="uft8")
csv_writer = csv.writer(data_file)
count = 0
for data in ndfile:
if count == 0:
header = data.keys()
csv_writer.writerow(header)
count += 1
csv_writer.writerow(data.values())
data_file.close()
#error message: raise JSONDecodeError("Extra data", s, end)
#json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 1908)
#try3 to see if the auto dictionary works
import json
output_lines=[]
with open('C:/Users/Sample1-tweets.ndjson', 'r', encoding="utf8") as f:
json_in=f.read()
json_in=json.loads(json_in)
print(json_in[2])
#error message: raise JSONDecodeError("Extra data", s, end)
#json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 1908)
#->same error message as aboveTo keep the desired columns try this
cols_to_keep = ['col1', 'col2', 'col3']
df = df[cols_to_keep]
df
You can also read in only the columns you need like this
df = pd.read_csv('test_old.csv', usecols = ['col1', 'col2', 'col3'],
dtype={"col1" : str, "col2" : str})
You can do all the grouping in pandas.
The idea behind this solution:
Create a new column subset that has the subset dictionary you want.
Group dataframe by col1 into a new data frame. Here the subset is connected to each item from col1. Extract the series subset.
Loop through this series and collect the data for your json in a list.
Convert that list to json with Python native tools.
import pandas as pd
import json
df = pd.read_csv('test_old.csv', sep=',',
dtype={
"col1" : str,
"col2" : str,
"col3" : str
})
# print(df) - compare with example
df['subset'] = df.apply(lambda x:
{'col2': x.col2,
'col3': x.col3 }, axis=1)
s = df.groupby('col1').agg(lambda x: list(x))['subset']
results = []
for col1, subset in s.iteritems():
results.append({'col1': col1, 'subset': subset})
with open('ExpectedJsonFile.json', 'w') as outfile:
outfile.write(json.dumps(results, indent=4))
UPDATE: Since there's a problem with the example,
insert a print(df) line after the pd.read_csv and compare.
The imported data frame should show as:
col1 col2 state col3 val2 val3 val4 val5
0 95110 2015-05-01 CA 50 30.0 5.0 3.0 3
1 95110 2015-06-01 CA 67 31.0 5.0 3.0 4
2 95110 2015-07-01 CA 97 32.0 5.0 3.0 6
The final result shows like this
[
{
"col1": "95110",
"subset": [
{
"col2": "2015-05-01",
"col3": "50"
},
{
"col2": "2015-06-01",
"col3": "67"
},
{
"col2": "2015-07-01",
"col3": "97"
}
]
}
]
Tested with Python 3.5.6 32bit, Pandas 0.23.4, Windows7