I notice this question was asked a few years ago but if someone else find this, here are some newer projects trying to address this same problem:
- ObjectPath (for Python and Javascript): http://objectpath.org/
- jsonpath (Python reimplementation of the Javascript equivalent): https://pypi.org/project/jsonpath/
- yaql: https://yaql.readthedocs.io/en/latest/readme.html
- pyjq (Python bindings for jq https://stedolan.github.io/jq/): https://pypi.org/project/pyjq/
- JMESPath: https://github.com/jmespath/jmespath.py
I personally went with pyjq because I use jq all the time for data exploration but ObjectPath seems very attractive and not limited to json.
I notice this question was asked a few years ago but if someone else find this, here are some newer projects trying to address this same problem:
- ObjectPath (for Python and Javascript): http://objectpath.org/
- jsonpath (Python reimplementation of the Javascript equivalent): https://pypi.org/project/jsonpath/
- yaql: https://yaql.readthedocs.io/en/latest/readme.html
- pyjq (Python bindings for jq https://stedolan.github.io/jq/): https://pypi.org/project/pyjq/
- JMESPath: https://github.com/jmespath/jmespath.py
I personally went with pyjq because I use jq all the time for data exploration but ObjectPath seems very attractive and not limited to json.
I thought about this a little bit, and I lean towards something less specific such as a "JSON Query Language" and considered something more generic. I remembered from working with C# a bit that they had a somewhat generic querying system called LINQ for handling these sort of querying issues.
It looks as though Python has something similar called Pynq which supports basic querying such as:
filtered_collection = From(some_collection).where("item.property > 10").select_many()
It even appears to have some basic aggregation functions. While not being specific to JSON, I think it's a least a good starting point for querying.
return SQL table as JSON in python - Stack Overflow
The fastest tool for querying large JSON files is written in Python! (benchmark)
Converting JSON to SQL Table
How to parse json output into SQL Server in python
Videos
Here is a really nice example of a pythonic way to do that:
import json
import psycopg2
def db(database_name='pepe'):
return psycopg2.connect(database=database_name)
def query_db(query, args=(), one=False):
cur = db().cursor()
cur.execute(query, args)
r = [dict((cur.description[i][0], value) \
for i, value in enumerate(row)) for row in cur.fetchall()]
cur.connection.close()
return (r[0] if r else None) if one else r
my_query = query_db("select * from majorroadstiger limit %s", (3,))
json_output = json.dumps(my_query)
You get an array of JSON objects:
>>> json_output
'[{"divroad": "N", "featcat": null, "countyfp": "001",...
Or with the following:
>>> j2 = query_db("select * from majorroadstiger where fullname= %s limit %s",\
("Mission Blvd", 1), one=True)
you get a single JSON object:
>>> j2 = json.dumps(j2)
>>> j2
'{"divroad": "N", "featcat": null, "countyfp": "001",...
import sqlite3
import json
DB = "./the_database.db"
def get_all_users( json_str = False ):
conn = sqlite3.connect( DB )
conn.row_factory = sqlite3.Row # This enables column access by name: row['column_name']
db = conn.cursor()
rows = db.execute('''
SELECT * from Users
''').fetchall()
conn.commit()
conn.close()
if json_str:
return json.dumps( [dict(ix) for ix in rows] ) #CREATE JSON
return rows
Callin the method no json...
print get_all_users()
prints:
[(1, u'orvar', u'password123'), (2, u'kalle', u'password123')]
Callin the method with json...
print get_all_users( json_str = True )
prints:
[{"password": "password123", "id": 1, "name": "orvar"}, {"password": "password123", "id": 2, "name": "kalle"}]
spyql is a tool (and python lib) for querying and transforming data. It is fully written in Python.
In the latest benchmark, spyql outperformed all other tools, including jq, one of the most popular tools written in C.
Here is one example extracted from the benchmark that shows spyql achieving the lowest processing time while keeping memory requirements low when the dataset size is >= 100MB.
Processing time and memory requirements vs size of input JSON dataIMO, these results might questions some preconceived opinions about Python’s performance and interpreted languages in general.
The benchmark is very easy to reproduce without installing any software since it runs on a google colab notebook.
Happy to hear your thoughts!
UPDATE 2022/04/22
Thank you all for your feedback. The benchmark was updated and the fastest tool is NOT written in Python. Here are the highlights:
Added ClickHouse (written in C++) to the benchmark: I was unaware that the clickhouse-local tool would handle these tasks. ClickHouse is now the fastest (together with OctoSQL);
OctoSQL (written in Go) was updated as a response to the benchmark: updates included switching to fastjson, short-circuiting LIMIT, and eagerly printing when outputting JSON and CSV. Now, OctoSQL is one of the fastest and memory is stable;
SPyQL (written in Python) is now third: SPyQL leverages orjson (Rust) to parse JSONs, while the query engine is written in Python. When processing 1GB of input data, SPyQL takes 4x-5x more time than the best, while still achieving up to 2x higher performance than jq (written in C);
I removed Pandas from the benchmark and focused on command-line tools. I am planning a separate benchmark on Python libs where Pandas, Polars and Modin (and eventually others) will be included.
This benchmark is a living document. If you are interested in receiving updates, please subscribe to the following issue: https://github.com/dcmoura/spyql/issues/72
Thank you!
Is there any module which will convert json data to a mysql table?
I would do it this way:
Copyfn = r'D:\temp\.data\40450591.json'
with open(fn) as f:
data = json.load(f)
# some of your records seem NOT to have `Tags` key, hence `KeyError: 'Tags'`
# let's fix it
for r in data['Volumes']:
if 'Tags' not in r:
r['Tags'] = []
v = pd.DataFrame(data['Volumes']).drop(['Attachments', 'Tags'],1)
a = pd.io.json.json_normalize(data['Volumes'], 'Attachments', ['VolumeId'], meta_prefix='parent_')
t = pd.io.json.json_normalize(data['Volumes'], 'Tags', ['VolumeId'], meta_prefix='parent_')
v.to_sql('volume', engine)
a.to_sql('attachment', engine)
t.to_sql('tag', engine)
Output:
CopyIn [179]: v
Out[179]:
AvailabilityZone CreateTime Iops Size SnapshotId State VolumeType
VolumeId
vol-049df61146c4d7901 us-east-1a 2013-12-18T22:35:00.084Z NaN 8 snap-1234567890abcdef0 in-use standard
vol-1234567890abcdef0 us-east-1a 2014-02-27T00:02:41.791Z 1000.0 100 None available io1
In [180]: a
Out[180]:
AttachTime DeleteOnTermination Device InstanceId State VolumeId parent_VolumeId
0 2013-12-18T22:35:00.000Z True /dev/sda1 i-1234567890abcdef0 attached vol-049df61146c4d7901 vol-049df61146c4d7901
1 2013-12-18T22:35:11.000Z True /dev/sda1 i-1234567890abcdef1 attached vol-049df61146c4d7111 vol-049df61146c4d7901
In [217]: t
Out[217]:
Key Value parent_VolumeId
0 Name DBJanitor-Private vol-049df61146c4d7901
1 Owner DBJanitor vol-049df61146c4d7901
2 Product Database vol-049df61146c4d7901
3 Portfolio DB Janitor vol-049df61146c4d7901
4 Service DB Service vol-049df61146c4d7901
Test JSON file:
{
"Volumes": [
{
"AvailabilityZone": "us-east-1a",
"Attachments": [
{
"AttachTime": "2013-12-18T22:35:00.000Z",
"InstanceId": "i-1234567890abcdef0",
"VolumeId": "vol-049df61146c4d7901",
"State": "attached",
"DeleteOnTermination": true,
"Device": "/dev/sda1"
},
{
"AttachTime": "2013-12-18T22:35:11.000Z",
"InstanceId": "i-1234567890abcdef1",
"VolumeId": "vol-049df61146c4d7111",
"State": "attached",
"DeleteOnTermination": true,
"Device": "/dev/sda1"
}
],
"Tags": [
{
"Value": "DBJanitor-Private",
"Key": "Name"
},
{
"Value": "DBJanitor",
"Key": "Owner"
},
{
"Value": "Database",
"Key": "Product"
},
{
"Value": "DB Janitor",
"Key": "Portfolio"
},
{
"Value": "DB Service",
"Key": "Service"
}
],
"VolumeType": "standard",
"VolumeId": "vol-049df61146c4d7901",
"State": "in-use",
"SnapshotId": "snap-1234567890abcdef0",
"CreateTime": "2013-12-18T22:35:00.084Z",
"Size": 8
},
{
"AvailabilityZone": "us-east-1a",
"Attachments": [],
"VolumeType": "io1",
"VolumeId": "vol-1234567890abcdef0",
"State": "available",
"Iops": 1000,
"SnapshotId": null,
"CreateTime": "2014-02-27T00:02:41.791Z",
"Size": 100
}
]
}
Analog to this example: https://github.com/zolekode/json-to-tables/blob/master/example.py
Use the following script:
The following script exports the data as HTML, but you might as well export it as SQL.
Copytable_maker.save_tables(YOUR_PATH, export_as="sql", sql_connection=YOUR_CONNECTION)
# See the code below
Copyimport json
from extent_table import ExtentTable
from table_maker import TableMaker
Volumes = [
{
"AvailabilityZone": "us-east-1a",
"Attachments": [
{
"AttachTime": "2013-12-18T22:35:00.000Z",
"InstanceId": "i-1234567890abcdef0",
"VolumeId": "vol-049df61146c4d7901",
"State": "attached",
"DeleteOnTermination": "true",
"Device": "/dev/sda1"
}
],
"Tags": [
{
"Value": "DBJanitor-Private",
"Key": "Name"
},
{
"Value": "DBJanitor",
"Key": "Owner"
},
{
"Value": "Database",
"Key": "Product"
},
{
"Value": "DB Janitor",
"Key": "Portfolio"
},
{
"Value": "DB Service",
"Key": "Service"
}
],
"VolumeType": "standard",
"VolumeId": "vol-049df61146c4d7901",
"State": "in-use",
"SnapshotId": "snap-1234567890abcdef0",
"CreateTime": "2013-12-18T22:35:00.084Z",
"Size": 8
},
{
"AvailabilityZone": "us-east-1a",
"Attachments": [],
"VolumeType": "io1",
"VolumeId": "vol-1234567890abcdef0",
"State": "available",
"Iops": 1000,
"SnapshotId": "null",
"CreateTime": "2014-02-27T00:02:41.791Z",
"Size": 100
}
]
volumes = json.dumps(Volumes)
volumes = json.loads(volumes)
extent_table = ExtentTable()
table_maker = TableMaker(extent_table)
table_maker.convert_json_objects_to_tables(volumes, "volumes")
table_maker.show_tables(8)
table_maker.save_tables("./", export_as="html") # you can also pass in export_as="sql" or "csv". In the case of sql, there is a parameter to pass the engine.
Output in HTML:
Copy<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>ID</th>
<th>AvailabilityZone</th>
<th>VolumeType</th>
<th>VolumeId</th>
<th>State</th>
<th>SnapshotId</th>
<th>CreateTime</th>
<th>Size</th>
<th>Iops</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>us-east-1a</td>
<td>standard</td>
<td>vol-049df61146c4d7901</td>
<td>in-use</td>
<td>snap-1234567890abcdef0</td>
<td>2013-12-18T22:35:00.084Z</td>
<td>8</td>
<td>None</td>
</tr>
<tr>
<td>1</td>
<td>us-east-1a</td>
<td>io1</td>
<td>vol-1234567890abcdef0</td>
<td>available</td>
<td>null</td>
<td>2014-02-27T00:02:41.791Z</td>
<td>100</td>
<td>1000</td>
</tr>
<tr>
<td>2</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
</tr>
</tbody>
</table>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>ID</th>
<th>PARENT_ID</th>
<th>is_scalar</th>
<th>scalar</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>False</td>
<td>None</td>
</tr>
</tbody>
</table>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>ID</th>
<th>AttachTime</th>
<th>InstanceId</th>
<th>VolumeId</th>
<th>State</th>
<th>DeleteOnTermination</th>
<th>Device</th>
<th>PARENT_ID</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>2013-12-18T22:35:00.000Z</td>
<td>i-1234567890abcdef0</td>
<td>vol-049df61146c4d7901</td>
<td>attached</td>
<td>true</td>
<td>/dev/sda1</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
</tr>
</tbody>
</table>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>ID</th>
<th>PARENT_ID</th>
<th>is_scalar</th>
<th>scalar</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>False</td>
<td>None</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>False</td>
<td>None</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>False</td>
<td>None</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>False</td>
<td>None</td>
</tr>
<tr>
<td>4</td>
<td>0</td>
<td>False</td>
<td>None</td>
</tr>
</tbody>
</table>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>ID</th>
<th>Value</th>
<th>Key</th>
<th>PARENT_ID</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>DBJanitor-Private</td>
<td>Name</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>DBJanitor</td>
<td>Owner</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>Database</td>
<td>Product</td>
<