There are a number of tools specifically designed for the purpose of manipulating JSON from the command line, and will be a lot easier and more reliable than doing it with Awk, such as jq:
curl -s 'https://api.github.com/users/lambda' | jq -r '.name'
You can also do this with tools that are likely already installed on your system, like Python using the json module, and so avoid any extra dependencies, while still having the benefit of a proper JSON parser. The following assume you want to use UTF-8, which the original JSON should be encoded in and is what most modern terminals use as well:
Python 3:
curl -s 'https://api.github.com/users/lambda' | \
python3 -c "import sys, json; print(json.load(sys.stdin)['name'])"
Python 2:
export PYTHONIOENCODING=utf8
curl -s 'https://api.github.com/users/lambda' | \
python2 -c "import sys, json; print json.load(sys.stdin)['name']"
Frequently Asked Questions
Why not a pure shell solution?
The standard POSIX/Single Unix Specification shell is a very limited language which doesn't contain facilities for representing sequences (list or arrays) or associative arrays (also known as hash tables, maps, dicts, or objects in some other languages). This makes representing the result of parsing JSON somewhat tricky in portable shell scripts. There are somewhat hacky ways to do it, but many of them can break if keys or values contain certain special characters.
Bash 4 and later, zsh, and ksh have support for arrays and associative arrays, but these shells are not universally available (macOS stopped updating Bash at Bash 3, due to a change from GPLv2 to GPLv3, while many Linux systems don't have zsh installed out of the box). It's possible that you could write a script that would work in either Bash 4 or zsh, one of which is available on most macOS, Linux, and BSD systems these days, but it would be tough to write a shebang line that worked for such a polyglot script.
Finally, writing a full fledged JSON parser in shell would be a significant enough dependency that you might as well just use an existing dependency like jq or Python instead. It's not going to be a one-liner, or even small five-line snippet, to do a good implementation.
Why not use awk, sed, or grep?
It is possible to use these tools to do some quick extraction from JSON with a known shape and formatted in a known way, such as one key per line. There are several examples of suggestions for this in other answers.
However, these tools are designed for line based or record based formats; they are not designed for recursive parsing of matched delimiters with possible escape characters.
So these quick and dirty solutions using awk/sed/grep are likely to be fragile, and break if some aspect of the input format changes, such as collapsing whitespace, or adding additional levels of nesting to the JSON objects, or an escaped quote within a string. A solution that is robust enough to handle all JSON input without breaking will also be fairly large and complex, and so not too much different than adding another dependency on jq or Python.
I have had to deal with large amounts of customer data being deleted due to poor input parsing in a shell script before, so I never recommend quick and dirty methods that may be fragile in this way. If you're doing some one-off processing, see the other answers for suggestions, but I still highly recommend just using an existing tested JSON parser.
Historical notes
This answer originally recommended jsawk, which should still work, but is a little more cumbersome to use than jq, and depends on a standalone JavaScript interpreter being installed which is less common than a Python interpreter, so the above answers are probably preferable:
curl -s 'https://api.github.com/users/lambda' | jsawk -a 'return this.name'
This answer also originally used the Twitter API from the question, but that API no longer works, making it hard to copy the examples to test out, and the new Twitter API requires API keys, so I've switched to using the GitHub API which can be used easily without API keys. The first answer for the original question would be:
curl 'http://twitter.com/users/username.json' | jq -r '.text'
Answer from Brian Campbell on Stack OverflowThere are a number of tools specifically designed for the purpose of manipulating JSON from the command line, and will be a lot easier and more reliable than doing it with Awk, such as jq:
curl -s 'https://api.github.com/users/lambda' | jq -r '.name'
You can also do this with tools that are likely already installed on your system, like Python using the json module, and so avoid any extra dependencies, while still having the benefit of a proper JSON parser. The following assume you want to use UTF-8, which the original JSON should be encoded in and is what most modern terminals use as well:
Python 3:
curl -s 'https://api.github.com/users/lambda' | \
python3 -c "import sys, json; print(json.load(sys.stdin)['name'])"
Python 2:
export PYTHONIOENCODING=utf8
curl -s 'https://api.github.com/users/lambda' | \
python2 -c "import sys, json; print json.load(sys.stdin)['name']"
Frequently Asked Questions
Why not a pure shell solution?
The standard POSIX/Single Unix Specification shell is a very limited language which doesn't contain facilities for representing sequences (list or arrays) or associative arrays (also known as hash tables, maps, dicts, or objects in some other languages). This makes representing the result of parsing JSON somewhat tricky in portable shell scripts. There are somewhat hacky ways to do it, but many of them can break if keys or values contain certain special characters.
Bash 4 and later, zsh, and ksh have support for arrays and associative arrays, but these shells are not universally available (macOS stopped updating Bash at Bash 3, due to a change from GPLv2 to GPLv3, while many Linux systems don't have zsh installed out of the box). It's possible that you could write a script that would work in either Bash 4 or zsh, one of which is available on most macOS, Linux, and BSD systems these days, but it would be tough to write a shebang line that worked for such a polyglot script.
Finally, writing a full fledged JSON parser in shell would be a significant enough dependency that you might as well just use an existing dependency like jq or Python instead. It's not going to be a one-liner, or even small five-line snippet, to do a good implementation.
Why not use awk, sed, or grep?
It is possible to use these tools to do some quick extraction from JSON with a known shape and formatted in a known way, such as one key per line. There are several examples of suggestions for this in other answers.
However, these tools are designed for line based or record based formats; they are not designed for recursive parsing of matched delimiters with possible escape characters.
So these quick and dirty solutions using awk/sed/grep are likely to be fragile, and break if some aspect of the input format changes, such as collapsing whitespace, or adding additional levels of nesting to the JSON objects, or an escaped quote within a string. A solution that is robust enough to handle all JSON input without breaking will also be fairly large and complex, and so not too much different than adding another dependency on jq or Python.
I have had to deal with large amounts of customer data being deleted due to poor input parsing in a shell script before, so I never recommend quick and dirty methods that may be fragile in this way. If you're doing some one-off processing, see the other answers for suggestions, but I still highly recommend just using an existing tested JSON parser.
Historical notes
This answer originally recommended jsawk, which should still work, but is a little more cumbersome to use than jq, and depends on a standalone JavaScript interpreter being installed which is less common than a Python interpreter, so the above answers are probably preferable:
curl -s 'https://api.github.com/users/lambda' | jsawk -a 'return this.name'
This answer also originally used the Twitter API from the question, but that API no longer works, making it hard to copy the examples to test out, and the new Twitter API requires API keys, so I've switched to using the GitHub API which can be used easily without API keys. The first answer for the original question would be:
curl 'http://twitter.com/users/username.json' | jq -r '.text'
To quickly extract the values for a particular key, I personally like to use "grep -o", which only returns the regex's match. For example, to get the "text" field from tweets, something like:
grep -Po '"text":.*?[^\\]",' tweets.json
This regex is more robust than you might think; for example, it deals fine with strings having embedded commas and escaped quotes inside them. I think with a little more work you could make one that is actually guaranteed to extract the value, if it's atomic. (If it has nesting, then a regex can't do it of course.)
And to further clean (albeit keeping the string's original escaping) you can use something like: | perl -pe 's/"text"://; s/^"//; s/",$//'. (I did this for this analysis.)
To all the haters who insist you should use a real JSON parser -- yes, that is essential for correctness, but
- To do a really quick analysis, like counting values to check on data cleaning bugs or get a general feel for the data, banging out something on the command line is faster. Opening an editor to write a script is distracting.
grep -ois orders of magnitude faster than the Python standardjsonlibrary, at least when doing this for tweets (which are ~2 KB each). I'm not sure if this is just becausejsonis slow (I should compare to yajl sometime); but in principle, a regex should be faster since it's finite state and much more optimizable, instead of a parser that has to support recursion, and in this case, spends lots of CPU building trees for structures you don't care about. (If someone wrote a finite state transducer that did proper (depth-limited) JSON parsing, that would be fantastic! In the meantime we have "grep -o".)
To write maintainable code, I always use a real parsing library. I haven't tried jsawk, but if it works well, that would address point #1.
One last, wackier, solution: I wrote a script that uses Python json and extracts the keys you want, into tab-separated columns; then I pipe through a wrapper around awk that allows named access to columns. In here: the json2tsv and tsvawk scripts. So for this example it would be:
json2tsv id text < tweets.json | tsvawk '{print "tweet "
text}'
This approach doesn't address #2, is more inefficient than a single Python script, and it's a little brittle: it forces normalization of newlines and tabs in string values, to play nice with awk's field/record-delimited view of the world. But it does let you stay on the command line, with more correctness than grep -o.
bash - Read JSON data in a shell script - Stack Overflow
text processing - How to parse JSON with shell scripting in Linux? - Unix & Linux Stack Exchange
Generating JSON in a shell script - Unix & Linux Stack Exchange
Build a JSON string with Bash variables - Stack Overflow
Videos
There is jq for parsing json on the command line:
jq '.Body'
Visit this for jq: https://stedolan.github.io/jq/
tl;dr
$ cat /tmp/so.json | underscore select '.Messages .Body'
["172.16.1.42|/home/480/1234/5-12-2013/1234.toSort"]
Javascript CLI tools
You can use Javascript CLI tools like
- underscore-cli:
- json:select(): CSS-like selectors for JSON.
Example
Select all name children of a addons:
underscore select ".addons > .name"
The underscore-cli provide others real world examples as well as the json:select() doc.
The availability of parsers in nearly every programming language is one of the advantages of JSON as a data-interchange format.
Rather than trying to implement a JSON parser, you are likely better off using either a tool built for JSON parsing such as jq or a general purpose script language that has a JSON library.
For example, using jq, you could pull out the ImageID from the first item of the Instances array as follows:
jq '.Instances[0].ImageId' test.json
Alternatively, to get the same information using Ruby's JSON library:
ruby -rjson -e 'j = JSON.parse(File.read("test.json")); puts j["Instances"][0]["ImageId"]'
I won't answer all of your revised questions and comments but the following is hopefully enough to get you started.
Suppose that you had a Ruby script that could read a from STDIN and output the second line in your example output[0]. That script might look something like:
#!/usr/bin/env ruby
require 'json'
data = JSON.parse(ARGF.read)
instance_id = data["Instances"][0]["InstanceId"]
name = data["Instances"][0]["Tags"].find {|t| t["Key"] == "Name" }["Value"]
owner = data["Instances"][0]["Tags"].find {|t| t["Key"] == "Owner" }["Value"]
cost_center = data["Instances"][0]["SubnetId"].split("-")[1][0..3]
puts "#{instance_id}\t#{name}\t#{cost_center}\t#{owner}"
How could you use such a script to accomplish your whole goal? Well, suppose you already had the following:
- a command to list all your instances
- a command to get the json above for any instance on your list and output it to STDOU
One way would be to use your shell to combine these tools:
echo -e "Instance id\tName\tcost centre\tOwner"
for instance in $(list-instances); do
get-json-for-instance $instance | ./ugly-ruby-scriptrb
done
Now, maybe you have a single command that give you one json blob for all instances with more items in that "Instances" array. Well, if that is the case, you'll just need to modify the script a bit to iterate through the array rather than simply using the first item.
In the end, the way to solve this problem, is the way to solve many problems in Unix. Break it down into easier problems. Find or write tools to solve the easier problem. Combine those tools with your shell or other operating system features.
[0] Note that I have no idea where you get cost-center from, so I just made it up.
You can use following python script to parse that data. Lets assume that you have JSON data from arrays in files like array1.json, array2.json and so on.
import json
import sys
from pprint import pprint
jdata = open(sys.argv[1])
data = json.load(jdata)
print "InstanceId", " - ", "Name", " - ", "Owner"
print data["Instances"][0]["InstanceId"], " - " ,data["Instances"][0]["Tags"][1]["Value"], " - " ,data["Instances"][0]["Tags"][2]["Value"]
jdata.close()
And then just run:
$ for x in `ls *.json`; do python parse.py $x; done
InstanceId - Name - Owner
i-1234576 - RDS_Machine (us-east-1c) - Jyoti Bhanot
I haven't seen cost in your data, that's why I didn't include that.
According to discussion in comments, I have updated parse.py script:
import json
import sys
from pprint import pprint
jdata = sys.stdin.read()
data = json.loads(jdata)
print "InstanceId", " - ", "Name", " - ", "Owner"
print data["Instances"][0]["InstanceId"], " - " ,data["Instances"][0]["Tags"][1]["Value"], " - " ,data["Instances"][0]["Tags"][2]["Value"]
You can try to run following command:
#ec2-describe-instance <instance> | python parse.py
Simply use printf to format the output into JSON
First, you have a very blatant typo in this part of your code right here:
echo "${array[3]:$var-3:4}
Note there is no closing straight quote: ". Fixed it in the rewrite I did below:
But more to the point, doing something like this (using printf) as suggested in this StackOverflow answer. Tested and works in CentOS 7.
#!/bin/bash
readarray -t array <<< "$(df -h)";
var=$(echo "${array[1]}"| grep -aob '%' | grep -oE '[0-9]+');
df_output="${array[3]:$var-3:4}";
manufacturer=$(cat /sys/class/dmi/id/chassis_vendor);
product_name=$(cat /sys/class/dmi/id/product_name);
version=$(cat /sys/class/dmi/id/bios_version);
serial_number=$(cat /sys/class/dmi/id/product_serial);
hostname=$(hostname);
operating_system=$(hostnamectl | grep "Operating System" | cut -d ' ' -f5-);
architecture=$(arch);
processor_name=$(awk -F':' '/^model name/ {print $2}' /proc/cpuinfo | uniq | sed -e 's/^[ \t]*//');
memory$(dmidecode -t 17 | grep "Size.*MB" | awk '{s+=$2} END {print s / 1024"GB"}');
hdd_model=$(cat /sys/block/sda/device/model);
system_main_ip=$(hostname -I);
printf '{"manufacturer":"%s","product_name":"%s","version":"%s","serial_number":"%s","hostname":"%s","operating_system":"%s","architecture":"%s","processor_name":"%s","memory":"%s","hdd_model":"%s","system_main_ip":"%s"}' "$manufacturer" "$product_name" "$version" "$serial_number" "$hostname" "$operating_system" "$architecture" "$processor_name" "$memory" "$hdd_model" "$system_main_ip"
The output I get is this:
{"manufacturer":"Oracle Corporation","product_name":"VirtualBox","version":"VirtualBox","serial_number":"","hostname":"sandbox-centos-7","operating_system":"CentOS Linux 7 (Core)","architecture":"x86_64","processor_name":"Intel(R) Core(TM) i5-1030NG7 CPU @ 1.10GHz","memory":"","hdd_model":"VBOX HARDDISK ","system_main_ip":"10.0.2.15 192.168.56.20 "}
And if you have jq installed, you can pipe the output of the shell script to jq to “pretty print” the output into some human readable format. Like let’s say your script is named my_script.sh, just pipe it to jq like this:
./my_script.sh | jq
And the output would look like this:
{
"manufacturer": "Oracle Corporation",
"product_name": "VirtualBox",
"version": "VirtualBox",
"serial_number": "",
"hostname": "sandbox-centos-7",
"operating_system": "CentOS Linux 7 (Core)",
"architecture": "x86_64",
"processor_name": "Intel(R) Core(TM) i5-1030NG7 CPU @ 1.10GHz",
"memory": "",
"hdd_model": "VBOX HARDDISK ",
"system_main_ip": "10.0.2.15 192.168.56.20 "
}
The following programs can output json:
lshw:
lshw -json
smartmontools v7+:
smartctl --json --all /dev/sda
lsblk:
lsblk --json
lsipc:
lsipc --json
sfdisk
sfdisk --json
One suggestion is to use --args with jq to create the two arrays and then collect these in the correct location in the main document. Note that --args is required to be the last option on the command line and that all the remaining command line arguments will become elements of the $ARGS.positional array.
{
jq -n --arg key APP-Service1-Admin '{($key): $ARGS.positional}' --args a b
jq -n --arg key APP-Service1-View '{($key): $ARGS.positional}' --args c d
} |
jq -s --arg key 'AD Accounts' '{($key): add}' |
jq --arg Service service1-name --arg 'AWS account' service1-dev '$ARGS.named + .'
The first two jq invocations create a set of two JSON objects:
{
"APP-Service1-Admin": [
"a",
"b"
]
}
{
"APP-Service1-View": [
"c",
"d"
]
}
The third jq invocation uses -s to read that set into an array, which becomes a merged object when passed through add. The merged object is assigned to our top-level key:
{
"AD Accounts": {
"APP-Service1-Admin": [
"a",
"b"
],
"APP-Service1-View": [
"c",
"d"
]
}
}
The last jq adds the remaining top-level keys and their values to the object:
{
"Service": "service1-name",
"AWS account": "service1-dev",
"AD Accounts": {
"APP-Service1-Admin": [
"a",
"b"
],
"APP-Service1-View": [
"c",
"d"
]
}
}
With jo:
jo -d . \
Service=service1-name \
'AWS account'=service1-dev \
'AD Accounts.APP-Service1-Admin'="$(jo -a a b)" \
'AD Accounts.APP-Service1-View'="$(jo -a c d)"
The "internal" object is created using .-notation (enabled with -d .), and a couple of command substitutions for creating the arrays.
Or you can drop the -d . and use a form of array notation:
jo Service=service1-name \
'AWS account'=service1-dev \
'AD Account[APP-Service1-Admin]'="$(jo -a a b)" \
'AD Account[APP-Service1-View]'="$(jo -a c d)"
I often use heredocs when creating complicated json objects in bash:
service=$(thing-what-gets-service)
account=$(thing-what-gets-account)
admin=$(jo -a $(thing-what-gets-admin))
view=$(jo -a $(thing-what-gets-view))
read -rd '' json <<EOF
[
{
"Service": "$service",
"AWS Account": "$account",
"AD Accounts": {
"APP-Service1-Admin": $admin,
"APP-Service1-View": $view
}
}
]
EOF
This uses jo to create the arrays as it's a pretty simple way to do it but it could be done differently if needed.
You are better off using a program like jq to generate the JSON, if you don't know ahead of time if the contents of the variables are properly escaped for inclusion in JSON. Otherwise, you will just end up with invalid JSON for your trouble.
BUCKET_NAME=testbucket
OBJECT_NAME=testworkflow-2.0.1.jar
TARGET_LOCATION=/opt/test/testworkflow-2.0.1.jar
JSON_STRING=$( jq -n \
--arg bn "$BUCKET_NAME" \
--arg on "$OBJECT_NAME" \
--arg tl "$TARGET_LOCATION" \
'{bucketname: $bn, objectname: $on, targetlocation: $tl}' )
You can use printf:
JSON_FMT='{"bucketname":"%s","objectname":"%s","targetlocation":"%s"}\n'
printf "$JSON_FMT" "$BUCKET_NAME" "$OBJECT_NAME" "$TARGET_LOCATION"
much clear and simpler
Note that injecting the values of shell variables into JSON documents may result in broken JSON strings due to the lack of encoding special values. It would be better (more robust) to use a JSON-aware tool to create the JSON document and to insert the wanted values.
Using jq to compose the JSON document from a simple template:
ABC=100000
DFG=600000
jq -n \
--arg Address "$ABC" \
--arg targetId "$DFG" \
--arg name "$ABC:11011" \
'[ $ARGS.named + { Backup: false, port: 6745, weight: 1 } ]'
Output (use jq with -c to get "compact" output):
[
{
"Address": "100000",
"targetId": "600000",
"name": "100000:11011",
"Backup": false,
"port": 6745,
"weight": 1
}
]
This creates an array consisting of a single element, a static object to which we add some named fields with values. The field names (keys) and their values are given using --arg key value repeatedly on the command line.
Since we're using jq to create the JSON document, this would still work if ABC held a value like 100 "My Road" (i.e. with embedded quotes, or newlines, or tabs, etc.).
Your double-quoted string contains double quotes.
You can escape all the inner quotes that you want to keep:
JSON="[{\"Address\":\"$ABC\",\"Backup\":false,\"name\":\"$ABC:11011\",\"port\":6745,\"targetId\":\"$DFG\",\"weight\":1}]"
You can use a single quoted string, interrupted where the variables go (this is hard to read):
JSON='[{"Address":"'"$ABC"'","Backup":false,"name":"'"$ABC"':11011","port":6745,"targetId":"'"$DFG"'","weight":1}]'
You can use printf with a single quoted format string:
printf -v JSON '[{"Address":"%s","Backup":false,"name":"%s:11011","port":6745,"targetId":"%s","weight":1}]' "$ABC" "$ABC" "$DFG"
Or, use a JSON-processing tool to handle any edge cases with the values of the variables:
JSON=$(
jq -n -c --arg a "$ABC" --arg b "$DFG" '[{
Address: $a,
Backup: false,
name: ($a + ":11011"),
port: 6745,
targetId: $b,
weight: 1
}]'
)
echo "$JSON"
That outputs
[{"Address":"100000","Backup":false,"name":"100000:11011","port":6745,"targetId":"600000","weight":1}]
You could use a here-doc:
foo=$(cat <<EOF
{"Comment":"Update DNSName.","Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"alex.","Type":"A","AliasTarget":{"HostedZoneId":"######","DNSName":"$bar","EvaluateTargetHealth":false}}}]}
EOF
)
By leaving EOF in the first line unquoted, the contents of the here-doc will be subject to parameter expansion, so your $bar expands to whatever you put in there.
If you can have linebreaks in your JSON, you can make it a little more readable:
foo=$(cat <<EOF
{
"Comment": "Update DNSName.",
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "alex.",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "######",
"DNSName": "$bar",
"EvaluateTargetHealth": false
}
}
}
]
}
EOF
)
or even (first indent on each line must be a tab)
foo=$(cat <<-EOF
{
"Comment": "Update DNSName.",
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "alex.",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "######",
"DNSName": "$bar",
"EvaluateTargetHealth": false
}
}
}
]
}
EOF
)
and to show how that is stored, including quoting (assuming that bar=baz):
$ declare -p foo
declare -- foo="{
\"Comment\": \"Update DNSName.\",
\"Changes\": [
{
\"Action\": \"UPSERT\",
\"ResourceRecordSet\": {
\"Name\": \"alex.\",
\"Type\": \"A\",
\"AliasTarget\": {
\"HostedZoneId\": \"######\",
\"DNSName\": \"baz\",
\"EvaluateTargetHealth\": false
}
}
}
]
}"
Because this expands some shell metacharacters, you could run into trouble if your JSON contains something like `, so alternatively, you could assign directly, but be careful about quoting around $bar:
foo='{"Comment":"Update DNSName.","Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"alex.","Type":"A","AliasTarget":{"HostedZoneId":"######","DNSName":"'"$bar"'","EvaluateTargetHealth":false}}}]}'
Notice the quoting for $bar: it's
"'"$bar"'"
│││ │││
│││ ││└ literal double quote
│││ │└ opening syntactical single quote
│││ └ closing syntactical double quote
││└ opening syntactical double quote
│└ closing syntactical single quote
└ literal double quote
It can be stored safely; generating it is a different matter, since the contents of $bar may need to be encoded. Let a tool like jq handle creating the JSON.
var=$(jq -n --arg b "$bar" '{
Comment: "Update DNSName.",
Changes: [
{
Action: "UPSERT",
ResourceRecordSet: {
Name: "alex.",
Type: "A",
AliasTarget: {
HostedZoneId: "######",
DNSName: $b,
EvaluateTargetHealth: false
}
}
}
]
}')