HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashModerate

JSON Parsing in Bash

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
jsonparsingbash

Problem

I have a json file which needs to be restructured.
The following is the code.

while IFS='' read -r line || [[ -n "$line" ]]; do
    COUNT=$(( $COUNT + 1 ))
    #echo "[$COUNT]"
    [ $COUNT -lt 5 ] && continue
    sj=`echo $line | jq ._source`
    index=`echo $line | jq ._index | tr -d '"'`
    itype=`echo $line | jq ._type| tr -d '"'`
    echo '{ "index" : { "_index" :"'$index'","_type":"'$itype'"}}' >> bulk_result.bulk
    echo $sj >> bulk_result.bulk
#echo "$COUNT lines processed from file $1"
done < "$1"
echo "$COUNT lines processed from file $1"


Basically, the program is reading a json record for example,

{"_index":"index1","_type":"rm","_id":"AVPkyS9w","_score":1,"_source":{"timestamp":"2016-04-05T05:00:00","token":"8eb38d14","tag":"logs.rm","message":"CouchbaseConnectSuccess,bucket=srmobjects","logsource":"rm.log","RM_pw":"","component":"rm-01-NFR","RM_un":"","timeEpochMs":1459832400.248,"RM_bucket":"srmobjects","RM_eventName":"CouchbaseConnectSuccess"}}


and converting it to the following

{ "index" : { "_index" :"index1","_type":"rm"}}
{ "RM_eventName": "FcgiClose", "timeEpochMs": 1459832435.293, "component": "rm-04-NFR", "logsource": "rm.log", "message": "FcgiClose,requestIndex=0", "tag": "logs.rm", "timestamp": "2016-04-05T05:00:35" }


The file size is about 4Gb . The code is taking a lot of time (in hours) in processing it. Is there an efficient way to make this faster ?

Solution

Bash is not well-suited for transforming JSON. But jq is. But calling jq 3 times for each line of input is certainly going to be slow.

There are several other issues too with the script. The `... syntax is obsolete in favor of $(...); the counting can be simplified, or even better, eliminated using tail -n +5; and the repeated bulk_result.bulk` would be good to put in a variable.

But none of that matters much, as it seems the entire script can be replaced with a single line:

tail -n +5 "$1" | jq -rc '{index: {_index: ._index, _type: ._type}}, ._source'

Code Snippets

tail -n +5 "$1" | jq -rc '{index: {_index: ._index, _type: ._type}}, ._source'

Context

StackExchange Code Review Q#129969, answer score: 15

Revisions (0)

No revisions yet.