Score:0

How to parse large files looking for a word

cn flag

I have some extremely large json files I am not even really sure if some are even written correctly

Therefore viewing the with a editor is not possible (it freezes vscode etc)

I can view them with less thefile.json but I am looking for (all) occurrences of a word and the data that follow them

How can I find these occurrences from the terminal?

For example I want to find all the occurrence of the word OutputResults and their value

{  "OutputResults": 
     { "first": 0,
       "second": 2
     }
}

EDIT:

I am trying

cat thefile.json | grep -i '"OutputResults":'

cat thefile.json | grep -A 4 '"OutputResults":'

But that shows the whole file, highlighting OutputResults in red. And it is ahuge file so I have to stop it at some point.

How can I just extract the part where OutputResults occur?

cn flag
The answer will depend on what "extremely large json files" means. 1Tb? Not a lot of software/tools can scan that. Is the size more than 1/2 your memory? If so look for options that use "stream".
Score:2
us flag

If you're parsing structured text like JSON, use a dedicated tool that understands the structure, as generic text processing tools will depend on cues like newlines and whitespace which are not essential to the structure.

So, using jq, you could get values of all .OutputResults keys using something like:

jq '.. | select(.OutputResults?) | .OutputResults'

For example:

% cat foo.json | jq '.. | select(.OutputResults?) | .OutputResults'
{
  "first": 0,
  "second": 2
}
{
  "first": 0,
  "second": 2
}

Or if you need the .OutputResults as part of the output:

% jq '.. | select(.OutputResults?) | {OutputResults}' foo.json
{
  "OutputResults": {
    "first": 0,
    "second": 2
  }
}
{
  "OutputResults": {
    "first": 0,
    "second": 2
  }
}

Or with compact output:

%  jq '.. | select(.OutputResults?) | {OutputResults}' -c < foo.json
{"OutputResults":{"first":0,"second":2}}
{"OutputResults":{"first":0,"second":2}}

For reading very large JSON files with jq, we have to use its "streaming mode", but the way jq does stream makes it much more complicated to use. I think the following jq program, obtained by tweaking the example in jq FAQ, should work for showing just the values of of OutputResults keys:

foreach inputs as $in (
  null;
  if has("OutputResults") then null
  else . as $x
  | $in
  | if length != 2 and $x then {"OutputResults": $x}
    elif length != 2 then null
    elif .[0][-2] == "OutputResults" then ($x + {"\(.[0][-1])": "\(.[-1])"})
    else $x
    end
  end;
  select(has("OutputResults")) | .
)

Put this in a file, say, outputresults.jq, and use it like so:

jq -n --stream -f outputresults.jq some-inputjson
cn flag
I tried the first one and it froze my PC for a while and killed my vscode... The output of the command was `Killed`
muru avatar
us flag
For _extremely_ large JSON files, `jq` also has a stream mode, but it's annoying to use. See if the update helps.
cn flag
Works great!. quick question, to stop it I have to do ctrl-c I suppose?
muru avatar
us flag
@KansaiRobot yes, but generally programs output at a much faster rate than the terminal can display the output, so for large outputs, by the time you press Ctrl-C and the process might already have exited on its own and you're just seeing the terminal catching up on the output.
Score:0
vg flag

grep is what you want. Type man grep if you want more info.

Use this: grep -a 4 '"OutputResults":' thefile.json and it will output something like that:

{  "OutputResults": 
     { "first": 0,
       "second": 2
     }
}
{  "OutputResults": 
     { "first": 5,
       "second": 3
     }
}
{  "OutputResults": 
     { "first": 2,
       "second": 2
     }
}
cn flag
grep: "OutputResults":: No such file or directory
cn flag
what is the `-a` option for? (I read the man and did not get it)
cn flag
When I do `cat thefile.json | grep -i 'OutputResults'` I got the whole file but with the words I want in red....
cn flag
Now that I think about it, perhaps the reason why it display the whole thing is because the json is not separated by new lines...maybe
MrArsikk avatar
vg flag
1. Make sure to write exactly `'"OutputResults":'` (there are single quotes before the double quotes). 2. `-a x` will output x lines after an occurrence of the searched string. 3. Red words is the normal behavior, it highlights the occurrence. 4. Didn't you need the whole file?
cn flag
I was hoping to get only the parts with OutputResults and some more lines.
MrArsikk avatar
vg flag
@KansaiRobot grep does exactly that
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.