Score:2

Ubuntu

Running grep via GNU parallel

Fatipati

3/30/24, 1:19 PM

How can I make searches with grep on a large number of files run faster? My first attempt uses parallel (which could be improved or other approaches suggested).

The first grep simply gives the list of files, which are then passed to parallel, which runs grep again to output matches.

The parallel command is- supposed to wait for a grep to finish so that I get the results from each file together. Otherwise I get a mixup from the results form different files.

I also use sed to skip files if necessary through the command

sed -z "${ista}~${istp}!d"

Multiple patterns are stored in array ${ptrn[@]} whilst the trailing context after matching lines is defined in ${isufx[@]}.

ptrn=("-e" "FN" "-e" "DS")
ictx=(-A 8)

grep --null -r -l "${isufx[@]}"  \
  -f <(printf "%s\n" "${ptrn[@]}") -- "${fdir[@]}"  \
    | sed -z "${ista}~${istp}!d"  \
    | PARALLEL_SHELL=bash psgc=$sgc psgr=$sgr uptrn=$ptrn  \
        parallel -m0kj"$procs"  \
          'for fl in {}; do
            printf "\n%s\n\n" "${psgc}==> $fl <==${psgr}"
            grep -ni '"${ictx[@]@Q}"'
                 -f <(printf "%s\n" "${ptrn[@]}") -- "$fl"
          done'

287

2 + 2

command-line

bash

grep

job-control

background-process

pbhj

3/30/24, 6:29 PM

If you're grep-ing through files you may want to look at the other apps designed for that ack-grep `ack`; "the silver surfer" `ag`; ripgrep `rg`; hyperscan; findrepo;... and a bunch of others.

Fatipati

3/30/24, 7:10 PM

Does `ripgrep` include concurrent execution on many files?

Score:1

Ubuntu

Raffa

3/30/24, 6:13 PM

grep is one of the most refined and time proven tools performance-wise ... Please, see for example the speed comparison of grep with other text-processing tools on very large 1G+ files with 8M+ lines here: https://askubuntu.com/a/1420653 ... Also, proper(i.e. preserving separate files output with correct line order) text-processing is not, IMHO, a suitable task for parallel because as you noticed it will mix the results from different files and shift their line order ... Although you used the parallel's -k option to keep the same output order as the input, but that might only work as intended if:

You limit the parallel jobs to 1 i.e. -j 1 and also --max-procs 1 -P 1.
You make sure text is passed in the right order by e.g. piping the actual text(in the right order/sequence) to parallel and use its --pipe option to pipe the text to grep afterwords.

That, however, will defy your intended purpose of running multiple jobs in parallel and therefore the added speed gain(if any) is negligible.

Also, using a for loop will require grep to fully run for each argument/file present in the loop's head with virtually the same match pattern/s for each file as it appears ... So, might not be the best approach when you are trying to speed things up ... You might be better off using e.g. grep's option --recursive in that case.

However, you can run multiple jobs in the background by sending each grep call inside your for loop to the background redirecting its output to a separate file i.e. grep ... > file1 & then later joining the resulting output files in one output file if you want ... That would run multiple instances of it in the background and greatly speed-up the loop ... Please see the demonstration below.

For demonstration purposes I will use (sleep N; echo "something" > fileN) & in place of grep ... > file1 & ... the sub-shell syntax (...;...) is necessary if you're sending multiple nested commands to the background but not needed for a single command:

$ # Creating some background jobs/processes
i=0
for f in file1 file2 file3
  do
  # Start incrementing a counter to use in filenames and calculating sleep seconds.
  ((i++))
  # Send command/s to background
  (sleep $((10*i)); echo "$f $(date)" > "${f}_${i}") &
  # Add background PID to array
  pids+=( "$!" )
  done

# Output:
[1] 31335
[2] 31336
[3] 31338

$ # Monitoring and controling the background jobs/processes
while sleep 5;
  do
  echo "Background PIDs are: ${pids[@]}"  
  for index in "${!pids[@]}"
    do
    if kill -0 "${pids[index]}" &> /dev/null;
      then
      echo "${pids[index]} is running"
      # Do whatever you want here if the process is running ... e.g. kill "${pids[index]}" to kill that process.
      else
      echo "${pids[index]} is not running"
      unset 'pids[index]'
      # Do whatever you want here if the process is not running.
      fi
    done
  if [[ "${#pids[@]}" -eq 0 ]]
    then
    echo "Combined output files contents:"
    cat file*
    unset i
    unset pids
    break
    fi
  done

# Output:
Background PIDs are: 31335 31336 31338
31335 is running
31336 is running
31338 is running
[1]   Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31335 31336 31338
31335 is not running
31336 is running
31338 is running
Background PIDs are: 31336 31338
31336 is running
31338 is running
[2]-  Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31336 31338
31336 is not running
31338 is running
Background PIDs are: 31338
31338 is running
[3]+  Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31338
31338 is not running
Combined output files contents:
file1 Fri Mar 31 12:20:47 AM +03 2023
file2 Fri Mar 31 12:20:57 AM +03 2023
file3 Fri Mar 31 12:21:07 AM +03 2023

Please also see Bash Job Control.

+ 7

Fatipati

3/30/24, 6:43 PM

Yes, you are correct. It is problematic when matches from different files get mixed,

Fatipati

3/30/24, 6:46 PM

I could use `--recursive` but the search would still be sequential. Can there be some other solutions? I do not want grep itself to run faster, but to run multiple instances, but with the ability to stop the grep processes in an easy way when needed.

Raffa

3/30/24, 6:54 PM

@Backy You can send `grep` inside the `for` loop to the background redirecting its output to a separate file i.e. `grep … > file1 &` … That would run multiple instances of it in the background and greatly speed-up the loop.

Fatipati

3/30/24, 7:06 PM

Ok, so I would refrain from an approach that calls parallel. How could I make it easy to stop the processes with a single command?

Raffa

3/30/24, 7:13 PM

@Backy You can kill all background jobs/processes at once with e.g. `kill $(jobs -p)` or selectively see e.g. [Bash Job Control Basics](https://www.gnu.org/software/bash/manual/html_node/Job-Control-Basics.html)

Fatipati

3/30/24, 7:21 PM

I want to kill selectively the ones I generate from my script only. But I do not want users to search for them to stop them.

Raffa

3/30/24, 9:37 PM

@Backy I updated the answer for that ... And only the owner or the super user can kill a process.

Score:0

Ubuntu

Ole Tange

4/3/24, 2:42 PM

It is one of GNU Paralel's examples:

https://www.gnu.org/software/parallel/parallel_examples.html#example-parallel-grep

If you are grepping the same files again and again, maybe this is usable too: https://stackoverflow.com/a/11913999/363028

+ 0

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Running grep via GNU parallel

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.