grep
is one of the most refined and time proven tools performance-wise ... Please, see for example the speed comparison of grep
with other text-processing tools on very large 1G+ files with 8M+ lines here: https://askubuntu.com/a/1420653 ... Also, proper(i.e. preserving separate files output with correct line order) text-processing is not, IMHO, a suitable task for parallel
because as you noticed it will mix the results from different files and shift their line order ... Although you used the parallel
's -k
option to keep the same output order as the input, but that might only work as intended if:
- You limit the parallel jobs to 1 i.e.
-j 1
and also --max-procs 1 -P 1
.
- You make sure text is passed in the right order by e.g. piping the actual text(in the right order/sequence) to
parallel
and use its --pipe
option to pipe the text to grep
afterwords.
That, however, will defy your intended purpose of running multiple jobs in parallel and therefore the added speed gain(if any) is negligible.
Also, using a for
loop will require grep
to fully run for each argument/file present in the loop's head with virtually the same match pattern/s for each file as it appears ... So, might not be the best approach when you are trying to speed things up ... You might be better off using e.g. grep
's option --recursive
in that case.
However, you can run multiple jobs in the background by sending each grep
call inside your for
loop to the background redirecting its output to a separate file i.e. grep ... > file1 &
then later joining the resulting output files in one output file if you want ... That would run multiple instances of it in the background and greatly speed-up the loop ... Please see the demonstration below.
For demonstration purposes I will use (sleep N; echo "something" > fileN) &
in place of grep ... > file1 &
... the sub-shell syntax (...;...)
is necessary if you're sending multiple nested commands to the background but not needed for a single command:
$ # Creating some background jobs/processes
i=0
for f in file1 file2 file3
do
# Start incrementing a counter to use in filenames and calculating sleep seconds.
((i++))
# Send command/s to background
(sleep $((10*i)); echo "$f $(date)" > "${f}_${i}") &
# Add background PID to array
pids+=( "$!" )
done
# Output:
[1] 31335
[2] 31336
[3] 31338
$ # Monitoring and controling the background jobs/processes
while sleep 5;
do
echo "Background PIDs are: ${pids[@]}"
for index in "${!pids[@]}"
do
if kill -0 "${pids[index]}" &> /dev/null;
then
echo "${pids[index]} is running"
# Do whatever you want here if the process is running ... e.g. kill "${pids[index]}" to kill that process.
else
echo "${pids[index]} is not running"
unset 'pids[index]'
# Do whatever you want here if the process is not running.
fi
done
if [[ "${#pids[@]}" -eq 0 ]]
then
echo "Combined output files contents:"
cat file*
unset i
unset pids
break
fi
done
# Output:
Background PIDs are: 31335 31336 31338
31335 is running
31336 is running
31338 is running
[1] Done ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31335 31336 31338
31335 is not running
31336 is running
31338 is running
Background PIDs are: 31336 31338
31336 is running
31338 is running
[2]- Done ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31336 31338
31336 is not running
31338 is running
Background PIDs are: 31338
31338 is running
[3]+ Done ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31338
31338 is not running
Combined output files contents:
file1 Fri Mar 31 12:20:47 AM +03 2023
file2 Fri Mar 31 12:20:57 AM +03 2023
file3 Fri Mar 31 12:21:07 AM +03 2023
Please also see Bash Job Control.