I need to parse a file containing records beginning with a keyword and
selectively write those records to separate category files. After this
is done, the records that were extracted need to be removed from the main
file. Using bash, is grepping the best way?
The following function and loop does the extraction part:
declare -a keywords
declare -f extract_records
keywords=(alpha bravo gamma delta)
extract_records() {
grep -E "^($1 )" main_file >> category_file.$1
}
for i in "${keywords[@]}"; do
extract_records "$i"
done
Then to re-compress the main file:
grep -E -v \
-e "^alpha " \
-e "^bravo " \
-e "^gamma " \
-e "^delta " \
main_file >> main_file.$$
The main_file.$$
is optionally sorted, replacing the original.
Here, the keyword list is specified twice. It would be better to use the array again for the re-compress part so that only one
keyword list is needed, something like:
grep -E -v "^(${keywords[@]})" main_file >> main_file.$$
but this does not work as each keyword needs a pattern specifier.
Is there a simpler way? Can each record be removed as it's being
extracted, rather than this two-part approach? The keyword list can
be hundreds, loaded from a file, and read into the array (not shown
here). Managing two sets of keywords is prone to a mismatch and loss
of data. Outside of bash, is there a Python or other solution?
EDIT: Mon Feb 20 22:52:53 EST 2023
A minimal representative example follows...
The original main_file
consists of 10 data records:
alpha 1
bravo 1
gamma 1
delta 1
omicron 1
sigma 1
alpha 2
bravo 2
gamma 2
delta 2
The category files created by extraction...
main_file.alpha
alpha 1
alpha 2
main_file.bravo
bravo 1
bravo 2
main_file.gamma
gamma 1
gamma 2
main_file.delta
delta 1
delta 2
The resultant compressed main_file
is to hold what was not extracted:
omicron 1
sigma 1
On steeldriver's suggestion, creation of an exclusion_args
array from
the keywords
array does compile the correct patterns for what could be
an inversion grep and does SOLVE my "two independent lists" problem:
declare -a exclusion_args
for k in "${keywords[@]}"; do exclusion_args+=( -e "\"^$k \"" ); done
printf "%s " "${exclusion_args[@]}"
-e "^alpha " -e "^bravo " -e "^gamma " -e "^delta "
The above string reproduces what was first posted with the "grep -E -v" as
far as the inversion patterns. Correct, the grouping parens were not needed.
Now in what way can the above string be used as additional arguments to the
following grep mock-up:
grep -E -v $(printf " %s " "${exclusion_args[@]}") main_file
If correct, this should yield only the omicron
and sigma
records.
An echo
of the above printf
unexpectedly removes the leading -e
seen here:
"^alpha " -e "^bravo " -e "^gamma " -e "^delta "
and of course this disrupts the inversion pattern for grep. Perhaps this is why the inversion grep returns all 10 records and excludes none.
And there might be a better way to take the input to the output than this design.