Score:0

How to Categorically Relocate Records Using Grep

es flag

I need to parse a file containing records beginning with a keyword and selectively write those records to separate category files. After this is done, the records that were extracted need to be removed from the main file. Using bash, is grepping the best way?

The following function and loop does the extraction part:

declare -a keywords
declare -f extract_records

keywords=(alpha bravo gamma delta)

extract_records() {
    grep -E "^($1 )" main_file >> category_file.$1
}

for i in "${keywords[@]}"; do
    extract_records "$i"
done

Then to re-compress the main file:

grep -E -v \
  -e "^alpha " \
  -e "^bravo " \
  -e "^gamma " \
  -e "^delta " \
main_file >> main_file.$$

The main_file.$$ is optionally sorted, replacing the original. Here, the keyword list is specified twice. It would be better to use the array again for the re-compress part so that only one keyword list is needed, something like:

grep -E -v "^(${keywords[@]})" main_file >> main_file.$$

but this does not work as each keyword needs a pattern specifier. Is there a simpler way? Can each record be removed as it's being extracted, rather than this two-part approach? The keyword list can be hundreds, loaded from a file, and read into the array (not shown here). Managing two sets of keywords is prone to a mismatch and loss of data. Outside of bash, is there a Python or other solution?

EDIT: Mon Feb 20 22:52:53 EST 2023

A minimal representative example follows...

The original main_file consists of 10 data records:

alpha 1
bravo 1
gamma 1
delta 1
omicron 1
sigma 1
alpha 2
bravo 2
gamma 2
delta 2

The category files created by extraction... main_file.alpha

alpha 1
alpha 2

main_file.bravo

bravo 1
bravo 2

main_file.gamma

gamma 1
gamma 2

main_file.delta

delta 1
delta 2

The resultant compressed main_file is to hold what was not extracted:

omicron 1
sigma 1

On steeldriver's suggestion, creation of an exclusion_args array from the keywords array does compile the correct patterns for what could be an inversion grep and does SOLVE my "two independent lists" problem:

declare -a exclusion_args
for k in "${keywords[@]}"; do exclusion_args+=( -e "\"^$k \"" ); done
printf "%s " "${exclusion_args[@]}"

-e "^alpha " -e "^bravo " -e "^gamma " -e "^delta "

The above string reproduces what was first posted with the "grep -E -v" as far as the inversion patterns. Correct, the grouping parens were not needed. Now in what way can the above string be used as additional arguments to the following grep mock-up:

grep -E -v $(printf " %s " "${exclusion_args[@]}") main_file

If correct, this should yield only the omicron and sigma records.

An echo of the above printf unexpectedly removes the leading -e seen here:

"^alpha " -e "^bravo " -e "^gamma " -e "^delta "

and of course this disrupts the inversion pattern for grep. Perhaps this is why the inversion grep returns all 10 records and excludes none.

And there might be a better way to take the input to the output than this design.

hr flag
While you *could* construct an array of grep arguments from the array of keywords as `args=( -v -w ); for k in "${keywords[@]}"; do args+=( -e "^$k" ); done` for example (I don't see the need for `-E` or the grouping parentheses, and `-w` is probably more robust than adding an explicit trailing space character) likely there is a better way to achieve what you want - it's hard to propose a solution without a minimal representative example of your input and desired output(s).
Paul_Capo avatar
es flag
Added the info you asked for. See question.
Score:0
hr flag

While you could construct an array of grep arguments from the array of keywords, like

args=( -v -w )
for k in "${keywords[@]}"; do 
  args+=( -e "^$k" )
done 

for example (I don't see the need for -E or the grouping parentheses, and -w is probably more robust than adding an explicit trailing space character), it would be more natural IMHO to use a more fully featured text processing language. For example in GNU awk, you could read the keywords into an internal array and then process the whole main_file in a single pass:

keywords=(alpha bravo gamma delta)

printf '%s\n' "${keywords[@]}" | gawk -i inplace '
  BEGIN{inplace::enable=0} 
  NR==FNR {keywords[$0]; next} 
  ($1 in keywords) {print > (FILENAME "." $1); next} 
  {print}
' - inplace::enable=1 main_file

resulting in

$ head main_file*
==> main_file <==
omicron 1
sigma 1

==> main_file.alpha <==
alpha 1
alpha 2

==> main_file.bravo <==
bravo 1
bravo 2

==> main_file.delta <==
delta 1
delta 2

==> main_file.gamma <==
gamma 1
gamma 2

(switching inplace::enable on and off is not strictly necessary - it just suppresses a warning about in-place editing for invalid FILENAME '-').


If you are determined to use a shell loop instead, then sed is a more appropriate regex tool than grep - which is intended for searching only rather than searching and replacing. So for example

for k in "${keywords[@]}"; do 
  sed -i -e "/^$k/{w main_file.$k" -e "d}" main_file
done

however I'd suggest that even with sed, a better approach would be to turn the keyword array into a sed script and then execute that as a single instruction:

printf '%s\n' "${keywords[@]}" | sed 's:.*:/&/{w main_file.&\nd}:' | 
  sed -i.bak -f - main_file

In fact, I'd choose the latter over the gawk solution if you don't mind hard-coding the basename main_file.

Paul_Capo avatar
es flag
steeldriver This is excellent and the sed solution more efficient I'm sure. Thank you for this alternate approach.
hr flag
@Paul_Capo you're welcome - please be aware that the sed approach may need tweaking if any of your actual keywords contain regex metacharacters. See for example [How to ensure that string interpolated into `sed` substitution escapes all metachars](https://unix.stackexchange.com/questions/129059/how-to-ensure-that-string-interpolated-into-sed-substitution-escapes-all-metac)
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.