Score:1

grep PCRE still greedy

mo flag

I'm searching a multi-line text file and want to match a string from a certain word until the first match of another word:

start
word1
word1
word1
word1
end
word2
word2
word2
start
word3
word3
word3
end

Here's what I use: grep -Pzo "(?s)start.*?end" file.txt

And it matches everything in the above text string from beginning to end, whereas I want to match only until the first end occurance, i.e.:

start
word1
word1
word1
word1
end

What am I doing wrong?

Somehow the non-greedy ? quantifier is not working as I expected it.

Thank you for your time and contributions!

hr flag
I think perhaps it is matching non-greedily - but twice
Adenano avatar
mo flag
Isn't that greedy? How could I limit it to only the first match?
hr flag
Greedy would indeed be matching "everything in the above text string from beginning to end" but that's not what's happening, is it? When I test your expression, lines containing `word2` are missing - if you try `grep -Pzo -b '(?s)start.*?end' file.txt` you will see that it matches once at byte offset 0 and again at byte offset 52
Adenano avatar
mo flag
So is it some loop that's forming?
Score:1
hr flag

A greedy match would include everything from the first start to the last end, thus:

$ grep -Pzo '(?s)start.*end' file.txt
start                                                                                                                                                                                        
word1                                                                                                                                                                                        
word1                                                                                                                                                                                        
word1                                                                                                                                                                                        
word1                                                                                                                                                                                        
end                                                                                                                                                                                          
word2                                                                                                                                                                                        
word2                                                                                                                                                                                        
word2                                                                                                                                                                                        
start                                                                                                                                                                                        
word3                                                                                                                                                                                        
word3                                                                                                                                                                                        
word3                                                                                                                                                                                        
end

What you are actually seeing is two separate non-greedy matches, output on separate "lines" per the -o option - except that with -Z, "lines" are actually denoted by the null character instead of the newline character:

$ grep -Pzo '(?s)start.*?end' file.txt
start                                                                                                                                                                                        
word1                                                                                                                                                                                        
word1                                                                                                                                                                                        
word1                                                                                                                                                                                        
word1                                                                                                                                                                                        
endstart                                                                                                                                                                                     
word3                                                                                                                                                                                        
word3                                                                                                                                                                                        
word3                                                                                                                                                                                        
end

Since we can't see the null byte here, it's clearer if you add -b to indicate the byte offsets of the two matches within the "line":

$ grep -Pzo -b '(?s)start.*?end' file.txt
0:start                                                                                                                                                                                      
word1                                                                                                                                                                                        
word1                                                                                                                                                                                        
word1                                                                                                                                                                                        
word1                                                                                                                                                                                        
end52:start                                                                                                                                                                                  
word3                                                                                                                                                                                        
word3                                                                                                                                                                                        
word3                                                                                                                                                                                        
end

Since the -o outputs are null-separated, you could pipe the result through head -z to get just the first match:

$ grep -Pzo '(?s)start.*?end' file.txt | head -z -n 1
start
word1
word1
word1
word1
end

Alternatively you could use perl itself

perl -0777 -nE 'say for /(start.*?end)/s' file.txt

which only prints one match in spite of the for loop since the g flag is omitted.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.