Score:1

Sort a file according to a field starting with string

jp flag

Suppose I have a file so structured

/home/zz/AUTHORBOOKS/Author-Chomsky-Who-Rules-the-World.epub
/home/zz/AUTHORBOOKS/Author-Cioran-Il-nulla.epub
/home/zz/BOOKS/Author-Artemis-Mathematica-Examples.nb
/home/zz/Books/Author-Zigniwe-Hisory-Medicine.pdf
/home/z1/OLDBOOKS1/OLDBOOKS2/Author-Watanabe-Waterloo.pdf
/home/z2/OLDBOOKS1/OLDBOOKS2/Author-Barbero-Lepanto.epub.pdf

I would like a file sorted this way:

/home/zz/BOOKS/Author-Artemis-Mathematica-Examples.nb
/home/z2/OLDBOOKS1/OLDBOOKS2/Author-Barbero-Lepanto.epub.pdf
/home/zz/AUTHORBOOKS/Author-Chomsky-Who-Rules-the-World.epub
/home/zz/AUTHORBOOKS/Author-Cioran-Il-nulla.epub
/home/z1/OLDBOOKS1/OLDBOOKS2/Author-Watanabe-Waterloo.pdf
/home/zz/Books/Author-Zigniwe-History-Medicine.pdf

That is, alphabeticallly, according the string Author-...

As you can see the position of Author-... is not constant.

How can I do this?

FedKad avatar
cn flag
This question would be more interesting if there did not exist a separator (like `-`) to start "key 2". For example, how can we sort using the _whole_ file name?
waltinator avatar
it flag
Read `man sort` and Knuth's book "Sorting and Searching", vol 4 of Fundamental Algorithms.
ar flag
If you think one of the answer is correct, accept the answer by clicking on the gray check mark ✔️ next to that answer and turn it green ✅. This will help others.
Score:3
hr flag

Although it's overkill for the present example because of the solution proposed in user68186's answer, you could more generally do something like this in GNU awk:

gawk -F/ '
  function mycmp(i1,v1,i2,v2) {
    m = split(v1,a);
    n = split(v2,b);
    return a[m]"" > b[n]"" ? 1 : a[m]"" < b[n]"" ? -1 : 0
  }
  {
    lines[NR] = $0
  }
  END {
    PROCINFO["sorted_in"] = "mycmp";
    for(i in lines) print lines[i]
  }
' file

Note that it sorts according to the lexical value of everything after the last / - so if the format is Author-<author name>-<title>.<extension> that will be

  • the fixed string Author- (which has no effect, since it has the same weight for all lines); then
  • <author name>-; then
  • <title>.; then
  • <extension>

This is similar to how GNU sort's simple KEYDEF -t- -k2 works i.e. the effective sort key starts from the <author name> and continues to the line end.

An explicit delimiter is omitted from the split calls so that they inherit the value of FS, making it easy to change for systems that use a different path separator. The appended empty strings "" in the mycmp function force lexical comparison even if the filenames are numerical - see for example How awk Converts Between Strings and Numbers


If you'd rather stick with the sort command, you could leverage GNU awk's Two-Way Communications with Another Process to:

  • duplicate the last /-separated field at the start of the string
  • pass the result to a sort comnand
  • read back the sorted result, remove the duplicated prefix and print

i.e.

gawk -F/ '
  BEGIN {OFS=FS; cmd = "sort -d"} 
  {print $NF $0 |& cmd} 
  END {
    close(cmd,"to"); 
    while(cmd |& getline){$1 = ""; print};
    close(cmd,"from")
  }
' file

There's a bit of a cheat here in that the absolute paths (lines start with /) imply an initial empty field; to handle relative paths you'd need to change print $NF $0 to print $NF,$0 to insert the "missing" separator, and then perhaps use a regex sub() instead of the simpler $1 = "" to remove the leading element.

As well as potentially being faster / more memory efficient than the pure gawk solution, this allows other sort options to be added straightforwardly ex. cmd = "sort -d -t " FS " -k1,1r" .

ar flag
Wonderful! I don't quite follow whats going on in the "mycmp" function. Could you point me to a web tutorial? Thanks!
hr flag
@user68186 custom sort functions are discussed in the GNU Awk User's Guide at [Controlling Array Traversal](https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal) BTW please turn your comment into an answer!
ar flag
Thanks for the link, I understand it a little better, but there is much to learn in `gawk`. And as you wished I have turned my comment to an answer. :)
ar flag
+1 for the added explanation! Now a t bit on `return a[m]"" > b[n]"" ? 1 : a[m]"" < b[n]"" ? -1 : 0` will be great! Thanks again.
hr flag
@user68186 it's just two [ternary conditionals](https://www.gnu.org/software/gawk/manual/gawk.html#Conditional-Exp) in a trenchcoat
Score:3
ar flag

Try the following bash command:

sort -t- -d -k2 -o output.txt input.txt

It has four options plus the name of the input file input.txt. If this file is not in the current directory you will have to provide the path/to/the/folder/input.txt. The options and their arguments are as follows:

  • -t marks the field separator. We use - as the separator, so that everything before and after the - are considered separate columns.
  • -d indicates dictionary sort. For example Apple is before Berry.
  • -k2 indicates the column by which to sort, in this case the second column. Note the first column is everything before the first -. For example, /home/zz/BOOKS/Author. The second column is in between the first and the second -, that is, Artemis.
  • -o output.txt redirects the sorted output to a file rather than to the terminal.

Hope this helps

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.