Score:3

Bash sort lines starting with punctuation in non-dictionary order

br flag

I have a file containing lines, some of which start with a ! character, some with a ? character, and some with a space ( ) character. The second character is always a letter of the alphabet.

When I try to use the bash sort command from coreutils, this seems to ignore the the first character, and sort according to the second only.

This surprised me very much, as I assumed the sort would treat the punctuation numbers by their ascii value, and lump all the ! lines together, followed by all the ? lines together, etc.

In particular, the documentation says there's a -d option, which explicitly instructs the sort command to ignore such punctuation marks. But what I want is the opposite behaviour, and there's no option to 'reverse' this behaviour. It's as if the -d option has been "baked in" somehow.

I have checked, and as far as I know, I don't have an alias defined somewhere that might activate the -d flag by accident.

Is this a bug in sort? (coreutils v8.32). Is there a way to force it NOT to sort by dictionary order but by strict ascii value?

OS: Linux Mint 21.1 (based on ubuntu jammy, afaik) in case this is relevant

EDIT: Providing locale and MVP as requested

$ locale
LANG=en_GB.UTF-8
LANGUAGE=en_GB:en
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC=en_GB.UTF-8
LC_TIME=en_GB.UTF-8
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY=en_GB.UTF-8
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER=en_GB.UTF-8
LC_NAME=en_GB.UTF-8
LC_ADDRESS=en_GB.UTF-8
LC_TELEPHONE=en_GB.UTF-8
LC_MEASUREMENT=en_GB.UTF-8
LC_IDENTIFICATION=en_GB.UTF-8
LC_ALL=

$ echo '
> !a
> ?b
>  c
> !f
>  e
> ?d' | sort 

!a
?b
 c
?d
 e
!f
tm flag
What's your locale setting? Can you post a sample of the input?
br flag
@choroba done. see edited question. (the `>` characters denote line continuation input in the terminal)
Score:5
hr flag

You probably want to sort in the C locale. Ex. given

$ printf '%2s\n' '!a' '?b' 'c' '!f' 'e' '?d'
!a
?b
 c
!f
 e
?d

then

$ printf '%2s\n' '!a' '?b' 'c' '!f' 'e' '?d' | LC_COLLATE=C sort
 c
 e
!a
!f
?b
?d

or perhaps better, use LC_ALL=C since according to info sort the former is affected by other variables:

---------- Footnotes ----------

(1) If you use a non-POSIX locale (e.g., by setting ‘LC_ALL’ to ‘en_US’), then ‘sort’ may produce output that is sorted differently than you’re accustomed to. In that case, set the ‘LC_ALL’ environment variable to ‘C’. Note that setting only ‘LC_COLLATE’ has two problems. First, it is ineffective if ‘LC_ALL’ is also set. Second, it has undefined behavior if ‘LC_CTYPE’ (or ‘LANG’, if ‘LC_CTYPE’ is unset) is set to an incompatible value. For example, you get undefined behavior if ‘LC_CTYPE’ is ‘ja_JP.PCK’ but ‘LC_COLLATE’ is ‘en_US.UTF-8’.

br flag
That's great, thanks! This "solves" the problem, though if you could explain a bit more why this discrepancy between en_GB vs C exists, or why the LC_COLLATE variable is the relevant one to override in this case and what it does exactly, that would be great! Also, your solution seems to break the `-d` switch in an equally interesting way ... but I guess that's a question for another day. :)
hr flag
@TasosPapastylianou tbh I don't understand locales sufficiently to answer with confidence - I *think* the short answer is that punctuation is ignored (or more precisely, is assigned equal collation weight) in non-ASCII locales - see for example [strange behavior of sort](https://unix.stackexchange.com/a/577256/65304). I may even be wrong about which variable to set (`info sort` says to use `LC_ALL` not `LC_COLLATE` since the latter is affected by other variables).
br flag
Thank you. The linked answer was very useful in understanding what collation order is, and makes the above behaviour slightly less cryptic (locales are such a mess though...)
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.