Score:2

Get stats of number of files of unqiue file extensions including binaries from command line

bl flag

I am trying to recursively count the number of different extensions and how often each of them occurs in the files of a directory and I want to include the number of files with no extension. I am trying:

find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | 
  grep -o -E "[[:alpha:]]{2,16}" | awk '{print tolower($0)}' | 
   sort | uniq -c | sort -rn

But this does not display binaries with no extensions. How can I do this correctly?

UPDATE : My main intention is top find out how many files and its total size of terraform / terraform plugin binaries are downloaded in one particular folder so that I can view them and delete them.

pl flag
Remove the `-name "*.*"` part?
anjanesh avatar
bl flag
Its still not showing. I had it as `*` alone.
pl flag
It shows a list of files here, and includes all files. `find . -type f | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{2,16}" | awk '{print tolower($0)}' | sort | uniq -c | sort -rn`
hr flag
Perhaps it would be helpful to *explain what outcome you want*? `grep -o -E "\.[^\.]+$"` is also going to exclude results that don't have "dot extensions" (btw you don't need to escape `.` inside `[]`). If you are using that as a kind of `basename` function there are simpler alternatives
anjanesh avatar
bl flag
"Perhaps it would be helpful to explain what outcome you want?" - I have an `uploads` folder which is 800GB - I want to know whats in there. Need to offload binaries which are mainly terraform binaries.
hr flag
So... what does that mean *programatically*? Do you want to find files with a particular name pattern? a particular size range? a particular mime type? I'm trying to understand how your pipeline is intended to implement your goal
terdon avatar
cn flag
Please [edit] your question and i) explain what you are trying to do and ii) show us a few example file names and the output you expect from that example. Make sure to include both things that should be found and things that should not be found.
Score:3
cn flag

The -name "*.*" in your find already excludes any file with no extension since you are looking only for files with at least one . and then, your grep is explicitly removing files without extensions:

grep -o -E "\.[^\.]+$" 

This will print the extension, but if the file has no extension it will print nothing:

$ echo 'foo.txt' | grep -o -E "\.[^\.]+$" 
.txt
$ echo 'foo' | grep -o -E "\.[^\.]+$" 
$

If what you want is to count the number of occurrences for each extension or no extension, try this instead:

find . -type f | 
    awk -F'.' '{
                 if(NF>2){ 
                    ext=tolower($NF)
                    k["."ext]++ } 
                 else{ k["no extension"]++ } 
               }
               END{ for(i in k){ print i":"k[i] } }'

For example, given this directory:

$ touch file{1..3}.txt file.jpg file.JPG file."weird extension with spaces" file
$ ls
 file   file1.txt   file2.txt   file3.txt   file.jpg  file.JPG 
 'file.weird extension with spaces'

You get:

$ find . -type f | 
>     awk -F'.' '{
>                  if(NF>2){ 
>                     ext=tolower($NF)
>                     k["."ext]++ } 
>                  else{ k["no extension"]++ } 
>                }
>                END{ for(i in k){ print i":"k[i] } }'
no extension:1
.txt:3
.weird extension with spaces:1
.jpg:2

Note that this will consider a hidden file, for example something named .hidden, as its own extension. That would be counted as one hit for the extension .hidden. Hidden files with extensions will be counted correctly (i.e. .hidden.txt will increase the hits for the .txt extension), but those without extension will not. I don't know how you want to handle that since extensions are largely cosmetic in Linux.

By the way, whether a file is binary or not has nothing to do with whether it has an extension or not.

hr flag
+1 was about to post essentially the same except with GNU awk to roll in the sort-by-count `gawk -F. '{c[tolower(NF > 1 ? $NF : "none")]++} END{PROCINFO["sorted_in"] = "@val_num_desc"; for(i in c) print c[i], i}'`
terdon avatar
cn flag
Heh, great minds and all that, @steeldriver! I think it's still worth posting yours though precisely because of the nice gawk trick. But I will add the tolower, I forgot that from the OP.
hr flag
... or fools ;) I'm happy for you to include it as a variant, tbh I'm still waiting for the OP to clarify exactly how this helps solve "Need to offload binaries which are mainly terraform binaries."
Raffa avatar
jp flag
+1 ... But, does it need to be that complicated :) ... Whats wrong with e.g. `awk -F'.' '{ext[tolower($3)]++} END {for (t in ext) print t, ext[t]}'` ?
terdon avatar
cn flag
@Raffa first, that won't print the number of files with no extension. If you have a file named `foo`, it will simply not be counted and that's one of the main requirements of the OP. Second, if you have a file named `foo.bar.baz.txt`, yours will take `bar` as the extension and not `.txt`. That's why I'm using `$NF` and not `$3` in mine. Finally, I also wanted to include the `.` in the extension for clarity, hence the `k["."ext]++` as opposed to `ext[tolower($NF)]++`.
Raffa avatar
jp flag
It does count files with no extension but, the number only is printed(*e.g. 10*) ... The other `$NF` idea however might be indeed worth the little extra complexity :-)
terdon avatar
cn flag
Ah yes, you're right, @raffa. But exactly, since only a space and a number were printed, I didn't even notice it!
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.