A few things to note, that 40Gbps NIC is only connected to one socket/NUMA-node - so if it's associated with the second socket then any file-sharing workload handled by the first socket will have to traverse the QPI bus to get to the NIC, thus turning your second socket into a part-time IO controller for the first socket. If the second socket isn't really being used, consider removing it completely, moving its memory to the first socket slots and moving the NIC to be associated with the first socket. Sometimes less is more :)
Secondly 40Gbps interfaces are usually just a pre-bundled connection of 4 x 10Gbps, in an LACP/Etherchannel fashion, which is fine if your server is talking to lots of varied MAC addresses but if you're always talking to the same MAC (say one client, or another switch) can limit you to 10Gbps or so of bandwidth. This is one of the reasons we've moved to 25Gbps from 40Gbps NICs.
On top of that you've got a software-driven R0 setup, which takes away not just CPU resource, but importantly resources used by the kernel - now a lot of people don't realise this but it doesn't matter how many cores you might have, most kernels will only use a certain number of, typically low-numbered, cores for kernel work - process scheduling, malloc, IO etc. software RAID has a couple of benefits but obviously hardware RAID has it's own benefits - and one of them is the comparatively-lower CPU overhead they usually have.
The files thing is a pain yes, almost all file systems are slower when dealing with small files than large one - in fact I'm struggling to think of any that don't certainly NTFS isn't great. Ironically removing jumbo frame support everywhere can actually help with lots of small files, a bit anyway.
Finally one thing that does worry me about this kind of system is the single-point-of-failure with the single server. Consider moving all of this file service to a central multi-'head' storage array if you have the budget, it should be faster and more reliable/resilient - easier to support too. Another option might be some kind of centralised or decentralised Distributed File System such as Ceph, even Window's own DFS-R perhaps. I have a friend who runs a very large renderfarm for the film/movie industry and they use that kind of thing, not cheap though to be fair.