I'm experimenting with the parameters for argon2, using argon2_cffi.
Whereas the iteration count or time_cost, and the memory_cost have obvious bearings on the speed and security of the result. I've not seen any guidance on a maximum for the parallelism parameter, other than enough for all the threads you have.
I have a 4-core i5, not sure if that counts as 4 or 8 threads. I am using time_cost=4, memory_cost=2**20 kiB, data dependent, with 'password' and 'some salt', and I get the following rough timings
parallelism time seconds
1 3.93
2 2.14
4 1.30
8 1.29
16 1.28
32 1.23
64 1.23
128 1.26
256 1.31
512 1.34
Once I'm using all my cores, I hit a time floor, as expected. I was sort of expecting a significant slowdown with large numbers for parallelism, and there is a hint of it running very slightly slower with silly numbers.
I don't know how the algorithm uses memory, but I might imagine that if it is breaking the computation up into many disjoint blocks to run in parallel, then each block will use less memory and/or will execute fewer iterations. As the published attacks on it seem to concentrate on examples with few iterations, I could well imagine that it's stronger to use many iterations on a single block, than fewer on many blocks that are then combined at the end.
The question is, do large numbers for parallelism hurt security, and if so, how badly, maybe in the way I've surmised? There doesn't appear to be any significant cost in speed from over-specifying parallelism.
If I'm to target a wide range of hardware, do I set parallelism high, to benefit from all the speed advantage of the many core machine? Is there a threshhold, below which it makes no difference to security?