Fix for #339 (xGetHADs mismatch between SCALAR and SIMD)
Extended the HAD calculation with SIMD to be performed in 32-bit space rather than 16-bit. This fixes #339.
The adaptations for AVX2 are straightforward and should not impact the performance to much. For SSE, there are quite a few more registers required, so this might impact the performance. Now the SIMD (both AVX2 and SSE41) version of xGetHADs produce equivalent results as the scalar implementation.