Fix SIMD overflow in 12-bit ALF classification
When internal bit depth is 12-bit, the dynamic range of colSums is 16-bit. In this case, directly using _mm_add_epi16 on top of colSums may lead to overflow.
When internal bit depth is 12-bit, the dynamic range of colSums is 16-bit. In this case, directly using _mm_add_epi16 on top of colSums may lead to overflow.