conv2dtranspose extension

Weijie Bao requested to merge Weijie/sadl-extension:Conv2dT into dev_v7

Dear Franck,

In this MR, I extend the Conv2dtranspose feature. Specifically, there are three main extensions, including

  1. add the support for 4x4 kernel
  2. speed up the function conv2dtranspose (without SIMD) by two strategies, continuous data access(increasing L2 cache hits) and indirect addressing
  3. base on the function conv2dtranspose, I add the function conv2dtranspose_simd256 (with AVX2) and modify the conv2dtranspose_simd512 (it also gets acceleration)

The speed testing files are listed here: conv2dt_64_256x256x64_k3_3_s2_2_p1_1_op1_1.onnx conv2dt_256_128x128x256_k3_3_s2_2_p1_1_op1_1.onnx

Please check!

Merge request reports