-
Yi Luo authored
- av1_fht32x32 AVX2 function level time reduction ~89% compared to C. - av1_fht32x32_avx2() on DCT_DCT improves 42.62% over aom_fdct32x32_avx2() But function replacement must go with the corresponding inverse txfm. - No obvious user level time reduction due to 32x32 TX_TYPE selection. - Zero high 128b YMM to avoid AVX-SSE transition penalties (fix 16x16 case). - Added 32x32 AVX2 unit tests to verify bitexact. - AVX2 optimization summary: On CPU i7-6700, based on 16x16/32x32 fwd txfm optimization results: C to AVX2: function level time reduction, ~86-89%. SSE2 to AVX2: function level time reduction, ~51%. Change-Id: Idd0cd8bf066a61c7117140ef15ab6c1f8eb4b036
fed8e1c0