• Yi Luo's avatar
    Hybrid forward transform 32x32 AVX2 optimization · fed8e1c0
    Yi Luo authored
    - av1_fht32x32 AVX2 function level time reduction ~89% compared to C.
    
    - av1_fht32x32_avx2() on DCT_DCT improves 42.62% over aom_fdct32x32_avx2()
      But function replacement must go with the corresponding inverse txfm.
    
    - No obvious user level time reduction due to 32x32 TX_TYPE selection.
    
    - Zero high 128b YMM to avoid AVX-SSE transition penalties
      (fix 16x16 case).
    
    - Added 32x32 AVX2 unit tests to verify bitexact.
    
    - AVX2 optimization summary:
      On CPU i7-6700, based on 16x16/32x32 fwd txfm optimization results:
      C to AVX2: function level time reduction, ~86-89%.
      SSE2 to AVX2: function level time reduction, ~51%.
    
    Change-Id: Idd0cd8bf066a61c7117140ef15ab6c1f8eb4b036
    fed8e1c0
aom_dsp_rtcd_defs.pl 123 KB