• Jingning Han's avatar
    Optimze inv 16x16 DCT with 10 non-zero coeffs - P1 · ba6ab46c
    Jingning Han authored
    This commit is the first patch optimizing SSE2 implementation of inverse
    16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row)
    transformation. It exploits the fact that only top-left 4x4 block contains
    non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients.
    The average runtime of idct16x16_10 unit is reduced from
    883 cycles -> 779 cycles (12% faster).
    For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes
    down from 310651 ms  -> 305910 ms. The decoding speed goes up from
    80.37 fps -> 80.87 fps.
    Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
vp9_idct_intrin_sse2.c 159 KB