    • Julia Robson's avatar
      Accelerated transform in high bit depth
      Julia Robson authored
      When configured with high bitdepth enabled, the 8bit transform
      stopped using optimised code. This made 8bit content decode slowly.
      Change-Id: I67d91f9b212921d5320f949fc0a0d3f32f90c0ea
    • Jingning Han's avatar
      Factor inverse transform functions into vpx_dsp
      Jingning Han authored
      This commit moves the module inverse transform functions from vp9
      to vpx_dsp folder. The hybrid transform wrapper functions stay in
      the vp9 folder, since it involves codec-specific data structures.
      Change-Id: Ib066367c953d3d024c73ba65157bbd70a95c9ef8
    • Jingning Han's avatar
      Refactor vp9_idct.h file
      Jingning Han authored
      Separate the common coefficient constant into vpx_dsp/txfm_common.h.
      Move the SSE2 macro definitions to vpx_dsp/x86/txfm_common_sse2.h.
      This clears the use case of vp9_idct.h in vpx_dsp folder.
      Change-Id: I319735a2abf42888e5080ac14cfbcde34be7b121
    • Johann's avatar
      Relocate memory operations for common code
      Johann authored
      With the sad functions, and hopefully the variance functions soon,
      moving to the vpx_dsp location, place the defines used in the
      reference C code in a common location.
      Change-Id: I4c8ce7778eb38a0a3ee674d2f1c488eda01cfeca
    • Peter de Rivaz's avatar
      Corrected optimization of 8x8 DCT code
      Peter de Rivaz authored
      The 8x8 DCT uses a fast version whenever possible.
      There was a mistake in the checking code which
      meant sometimes the fast version was used when it
      was not safe to do so.
      Change-Id: I154c84c9e2d836764768a11082947ca30f4b5ab7
      (cherry picked from commit fd05fb0c21e253b4d6f92d7e0b752850ff8ab188)
    • Peter de Rivaz's avatar
      Added high bitdepth sse2 transform functions
      Peter de Rivaz authored
      Also removes some spurious changes in common/vp9_blockd.h which
      was introduced by a rebase issue between nextgen and master branches.
      Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282
      (cherry picked from commit 005d80cd05269a299cd2f7ddbc3d4d8b791aebba)
      (cherry picked from commit 08d2f548007fd8d6fd41da8ef7fdb488b6485af3)
      (cherry picked from commit 4230c2306c194c058f56433a5275aa02a2e71d56)
    • Jingning Han's avatar
      Enable SSSE3 inverse 2D-DCT with 10 non-zero coeffs
      Jingning Han authored
      This commit enables SSSE3 implementation of the inverse 2D-DCT
      with only first 10 coefficients non-zero. It reduces the runtime
      of SSE2 version from 745 cycles to 538 cycles, i.e., 27% speed-up.
      Change-Id: I18ba4128859b09c704a6ee361d69a86c09fe8dfe
    • Jingning Han's avatar
      Inverse 16x16 2D-DCT SSSE3 implementation
      Jingning Han authored
      This commit enables the SSSE3 implementation of full inverse 16x16
      2D-DCT. The unit runtime goes down from 1642 cycles to 1519 cycles,
      about 7% speed-up.
      Change-Id: I14d2fdf9da1fb4ed1e5db7ce24f77a1bfc8ea90d
    • Jingning Han's avatar
      Change eob threshold for partial inverse 8x8 2D-DCT to 12
      Jingning Han authored
      The scanning order has the first 12 coefficients of the 8x8 2D-DCT
      sitting in the top left 4x4 block. Hence the partial inverse 8x8
      2D-DCT allows to handle cases with eob below 12.
      The overall runtime of the inverse 8x8 2D-DCT unit is reduced from
      166 cycles (using SSE2) to 150 cycles (using SSSE3).
      Change-Id: I4514f9748042809ac84df4c14382c00f313f1cd2
    • Jingning Han's avatar
      Optimze inv 16x16 DCT with 10 non-zero coeffs - P2
      Jingning Han authored
      This commit further optimizes SSE2 operations in the second 1-D
      inverse 16x16 DCT, with (<10) non-zero coefficients. The average
      runtime of this module goes down from 779 cycles -> 725 cycles.
      Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f
    • Jingning Han's avatar
      Optimze inv 16x16 DCT with 10 non-zero coeffs - P1
      Jingning Han authored
      This commit is the first patch optimizing SSE2 implementation of inverse
      16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row)
      transformation. It exploits the fact that only top-left 4x4 block contains
      non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients.
      The average runtime of idct16x16_10 unit is reduced from
      883 cycles -> 779 cycles (12% faster).
      For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes
      down from 310651 ms  -> 305910 ms. The decoding speed goes up from
      80.37 fps -> 80.87 fps.
      Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
    • Jingning Han's avatar
      Tune IDCT8_1D macro function interface
      Jingning Han authored
      This commit adds input/output ports for IDCT8_1D macro function to
      provide more flexibility in variable use. It allows to skip several
      buffer swap operations.
      Change-Id: I21f3450509537322293043b3281bfd3949868677
    • Jingning Han's avatar
      Reduce num of buffer swap calls in idct8_1d_sse2
      Jingning Han authored
      This commit merges the initial buffer swap operations in idct8_1d_sse2
      into the array transpose step, hence reducing number of instructions
      Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479
    • Jingning Han's avatar
      Rework idct8x8_10 SSE2 implementation
      Jingning Han authored
      This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits
      the fact that only top-left 4x4 block contains non-zero coefficients,
      and hence reduces the instructions needed.
      The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles,
      estimated by averaging over 100000 runs. For pedestrian_area_1080p 300
      frames coded at 4000kbps, the average decoding speed goes up from
      79.3 fps to 79.7 fps.
      Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180
    • Abo Talib Mahfoodh's avatar
      improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2
      Abo Talib Mahfoodh authored
      speedup: 1.472
      IDCT32_1D_34 and MULTIPLICATION_AND_ADD_2 are optimized
      based on the fact that Only upper-left 8x8 has
      non-zero values.
      speedup: 1.032
      Tested with: park_joy_420_720p50.y4m
      Change-Id: I8670ce547552b48695049de298e2fc46ce28dfbc
    • Abo Talib Mahfoodh's avatar
      Improve vp9_iht4x4_16_add_sse2 (x1.341)
      Abo Talib Mahfoodh authored
      This rebase is a better implementation of the previous ones.
      Modifications are done to reduce the total clock cycle.
      Speedup: 1.341
      Compiled with -O3
      Tested with: park_joy_420_720p50.y4m
      Change-Id: I940eaf283f60597ca0d9d2e13d518878d55ff02d
    • Yunqing Wang's avatar
      Add 32x32 idct function for eob<=34 case
      Yunqing Wang authored
      When only upper-left 8x8 area has non-zero dct coefficients, we
      could skip 1D IDCT for 9th to 32th rows to save operations. This
      function is called when eob <= 34.
      Change-Id: I9684b75947bdde346cfe3720f08a953aa7a13fb5
    • Abo Talib Mahfoodh's avatar
      Improve vp9_idct4x4_1_add_sse2
      Abo Talib Mahfoodh authored
      Simple modification to reduce number of cycles in the
      Original function number of cycles: 973
      Modified function number of cycles: 835
      Improvment factor: 1.165
      Tested with: park_joy_420_720p50.y4m
      Change-Id: Ic5857272ea3aafe21d5ef9a69258d78c688f69bd
    • Dmitry Kovalev's avatar
      Removing vp9_idct4_1d_sse2 function.
      Dmitry Kovalev authored
      We have two SSE2-optimized functions for idct4_1d:
        vp9_idct4_1d_sse2 <-- removing this one
      vp9_idct4_1d_sse2 was used only by the following functions which already
      have SSE2 optimized variants:
        vp9_idct4x4_16_add_c   -> vp9_idct4x4_16_add_see2
        idct8_1d               -> vp9_idct8x8_{16, 10, 1}_see2
        vp9_short_iht4x4_add_c -> vp9_short_iht4x4_add_see2
      Change-Id: Ib0a7f6d1373dbaf7a4a41208cd9d0671fdf15edb
    • Dmitry Kovalev's avatar
      Giving consistent names to IDCT 32x32 functions.
      Dmitry Kovalev authored
        vp9_short_idct32x32_add   -> vp9_idct32x32_1024_add
        vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add
        vp9_idct_add_32x32        -> vp9_idct32x32_add
      Change-Id: Id85306f5814bac6c47463a6b5901a93082510666
    • Dmitry Kovalev's avatar
      Giving consistent names to IDCT 16x16 functions.
      Dmitry Kovalev authored
        vp9_short_idct16x16_add    -> vp9_idct16x16_256_add
        vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add
        vp9_short_idct16x16_1_add  -> vp9_idct16x16_1_add
        vp9_idct_add_16x16         -> vp9_idct16x16_add
      Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3
    • Dmitry Kovalev's avatar
      Giving consistent names to IDCT 8x8 functions.
      Dmitry Kovalev authored
        vp9_short_idct8x8_add    -> vp9_idct8x8_64_add
        vp9_short_idct8x8_1_add  -> vp9_idct8x8_1_add
        vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add
        vp9_idct_add_8x8         -> vp9_idct8x8_add
      Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
    • Dmitry Kovalev's avatar
      Giving consistent names to IDCT/IWHT functions.
      Dmitry Kovalev authored
      The idea is to have the following names for each transform size:
      etc for 16x16, 32x32
      The actual list of renames in this patch:
      vp9_idct_add_lossless     -> vp9_iwht4x4_add
      vp9_short_iwalsh4x4_add   -> vp9_iwht4x4_16_add
      vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add
      vp9_idct_add            -> vp9_idct4x4_add
      vp9_short_idct4x4_add   -> vp9_idct4x4_16_add
      vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add
      Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1