1. 08 May, 2014 1 commit
    • Jingning Han's avatar
      Change eob threshold for partial inverse 8x8 2D-DCT to 12 · 41a350a8
      Jingning Han authored
      The scanning order has the first 12 coefficients of the 8x8 2D-DCT
      sitting in the top left 4x4 block. Hence the partial inverse 8x8
      2D-DCT allows to handle cases with eob below 12.
      
      The overall runtime of the inverse 8x8 2D-DCT unit is reduced from
      166 cycles (using SSE2) to 150 cycles (using SSSE3).
      
      Change-Id: I4514f9748042809ac84df4c14382c00f313f1cd2
      41a350a8
  2. 28 Jan, 2014 1 commit
  3. 09 Jan, 2014 1 commit
    • Jingning Han's avatar
      Optimze inv 16x16 DCT with 10 non-zero coeffs - P2 · af31b27a
      Jingning Han authored
      This commit further optimizes SSE2 operations in the second 1-D
      inverse 16x16 DCT, with (<10) non-zero coefficients. The average
      runtime of this module goes down from 779 cycles -> 725 cycles.
      
      Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f
      af31b27a
  4. 08 Jan, 2014 1 commit
    • Jingning Han's avatar
      Optimze inv 16x16 DCT with 10 non-zero coeffs - P1 · ba6ab46c
      Jingning Han authored
      This commit is the first patch optimizing SSE2 implementation of inverse
      16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row)
      transformation. It exploits the fact that only top-left 4x4 block contains
      non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients.
      
      The average runtime of idct16x16_10 unit is reduced from
      883 cycles -> 779 cycles (12% faster).
      
      For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes
      down from 310651 ms  -> 305910 ms. The decoding speed goes up from
      80.37 fps -> 80.87 fps.
      
      Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
      ba6ab46c
  5. 03 Jan, 2014 3 commits
    • Jingning Han's avatar
      Tune IDCT8_1D macro function interface · 3e0c62b5
      Jingning Han authored
      This commit adds input/output ports for IDCT8_1D macro function to
      provide more flexibility in variable use. It allows to skip several
      buffer swap operations.
      
      Change-Id: I21f3450509537322293043b3281bfd3949868677
      3e0c62b5
    • Jingning Han's avatar
      Reduce num of buffer swap calls in idct8_1d_sse2 · 0b1a2713
      Jingning Han authored
      This commit merges the initial buffer swap operations in idct8_1d_sse2
      into the array transpose step, hence reducing number of instructions
      therein.
      
      Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479
      0b1a2713
    • Jingning Han's avatar
      Rework idct8x8_10 SSE2 implementation · 1bb11781
      Jingning Han authored
      This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits
      the fact that only top-left 4x4 block contains non-zero coefficients,
      and hence reduces the instructions needed.
      
      The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles,
      estimated by averaging over 100000 runs. For pedestrian_area_1080p 300
      frames coded at 4000kbps, the average decoding speed goes up from
      79.3 fps to 79.7 fps.
      
      Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180
      1bb11781
  6. 03 Dec, 2013 1 commit
  7. 26 Nov, 2013 1 commit
    • Abo Talib Mahfoodh's avatar
      improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2 · f97d91ab
      Abo Talib Mahfoodh authored
      vp9_idct32x32_34_add_sse2:
      speedup: 1.472
      IDCT32_1D_34 and MULTIPLICATION_AND_ADD_2 are optimized
      based on the fact that Only upper-left 8x8 has
      non-zero values.
      
      vp9_idct32x32_1024_add_sse2:
      speedup: 1.032
      
      Tested with: park_joy_420_720p50.y4m
      
      Change-Id: I8670ce547552b48695049de298e2fc46ce28dfbc
      f97d91ab
  8. 19 Nov, 2013 1 commit
    • Abo Talib Mahfoodh's avatar
      Improve vp9_iht4x4_16_add_sse2 (x1.341) · 613e2d2e
      Abo Talib Mahfoodh authored
      This rebase is a better implementation of the previous ones.
      
      Modifications are done to reduce the total clock cycle.
      Speedup: 1.341
      Compiled with -O3
      Tested with: park_joy_420_720p50.y4m
      
      Change-Id: I940eaf283f60597ca0d9d2e13d518878d55ff02d
      613e2d2e
  9. 24 Oct, 2013 1 commit
    • Yunqing Wang's avatar
      Add 32x32 idct function for eob<=34 case · f88315cb
      Yunqing Wang authored
      When only upper-left 8x8 area has non-zero dct coefficients, we
      could skip 1D IDCT for 9th to 32th rows to save operations. This
      function is called when eob <= 34.
      
      Change-Id: I9684b75947bdde346cfe3720f08a953aa7a13fb5
      f88315cb
  10. 23 Oct, 2013 1 commit
  11. 22 Oct, 2013 1 commit
    • Abo Talib Mahfoodh's avatar
      Improve vp9_idct4x4_1_add_sse2 · 908a992d
      Abo Talib Mahfoodh authored
      Simple modification to reduce number of cycles in the
      function.
      Original function number of cycles: 973
      Modified function number of cycles: 835
      Improvment factor: 1.165
      
      Tested with: park_joy_420_720p50.y4m
      
      Change-Id: Ic5857272ea3aafe21d5ef9a69258d78c688f69bd
      908a992d
  12. 12 Oct, 2013 1 commit
  13. 11 Oct, 2013 1 commit
  14. 10 Oct, 2013 2 commits
    • Dmitry Kovalev's avatar
      Removing vp9_idct4_1d_sse2 function. · ddf1b762
      Dmitry Kovalev authored
      We have two SSE2-optimized functions for idct4_1d:
        vp9_idct4_1d_sse2 <-- removing this one
        idct4_1d_sse2
      
      vp9_idct4_1d_sse2 was used only by the following functions which already
      have SSE2 optimized variants:
        vp9_idct4x4_16_add_c   -> vp9_idct4x4_16_add_see2
        idct8_1d               -> vp9_idct8x8_{16, 10, 1}_see2
        vp9_short_iht4x4_add_c -> vp9_short_iht4x4_add_see2
      
      Change-Id: Ib0a7f6d1373dbaf7a4a41208cd9d0671fdf15edb
      ddf1b762
    • Dmitry Kovalev's avatar
      Giving consistent names to IDCT 32x32 functions. · 1e766b50
      Dmitry Kovalev authored
      Renames:
        vp9_short_idct32x32_add   -> vp9_idct32x32_1024_add
        vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add
        vp9_idct_add_32x32        -> vp9_idct32x32_add
      
      Change-Id: Id85306f5814bac6c47463a6b5901a93082510666
      1e766b50
  15. 07 Oct, 2013 1 commit
    • Dmitry Kovalev's avatar
      Giving consistent names to IDCT 16x16 functions. · b096c5a3
      Dmitry Kovalev authored
      Renames:
        vp9_short_idct16x16_add    -> vp9_idct16x16_256_add
        vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add
        vp9_short_idct16x16_1_add  -> vp9_idct16x16_1_add
        vp9_idct_add_16x16         -> vp9_idct16x16_add
      
      Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3
      b096c5a3
  16. 06 Oct, 2013 1 commit
    • Dmitry Kovalev's avatar
      Giving consistent names to IDCT 8x8 functions. · c6ad70d5
      Dmitry Kovalev authored
      Renames:
        vp9_short_idct8x8_add    -> vp9_idct8x8_64_add
        vp9_short_idct8x8_1_add  -> vp9_idct8x8_1_add
        vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add
        vp9_idct_add_8x8         -> vp9_idct8x8_add
      
      Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
      c6ad70d5
  17. 04 Oct, 2013 1 commit
    • Dmitry Kovalev's avatar
      Giving consistent names to IDCT/IWHT functions. · 3a060257
      Dmitry Kovalev authored
      The idea is to have the following names for each transform size:
      
      vp9_idct4x4_add
        vp9_idct4x4_1_add
        vp9_idct4x4_10_add
        vp9_idct4x4_16_add
      
      vp9_idct8x8_add
        vp9_idct8x8_1_add
        vp9_idct8x8_10_add
        vp9_idct8x8_64_add
      
      etc for 16x16, 32x32
      
      The actual list of renames in this patch:
      
      vp9_idct_add_lossless     -> vp9_iwht4x4_add
      vp9_short_iwalsh4x4_add   -> vp9_iwht4x4_16_add
      vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add
      
      vp9_idct_add            -> vp9_idct4x4_add
      vp9_short_idct4x4_add   -> vp9_idct4x4_16_add
      vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add
      
      Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
      3a060257
  18. 02 Oct, 2013 1 commit
  19. 30 Sep, 2013 1 commit
  20. 27 Sep, 2013 1 commit
  21. 26 Sep, 2013 1 commit
  22. 01 Aug, 2013 1 commit
    • Jingning Han's avatar
      Optimize 32x32 2D inverse DCT for speed-up · 9d67495f
      Jingning Han authored
      This commit exploits the sparsity of quantized coefficient matrix.
      It detects each 32x8 array and skip the corresponding inverse
      transformation if all entries are zero.
      
      For ped1080p at 8000 kbps, this on average reduces the runtime of
      32x32 inverse 2D-DCT SSE2 function from 6256 cycles -> 5200
      cycles. It makes the overall encoding process about 2% faster at
      speed 0. The speed-up is more pronounceable for the decoding process.
      
      Change-Id: If20056c3566bd117642a76f8884c83e8bc8efbcf
      9d67495f
  23. 29 Jul, 2013 1 commit
    • Jingning Han's avatar
      16x16 inverse 2D-DCT with DC only · a7c4de22
      Jingning Han authored
      This commit provides special handle on 16x16 inverse 2D-DCT, where
      only DC coefficient is quantized to be non-zero value.
      
      Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
      a7c4de22
  24. 26 Jul, 2013 1 commit
    • Jingning Han's avatar
      Special handle on DC only inverse 8x8 2D-DCT · 325e0aa6
      Jingning Han authored
      This commit enables a special handle for the 8x8 inverse 2D-DCT,
      where only DC coefficient is quantized to be non-zero. For bus_cif
      at 2000 kbps, it provides about 1% speed-up at speed 0.
      
      Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011
      325e0aa6
  25. 25 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 inverse 4x4 2D-DCT with DC only · 384e37e3
      Jingning Han authored
      Add SSE2 implementation to handle the special case of inverse 2D-DCT
      where only DC coefficient is non-zero.
      
      Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f
      384e37e3
  26. 24 Jul, 2013 1 commit
  27. 16 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 16x16 inverse ADST/DCT hybrid transform · d05f66aa
      Jingning Han authored
      This commit enables SSE2 implementation of 16x16 inverse ADST/DCT
      hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles.
      This provides about 1% encoding speed-up at speed 0.
      
      Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b
      d05f66aa
  28. 13 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 8x8 inverse ADST/DCT transform · 91365add
      Jingning Han authored
      This commit enables SSE2 implementation of 8x8 inverse ADST/DCT
      transform. The runtime goes from 1216 cycles -> 266 cycles.
      For bus_cif at 2000 kbps, the overall runtime reduces from
      253707ms -> 248430ms, i.e., 2% speed-up at speed 0.
      
      Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb
      91365add
  29. 11 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 4x4 invserse ADST/DCT transform · 49b63020
      Jingning Han authored
      Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from
      292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the
      overall runtime of speed 0 goes from 301s to 295s (2% speed-up).
      
      Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628
      49b63020
  30. 20 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 4x4 idct/recon merge · ba48a111
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: I296604bf73579c45105de0dd1adbcc91bcc53c22
      ba48a111
  31. 16 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 8x8 idct/recon merge · 794a7bed
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: Iacfd57324fbe2b7beca5d7f3dcae25c976e67f45
      794a7bed
  32. 15 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 16x16 idct/recon merge · a272ff25
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: Iea7976b22b1927d24b8004d2a3fddae7ecca3ba1
      a272ff25
  33. 14 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 32x32 idct/recon merge · 2cf0d4be
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: I4ea09df0e162591e420d869b7431c2e7f89a8c1a
      2cf0d4be
  34. 25 Apr, 2013 1 commit
    • Johann's avatar
      Rename vp9_idct_x86.c · c5b127af
      Johann authored
      Remove similarly named header file. It is obsolete.
      
      Move file to match naming style.
      
      Adjust make file to include the file correctly and remove extra
      unnecessary #if guard.
      
      Change-Id: Ifba07ba9938a5df08a9f4eda54a3ac4d6983f7bf
      c5b127af
  35. 27 Mar, 2013 2 commits
    • Yunqing Wang's avatar
      Modify idct code to use macro · c6c0657c
      Yunqing Wang authored
      Small modification of idct code.
      
      Change-Id: I5c4e3223944c68e4ccf762f6cf07c990250e4290
      c6c0657c
    • Yunqing Wang's avatar
      Optimize 32x32 idct function · 21a718d9
      Yunqing Wang authored
      Wrote sse2 version of vp9_short_idct_32x32 function. Compared
      to c version, the sse2 version is 5X faster.
      
      Change-Id: I071ab7378358346ab4d9c6e2980f713c3c209864
      21a718d9
  36. 21 Mar, 2013 1 commit
    • Yunqing Wang's avatar
      Optimize 16x16 idct10 function · 869d6c05
      Yunqing Wang authored
      Wrote sse2 version of vp9_short_idct10_16x16 function. Compared
      to c version, the sse2 version is 2.3X faster.
      
      Change-Id: I314c4f09369648721798321eeed6f58e38857f26
      869d6c05