1. 25 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 inverse 4x4 2D-DCT with DC only · 384e37e3
      Jingning Han authored
      Add SSE2 implementation to handle the special case of inverse 2D-DCT
      where only DC coefficient is non-zero.
      
      Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f
      384e37e3
  2. 24 Jul, 2013 1 commit
  3. 18 Jul, 2013 1 commit
  4. 17 Jul, 2013 1 commit
    • Johann's avatar
      vp9_convolve8_neon placeholder · 59dc4e9c
      Johann authored
      Call the individually optimized horizontal and vertical functions. This
      implementation abuses the temp buffer.
      
      This will be replaced with a custom optimized function.
      
      Over 2x speedup.
      
      Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd
      59dc4e9c
  5. 16 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 16x16 inverse ADST/DCT hybrid transform · d05f66aa
      Jingning Han authored
      This commit enables SSE2 implementation of 16x16 inverse ADST/DCT
      hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles.
      This provides about 1% encoding speed-up at speed 0.
      
      Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b
      d05f66aa
  6. 13 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 8x8 inverse ADST/DCT transform · 91365add
      Jingning Han authored
      This commit enables SSE2 implementation of 8x8 inverse ADST/DCT
      transform. The runtime goes from 1216 cycles -> 266 cycles.
      For bus_cif at 2000 kbps, the overall runtime reduces from
      253707ms -> 248430ms, i.e., 2% speed-up at speed 0.
      
      Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb
      91365add
  7. 12 Jul, 2013 1 commit
    • Johann's avatar
      vp9_convolve8_[horiz|vert]_avg · a15bebfc
      Johann authored
      Super basic conversion from the other implementations. Any changes to
      one should be trivial to copy over keep in sync.
      
      Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8
      a15bebfc
  8. 11 Jul, 2013 4 commits
  9. 10 Jul, 2013 6 commits
  10. 09 Jul, 2013 2 commits
    • Frank Galligan's avatar
      Add Neon horizontal and vertical vp9_mbloop_filter · 198fa6d0
      Frank Galligan authored
      - The vp9 mbfilter C code will branch on flat and mask. This CL
        will perform both branches and combine the data. A later CL will
        perform a check to see if all patch will take one branch.
      - These functions are about 1.75 times faster than the C code on
        Nexus 7.
      
      PS #3
      - Changed all functions to dub limit, blimit, and thresh from
        vld {dx[]}, freeing up r4-r6.
      - Changed code to use vbif to reduce one instruction and free
        up a d register.
      
      Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777
      198fa6d0
    • Ronald S. Bultje's avatar
      Make intra prediction pointers RTCD-based. · 8350e7fe
      Ronald S. Bultje authored
      This probably has a mildly negative impact on performance, but will
      (in future commits - or possibly merged with this one) allow SIMD
      implementations of individual intra prediction functions. We may
      perhaps want to consider having separate functions per txfm-size
      also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for
      each intra prediction mode), but I haven't played much with that
      yet.
      
      Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269
      8350e7fe
  11. 01 Jul, 2013 2 commits
    • Ronald S. Bultje's avatar
      Update quantize SSSE3 SIMD to cover 32x32 transform case also. · c8defcfd
      Ronald S. Bultje authored
      Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
      2min10.1, i.e. a 2.3% overall speed increase.
      
      Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
      c8defcfd
    • Ronald S. Bultje's avatar
      Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab
      Ronald S. Bultje authored
      Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
      goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
      x86-64 only, it needs some minor modifications to be 32bit compatible,
      because it uses 15 xmm registers, whereas 32bit only has 8.
      
      Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
      7353ceab
  12. 29 Jun, 2013 3 commits
  13. 28 Jun, 2013 1 commit
    • Ronald S. Bultje's avatar
      Make coefficient skip condition an explicit RD choice. · af660715
      Ronald S. Bultje authored
      This commit replaces zrun_zbin_boost, a method of biasing non-zero
      coefficients following runs of zero-coefficients to be rounded towards
      zero, with an explicit skip-block choice in the RD loop.
      
      The logic is basically that if individual coefficients should be rounded
      towards zero (from a RD point of view), the trellis/optimize loop should
      take care of it. If whole blocks should be zero (from a RD point of
      view), a single RD check is much more efficient than a complete
      serialization of the quantization loop.
      
      Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
      SIMD for quantize will follow in a separate patch. Results for other
      test sets pending.
      
      Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
      af660715
  14. 27 Jun, 2013 1 commit
    • Frank Galligan's avatar
      Add Neon optimized loop filter functions. · 1d6dc1b7
      Frank Galligan authored
      - Added vp9_loop_filter_horizontal_edge_neon and
        vp9_loop_filter_vertical_edge_neon.
      - The functions are based off the vp8 loopfilter
        functions.
      - Matches x86 md5 checksum.
      
      Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0
      1d6dc1b7
  15. 25 Jun, 2013 3 commits
    • Jingning Han's avatar
      Refactor intra predictor block · d19ea386
      Jingning Han authored
      Remove vp9_intra4x4_predict(). Use the common intra prediction
      function for all block sizes.
      
      Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560
      d19ea386
    • Ronald S. Bultje's avatar
      Add averaging-SAD functions for 8-point comp-inter motion search. · c24d9223
      Ronald S. Bultje authored
      Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
      i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
      the variance of the averaging predictor. This is slightly suboptimal
      because the function is subpixel-position-aware, but it will (at least
      for the SSE2 version) not actually use a bilinear filter for a full-pixel
      position, thus leading to approximately the same performance compared to
      if we implemented an actual average-aware full-pixel variance function.
      That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
      leading to a total gain of 2.7%.
      
      Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
      c24d9223
    • Jingning Han's avatar
      Enable sse2 implmentation of 8x8 ADST/DCT · a32a086d
      Jingning Han authored
      This commit makes use of the butterfly structure to enable the sse2
      version implementation of 8x8 ADST/DCT hybrid transform coding.
      
      The runtime of hybrid transform module goes down from 1170 cycles
      to 245 cycles. Overall speed-up around 1.5%.
      
      Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
      a32a086d
  16. 21 Jun, 2013 3 commits
    • John Koleszar's avatar
      Remove unused vp9_build_intra_predictors_sb{y,uv}_s · 9e7019f7
      John Koleszar authored
      The functions no longer referenced.
      
      Change-Id: If2705dfbc607f79ec8ec2242d5e03bec27a35aaf
      9e7019f7
    • Ronald S. Bultje's avatar
      Implement SSE2 block_error. · 54b2a596
      Ronald S. Bultje authored
      Change vp9_block_error() to return a 64bit error variable, change all
      callers to expect a 64bit return value (this will prevent overflows,
      which we basically don't check for at all right now). Remove duplicate
      block_error() function, which fixed that through truncation. Remove
      old (incompatible) mmx/sse2 block_error SIMD versions and replace with
      a new one that returns a 64bit value.
      
      Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
      3min23, i.e. a 3% overall speedup.
      
      Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
      54b2a596
    • Ronald S. Bultje's avatar
      Add subtract_block SSE2 version and unit test. · 25c588b1
      Ronald S. Bultje authored
      3% faster overall (3min35.0 to 3min28.5).
      
      Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
      25c588b1
  17. 20 Jun, 2013 2 commits
    • Ronald S. Bultje's avatar
      SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). · 1e6a32f1
      Ronald S. Bultje authored
      Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
      3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
      which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
      perfectly interleaved, and can probably be improved further in the
      future. I've marked this with a few TODOs/FIXMEs in the code.
      
      Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
      1e6a32f1
    • Ronald S. Bultje's avatar
      Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. · 8fb6c581
      Ronald S. Bultje authored
      Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
      3min58). Specific changes to timings for each function compared to
      original assembly-optimized versions (or just new version timings if
      no previous assembly-optimized version was available):
      
      sse2   4x4:    99 ->   82 cycles
      sse2   4x8:           128 cycles
      sse2   8x4:           121 cycles
      sse2   8x8:   149 ->  129 cycles
      sse2   8x16:  235 ->  245 cycles (?)
      sse2  16x8:   269 ->  203 cycles
      sse2  16x16:  441 ->  349 cycles
      sse2  16x32:          641 cycles
      sse2  32x16:          643 cycles
      sse2  32x32: 1733 -> 1154 cycles
      sse2  32x64:         2247 cycles
      sse2  64x32:         2323 cycles
      sse2  64x64: 6984 -> 4442 cycles
      
      ssse3  4x4:           100 cycles (?)
      ssse3  4x8:           103 cycles
      ssse3  8x4:            71 cycles
      ssse3  8x8:           147 cycles
      ssse3  8x16:          158 cycles
      ssse3 16x8:   188 ->  162 cycles
      ssse3 16x16:  316 ->  273 cycles
      ssse3 16x32:          535 cycles
      ssse3 32x16:          564 cycles
      ssse3 32x32:          973 cycles
      ssse3 32x64:         1930 cycles
      ssse3 64x32:         1922 cycles
      ssse3 64x64:         3760 cycles
      
      Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
      8fb6c581
  18. 18 Jun, 2013 1 commit
    • Jingning Han's avatar
      Make fdct32 computation flow within 16bit range · a41a4860
      Jingning Han authored
      This commit makes use of dual fdct32x32 versions for rate-distortion
      optimization loop and encoding process, respectively. The one for
      rd loop requires only 16 bits precision for intermediate steps.
      The original fdct32x32 that allows higher intermediate precision (18
      bits) was retained for the encoding process only.
      
      This allows speed-up for fdct32x32 in the rd loop. No performance
      loss observed.
      
      Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
      a41a4860
  19. 14 Jun, 2013 1 commit
    • Jingning Han's avatar
      Enable sse2 version of sad8x4/4x8 · c43af9a8
      Jingning Han authored
      The encoding time for bus at CIF goes from 661s to 625s. This commit
      also enabled unit test of sad8x4/4x8 in sad_test.cc.
      
      Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
      c43af9a8
  20. 13 Jun, 2013 1 commit
    • Jingning Han's avatar
      Enable sse2 version of sad8x4/4x8 · 15f50e7b
      Jingning Han authored
      The encoding time for bus at CIF goes from 661s to 625s. This commit
      also enabled unit test of sad8x4/4x8 in sad_test.cc.
      
      Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
      15f50e7b
  21. 12 Jun, 2013 3 commits