1. 06 Aug, 2013 2 commits
    • Jim Bankoski's avatar
      block error / x86inc mods · 62c6aa88
      Jim Bankoski authored
      Change-Id: Icb607745634e10b9bac5019d06661ece09fcdb40
      62c6aa88
    • Jim Bankoski's avatar
      reworked config for use_x86_inc · a93b115c
      Jim Bankoski authored
      Support enabling it or disabling it.  Moved read out to configure.sh
      so that its done once instead of in make and in config.
      
      Change-Id: I73a9190cf31de9f03e8a577f478fa522f8c01c8b
      a93b115c
  2. 05 Aug, 2013 2 commits
  3. 02 Aug, 2013 2 commits
  4. 01 Aug, 2013 1 commit
    • Jingning Han's avatar
      Remove unused vp9_short_idct10_32x32_add · 67719abd
      Jingning Han authored
      The inverse 32x32 transform detects all zero entries and skips the
      computations accordingly per 8 rows in the first 1-D operation. The
      function vp9_short_idct10_32x32_add performs differently and is not
      used anywhere, hence removed.
      
      Change-Id: Ic4fad422debbde7b6b6ffed47c69fbd4268a906c
      67719abd
  5. 29 Jul, 2013 1 commit
    • Jingning Han's avatar
      16x16 inverse 2D-DCT with DC only · a7c4de22
      Jingning Han authored
      This commit provides special handle on 16x16 inverse 2D-DCT, where
      only DC coefficient is quantized to be non-zero value.
      
      Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
      a7c4de22
  6. 26 Jul, 2013 2 commits
  7. 25 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 inverse 4x4 2D-DCT with DC only · 384e37e3
      Jingning Han authored
      Add SSE2 implementation to handle the special case of inverse 2D-DCT
      where only DC coefficient is non-zero.
      
      Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f
      384e37e3
  8. 24 Jul, 2013 1 commit
  9. 18 Jul, 2013 1 commit
  10. 17 Jul, 2013 1 commit
    • Johann's avatar
      vp9_convolve8_neon placeholder · 59dc4e9c
      Johann authored
      Call the individually optimized horizontal and vertical functions. This
      implementation abuses the temp buffer.
      
      This will be replaced with a custom optimized function.
      
      Over 2x speedup.
      
      Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd
      59dc4e9c
  11. 16 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 16x16 inverse ADST/DCT hybrid transform · d05f66aa
      Jingning Han authored
      This commit enables SSE2 implementation of 16x16 inverse ADST/DCT
      hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles.
      This provides about 1% encoding speed-up at speed 0.
      
      Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b
      d05f66aa
  12. 13 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 8x8 inverse ADST/DCT transform · 91365add
      Jingning Han authored
      This commit enables SSE2 implementation of 8x8 inverse ADST/DCT
      transform. The runtime goes from 1216 cycles -> 266 cycles.
      For bus_cif at 2000 kbps, the overall runtime reduces from
      253707ms -> 248430ms, i.e., 2% speed-up at speed 0.
      
      Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb
      91365add
  13. 12 Jul, 2013 1 commit
    • Johann's avatar
      vp9_convolve8_[horiz|vert]_avg · a15bebfc
      Johann authored
      Super basic conversion from the other implementations. Any changes to
      one should be trivial to copy over keep in sync.
      
      Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8
      a15bebfc
  14. 11 Jul, 2013 4 commits
  15. 10 Jul, 2013 6 commits
  16. 09 Jul, 2013 2 commits
    • Frank Galligan's avatar
      Add Neon horizontal and vertical vp9_mbloop_filter · 198fa6d0
      Frank Galligan authored
      - The vp9 mbfilter C code will branch on flat and mask. This CL
        will perform both branches and combine the data. A later CL will
        perform a check to see if all patch will take one branch.
      - These functions are about 1.75 times faster than the C code on
        Nexus 7.
      
      PS #3
      - Changed all functions to dub limit, blimit, and thresh from
        vld {dx[]}, freeing up r4-r6.
      - Changed code to use vbif to reduce one instruction and free
        up a d register.
      
      Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777
      198fa6d0
    • Ronald S. Bultje's avatar
      Make intra prediction pointers RTCD-based. · 8350e7fe
      Ronald S. Bultje authored
      This probably has a mildly negative impact on performance, but will
      (in future commits - or possibly merged with this one) allow SIMD
      implementations of individual intra prediction functions. We may
      perhaps want to consider having separate functions per txfm-size
      also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for
      each intra prediction mode), but I haven't played much with that
      yet.
      
      Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269
      8350e7fe
  17. 01 Jul, 2013 2 commits
    • Ronald S. Bultje's avatar
      Update quantize SSSE3 SIMD to cover 32x32 transform case also. · c8defcfd
      Ronald S. Bultje authored
      Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
      2min10.1, i.e. a 2.3% overall speed increase.
      
      Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
      c8defcfd
    • Ronald S. Bultje's avatar
      Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab
      Ronald S. Bultje authored
      Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
      goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
      x86-64 only, it needs some minor modifications to be 32bit compatible,
      because it uses 15 xmm registers, whereas 32bit only has 8.
      
      Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
      7353ceab
  18. 29 Jun, 2013 3 commits
  19. 28 Jun, 2013 1 commit
    • Ronald S. Bultje's avatar
      Make coefficient skip condition an explicit RD choice. · af660715
      Ronald S. Bultje authored
      This commit replaces zrun_zbin_boost, a method of biasing non-zero
      coefficients following runs of zero-coefficients to be rounded towards
      zero, with an explicit skip-block choice in the RD loop.
      
      The logic is basically that if individual coefficients should be rounded
      towards zero (from a RD point of view), the trellis/optimize loop should
      take care of it. If whole blocks should be zero (from a RD point of
      view), a single RD check is much more efficient than a complete
      serialization of the quantization loop.
      
      Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
      SIMD for quantize will follow in a separate patch. Results for other
      test sets pending.
      
      Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
      af660715
  20. 27 Jun, 2013 1 commit
    • Frank Galligan's avatar
      Add Neon optimized loop filter functions. · 1d6dc1b7
      Frank Galligan authored
      - Added vp9_loop_filter_horizontal_edge_neon and
        vp9_loop_filter_vertical_edge_neon.
      - The functions are based off the vp8 loopfilter
        functions.
      - Matches x86 md5 checksum.
      
      Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0
      1d6dc1b7
  21. 25 Jun, 2013 3 commits
    • Jingning Han's avatar
      Refactor intra predictor block · d19ea386
      Jingning Han authored
      Remove vp9_intra4x4_predict(). Use the common intra prediction
      function for all block sizes.
      
      Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560
      d19ea386
    • Ronald S. Bultje's avatar
      Add averaging-SAD functions for 8-point comp-inter motion search. · c24d9223
      Ronald S. Bultje authored
      Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
      i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
      the variance of the averaging predictor. This is slightly suboptimal
      because the function is subpixel-position-aware, but it will (at least
      for the SSE2 version) not actually use a bilinear filter for a full-pixel
      position, thus leading to approximately the same performance compared to
      if we implemented an actual average-aware full-pixel variance function.
      That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
      leading to a total gain of 2.7%.
      
      Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
      c24d9223
    • Jingning Han's avatar
      Enable sse2 implmentation of 8x8 ADST/DCT · a32a086d
      Jingning Han authored
      This commit makes use of the butterfly structure to enable the sse2
      version implementation of 8x8 ADST/DCT hybrid transform coding.
      
      The runtime of hybrid transform module goes down from 1170 cycles
      to 245 cycles. Overall speed-up around 1.5%.
      
      Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
      a32a086d
  22. 21 Jun, 2013 1 commit