1. 15 Aug, 2013 1 commit
  2. 14 Aug, 2013 2 commits
  3. 12 Aug, 2013 1 commit
    • Jingning Han's avatar
      SSE2 high precision 32x32 forward DCT · 78136edc
      Jingning Han authored
      Enable SSE2 implementation of high precision 32x32 forward DCT. The
      intermediate stacks are of 32-bits. The run-time goes down from
      32126 cycles to 13442 cycles.
      
      Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
      78136edc
  4. 07 Aug, 2013 1 commit
  5. 06 Aug, 2013 6 commits
  6. 05 Aug, 2013 2 commits
  7. 02 Aug, 2013 2 commits
  8. 01 Aug, 2013 1 commit
    • Jingning Han's avatar
      Remove unused vp9_short_idct10_32x32_add · 67719abd
      Jingning Han authored
      The inverse 32x32 transform detects all zero entries and skips the
      computations accordingly per 8 rows in the first 1-D operation. The
      function vp9_short_idct10_32x32_add performs differently and is not
      used anywhere, hence removed.
      
      Change-Id: Ic4fad422debbde7b6b6ffed47c69fbd4268a906c
      67719abd
  9. 29 Jul, 2013 1 commit
    • Jingning Han's avatar
      16x16 inverse 2D-DCT with DC only · a7c4de22
      Jingning Han authored
      This commit provides special handle on 16x16 inverse 2D-DCT, where
      only DC coefficient is quantized to be non-zero value.
      
      Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
      a7c4de22
  10. 26 Jul, 2013 2 commits
  11. 25 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 inverse 4x4 2D-DCT with DC only · 384e37e3
      Jingning Han authored
      Add SSE2 implementation to handle the special case of inverse 2D-DCT
      where only DC coefficient is non-zero.
      
      Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f
      384e37e3
  12. 24 Jul, 2013 1 commit
  13. 18 Jul, 2013 1 commit
  14. 17 Jul, 2013 1 commit
    • Johann's avatar
      vp9_convolve8_neon placeholder · 59dc4e9c
      Johann authored
      Call the individually optimized horizontal and vertical functions. This
      implementation abuses the temp buffer.
      
      This will be replaced with a custom optimized function.
      
      Over 2x speedup.
      
      Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd
      59dc4e9c
  15. 16 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 16x16 inverse ADST/DCT hybrid transform · d05f66aa
      Jingning Han authored
      This commit enables SSE2 implementation of 16x16 inverse ADST/DCT
      hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles.
      This provides about 1% encoding speed-up at speed 0.
      
      Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b
      d05f66aa
  16. 13 Jul, 2013 1 commit
    • Jingning Han's avatar
      SSE2 8x8 inverse ADST/DCT transform · 91365add
      Jingning Han authored
      This commit enables SSE2 implementation of 8x8 inverse ADST/DCT
      transform. The runtime goes from 1216 cycles -> 266 cycles.
      For bus_cif at 2000 kbps, the overall runtime reduces from
      253707ms -> 248430ms, i.e., 2% speed-up at speed 0.
      
      Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb
      91365add
  17. 12 Jul, 2013 1 commit
    • Johann's avatar
      vp9_convolve8_[horiz|vert]_avg · a15bebfc
      Johann authored
      Super basic conversion from the other implementations. Any changes to
      one should be trivial to copy over keep in sync.
      
      Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8
      a15bebfc
  18. 11 Jul, 2013 4 commits
  19. 10 Jul, 2013 6 commits
  20. 09 Jul, 2013 2 commits
    • Frank Galligan's avatar
      Add Neon horizontal and vertical vp9_mbloop_filter · 198fa6d0
      Frank Galligan authored
      - The vp9 mbfilter C code will branch on flat and mask. This CL
        will perform both branches and combine the data. A later CL will
        perform a check to see if all patch will take one branch.
      - These functions are about 1.75 times faster than the C code on
        Nexus 7.
      
      PS #3
      - Changed all functions to dub limit, blimit, and thresh from
        vld {dx[]}, freeing up r4-r6.
      - Changed code to use vbif to reduce one instruction and free
        up a d register.
      
      Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777
      198fa6d0
    • Ronald S. Bultje's avatar
      Make intra prediction pointers RTCD-based. · 8350e7fe
      Ronald S. Bultje authored
      This probably has a mildly negative impact on performance, but will
      (in future commits - or possibly merged with this one) allow SIMD
      implementations of individual intra prediction functions. We may
      perhaps want to consider having separate functions per txfm-size
      also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for
      each intra prediction mode), but I haven't played much with that
      yet.
      
      Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269
      8350e7fe
  21. 01 Jul, 2013 2 commits
    • Ronald S. Bultje's avatar
      Update quantize SSSE3 SIMD to cover 32x32 transform case also. · c8defcfd
      Ronald S. Bultje authored
      Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
      2min10.1, i.e. a 2.3% overall speed increase.
      
      Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
      c8defcfd
    • Ronald S. Bultje's avatar
      Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab
      Ronald S. Bultje authored
      Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
      goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
      x86-64 only, it needs some minor modifications to be 32bit compatible,
      because it uses 15 xmm registers, whereas 32bit only has 8.
      
      Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
      7353ceab