1. 05 Dec, 2015 1 commit
  2. 04 Dec, 2015 4 commits
    • Jian Zhou's avatar
      Speed up h_predictor_16x16 · e86c7c86
      Jian Zhou authored
      Relocate the function from SSSE3 to SSE2, Unroll loop from 8 to 4,
      and reduce mem access to left.
      Speed up by >20% in ./test_intra_pred_speed.
      
      Change-Id: Ie48229c2e32404706b722442942c84983bda74cc
      e86c7c86
    • Jian Zhou's avatar
      Speed up h_predictor_8x8 · da3f08fa
      Jian Zhou authored
      Relocate the function from SSSE3 to SSE2, Unroll loop from 4 to 2,
      and reduce mem access to left.
      Speed up by >20% in ./test_intra_pred_speed.
      
      Change-Id: Ib9f1846819783b6e05e2a310c930eb844b2b4d2e
      da3f08fa
    • Jian Zhou's avatar
      MMX in intra 8x8 prediction replaced with SSE2 · aa2764ab
      Jian Zhou authored
      8x8 Intra predictor implemented with MMX is replaced with SSE2.
      
      Change-Id: I0c90e7c1e1e6942489ac2bfe58903b728aac7a52
      aa2764ab
    • Jian Zhou's avatar
      MMX in intra 4x4 prediction replaced with SSE2 · 89a1efa4
      Jian Zhou authored
      4x4 Intra predictor implemented with MMX is replaced with SSE2.
      
      Change-Id: Id57da2a7c38832d0356bc998790fc1989d39eafc
      89a1efa4
  3. 30 Nov, 2015 1 commit
    • Jian Zhou's avatar
      SSE2 speed up of h_predictor_4x4 · 9d29d762
      Jian Zhou authored
      Relocate h_predictor_4x4 from SSSE3 to SSE2 with XMM registers.
      Speed up by ~25% in ./test_intra_pred_speed.
      
      Change-Id: I64e14c13b482a471449be3559bfb0da45cf88d9d
      9d29d762
  4. 25 Nov, 2015 1 commit
    • James Zern's avatar
      add vp9_satd_neon · eb1d0f8d
      James Zern authored
      ~60-65% faster at the function level across block sizes
      
      Change-Id: Iaf8cbe95731c43fdcbf68256e44284ba51a93893
      eb1d0f8d
  5. 23 Nov, 2015 1 commit
  6. 20 Nov, 2015 2 commits
    • James Zern's avatar
      fix vp9_satd_sse2 · 60760f71
      James Zern authored
      accumulate satd in 32-bits
      + add unit test
      
      Change-Id: I6748183df3662ddb9d635f9641f9586f2fd38ad5
      60760f71
    • James Zern's avatar
      vp9_satd: return an int · 3e0138ed
      James Zern authored
      the final sum may use up to 26 bits
      
      + add a unit test
      + disable the sse2 as the result will rollover; this will be fixed in a
      future commit
      
      Change-Id: I2a49811dfaa06abfd9fa1e1e65ed7cd68e4c97ce
      3e0138ed
  7. 19 Nov, 2015 1 commit
    • Jian Zhou's avatar
      Speed up tm_predictor_4x4 · 79b68626
      Jian Zhou authored
      tm_predictor_4x4 is implemented with SSE2 using XMM registers.
      Speed up by ~25% in ./test_intra_pred_speed.
      
      Change-Id: I25074b78d476a2cb17f81cf654bdfd80df2070e0
      79b68626
  8. 14 Nov, 2015 1 commit
  9. 13 Nov, 2015 2 commits
  10. 10 Nov, 2015 1 commit
  11. 09 Nov, 2015 2 commits
  12. 06 Nov, 2015 3 commits
  13. 05 Nov, 2015 1 commit
  14. 03 Nov, 2015 1 commit
  15. 31 Oct, 2015 1 commit
  16. 30 Oct, 2015 1 commit
  17. 29 Oct, 2015 1 commit
  18. 28 Oct, 2015 2 commits
  19. 22 Oct, 2015 1 commit
  20. 21 Oct, 2015 1 commit
    • Geza Lore's avatar
      Optimize vp9_highbd_block_error_8bit assembly. · aa8f8522
      Geza Lore authored
      A new version of vp9_highbd_error_8bit is now available which is
      optimized with AVX assembly. AVX itself does not buy us too much, but
      the non-destructive 3 operand format encoding of the 128bit SSEn integer
      instructions helps to eliminate move instructions. The Sandy Bridge
      micro-architecture cannot eliminate move instructions in the processor
      front end, so AVX will help on these machines.
      
      Further 2 optimizations are applied:
      
      1. The common case of computing block error on 4x4 blocks is optimized
      as a special case.
      2. All arithmetic is speculatively done on 32 bits only. At the end of
      the loop, the code detects if overflow might have happened and if so,
      the whole computation is re-executed using higher precision arithmetic.
      This case however is extremely rare in real use, so we can achieve a
      large net gain here.
      
      The optimizations rely on the fact that the coefficients are in the
      range [-(2^15-1), 2^15-1], and that the quantized coefficients always
      have the same sign as the input coefficients (in the worst case they are
      0). These are the same assumptions that the old SSE2 assembly code for
      the non high bitdepth configuration relied on. The unit tests have been
      updated to take this constraint into consideration when generating test
      input data.
      
      Change-Id: I57d9888a74715e7145a5d9987d67891ef68f39b7
      aa8f8522
  21. 16 Oct, 2015 1 commit
  22. 09 Oct, 2015 2 commits
  23. 08 Oct, 2015 1 commit
    • Geza Lore's avatar
      Optimization of 8bit block error for high bitdepth · 0134764f
      Geza Lore authored
      If high bit depth configuration is enabled, but encoding in profile 0,
      the code now falls back on optimized SSE2 assembler to compute the
      block errors, similar to when high bit depth is not enabled.
      
      Change-Id: I471d1494e541de61a4008f852dbc0d548856484f
      0134764f
  24. 07 Oct, 2015 1 commit
  25. 06 Oct, 2015 1 commit
    • James Zern's avatar
      invalid_file_test: loosen error check w/tile-threading · fb209003
      James Zern authored
      The serial decode check is too strict for tile-threaded decoding as
      there is no guarantee on the decode order nor which specific error
      will take precedence. Currently a tile-level error is not forwarded so
      the frame will simply be marked corrupt.
      
      Change-Id: I51cf1e39e44bedeac93746154b36a4ccb2f059b1
      fb209003
  26. 30 Sep, 2015 4 commits
  27. 26 Sep, 2015 1 commit
    • Ronald S. Bultje's avatar
      vp9/10: improve support for render_width/height. · 812945a8
      Ronald S. Bultje authored
      In the decoder, map this to the output variable vpx_image_t.r_w/h.
      This is intended as an improved version of VP9D_GET_DISPLAY_SIZE,
      which doesn't work with parallel frame decoding. In the encoder,
      map this to a codec control func (VP9E_SET_RENDER_SIZE) that takes
      a w/h pair argument in a int[2] (identical to VP9D_GET_DISPLAY_SIZE).
      
      Also add render_size to the encoder_param_get_to_decoder unit test.
      
      See issue 1030.
      
      Change-Id: I12124c13602d832bf4c44090db08c1009c94c7e8
      812945a8