1. 08 Jun, 2016 1 commit
  2. 27 May, 2016 1 commit
    • Linfeng Zhang's avatar
      Upgrade fwht4x4_mmx() to fwht4x4_sse2() for vp9 and vp10. · af7fb17c
      Linfeng Zhang authored
      Function level timing test shows about 27% time saving on
      a Xeon E5-2680 v2 desktop.
      
      Rename vp9_dct_sse2.c to vp9_dct_intrin_sse2.c for vp9 and
      rename dct_sse2.c to dct_intrin_sse2.c for vp10 to avoid
      duplicate basenames.
      
      Actually vp9_fwht4x4_mmx/sse2() and vp10_fwht4x4_mmx/sse2()
      are identical. TODO: They should be unified later if there is
      no intention to keep a duplicate.
      
      Change-Id: I3e537b7bbd9ba417c606cd7c68c4dbbfa583f77d
      af7fb17c
  3. 02 May, 2016 1 commit
  4. 31 Mar, 2016 1 commit
  5. 04 Feb, 2016 1 commit
  6. 14 Dec, 2015 1 commit
  7. 09 Dec, 2015 1 commit
  8. 25 Nov, 2015 1 commit
    • James Zern's avatar
      add vp9_satd_neon · eb1d0f8d
      James Zern authored
      ~60-65% faster at the function level across block sizes
      
      Change-Id: Iaf8cbe95731c43fdcbf68256e44284ba51a93893
      eb1d0f8d
  9. 20 Nov, 2015 2 commits
    • James Zern's avatar
      fix vp9_satd_sse2 · 60760f71
      James Zern authored
      accumulate satd in 32-bits
      + add unit test
      
      Change-Id: I6748183df3662ddb9d635f9641f9586f2fd38ad5
      60760f71
    • James Zern's avatar
      vp9_satd: return an int · 3e0138ed
      James Zern authored
      the final sum may use up to 26 bits
      
      + add a unit test
      + disable the sse2 as the result will rollover; this will be fixed in a
      future commit
      
      Change-Id: I2a49811dfaa06abfd9fa1e1e65ed7cd68e4c97ce
      3e0138ed
  10. 13 Nov, 2015 1 commit
    • paulwilkins's avatar
      Changes to exhaustive motion search. · 0149fb3d
      paulwilkins authored
      This change alters the nature and use of exhaustive motion search.
      
      Firstly any exhaustive search is preceded by a normal step search.
      The exhaustive search is only carried out if the distortion resulting
      from the step search is above a threshold value.
      
      Secondly the simple +/- 64 exhaustive search is replaced by a
      multi stage mesh based search where each stage has a range
      and step/interval size. Subsequent stages use the best position from
      the previous stage as the center of the search but use a reduced range
      and interval size.
      
      For example:
        stage 1: Range +/- 64 interval 4
        stage 2: Range +/- 32 interval 2
        stage 3: Range +/- 15 interval 1
      
      This process, especially when it follows on from a normal step
      search, has shown itself to be almost as effective as a full range
      exhaustive search with step 1 but greatly lowers the computational
      complexity such that it can be used in some cases for speeds 0-2.
      
      This patch also removes a double exhaustive search for sub 8x8 blocks
      which also contained  a bug (the two searches used different distortion
      metrics).
      
      For best quality in my test animation sequence this patch has almost
      no impact on quality but improves encode speed by more than 5X.
      
      Restricted use in good quality speeds 0-2 yields significant quality gains
      on the animation test of 0.2 - 0.5 db with only a small impact on encode
      speed. On most clips though the quality gain and speed impact are small.
      
      Change-Id: Id22967a840e996e1db273f6ac4ff03f4f52d49aa
      0149fb3d
  11. 11 Nov, 2015 1 commit
    • Geza Lore's avatar
      Add AVX vectorized vp9_diamond_search_sad · 5eefd3eb
      Geza Lore authored
      This function now has an AVX intrinsics version which is about 80%
      faster compared to the C implementation. This provides a 2-4% total
      speed-up for encode, depending on encoding parameters. The function
      utilizes 3 properties of the cost function lookup table, constructed
      in 'cal_nmvjointsadcost' and 'cal_nmvsadcosts'.
      For the joint cost:
        - mvjointsadcost[1] == mvjointsadcost[2] == mvjointsadcost[3]
      For the component costs:
        - For all i: mvsadcost[0][i] == mvsadcost[1][i]
              (equal per component cost)
        - For all i: mvsadcost[0][i] == mvsadcost[0][-i]
              (Cost function is even)
      These must hold, otherwise the AVX version of the function cannot be used.
      
      Change-Id: I6c2791d43022822a9e6ab43cd124a773946d0bdc
      5eefd3eb
  12. 06 Nov, 2015 1 commit
    • James Zern's avatar
      Revert "Add AVX vectorized vp9_diamond_search_sad" · 30466f26
      James Zern authored
      This reverts commit f1342a7b.
      
      This breaks 32-bit builds:
       runtime error: load of misaligned address 0xf72fdd48 for type 'const
      __m128i' (vector of 2 'long long' values), which requires 16 byte
      alignment
      
      + _mm_set1_epi64x is incompatible with some versions of visual studio
      
      Change-Id: I6f6fc3c11403344cef78d1c432cdc9147e5c1673
      30466f26
  13. 05 Nov, 2015 1 commit
    • Geza Lore's avatar
      Add AVX vectorized vp9_diamond_search_sad · f1342a7b
      Geza Lore authored
      This function now has an AVX intrinsics version which is about 80%
      faster compared to the C implementation. This provides a 2-4% total
      speed-up for encode, depending on encoding parameters. The function
      utilizes 3 properties of the cost function lookup table, constructed
      in 'cal_nmvjointsadcost' and 'cal_nmvsadcosts'.
      For the joint cost:
        - mvjointsadcost[1] == mvjointsadcost[2] == mvjointsadcost[3]
      For the component costs:
        - For all i: mvsadcost[0][i] == mvsadcost[1][i]
              (equal per component cost)
        - For all i: mvsadcost[0][i] == mvsadcost[0][-i]
              (Cost function is even)
      These must hold, otherwise the AVX version of the function cannot be used.
      
      Change-Id: I184055b864c5a2dc37b2d8c5c9012eb801e9daf6
      f1342a7b
  14. 21 Oct, 2015 1 commit
    • Geza Lore's avatar
      Optimize vp9_highbd_block_error_8bit assembly. · aa8f8522
      Geza Lore authored
      A new version of vp9_highbd_error_8bit is now available which is
      optimized with AVX assembly. AVX itself does not buy us too much, but
      the non-destructive 3 operand format encoding of the 128bit SSEn integer
      instructions helps to eliminate move instructions. The Sandy Bridge
      micro-architecture cannot eliminate move instructions in the processor
      front end, so AVX will help on these machines.
      
      Further 2 optimizations are applied:
      
      1. The common case of computing block error on 4x4 blocks is optimized
      as a special case.
      2. All arithmetic is speculatively done on 32 bits only. At the end of
      the loop, the code detects if overflow might have happened and if so,
      the whole computation is re-executed using higher precision arithmetic.
      This case however is extremely rare in real use, so we can achieve a
      large net gain here.
      
      The optimizations rely on the fact that the coefficients are in the
      range [-(2^15-1), 2^15-1], and that the quantized coefficients always
      have the same sign as the input coefficients (in the worst case they are
      0). These are the same assumptions that the old SSE2 assembly code for
      the non high bitdepth configuration relied on. The unit tests have been
      updated to take this constraint into consideration when generating test
      input data.
      
      Change-Id: I57d9888a74715e7145a5d9987d67891ef68f39b7
      aa8f8522
  15. 08 Oct, 2015 1 commit
    • Geza Lore's avatar
      Optimization of 8bit block error for high bitdepth · 0134764f
      Geza Lore authored
      If high bit depth configuration is enabled, but encoding in profile 0,
      the code now falls back on optimized SSE2 assembler to compute the
      block errors, similar to when high bit depth is not enabled.
      
      Change-Id: I471d1494e541de61a4008f852dbc0d548856484f
      0134764f
  16. 29 Sep, 2015 1 commit
    • Julia Robson's avatar
      Accelerated transform in high bit depth · 406030d1
      Julia Robson authored
      When configured with high bitdepth enabled, the 8bit transform
      stopped using optimised code. This made 8bit content decode slowly.
      
      Change-Id: I67d91f9b212921d5320f949fc0a0d3f32f90c0ea
      406030d1
  17. 07 Aug, 2015 1 commit
  18. 01 Aug, 2015 1 commit
    • James Zern's avatar
      add vp9_vector_var_neon · 7dc5a689
      James Zern authored
      ~50-60% faster depending on the width
      
      Change-Id: I9d007cfa10b9aaa2169c8c009d95522df6123a92
      7dc5a689
  19. 31 Jul, 2015 2 commits
    • Jingning Han's avatar
      Factor inverse transform functions into vpx_dsp · e8b133c7
      Jingning Han authored
      This commit moves the module inverse transform functions from vp9
      to vpx_dsp folder. The hybrid transform wrapper functions stay in
      the vp9 folder, since it involves codec-specific data structures.
      
      Change-Id: Ib066367c953d3d024c73ba65157bbd70a95c9ef8
      e8b133c7
    • Zoe Liu's avatar
      Code refactor on InterpKernel · 7186a2dd
      Zoe Liu authored
      It in essence refactors the code for both the interpolation
      filtering and the convolution. This change includes the moving
      of all the files as well as the changing of the code from vp9_
      prefix to vpx_ prefix accordingly, for underneath architectures:
      (1) x86;
      (2) arm/neon; and
      (3) mips/msa.
      The work on mips/drsp2 will be done in a separate change list.
      
      Change-Id: Ic3ce7fb7f81210db7628b373c73553db68793c46
      7186a2dd
  20. 28 Jul, 2015 3 commits
  21. 27 Jul, 2015 1 commit
  22. 22 Jul, 2015 1 commit
  23. 20 Jul, 2015 1 commit
    • Jingning Han's avatar
      Unify the high bit-depth forward hybrid transforms · e253eaa0
      Jingning Han authored
      The SSE2 version high bit-depth forward hybrid transforms are
      essentially using the C functions via cross referencing to 1-D
      functions in vp9_dct.c. This commit unifies the two versions and
      removes the unnecessary dependency.
      
      Change-Id: Ib4d0702a138f8daf7d0bd97c141ee7088f293765
      e253eaa0
  24. 17 Jul, 2015 1 commit
    • Yunqing Wang's avatar
      Migrate quantization functions from vp9/ to vpx_dsp/ · 38f1fbbb
      Yunqing Wang authored
      The following quantization functions were moved:
      vp9_quantize_b
      vp9_quantize_b_32x32
      vp9_highbd_quantize_b
      vp9_highbd_quantize_b_32x32
      
      vp9_quantize_dc
      vp9_quantize_dc_32x32
      vp9_highbd_quantize_dc
      vp9_highbd_quantize_dc_32x32
      
      The purpose of doing that was to allow these functions to be shared
      by multiple codecs.
      
      Change-Id: Id8ab939f283353cdd07bd930d47db3d932a5d87f
      38f1fbbb
  25. 16 Jul, 2015 1 commit
  26. 15 Jul, 2015 1 commit
  27. 14 Jul, 2015 1 commit
  28. 13 Jul, 2015 1 commit
  29. 08 Jul, 2015 1 commit
  30. 07 Jul, 2015 1 commit
  31. 06 Jul, 2015 2 commits
  32. 02 Jul, 2015 2 commits
  33. 01 Jul, 2015 2 commits