1. 20 Feb, 2018 1 commit
  2. 19 Feb, 2018 1 commit
    • Maxym Dmytrychenko's avatar
      SSE2 optimization for lpf 16_dual implementations · d6a7dd19
      Maxym Dmytrychenko authored
      covers horizontal and vertical variations and
      including low and high bitdepth types.
      
      Appropriate tests are enabled
      
      Performance changes, SSE2 over C:
      Horizontal methods: up to  3x
      Vertical   methods: up to  2x
      
      Change-Id: If430a916394c7befa743e4fbaa9913fd37c535ed
      d6a7dd19
  3. 15 Feb, 2018 1 commit
    • Yaowu Xu's avatar
      Remove CONFIG_TX64X64 · d3d4159f
      Yaowu Xu authored
      The experiment is fully adopted.
      
      Change-Id: I6cc80a2acf0c93c13b0e36e6f4a2378fe5ce33c3
      d3d4159f
  4. 14 Feb, 2018 1 commit
  5. 03 Feb, 2018 1 commit
    • Peng Bin's avatar
      Add aom_comp_mask_pred_avx2 · 3c74dd45
      Peng Bin authored
      1. Add AVX2 implementation of aom_comp_mask_pred.
      2. For width 8 still use ssse3 version.
      3. For other widths(16,32), AVX2 version is 1.2x-2.0x faster
      than ssse3 version
      
      Change-Id: I80acc1be54ab21a52f7847e91b1299853add757c
      3c74dd45
  6. 02 Feb, 2018 2 commits
    • Imdad Sardharwalla's avatar
      AVX2 implementation of the Wiener filter · aab6aee3
      Imdad Sardharwalla authored
      Added an AVX2 version of the Wiener filter, along with associated tests. Speed
      tests have been added for all implementations of the Wiener filter.
      
      Speed Test results
      ==================
      
      GCC
      ---
      
      Low bit-depth filter:
      - SSE2 vs C: SSE2 takes ~92% less time
      - AVX2 vs C: AVX2 takes ~96% less time
      - SSE2 vs AVX2: AVX2 takes ~43% less time (~74% faster)
      
      High bit-depth filter:
      - SSSE3 vs C: SSSE3 takes ~92% less time
      - AVX2  vs C: AVX2  takes ~96% less time
      - SSSE3 vs AVX2: AVX2 takes ~46% less time (~84% faster)
      
      CLANG
      -----
      
      Low bit-depth filter:
      - SSE2 vs C: SSE2 takes ~84% less time
      - AVX2 vs C: AVX2 takes ~88% less time
      - SSE2 vs AVX2: AVX2 takes ~27% less time (~36% faster)
      
      High bit-depth filter:
      - SSSE3 vs C: SSSE3 takes ~85% less time
      - AVX2  vs C: AVX2  takes ~89% less time
      - SSS3  vs AVX2: AVX2 takes ~24% less time (~31% faster)
      
      Change-Id: Ide22d7c09c0be61483e9682caf17a39438e4a208
      aab6aee3
    • Peng Bin's avatar
      Remove aom_comp_mask_upsampled_pred from rtcd · f8daa92d
      Peng Bin authored
      Since aom_comp_mask_upsampled_pred just call aom_upsampled_pred
      and aom_comp_mask_pred, no need to separate c version from simd
      version any more.
      
      Change-Id: I1ff8bcae87d501c68a80708fd2dc6b74c6952f88
      f8daa92d
  7. 31 Jan, 2018 1 commit
    • Peng Bin's avatar
      Add aom_comp_mask_<upsampled>pred_ssse3 · 33ba1fe5
      Peng Bin authored
      1) For encoder speed, overall ~1% faster with no impact on coding performance.
      2) aom_comp_mask_pred_ssse3 is 3.5x - 6x faster than aom_comp_mask_pred_c
      3) aom_comp_mask_upsampled_pred_ssse3 1.5x - 3x faster than
      aom_comp_mask_upsampled_pred_c, for special case where subpel_x ==
      subpel_y == 0, optimized version achieves 4x - 7x speedup
      
      Unittest for both functions have been added.
      
      Change-Id: Ib498317975e0dbd9cdcf61be327b640dfac9a7e5
      33ba1fe5
  8. 26 Jan, 2018 1 commit
    • Maxym Dmytrychenko's avatar
      SSE2 optimizations for _6/_16 lowbd lpf functions · ae6e6bc1
      Maxym Dmytrychenko authored
      Includes vertical and horizontal implementations
      and to fix 5/13 TAPs/Parallel deblocking support.
      
      Re-working internals of the filters for better
      re-usage across different sizes.
      
      Tests are enabled.
      
      Performance changes, SSE2 over C:
      Horizontal methods: up to    3-4x
      Vertical   methods: up to 1.5x-2x
      
      Change-Id: I2e36035355d8c23c1d4b0d59d0e23f598e9d0e3f
      ae6e6bc1
  9. 22 Jan, 2018 1 commit
  10. 13 Jan, 2018 1 commit
    • Kyle Siefring's avatar
      Add implemented functions to rtcd that were missed · 729d0f5d
      Kyle Siefring authored
      "ext-partition-types: Add 4:1 partitions" added a number SIMD
      functions. The SAD functions introduced in that patch were not
      added to the rtcd file and were not getting called.
      
      Hash of "ext-partition-types: Add 4:1 partitions"
      93c39e91
      
      Change-Id: I47094799e27d66f74311ff0bcff23ecb7eed8a76
      729d0f5d
  11. 27 Dec, 2017 1 commit
  12. 18 Dec, 2017 1 commit
    • Cheng Chen's avatar
      JNT_COMP: add SIMD and interface for high bit-depth · bf3d4964
      Cheng Chen authored
      Add high bit-depth macro definitions:
      highbd_jnt_sad
      highbd_8(10/12)_jnt_sub_pixel_avg.
      
      Add SIMD functions:
      aom_highbd_jnt_comp_avg_pred_sse2
      aom_highbd_jnt_comp_avg_upsampled_pred_sse2
      
      This patch also solves the seg fault caused by low bit-depth and
      high bit-depth paths
      
      BUG=aomedia:967
      BUG=aomedia:944
      
      Change-Id: Iea69f114e81ca226a30d84a540ad846f1b94b8d6
      bf3d4964
  13. 15 Dec, 2017 1 commit
    • Johann's avatar
      add copyright to rtcd files · aecbba6d
      Johann authored
      Allows them to pass the license check in chromium.
      
      Based on libvpx e4b3f03
      
      BUG=chromium:795297
      
      Change-Id: I2bb49ecb62f20d7bc5093a1732b6a8228ef5c87f
      aecbba6d
  14. 14 Dec, 2017 1 commit
    • Urvang Joshi's avatar
      round_shift_array: Use SSE4 version everywhere. · 1ac47a7c
      Urvang Joshi authored
      Usage of CPU by round_shift_array goes from 2.01% to 1.04%.
      Overall encoding is slightly faster (~0.05%).
      
      This means some of the intermediate array have to be aligned.
      Also, these functions were moved to common header/source files.
      
      BUG=aomedia:1106
      
      Change-Id: I492c9b1f2e7339c6cb83cfe68a61218642654d1b
      1ac47a7c
  15. 13 Dec, 2017 1 commit
  16. 04 Dec, 2017 1 commit
  17. 29 Nov, 2017 1 commit
    • James Zern's avatar
      Unify highbd loopfilter function names · 684b7bd1
      James Zern authored
      Rename aom_highbd_lpf_horizontal_edge_8() to aom_highbd_lpf_horizontal_16().
      Rename aom_highbd_lpf_horizontal_edge_16() to aom_highbd_lpf_horizontal_16_dual().
      
      based on the same change from libvpx:
      7f1f35183 Unify loopfilter function names
      
      Change-Id: I40cd587e74e0fe02bae23e6c10280c8e269df1d6
      684b7bd1
  18. 27 Nov, 2017 1 commit
    • James Zern's avatar
      Unify loopfilter function names · 1dbe80bc
      James Zern authored
      Rename aom_lpf_horizontal_edge_8() to aom_lpf_horizontal_16().
      Rename aom_lpf_horizontal_edge_16() to aom_lpf_horizontal_16_dual().
      
      based on the same change from libvpx:
      7f1f35183 Unify loopfilter function names
      
      Change-Id: I4fda7a2e3a893fc3dee0779975e2d4145c32f5d2
      1dbe80bc
  19. 25 Nov, 2017 1 commit
  20. 22 Nov, 2017 2 commits
    • Cheng Chen's avatar
      JNT_COMP: add ssse3 implementations for sad_avg · d0179a6b
      Cheng Chen authored
      Add ssse3 implementations for the sad_avg c function at low bit-depth.
      With this, aom_jnt_sad c functions can all have simd implementations.
      This CL follows existing MACRO definitions for multiple combinations
      of block sizes.
      
      Change-Id: I882343684026525f5589a239337cfac2dd411e11
      d0179a6b
    • Cheng Chen's avatar
      JNT_COMP: SIMD implementation for aom_jnt_sub_pixel_avg · d286443c
      Cheng Chen authored
      Change function names and add SIMD implementation for two c functions:
      (1) var_filter_block2d_bil_first_pass
      (2) var_filter_block2d_bil_second_pass
      
      This CL allows aom_jnt_sub_pixel_avg_variance now in SIMD.
      
      Change-Id: Ib41ef13d62ae91a0ca481bcebb24568dcd4722c4
      d286443c
  21. 10 Nov, 2017 1 commit
    • Urvang Joshi's avatar
      Remove smooth_hv experiment flag. · b7301cd6
      Urvang Joshi authored
      This experiment has been cleared by Tapas.
      
      Also, fix a couple of hash signatures in the test while we are at it.
      
      Change-Id: I1658bcb07913cf8bd47cfffadd729e16d5c55fc3
      b7301cd6
  22. 06 Nov, 2017 2 commits
    • Cheng Chen's avatar
      JNT_COMP: add SIMD implementations for c functions · ef34fff7
      Cheng Chen authored
      Add SIMD implementations for c functions for low bit-depth, making
      encoder speed faster by 3~4x than c functions.
      
      Change-Id: Icca0b07b25489759be9504aaec09d1239076fc52
      ef34fff7
    • Cheng Chen's avatar
      JNT_COMP: Refactor code · f78632e0
      Cheng Chen authored
      The refactoring serves two purposes:
      1. Separate code paths for jnt_comp and original compound average
      computation. It provides function interface for jnt_comp while leaving
      original compound average computation unchanged. In near future, SIMD
      functions can be added for jnt_comp using the interface.
      
      2. Previous implementation uses a hack on second_pred. But it may cause
      segmentation fault when the test clip is small. As reported in Issue
      944. This refactoring removes hacking and make it possible to address
      the seg fault problem in the future.
      
      Change-Id: Idd2cb99f6c77dae03d32ccfa1f9cbed1d7eed067
      f78632e0
  23. 31 Oct, 2017 1 commit
  24. 21 Oct, 2017 1 commit
  25. 20 Oct, 2017 1 commit
    • Yi Luo's avatar
      Lowbd D207E/D63E/D45E intrapred x86 optimization · ae676953
      Yi Luo authored
      D207E
      Predictor  SSE2 vs C
      4x4        ~2.6X
      4x8        ~2.5X
      8x4        ~8.0X
      8x8        ~9.1X
      8x16       ~11.7X
      16x8       ~16.9X
      16x16      ~17.3X
      16x32      ~17.2X
      32x16      ~30.2X
      32x32      ~35.5X
      
      D63E
      Predictor  SSE2 vs C
      4x4        ~4.7X
      4x8        ~4.9X
      8x4        ~7.8X
      8x8        ~8.9X
      8x16       ~9.3X
      16x8       ~15.7X
      16x16      ~14.7X
      16x32      ~17.3X
      32x16      ~18.0X
      32x32      ~15.7X
      
      D45E
      Predictor  SSSE3 vs C
      4x4        ~1.8X
      4x8        ~2.9X
      8x4        ~6.7X
      8x8        ~6.5X
      8x16       ~7.4X
      16x8       ~24.4X
      16x16      ~21.5X
      16x32      ~24.2X
      32x16      ~25.4X
      32x32      ~25.2X
      
      Change-Id: I8215de190e2b6314272749761600e389d1ca0fdf
      ae676953
  26. 16 Oct, 2017 1 commit
    • Yi Luo's avatar
      Highbd D207E/D63E intrapred sse2/avx2 optimization · 0b7127b3
      Yi Luo authored
      D207E
      Predictor SSE2 vs C   AVX2 vs C
      4x4       ~2.7x
      4x8       ~3.0x
      8x4       ~7.2x
      8x8       ~8.5x
      8x16      ~9.4x
      16x8      ~12.8x
      16x16     ~13.0x
      16x32     ~14.3x
      32x16                 ~19.9x
      32x32                 ~23.6x
      
      D63E
      Predictor SSE2 vs C   AVX2 vs C
      4x4       ~3.8x
      4x8       ~4.3x
      8x4       ~6.4x
      8x8       ~6.8x
      8x16      ~8.6x
      16x8                  ~9.0x
      16x16                 ~9.6x
      16x32                 ~10.3x
      32x16                 ~9.1x
      32x32                 ~11.0x
      
      Change-Id: I87373804c9d53276bf4d7788c4ae0d13d01c00dc
      0b7127b3
  27. 10 Oct, 2017 2 commits
    • Yi Luo's avatar
      Highbd D45E intrapred SSE2/AVX2 speedup · 56ad3dd3
      Yi Luo authored
      Function  SSE2 vs C  AVX2 vs C
      4x4       ~4.5x
      4x8       ~4.5x
      8x4       ~11.7x
      8x8       ~12.7x
      8x16      ~14.0x
      16x8                 ~21.7x
      16x16                ~24.0x
      16x32                ~28.7x
      32x16                ~20.5x
      32x32                ~24.4x
      
      Change-Id: Iaca49727d8df17b7f793b774a8d51a401ef8a8d1
      56ad3dd3
    • Yi Luo's avatar
      Migrate some vp9 highbd intrapred x86 speedup to av1 · 71b6e043
      Yi Luo authored
      Function speedup on i7-6700:
      D117   sse2   ssse3
      4x4    ~1.8x
      8x8           ~3.4x
      16x16         ~5.5x
      32x32         ~2.9x
      
      D135   sse2   ssse3
      4x4    ~1.9
      8x8           ~3.3x
      16x16         ~5.3x
      32x32         ~3.6x
      
      D153   sse2   ssse3
      4x4    ~1.9x
      8x8           ~2.8x
      16x16         ~5.5x
      32x32         ~3.6x
      
      Change-Id: I43ab5fa8dcbcfa51acbde554abf3e5d7d336f391
      71b6e043
  28. 06 Oct, 2017 1 commit
    • Yi Luo's avatar
      Lowbd SMOOTH_PRED intrapred ssse3 optimization · 46ae1ea3
      Yi Luo authored
      On i7-6700:
      Predictor    ssse3 v. C
      4x4          ~1.3x
      4x8          ~1.9x
      8x4          ~2.3x
      8x8          ~3.4x
      8x16         ~4.1x
      16x8         ~4.6x
      16x16        ~5.2x
      16x32        ~5.6x
      32x16        ~4.2x
      32x32        ~4.7x
      
      Change-Id: Ic12383cf9d4446361d6355eb8a480a3c7602060e
      46ae1ea3
  29. 04 Oct, 2017 1 commit
    • Yi Luo's avatar
      Lowbd TM_PRED intra pred avx2 optimization · 237cf1b2
      Yi Luo authored
      For block width >= 16, avx2 can further speedup the
      TM_PREM intra prediction.
      
      Function speedup on i7-6700:
      Predictor  avx2 v. ssse3
      16x8       ~1.6x
      16x16      ~1.8x
      16x32      ~1.9x
      32x16      ~1.9x
      32x32      ~1.9x
      
      Change-Id: I62c20bd7628f52251b0c051b99a9b738ee44f7e6
      237cf1b2
  30. 02 Oct, 2017 2 commits
  31. 29 Sep, 2017 3 commits
    • Yi Luo's avatar
      Lowbd TM_PRED intrapred ssse3 optimization · a0f66fc0
      Yi Luo authored
      Function speedup (i7-6700)
      Predictor  ssse3 v. C
      4x4        ~2.1x
      4x8        ~2.4x
      8x4        ~4.1x
      8x8        ~5.4x
      8x16       ~6.1x
      16x8       ~5.9x
      16x16      ~6.4x
      16x32      ~6.7x
      32x16      ~7.4x
      32x32      ~8.0x
      
      Change-Id: I52b8ebf8193e76f4ea1137cbad5ad7fa109d86d8
      a0f66fc0
    • Rupert Swarbrick's avatar
      Add 32x128/128x32 block sizes · 2fa6e1ce
      Rupert Swarbrick authored
      Change-Id: Ieb28f40d85e4db4af33648c32c406dd2931ceb89
      2fa6e1ce
    • Yi Luo's avatar
      Lowbd intrapred DC/TOP/LEFT/128/V/H avx2 · 23c61903
      Yi Luo authored
      For prediction block width equal to 32, avx2 can further speedup
      the prediction function (i7-6700):
      
      32x32     avx2 v. sse2
      DC        ~1.4x
      top       ~1.5x
      left      ~1.4x
      128       ~1.5x
      v         ~1.6x
      h         ~1.2x
      
      32x16     avx2 v. sse2
      DC        ~2.2x
      top       ~1.7x
      left      ~1.6x
      128       ~1.8x
      v         ~1.9x
      
      Note: 32x16 H_PRED on avx2 does not run faster enough than sse2 yet.
      
      Change-Id: I145ed504d1b3ea9df283b94927be66a2c6f81225
      23c61903
  32. 28 Sep, 2017 1 commit
    • Yi Luo's avatar
      Lowbd rectangle V/H intra pred sse2 optimization · 0c0fd1e5
      Yi Luo authored
      Function speedup sse2 v. C
      Predictor  V_PRED  H_PRED
      4x8        ~1.7x   ~1.8x
      8x4        ~1.8x   ~2.2x
      8x16       ~1.5x   ~1.4x
      16x8       ~1.9x   ~1.3x
      16x32      ~1.6x   ~1.4x
      32x16      ~2.0x   ~1.9x
      
      This patch disables speed tests to save Jenkins build
      time. Developer can manually enable them by using,
      --gtest_also_run_disabled_test flag in test command line.
      
      Change-Id: I81eaee5e8afc55275c7507c99774f78cc9e49f9a
      0c0fd1e5
  33. 27 Sep, 2017 1 commit