1. 21 Feb, 2018 1 commit
    • Cheng Chen's avatar
      Turn on jnt_comp by default · 238bc287
      Cheng Chen authored
      Turn of CONFIG_RD_DEBUG when jnt_comp is on, to avoid stack size
      Make subpel processing for width <= 4 correct.
      Change-Id: Ic1de96ff2eff4a80543e19531fa75511b0a2f427
  2. 20 Feb, 2018 2 commits
  3. 19 Feb, 2018 3 commits
  4. 16 Feb, 2018 2 commits
    • Johann's avatar
      Remove unused jnt functions · 143de432
      Johann authored
      The 4x2 transforms gives a compile warning with gcc 6.3.0 but appears
      to be unused:
      *((void *)&temp2+8)' is used uninitialized in this function
      Change-Id: I8b08e05d0365dc117b5374ec00bddc6f7bd84bd3
    • Cheng Chen's avatar
      Fix a bug in memory access · 3290ba02
      Cheng Chen authored
      Avoid reading/writing out side of buffer. Triggered by ASAN.
      Change-Id: I7de2a9f01cc13feb1c13556dfe77e9e6e7e55056
  5. 15 Feb, 2018 3 commits
    • Yaowu Xu's avatar
      Remove CONFIG_TX64X64 · d3d4159f
      Yaowu Xu authored
      The experiment is fully adopted.
      Change-Id: I6cc80a2acf0c93c13b0e36e6f4a2378fe5ce33c3
    • Dominic Symes's avatar
      film-grain: fix buffer overflow · aa5904ba
      Dominic Symes authored
      When bit_depth is 8 the copy_rect function was setting the size to
      2 bytes per sample. This causes a buffer overflow as each line copied
      in the loop is twice the number of bytes it should be and the last
      line writes off the end of the buffer.
      Change-Id: Ib9fa11d1dd13806dedbce2cd47dd8d562007428d
    • Andrey Norkin's avatar
      [NORMATIVE] Film grain bug-fixes · 2e8ae05c
      Andrey Norkin authored
      Change-Id: I63f84dca86ca426b9c6927b056657741022d5f68
  6. 14 Feb, 2018 4 commits
    • Peng Bin's avatar
      Refactor pair_set_epi16 for speedup · 8b8aaffc
      Peng Bin authored
      Use _mm_set1_epi32 instead of _mm_set_epi16, less instructions produced
      by compiler. This patch also removes the duplicate define of the same
      Speed test results:
      1. Unittest for each test cases in SSE2/AV1LbdInvTxfm2d shows 60%~80%
      speedup (except those case with TX_TYPE include iidentity)
      2. A brief speed test shows that with this CL, for speed1 encoder speeds up
      ~3% and decoder speeds up ~1.8%.
      (Baseline is 18976fa5)
      Change-Id: I2b0e12973fda05a21d6b6eb0f0efe11df6edfb84
    • Yaowu Xu's avatar
      Remove unused variables · cbfffa8e
      Yaowu Xu authored
      Change-Id: I5290f94da6c1a0319357f84b2ec70b4331a0e4af
    • Yaowu Xu's avatar
      Remove two more LPF macros · 8ec5c077
      Yaowu Xu authored
      Change-Id: I60278e399f4f65aa63526e459947e88084f0e889
    • Yaowu Xu's avatar
      Remove CONFIG_PARALLEL_DEBLOCKING · 6d0ed3ed
      Yaowu Xu authored
      The experiment is fully adopted now.
      Change-Id: I27906d2af4c746ce55aa17f64d1c0ef281e23ab2
  7. 13 Feb, 2018 1 commit
  8. 12 Feb, 2018 1 commit
    • Peng Bin's avatar
      Add inv txfm2d sse2 for sizes with 4 · 18976fa5
      Peng Bin authored
      Implement av1_lowbd_inv_txfm2d_add_4x4_sse2
      Implement av1_lowbd_inv_txfm2d_add_4x8_sse2
      Implement av1_lowbd_inv_txfm2d_add_8x4_sse2
      Implement av1_lowbd_inv_txfm2d_add_4x16_sse2
      Implement av1_lowbd_inv_txfm2d_add_16x4_sse2
      A brief speed test shows that using the included SSE2 functions
      completed by this CL, for speed1 lowbitdepth encoder speeds up >9%
      and lowbitdepth decoder speeds up >25%, comparing to the highbitdepth
      implementation in the baseline.
      Change-Id: I0576a2a146c0b1a7b483c9d35c3d21d979e263cd
  9. 10 Feb, 2018 1 commit
  10. 09 Feb, 2018 2 commits
    • David Barker's avatar
      [wedge/compound-segment, normative] Remove more rounding · 7dbb0051
      David Barker authored
      This reduces the overall rounding in the masked blend process -
      the result is now equivalent to having a single round operation
      at the end of the prediction process.
      This increases the range of the intermediate values inside
      aom_blend_a64_d32_mask() by 2 bits, but has no effect on the
      ranges of any values outside that function.
      Change-Id: I1010ed94c7d8db75bb3d8157c864c5527005725b
    • David Barker's avatar
      [wedge/compound-segment, normative] Reduce multiple rounding · d3b99738
      David Barker authored
      As described in the linked bug report, the masked blend operation
      contains multiple stages of rounding. This commit replaces one
      intermediate round with a right shift, which should be slightly
      faster and more accurate.
      Change-Id: Ib24ce687e628b05d645fbde5306ee552f7ad876b
  11. 07 Feb, 2018 1 commit
    • Maxym Dmytrychenko's avatar
      SSE2 optimizations for _16 highbd lpf functions · e33f5819
      Maxym Dmytrychenko authored
      Includes vertical and horizontal implementations
      and to fix 13 TAPs/Parallel deblocking support
      Appropriate tests are enabled
      Performance changes, SSE2 over C:
      Horizontal methods: up to    2x
      Vertical   methods: up to  1.5x
      Change-Id: Icbdc217a55353eb33417b81847b73005e043262d
  12. 06 Feb, 2018 3 commits
  13. 05 Feb, 2018 1 commit
  14. 03 Feb, 2018 3 commits
    • Peng Bin's avatar
      Add aom_comp_mask_pred_avx2 · 3c74dd45
      Peng Bin authored
      1. Add AVX2 implementation of aom_comp_mask_pred.
      2. For width 8 still use ssse3 version.
      3. For other widths(16,32), AVX2 version is 1.2x-2.0x faster
      than ssse3 version
      Change-Id: I80acc1be54ab21a52f7847e91b1299853add757c
    • Peng Bin's avatar
      comp_mask_pred:process each width separately · 953b77ee
      Peng Bin authored
      There are 3 valid input width of aom_comp_mask_pred_ssse3.
      Process each width(8,16,32) separately achieves
      1.2x~1.5x speed up compare to origin ssse3 version.
      Change-Id: Ida3699e2e6ca98d1f9c7662d48806b299af26f10
    • Yaowu Xu's avatar
      Replace 64 bit operations with 32 bit ones · f06f641f
      Yaowu Xu authored
      Change-Id: Ic51231510fc8bb897f8ca771dd4e750d0e1cd693
  15. 02 Feb, 2018 2 commits
    • Imdad Sardharwalla's avatar
      AVX2 implementation of the Wiener filter · aab6aee3
      Imdad Sardharwalla authored
      Added an AVX2 version of the Wiener filter, along with associated tests. Speed
      tests have been added for all implementations of the Wiener filter.
      Speed Test results
      Low bit-depth filter:
      - SSE2 vs C: SSE2 takes ~92% less time
      - AVX2 vs C: AVX2 takes ~96% less time
      - SSE2 vs AVX2: AVX2 takes ~43% less time (~74% faster)
      High bit-depth filter:
      - SSSE3 vs C: SSSE3 takes ~92% less time
      - AVX2  vs C: AVX2  takes ~96% less time
      - SSSE3 vs AVX2: AVX2 takes ~46% less time (~84% faster)
      Low bit-depth filter:
      - SSE2 vs C: SSE2 takes ~84% less time
      - AVX2 vs C: AVX2 takes ~88% less time
      - SSE2 vs AVX2: AVX2 takes ~27% less time (~36% faster)
      High bit-depth filter:
      - SSSE3 vs C: SSSE3 takes ~85% less time
      - AVX2  vs C: AVX2  takes ~89% less time
      - SSS3  vs AVX2: AVX2 takes ~24% less time (~31% faster)
      Change-Id: Ide22d7c09c0be61483e9682caf17a39438e4a208
    • Peng Bin's avatar
      Remove aom_comp_mask_upsampled_pred from rtcd · f8daa92d
      Peng Bin authored
      Since aom_comp_mask_upsampled_pred just call aom_upsampled_pred
      and aom_comp_mask_pred, no need to separate c version from simd
      version any more.
      Change-Id: I1ff8bcae87d501c68a80708fd2dc6b74c6952f88
  16. 01 Feb, 2018 1 commit
  17. 31 Jan, 2018 4 commits
    • Johann's avatar
      use GLOBAL() macro when loading constant · 4972ac81
      Johann authored
      Clear linker error when building with gcc 6:
      relocation R_X86_64_32 against `.rodata' can not be used when making a
      shared object; recompile with -fPIC
      Change-Id: I6c06de1e9dac1c044a4b07125abcaba0943a29b6
    • Deepa K G's avatar
      AVX2 optimization of motion compensation functions · c8e0336a
      Deepa K G authored
      AVX2 implementation of av1_convolve_x_sr, av1_convolve_y_sr and
      av1_convolve_2d_sr have been added.
      Improvements have been made to av1_convolve_x_avx2, av1_convolve_y_avx2
      and av1_convolve_2d_avx2.
      Change-Id: I62a699dd9dcf42de94dd72cc2d43affc0dc31404
    • Johann's avatar
      BUG FIX: sse2 subpel variance is not PIC compliant · 0cf864fd
      Johann authored
      cherry-picked from libvpx:
        commit cb9f4dc1056b39383595f658cfcd166833bc0097
        Author: Scott LaVarnway <slavarnway@google.com>
        Date:   Sat Jan 13 07:01:04 2018 -0800
      Change-Id: Ie1736ea0787f4dad80204dcf5251fbb02d79541e
    • Peng Bin's avatar
      Add aom_comp_mask_<upsampled>pred_ssse3 · 33ba1fe5
      Peng Bin authored
      1) For encoder speed, overall ~1% faster with no impact on coding performance.
      2) aom_comp_mask_pred_ssse3 is 3.5x - 6x faster than aom_comp_mask_pred_c
      3) aom_comp_mask_upsampled_pred_ssse3 1.5x - 3x faster than
      aom_comp_mask_upsampled_pred_c, for special case where subpel_x ==
      subpel_y == 0, optimized version achieves 4x - 7x speedup
      Unittest for both functions have been added.
      Change-Id: Ib498317975e0dbd9cdcf61be327b640dfac9a7e5
  18. 30 Jan, 2018 2 commits
    • Yaowu Xu's avatar
      aom_lpf_horizontal_6_sse2(): fix valgrind warnings · 5a667bfd
      Yaowu Xu authored
      Change-Id: I12d522c3704083bba5c4332031dff7a01fd7dfb3
    • Johann's avatar
      fwd txfm: cherrypick improvements from libvpx · c048a2d9
      Johann authored
      committ 9a780fa7db79b709787a9ca56fc324a118158da7
      Author: Jingning Han <jingning@google.com>
        Rework forward 8x8 2D-DCT ssse3 implementation
      commit 3e3a5686167a5493a5e2223635d1085cf8c963dd
      Author: Johann <johannkoenig@google.com>
        fwd txfm ssse3: use GLOBAL() for loading constants
      Change-Id: If7ca11a5b3c9dcf2ac7dbf8b7643e3424399d201
  19. 29 Jan, 2018 1 commit
  20. 26 Jan, 2018 2 commits
    • Thomas Daede's avatar
      Add CDF_STORAGE_REDUCTION experiment flag. · 5f0c41de
      Thomas Daede authored
      Change-Id: I8ce208e842b738bb729d5732f0f35366c3549063
    • Maxym Dmytrychenko's avatar
      SSE2 optimizations for _6/_16 lowbd lpf functions · ae6e6bc1
      Maxym Dmytrychenko authored
      Includes vertical and horizontal implementations
      and to fix 5/13 TAPs/Parallel deblocking support.
      Re-working internals of the filters for better
      re-usage across different sizes.
      Tests are enabled.
      Performance changes, SSE2 over C:
      Horizontal methods: up to    3-4x
      Vertical   methods: up to 1.5x-2x
      Change-Id: I2e36035355d8c23c1d4b0d59d0e23f598e9d0e3f