1. 21 Feb, 2018 2 commits
  2. 20 Feb, 2018 2 commits
  3. 19 Feb, 2018 3 commits
  4. 16 Feb, 2018 2 commits
    • Johann's avatar
      Remove unused jnt functions · 143de432
      Johann authored
      The 4x2 transforms gives a compile warning with gcc 6.3.0 but appears
      to be unused:
      
      *((void *)&temp2+8)' is used uninitialized in this function
      
      Change-Id: I8b08e05d0365dc117b5374ec00bddc6f7bd84bd3
      143de432
    • Cheng Chen's avatar
      Fix a bug in memory access · 3290ba02
      Cheng Chen authored
      Avoid reading/writing out side of buffer. Triggered by ASAN.
      
      Change-Id: I7de2a9f01cc13feb1c13556dfe77e9e6e7e55056
      3290ba02
  5. 15 Feb, 2018 3 commits
    • Yaowu Xu's avatar
      Remove CONFIG_TX64X64 · d3d4159f
      Yaowu Xu authored
      The experiment is fully adopted.
      
      Change-Id: I6cc80a2acf0c93c13b0e36e6f4a2378fe5ce33c3
      d3d4159f
    • Dominic Symes's avatar
      film-grain: fix buffer overflow · aa5904ba
      Dominic Symes authored
      When bit_depth is 8 the copy_rect function was setting the size to
      2 bytes per sample. This causes a buffer overflow as each line copied
      in the loop is twice the number of bytes it should be and the last
      line writes off the end of the buffer.
      
      BUG=aomedia:1389
      
      Change-Id: Ib9fa11d1dd13806dedbce2cd47dd8d562007428d
      aa5904ba
    • Andrey Norkin's avatar
      [NORMATIVE] Film grain bug-fixes · 2e8ae05c
      Andrey Norkin authored
      BUG=aomedia:1366
      BUG=aomedia:1368
      
      Change-Id: I63f84dca86ca426b9c6927b056657741022d5f68
      2e8ae05c
  6. 14 Feb, 2018 4 commits
    • Peng Bin's avatar
      Refactor pair_set_epi16 for speedup · 8b8aaffc
      Peng Bin authored
      Use _mm_set1_epi32 instead of _mm_set_epi16, less instructions produced
      by compiler. This patch also removes the duplicate define of the same
      function.
      
      Speed test results:
      1. Unittest for each test cases in SSE2/AV1LbdInvTxfm2d shows 60%~80%
      speedup (except those case with TX_TYPE include iidentity)
      2. A brief speed test shows that with this CL, for speed1 encoder speeds up
      ~3% and decoder speeds up ~1.8%.
      (Baseline is 18976fa5)
      
      Change-Id: I2b0e12973fda05a21d6b6eb0f0efe11df6edfb84
      8b8aaffc
    • Yaowu Xu's avatar
      Remove unused variables · cbfffa8e
      Yaowu Xu authored
      Change-Id: I5290f94da6c1a0319357f84b2ec70b4331a0e4af
      cbfffa8e
    • Yaowu Xu's avatar
      Remove two more LPF macros · 8ec5c077
      Yaowu Xu authored
      Change-Id: I60278e399f4f65aa63526e459947e88084f0e889
      8ec5c077
    • Yaowu Xu's avatar
      Remove CONFIG_PARALLEL_DEBLOCKING · 6d0ed3ed
      Yaowu Xu authored
      The experiment is fully adopted now.
      
      Change-Id: I27906d2af4c746ce55aa17f64d1c0ef281e23ab2
      6d0ed3ed
  7. 13 Feb, 2018 1 commit
  8. 12 Feb, 2018 1 commit
    • Peng Bin's avatar
      Add inv txfm2d sse2 for sizes with 4 · 18976fa5
      Peng Bin authored
      Implement av1_lowbd_inv_txfm2d_add_4x4_sse2
      Implement av1_lowbd_inv_txfm2d_add_4x8_sse2
      Implement av1_lowbd_inv_txfm2d_add_8x4_sse2
      Implement av1_lowbd_inv_txfm2d_add_4x16_sse2
      Implement av1_lowbd_inv_txfm2d_add_16x4_sse2
      
      A brief speed test shows that using the included SSE2 functions
      completed by this CL, for speed1 lowbitdepth encoder speeds up >9%
      and lowbitdepth decoder speeds up >25%, comparing to the highbitdepth
      implementation in the baseline.
      
      Change-Id: I0576a2a146c0b1a7b483c9d35c3d21d979e263cd
      18976fa5
  9. 10 Feb, 2018 1 commit
  10. 09 Feb, 2018 2 commits
    • David Barker's avatar
      [wedge/compound-segment, normative] Remove more rounding · 7dbb0051
      David Barker authored
      This reduces the overall rounding in the masked blend process -
      the result is now equivalent to having a single round operation
      at the end of the prediction process.
      
      This increases the range of the intermediate values inside
      aom_blend_a64_d32_mask() by 2 bits, but has no effect on the
      ranges of any values outside that function.
      
      Change-Id: I1010ed94c7d8db75bb3d8157c864c5527005725b
      7dbb0051
    • David Barker's avatar
      [wedge/compound-segment, normative] Reduce multiple rounding · d3b99738
      David Barker authored
      As described in the linked bug report, the masked blend operation
      contains multiple stages of rounding. This commit replaces one
      intermediate round with a right shift, which should be slightly
      faster and more accurate.
      
      BUG=aomedia:1292
      
      Change-Id: Ib24ce687e628b05d645fbde5306ee552f7ad876b
      d3b99738
  11. 07 Feb, 2018 1 commit
    • Maxym Dmytrychenko's avatar
      SSE2 optimizations for _16 highbd lpf functions · e33f5819
      Maxym Dmytrychenko authored
      Includes vertical and horizontal implementations
      and to fix 13 TAPs/Parallel deblocking support
      
      Appropriate tests are enabled
      
      Performance changes, SSE2 over C:
      Horizontal methods: up to    2x
      Vertical   methods: up to  1.5x
      
      Change-Id: Icbdc217a55353eb33417b81847b73005e043262d
      e33f5819
  12. 06 Feb, 2018 3 commits
  13. 05 Feb, 2018 1 commit
  14. 03 Feb, 2018 3 commits
    • Peng Bin's avatar
      Add aom_comp_mask_pred_avx2 · 3c74dd45
      Peng Bin authored
      1. Add AVX2 implementation of aom_comp_mask_pred.
      2. For width 8 still use ssse3 version.
      3. For other widths(16,32), AVX2 version is 1.2x-2.0x faster
      than ssse3 version
      
      Change-Id: I80acc1be54ab21a52f7847e91b1299853add757c
      3c74dd45
    • Peng Bin's avatar
      comp_mask_pred:process each width separately · 953b77ee
      Peng Bin authored
      There are 3 valid input width of aom_comp_mask_pred_ssse3.
      Process each width(8,16,32) separately achieves
      1.2x~1.5x speed up compare to origin ssse3 version.
      
      Change-Id: Ida3699e2e6ca98d1f9c7662d48806b299af26f10
      953b77ee
    • Yaowu Xu's avatar
      Replace 64 bit operations with 32 bit ones · f06f641f
      Yaowu Xu authored
      Change-Id: Ic51231510fc8bb897f8ca771dd4e750d0e1cd693
      f06f641f
  15. 02 Feb, 2018 2 commits
    • Imdad Sardharwalla's avatar
      AVX2 implementation of the Wiener filter · aab6aee3
      Imdad Sardharwalla authored
      Added an AVX2 version of the Wiener filter, along with associated tests. Speed
      tests have been added for all implementations of the Wiener filter.
      
      Speed Test results
      ==================
      
      GCC
      ---
      
      Low bit-depth filter:
      - SSE2 vs C: SSE2 takes ~92% less time
      - AVX2 vs C: AVX2 takes ~96% less time
      - SSE2 vs AVX2: AVX2 takes ~43% less time (~74% faster)
      
      High bit-depth filter:
      - SSSE3 vs C: SSSE3 takes ~92% less time
      - AVX2  vs C: AVX2  takes ~96% less time
      - SSSE3 vs AVX2: AVX2 takes ~46% less time (~84% faster)
      
      CLANG
      -----
      
      Low bit-depth filter:
      - SSE2 vs C: SSE2 takes ~84% less time
      - AVX2 vs C: AVX2 takes ~88% less time
      - SSE2 vs AVX2: AVX2 takes ~27% less time (~36% faster)
      
      High bit-depth filter:
      - SSSE3 vs C: SSSE3 takes ~85% less time
      - AVX2  vs C: AVX2  takes ~89% less time
      - SSS3  vs AVX2: AVX2 takes ~24% less time (~31% faster)
      
      Change-Id: Ide22d7c09c0be61483e9682caf17a39438e4a208
      aab6aee3
    • Peng Bin's avatar
      Remove aom_comp_mask_upsampled_pred from rtcd · f8daa92d
      Peng Bin authored
      Since aom_comp_mask_upsampled_pred just call aom_upsampled_pred
      and aom_comp_mask_pred, no need to separate c version from simd
      version any more.
      
      Change-Id: I1ff8bcae87d501c68a80708fd2dc6b74c6952f88
      f8daa92d
  16. 01 Feb, 2018 1 commit
  17. 31 Jan, 2018 4 commits
    • Johann's avatar
      use GLOBAL() macro when loading constant · 4972ac81
      Johann authored
      Clear linker error when building with gcc 6:
      relocation R_X86_64_32 against `.rodata' can not be used when making a
      shared object; recompile with -fPIC
      
      BUG=aomedia:102
      
      Change-Id: I6c06de1e9dac1c044a4b07125abcaba0943a29b6
      4972ac81
    • Deepa K G's avatar
      AVX2 optimization of motion compensation functions · c8e0336a
      Deepa K G authored
      AVX2 implementation of av1_convolve_x_sr, av1_convolve_y_sr and
      av1_convolve_2d_sr have been added.
      
      Improvements have been made to av1_convolve_x_avx2, av1_convolve_y_avx2
      and av1_convolve_2d_avx2.
      
      Change-Id: I62a699dd9dcf42de94dd72cc2d43affc0dc31404
      c8e0336a
    • Johann's avatar
      BUG FIX: sse2 subpel variance is not PIC compliant · 0cf864fd
      Johann authored
      cherry-picked from libvpx:
        commit cb9f4dc1056b39383595f658cfcd166833bc0097
        Author: Scott LaVarnway <slavarnway@google.com>
        Date:   Sat Jan 13 07:01:04 2018 -0800
      
      BUG=aomedia:102
      
      Change-Id: Ie1736ea0787f4dad80204dcf5251fbb02d79541e
      0cf864fd
    • Peng Bin's avatar
      Add aom_comp_mask_<upsampled>pred_ssse3 · 33ba1fe5
      Peng Bin authored
      1) For encoder speed, overall ~1% faster with no impact on coding performance.
      2) aom_comp_mask_pred_ssse3 is 3.5x - 6x faster than aom_comp_mask_pred_c
      3) aom_comp_mask_upsampled_pred_ssse3 1.5x - 3x faster than
      aom_comp_mask_upsampled_pred_c, for special case where subpel_x ==
      subpel_y == 0, optimized version achieves 4x - 7x speedup
      
      Unittest for both functions have been added.
      
      Change-Id: Ib498317975e0dbd9cdcf61be327b640dfac9a7e5
      33ba1fe5
  18. 30 Jan, 2018 2 commits
    • Yaowu Xu's avatar
      aom_lpf_horizontal_6_sse2(): fix valgrind warnings · 5a667bfd
      Yaowu Xu authored
      BUG=aomedia:1285
      
      Change-Id: I12d522c3704083bba5c4332031dff7a01fd7dfb3
      5a667bfd
    • Johann's avatar
      fwd txfm: cherrypick improvements from libvpx · c048a2d9
      Johann authored
      committ 9a780fa7db79b709787a9ca56fc324a118158da7
      Author: Jingning Han <jingning@google.com>
        Rework forward 8x8 2D-DCT ssse3 implementation
      
      commit 3e3a5686167a5493a5e2223635d1085cf8c963dd
      Author: Johann <johannkoenig@google.com>
        fwd txfm ssse3: use GLOBAL() for loading constants
      
      Change-Id: If7ca11a5b3c9dcf2ac7dbf8b7643e3424399d201
      c048a2d9
  19. 29 Jan, 2018 1 commit
  20. 26 Jan, 2018 1 commit