1. 08 Mar, 2017 1 commit
    • David Barker's avatar
      Make encoder use vectorized self-guided filter · 506eb723
      David Barker authored
      By rearranging the code in restoration.c, we can allow the
      encoder to use the SSE4.1 version of the self-guided filter
      while picking the loop-restoration filter.
      
      This also helps us prepare for adding a highbitdepth SSE4.1
      version of the self-guided filter.
      
      No effect on encoder output, but gives an end-to-end speedup
      of 1-2%.
      
      Change-Id: Id17ba4a0963ddce9f70a7cae666e212e138d5f2c
      506eb723
  2. 06 Mar, 2017 1 commit
    • David Barker's avatar
      Vectorize self-guided filter · ce110cc5
      David Barker authored
      Add an SSE4.1 lowbd version of the self-guided filter for
      loop-restoration, and apply some optimizations to the C
      version.
      
      Approximate times per 128x128 / 256x256 tile on the machine
      this was developed on:
      Previous C:  620us / 2800us
      Optimized C: 500us / 2200us ( 24% /  27% faster)
      SSE4.1:      147us / 600us  (320% / 370% faster)
      
      Change-Id: I23ff5a5482a191aeb06f9d1f767a9f036bb357fe
      ce110cc5
  3. 02 Mar, 2017 1 commit
    • Yue Chen's avatar
      Use 3-tap spatial filter in FILTER_INTRA experiment · 8d8638a1
      Yue Chen authored
      3-tap recursive intra prediction filters are added.
      Macro USE_3TAP_INTRA_FILTER is set to 1 to use 3-tap by default.
      Coding gain of FILTER_INTRA experiment in AWCY, high delay 150f
      3-tap: 0.51%
      4-tap: 0.68%
      
      Change-Id: I44192dd08bfd8155f58a9b0b5cf1de88fceb762e
      8d8638a1
  4. 28 Feb, 2017 1 commit
    • Michael Bebenita's avatar
      Add SIMD code for PVQ search · 3a88de8f
      Michael Bebenita authored
      This reduces the runtime profile of pvq_search_rdo_double from 37%
      to 15% and improves overall encoding speed when PVQ is enabled by ~40%.
      The SIMD code is not bit accurate with the C version and introduces a
      slight PSNR regression on AWCY:
      
        PSNR | PSNR Cb | PSNR Cr | PSNR HVS | SSIM | MS SSIM | CIEDE 2000
      0.0607 |  0.1044 |     N/A |   0.0126 |  N/A | -0.0309 |        N/A
      
      Change-Id: Ie22cebc62df2e72618305f2268668d79167860c6
      3a88de8f
  5. 24 Feb, 2017 1 commit
    • Angie Chiang's avatar
      Let hbd conv func be flexible · 0a2c0cbc
      Angie Chiang authored
      This CL allow us to change filter coefficients easily for SIMD
      implementation of high bitdepth convolution functions
      
      Change-Id: I454a5c76d3ba9e4454118c6a9d87737b3aa24898
      0a2c0cbc
  6. 18 Feb, 2017 1 commit
  7. 19 Jan, 2017 1 commit
  8. 13 Jan, 2017 1 commit
    • Angie Chiang's avatar
      Add rounding option into av1_convolve · 674bffdc
      Angie Chiang authored
      Use a round flag in ConvolveParams to indicate if the destination buffer
      has the result rounded by FILTER_BITS or not.
      This CL is part of the goal of reducing interpolation rounding error in
      compound prediction mode.
      
      Change-Id: I49e522a89a67a771f5a6e7fbbc609e97923aecb6
      674bffdc
  9. 12 Jan, 2017 2 commits
    • David Barker's avatar
      Add SSE2 vectorized warp filter for lowbd · d5dfa96e
      David Barker authored
      End-to-end speed improvements: (measured on tempete_cif.y4m,
      20 frames for encoder and all 260 frames for decoder)
      
      * GLOBAL_MOTION encoder: ~10% faster
      * GLOBAL_MOTION decoder: 100-200% faster depending on bitrate
      * WARPED_MOTION encoder: ~2.5% faster
      * WARPED_MOTION decoder: ~20-40% faster depending on bitrate
      
      The improvement in the GLOBAL_MOTION decoder is particularly
      large because its runtime is dominated by calls to warp_plane().
      
      This introduces minor changes to the output of the warp filter,
      but these should be rare.
      
      Change-Id: I5813ab9e90311e27587045153c32d400b6b9eb92
      d5dfa96e
    • Yi Luo's avatar
      High bit depth 32x32 inverse DCT_DCT transform, AVX2 · 3bd83775
      Yi Luo authored
      - Witness the follow user-level speedup on AV1 baseline:
       Encoding time reduction: 4.26%
       Decoding time reduction: 25.35%
      
      Change-Id: Ideaf3cd473ad45ed9256c80d5a5daed0a6e098cf
      3bd83775
  10. 14 Dec, 2016 1 commit
  11. 29 Nov, 2016 1 commit
    • Angie Chiang's avatar
      Add av1_convolve_init() · e067de00
      Angie Chiang authored
      Generate simd filter structure in av1_convolve_init()
      This will provide flexibility of changing filter coefficients.
      
      Change-Id: If79f84c56483aa08c894d6b12e2b6ce10147f0ce
      e067de00
  12. 28 Nov, 2016 1 commit
  13. 09 Nov, 2016 1 commit
  14. 07 Nov, 2016 1 commit
    • Yushin Cho's avatar
      New experiment: Perceptual Vector Quantization from Daala · 77bba8d3
      Yushin Cho authored
      PVQ replaces the scalar quantizer and coefficient coding with a new
      design originally developed in Daala. It currently depends on the
      Daala entropy coder although it could be adapted to work with another
      entropy coder if needed:
      ./configure --enable-experimental --enable-daala_ec --enable-pvq
      
      The version of PVQ in this commit is adapted from the following
      revision of Daala:
      https://github.com/xiph/daala/commit/fb51c1ade6a31b668a0157d89de8f0a4493162a8
      
      More information about PVQ:
      - https://people.xiph.org/~jm/daala/pvq_demo/
      - https://jmvalin.ca/papers/spie_pvq.pdf
      
      The following files are copied as-is from Daala with minimal
      adaptations, therefore we disable clang-format on those files
      to make it easier to synchronize the AV1 and Daala codebases in the future:
       av1/common/generic_code.c
       av1/common/generic_code.h
       av1/common/laplace_tables.c
       av1/common/partition.c
       av1/common/partition.h
       av1/common/pvq.c
       av1/common/pvq.h
       av1/common/state.c
       av1/common/state.h
       av1/common/zigzag.h
       av1/common/zigzag16.c
       av1/common/zigzag32.c
       av1/common/zigzag4.c
       av1/common/zigzag64.c
       av1/common/zigzag8.c
       av1/decoder/decint.h
       av1/decoder/generic_decoder.c
       av1/decoder/laplace_decoder.c
       av1/decoder/pvq_decoder.c
       av1/decoder/pvq_decoder.h
       av1/encoder/daala_compat_enc.c
       av1/encoder/encint.h
       av1/encoder/generic_encoder.c
       av1/encoder/laplace_encoder.c
       av1/encoder/pvq_encoder.c
       av1/encoder/pvq_encoder.h
      
      Known issues:
      - Lossless mode is not supported, '--lossless=1' will give the same result as
      '--end-usage=q --cq-level=1'.
      - High bit depth is not supported by PVQ.
      
      Change-Id: I1ae0d6517b87f4c1ccea944b2e12dc906979f25e
      77bba8d3
  15. 04 Nov, 2016 1 commit
    • Yushin Cho's avatar
      New experiment: Perceptual Vector Quantization from Daala · 09705fe7
      Yushin Cho authored
      PVQ replaces the scalar quantizer and coefficient coding with a new
      design originally developed in Daala. It currently depends on the
      Daala entropy coder although it could be adapted to work with another
      entropy coder if needed:
      ./configure --enable-experimental --enable-daala_ec --enable-pvq
      
      The version of PVQ in this commit is adapted from the following
      revision of Daala:
      https://github.com/xiph/daala/commit/fb51c1ade6a31b668a0157d89de8f0a4493162a8
      
      More information about PVQ:
      - https://people.xiph.org/~jm/daala/pvq_demo/
      - https://jmvalin.ca/papers/spie_pvq.pdf
      
      The following files are copied as-is from Daala with minimal
      adaptations, therefore we disable clang-format on those files
      to make it easier to synchronize the AV1 and Daala codebases in the future:
       av1/common/generic_code.c
       av1/common/generic_code.h
       av1/common/laplace_tables.c
       av1/common/partition.c
       av1/common/partition.h
       av1/common/pvq.c
       av1/common/pvq.h
       av1/common/state.c
       av1/common/state.h
       av1/common/zigzag.h
       av1/common/zigzag16.c
       av1/common/zigzag32.c
       av1/common/zigzag4.c
       av1/common/zigzag64.c
       av1/common/zigzag8.c
       av1/decoder/decint.h
       av1/decoder/generic_decoder.c
       av1/decoder/laplace_decoder.c
       av1/decoder/pvq_decoder.c
       av1/decoder/pvq_decoder.h
       av1/encoder/daala_compat_enc.c
       av1/encoder/encint.h
       av1/encoder/generic_encoder.c
       av1/encoder/laplace_encoder.c
       av1/encoder/pvq_encoder.c
       av1/encoder/pvq_encoder.h
      
      Known issues:
      - Lossless mode is not supported, '--lossless=1' will give the same result as
      '--end-usage=q --cq-level=1'.
      - High bit depth is not supported by PVQ.
      
      Change-Id: I1ae0d6517b87f4c1ccea944b2e12dc906979f25e
      09705fe7
  16. 03 Nov, 2016 1 commit
  17. 02 Nov, 2016 3 commits
  18. 01 Nov, 2016 2 commits
  19. 20 Oct, 2016 2 commits
    • Yi Luo's avatar
      Fix the overflow of av1_fht32x32() in 2D DCT_DCT · 157e45a4
      Yi Luo authored
      - Use range check function to avoid DCT_DCT overflow.
        We need to re-develop the column txfm side scaling/rounding. Now,
        we prefer to maintain the current BDRate level.
      - Encoder user level time reduction <1% owing to av1_fht32x32_avx2.
      - Add MemCheck unit test and fdct32() unit test.
      
      Change-Id: I1e67030f67bc637859798ebe2f6698afffb8531c
      157e45a4
    • hui su's avatar
      Seperate FILTER_INTRA from EXT_INTRA experiment · 5db9743f
      hui su authored
      Prepare for the av1/nextgenv2 merge.
      
      Coding gain (%):
      
                     lowres     midres
      ext-intra       0.69       0.97
      filter-intra    0.67       0.83
      both            1.05       1.48
      
      Change-Id: Ia24d6fafb3e484c4f92192e0b7eee5e39f4f4ee6
      5db9743f
  20. 19 Oct, 2016 1 commit
  21. 13 Oct, 2016 2 commits
  22. 12 Oct, 2016 1 commit
    • Yi Luo's avatar
      Hybrid forward transform 32x32 AVX2 optimization · fed8e1c0
      Yi Luo authored
      - av1_fht32x32 AVX2 function level time reduction ~89% compared to C.
      
      - av1_fht32x32_avx2() on DCT_DCT improves 42.62% over aom_fdct32x32_avx2()
        But function replacement must go with the corresponding inverse txfm.
      
      - No obvious user level time reduction due to 32x32 TX_TYPE selection.
      
      - Zero high 128b YMM to avoid AVX-SSE transition penalties
        (fix 16x16 case).
      
      - Added 32x32 AVX2 unit tests to verify bitexact.
      
      - AVX2 optimization summary:
        On CPU i7-6700, based on 16x16/32x32 fwd txfm optimization results:
        C to AVX2: function level time reduction, ~86-89%.
        SSE2 to AVX2: function level time reduction, ~51%.
      
      Change-Id: Idd0cd8bf066a61c7117140ef15ab6c1f8eb4b036
      fed8e1c0
  23. 10 Oct, 2016 2 commits
  24. 07 Oct, 2016 1 commit
  25. 06 Oct, 2016 2 commits
  26. 09 Sep, 2016 1 commit
  27. 08 Sep, 2016 1 commit
  28. 01 Sep, 2016 1 commit
  29. 22 Jul, 2016 1 commit
  30. 08 Jun, 2016 1 commit
    • Linfeng Zhang's avatar
      Upgrade fwht4x4_mmx() to fwht4x4_sse2() (from libvpx) · 9bf89689
      Linfeng Zhang authored
      Cherry-pick af7fb17c Upgrade fwht4x4_mmx() to fwht4x4_sse2() for vp9 and
      vp10.
      
      Function level timing test shows about 27% time saving on
      a Xeon E5-2680 v2 desktop.
      
      Rename dct_sse2.c to dct_intrin_sse2.c to avoid duplicate basenames.
      
      Change-Id: I2c504130099af8f0ccc07da0dacef2464197b0ac
      9bf89689
  31. 11 May, 2016 1 commit
  32. 25 Mar, 2016 1 commit