1. 04 Oct, 2017 1 commit
    • Yi Luo's avatar
      Lowbd TM_PRED intra pred avx2 optimization · 237cf1b2
      Yi Luo authored
      For block width >= 16, avx2 can further speedup the
      TM_PREM intra prediction.
      
      Function speedup on i7-6700:
      Predictor  avx2 v. ssse3
      16x8       ~1.6x
      16x16      ~1.8x
      16x32      ~1.9x
      32x16      ~1.9x
      32x32      ~1.9x
      
      Change-Id: I62c20bd7628f52251b0c051b99a9b738ee44f7e6
      237cf1b2
  2. 02 Oct, 2017 2 commits
  3. 29 Sep, 2017 3 commits
    • Yi Luo's avatar
      Lowbd TM_PRED intrapred ssse3 optimization · a0f66fc0
      Yi Luo authored
      Function speedup (i7-6700)
      Predictor  ssse3 v. C
      4x4        ~2.1x
      4x8        ~2.4x
      8x4        ~4.1x
      8x8        ~5.4x
      8x16       ~6.1x
      16x8       ~5.9x
      16x16      ~6.4x
      16x32      ~6.7x
      32x16      ~7.4x
      32x32      ~8.0x
      
      Change-Id: I52b8ebf8193e76f4ea1137cbad5ad7fa109d86d8
      a0f66fc0
    • Rupert Swarbrick's avatar
      Add 32x128/128x32 block sizes · 2fa6e1ce
      Rupert Swarbrick authored
      Change-Id: Ieb28f40d85e4db4af33648c32c406dd2931ceb89
      2fa6e1ce
    • Yi Luo's avatar
      Lowbd intrapred DC/TOP/LEFT/128/V/H avx2 · 23c61903
      Yi Luo authored
      For prediction block width equal to 32, avx2 can further speedup
      the prediction function (i7-6700):
      
      32x32     avx2 v. sse2
      DC        ~1.4x
      top       ~1.5x
      left      ~1.4x
      128       ~1.5x
      v         ~1.6x
      h         ~1.2x
      
      32x16     avx2 v. sse2
      DC        ~2.2x
      top       ~1.7x
      left      ~1.6x
      128       ~1.8x
      v         ~1.9x
      
      Note: 32x16 H_PRED on avx2 does not run faster enough than sse2 yet.
      
      Change-Id: I145ed504d1b3ea9df283b94927be66a2c6f81225
      23c61903
  4. 28 Sep, 2017 1 commit
    • Yi Luo's avatar
      Lowbd rectangle V/H intra pred sse2 optimization · 0c0fd1e5
      Yi Luo authored
      Function speedup sse2 v. C
      Predictor  V_PRED  H_PRED
      4x8        ~1.7x   ~1.8x
      8x4        ~1.8x   ~2.2x
      8x16       ~1.5x   ~1.4x
      16x8       ~1.9x   ~1.3x
      16x32      ~1.6x   ~1.4x
      32x16      ~2.0x   ~1.9x
      
      This patch disables speed tests to save Jenkins build
      time. Developer can manually enable them by using,
      --gtest_also_run_disabled_test flag in test command line.
      
      Change-Id: I81eaee5e8afc55275c7507c99774f78cc9e49f9a
      0c0fd1e5
  5. 27 Sep, 2017 2 commits
    • James Zern's avatar
      cosmetics,*rtcd*.pl: reindent · 1512fa97
      James Zern authored
      Change-Id: I612517c6218c561ee94888c8c14298964851484a
      1512fa97
    • Yi Luo's avatar
      Lowbd rect intrapred DC/LEFT/TOP/128 sse2 optimization · 39bdf36a
      Yi Luo authored
      Add lowbd unit test functionality to intrapred_test.cc
      Function speedup against C (i7-6700):
      Predictor   DC     LEFT   TOP    128
      4x8        ~1.4x  ~1.4x  ~1.7x  ~1.9x
      8x4        ~1.2x  ~1.6x  ~1.6x  ~2.6x
      8x16       ~1.4x  ~1.3x  ~1.4x  ~2.1x
      16x8       ~2.0x  ~1.8x  ~2.3x  ~2.1x
      16x32      ~2.0x  ~1.9x  ~1.8x  ~2.2x
      32x16      ~2.0x  ~2.0x  ~1.9x  ~2.2x
      
      Change-Id: I33db512020ca3c6853a9205a8079f3d00134f584
      39bdf36a
  6. 22 Sep, 2017 1 commit
    • Yi Luo's avatar
      Highbd rectangle intrapred V/DC sse2 optimization · bdddf33a
      Yi Luo authored
      Function speedup (i7-6700),  sse2 verse C:
      Predictor      V_PRED    DC_PRED
      4x8            ~1.5x     ~4.9x
      8x4            ~2.5x     ~4.8x
      8x16           ~1.9x     ~9.1x
      16x8           ~1.9x     ~4.4x
      16x32          ~2.1x     ~5.8x
      32x16          ~2.0x     ~3.6x
      
      Change-Id: I6deffd0637e57ee5d0bd533502f5705148c4cdd4
      bdddf33a
  7. 19 Sep, 2017 1 commit
    • Yi Luo's avatar
      Highbd intrapred DC_LEFT/TOP/128 sse2 optimization · bbf6186e
      Yi Luo authored
      Also extend intra pred speed test to rectangular block.
      Speedup (i7-6700)
      predictor      sse2 v. C
      left 4x4       ~5.6x
      top  4x4       ~7.2x
      128  4x4       ~6.9x
      left 4x8       ~7.7x
      top  4x8       ~10.1x
      128  4x8       ~10.0x
      
      left 8x4       ~8.1x
      top  8x4       ~9.1x
      128  8x4       ~10.1x
      left 8x8       ~10.3x
      top  8x8       ~13.6x
      128  8x8       ~14.8x
      left 8x16      ~12.6x
      top  8x16      ~14.0x
      128  8x16      ~15.5x
      
      left 16x8      ~6.3x
      top  16x8      ~7.0x
      128  16x8      ~6.5x
      left 16x16     ~6.5x
      top  16x16     ~7.1x
      128  16x16     ~8.2x
      left 16x32     ~5.1x
      top  16x32     ~6.4x
      128  16x32     ~5.6x
      
      left 32x16     ~4.2x
      top  32x16     ~4.3x
      128  32x16     ~4.5x
      left 32x32     ~3.8x
      top  32x32     ~3.7x
      128  32x32     ~3.9x
      
      Change-Id: Ie7fcc85b9ded3030ee904623c40e9edeec1695ae
      bbf6186e
  8. 18 Sep, 2017 1 commit
    • Yi Luo's avatar
      Highbd intra pred H_PRED sse2 optimization · 23b9b317
      Yi Luo authored
      sse2 v. C speedup:
      4x4   ~8.0x
      8x8   ~8.2x
      16x16 ~6.5x
      32x32 ~3.8x
      Blocksize:
      4x4, 4x8, 8x4, 8x8, 8x16, 16x8, 16x16, 16x32, 32x16, 32x32
      Square blocksize code is from libvpx:
      "30d9a1916 vpxdsp: [x86] add highbd_h_predictor functions",
      Credit goes to Scott LaVarnway. Speed tests do not support
      rectangular blocksize yet.
      
      Change-Id: I9a1f24aecab8de94f8ea59ec8748fe3537d721ae
      23b9b317
  9. 07 Sep, 2017 1 commit
    • Yi Luo's avatar
      Lowbd parallel_deblocking sse2 optimization · ea8a0d52
      Yi Luo authored
      Baseline + parallel_deblocking:
      
      - Passed unit tests *SSE2/Loop8Test6*, *AVX2/Loop8Test6*.
      - 1080p, 25 frames, profile=0, encoding/decoding, output match.
      - Decoder frame rate increases from 54.15 to 65.84.
      
      Change-Id: I55938c94961066594f4b9080192c7268c19d9bf9
      ea8a0d52
  10. 15 Aug, 2017 2 commits
    • Ralph Giles's avatar
      aom_dsp: regularize EXT_PARTITION_TYPES handling. · ccfdfce1
      Ralph Giles authored
      aom_dsp_rtcd_defs.pl compares most CONFIG_* keys to "yes"
      to see if they're set. The script was checking just
      
        if (aom_config("CONFIG_EXT_PARTITION_TYPES"))
      
      in some cases. The build system doesn't add disabled
      configuration options to libs.mk so this is effectively
      the same, however it means that setting the config
      key explicitly to 0 or "no" in the config headers
      was treated the same as setting it to 1 or "yes",
      and aom_dsp_rtcd.h would have opposite expections
      from aom_config.h or aom_config.asm.
      
      Treat this key similarly to others for consistency.
      
      Change-Id: I27bd7a5532ba4afc2bb289b43b57a1b1971c0348
      ccfdfce1
    • Urvang Joshi's avatar
      Remove ALT_INTRA flag. · 93b543ab
      Urvang Joshi authored
      This experiment has been adopted as it has been cleared by Tapas.
      
      Change-Id: I0682face60f62dd43091efa0a92d09d846396850
      93b543ab
  11. 10 Aug, 2017 1 commit
    • Yi Luo's avatar
      Highbd loop filter AVX2 · 6ae0054c
      Yi Luo authored
      - Speed test (ms) on i7-6700, Linux x86_64
        FUNCTION             SSE2    AVX2
        horizontal_edge_16   55      28
        vertical_16_dual     84      47
        horizontal_4_dual    27      13
        horizontal_8_dual    36      15
        vertical_4_dual      38      25
        vertical_8_dual      44      27
      - Decoder frame rate improves around 1.2% - 2.8%.
      
      Change-Id: I9c4123869bac9b6d32e626173c2a8e7eb0cf49e7
      6ae0054c
  12. 08 Aug, 2017 1 commit
    • Thomas Davies's avatar
      Refactor quantization C code. · f3b5ee14
      Thomas Davies authored
      This commit de-duplicates C reference quantization code
      and unifies quantization matrix (QM) and non-QM code
      paths when there is no SIMD.
      
      The reorganisation also will facilitate re-using SIMD quant
      functions for QM when the matrix is flat, as is the
      default when AOM_QM is enabled.
      
      Change-Id: Idbfdac9eb9a31adcffe734aac1877d58b86fab77
      f3b5ee14
  13. 04 Aug, 2017 1 commit
  14. 21 Jul, 2017 1 commit
    • Angie Chiang's avatar
      Integrate convolve_round with compound_segment · 7b517095
      Angie Chiang authored
      This integration only covers low bitdepth mode for now
      
      The performance of Convolve_round on top of compound_segment
      revives from 0.475% to 0.612% on lowres
      
      Change-Id: I21606c79d0a22c0834966730358267c082d8071e
      7b517095
  15. 12 Jul, 2017 1 commit
    • Rupert Swarbrick's avatar
      ext-partition-types: Add 4:1 partitions · 93c39e91
      Rupert Swarbrick authored
      This patch adds support for 4:1 rectangular blocks to various common
      data arrays, and adds new partition types to the EXT_PARTITION_TYPES
      experiment which will use them.
      
      This patch has the following restrictions, which can be lifted in
      future patches:
      
        * ext-partition-types is incompatible with fp_mb_stats and supertx
          for the moment
      
        * Currently only 32x32 superblocks can use the new partition types
      
      There's a slightly odd restriction about when we allow
      PARTITION_HORZ_4 or PARTITION_VERT_4. Since these both live in the
      EXT_PARTITION_TYPES CDF, read_partition() can only return them if both
      has_rows and has_cols is true. This means that at least half of the
      width and height of the block must be visible. It might be nice to
      relax that restriction but that would imply a change to how we encode
      partition types, which seems already to be in a state of flux, so
      maybe it's better to wait until that has settled down.
      
      Change-Id: Id7fc3fd0f762f35f63b3d3e3bf4e07c245c7b4fa
      93c39e91
  16. 08 Jul, 2017 1 commit
    • Fergus Simpson's avatar
      Fix frame scaling prediction · 505f0068
      Fergus Simpson authored
      Use higher precision offsets for more accurate predictor
      generation when references are at a different scale from
      the coded frame.
      
      Change-Id: I4c2c0ec67fa4824273cb3bd072211f41ac7802e8
      505f0068
  17. 29 Jun, 2017 1 commit
  18. 28 Jun, 2017 1 commit
  19. 27 Jun, 2017 1 commit
    • Yi Luo's avatar
      Fix inv txfm low/high bitdepth selection logic · 51281095
      Yi Luo authored
      We are going to have several commits to setup new low/high
      bitdepth data path selection logic. This patch is for inverse
      transform. Let me summarize the ideas as following.
      
      - For low/high bitdepth selection, encoder depends on
        input configuration, e.g., video sequence bitdepth,
        profile. Decoder depends on input bitstream. This has
        nothing to do with compiler/build  configuration.
      
      - Typical encoder usage for sampling format 4:2:0.
        1) 8-bit video sequence:
         a) --profile=0
         Fastest encoding/decoding pipeline on speedup.
      
         b) --profile=2 --bit-depth=10
         Image pixels are left shifted by 2 bits. It
         employs 16-bit reference frame buffer and has high
         calculation precision. It usually enjoys higher
         compression performance.
      
        2) 10/12-bit video sequence (HDR):
         --profile=2 --bit-depth=10/12
      
      - Transform coefficient type:
        Lowbitdepth:  int16_t
        Highbitdepth: int32_t
      
      - The type, tran_low_t is still used in codebase,
        Which is int32_t, defining the data path capacity.
        Naturally, it is high bitdepth.
      
      Eventually we shall remove the configuration flags,
      CONFIG_HIGHBITDEPTH/CONFIG_LOWBITDEPTH, and seperate
      low and high bitdepth data path. Two data paths co-exist
      in the same build environment.
      
      Change-Id: I35c06d4d4f19ebf80d909168fdddbae57c3cc884
      51281095
  20. 22 Jun, 2017 1 commit
    • Yi Luo's avatar
      Add avx2 highbd_quantize_b · 193422e7
      Yi Luo authored
      - First pass encoding time reduces ~10.9% on i7-6700
        at 100 frames, 1080p.
      - avx2 works for coeff number >= 8 cases; coeff number < 8
        case will be implemented by sse2.
      - Unit test is added type B/FP/DC.
      
      Change-Id: Ibe5b7807c64e6dfc2d59c470ed50a6e8ca94ef7c
      193422e7
  21. 19 Jun, 2017 1 commit
    • Timothy B. Terriberry's avatar
      encoder: Remove 64x upsampled reference buffers · 5d24b6f0
      Timothy B. Terriberry authored
      They do not handle border extension correctly (interpolation and
      border extension do not commute unless you upsample into the
      border), nor do they handle crop dimensions that are not a multiple
      of 8 (the upsampled version is not sufficiently large), in addition
      to using massive amounts of memory and being a criminal waste of
      cache (1 byte used for every 8 bytes fetched).
      
      This commit reimplements use_upsampled_references by computing the
      subpixel samples on the fly. This implementation not only corrects
      the border handling, but is also faster, while maintaining the
      same quality.
      
      HL AWCY results are basically noise:
          PSNR | PSNR HVS |   SSIM | MS SSIM | CIEDE 2000
        0.0188 |   0.0187 | 0.0045 |  0.0063 |     0.0228
      
      Change-Id: I7527db9f83b87a7bb8b35342f7e6457cd0bef9cd
      5d24b6f0
  22. 08 Jun, 2017 1 commit
  23. 06 Jun, 2017 1 commit
    • Urvang Joshi's avatar
      Add a new experiment "rect-intra-pred". · 766a389b
      Urvang Joshi authored
      Earlier, intra prediction for rectangular blocks was performed by
      running two steps of prediction on square sub-blocks.
      
      With this experiment, we do proper intra prediction for rectangular
      blocks. This ensures that we make use of all available neighboring
      pixels especially for directional modes. For this, all the intra
      predictors were updated to work with rectangular transform block sizes.
      
      Performance improvements are small but free of cost:
      
      All Intra frames:
      lowres: -0.126
      midres: -0.154
      
      Video Overall:
      lowres: -0.043
      midres: -0.100
      
      [Could not get AWCY results due to a backlog.]
      
      BUG=aomedia:551
      
      Change-Id: I7936e91b171d5c246cb0a4ea470a981a013892e6
      766a389b
  24. 30 May, 2017 1 commit
  25. 26 May, 2017 1 commit
    • David Barker's avatar
      ext-inter: Vectorize new masked SAD/SSE functions · 0aa39ff0
      David Barker authored
      We would expect that these new functions would be slower than
      the old masked SAD/SSE functions, as they do additional work
      (blending two inputs and comparing to a third, rather than
      just comparing two inputs).
      
      This is true for the SAD functions, which are about 50% slower
      (depending on block size and bit depth). However, the sub-pixel
      SSE functions are comparable to the old speed for the accelerated
      special cases (xoffset or yoffset = 0 or 4), and are
      between 40-90% faster for the generic case.
      
      Change-Id: I1a296ed8fc9e3edc313a6add516ff76b17cd3e9f
      0aa39ff0
  26. 24 May, 2017 1 commit
    • David Barker's avatar
      ext-inter: Further cleanup · f19f35f7
      David Barker authored
      * Rename the 'masked_compound_*' functions to just 'masked_*'.
        The previous names were intended to be temporary, to distinguish
        the old and new masked motion search pipelines. But now that the
        old pipeline has been removed, we can reuse the old names.
      
      * Simplify the new ext-inter compound motion search pipeline
        a bit.
      
      * Harmonize names: Rename
        aom_highbd_masked_compound_sub_pixel_variance* to
        aom_highbd_8_masked_sub_pixel_variance*, to match the naming of
        the corresponding non-masked functions
      
      Change-Id: I988768ffe2f42a942405b7d8e93a2757a012dca3
      f19f35f7
  27. 23 May, 2017 2 commits
    • David Barker's avatar
      Vectorize high-precision convolve filter · 5d34e6a7
      David Barker authored
      Add SSE2 lowbd and SSSE3 highbd versions of the filters
      introduced in https://aomedia-review.googlesource.com/c/11962/ .
      
      These filters are equivalent in speed to the SSE2 implementations
      of the regular convolve filter. The average time to filter a
      64x64 block is:
      
      lowbd C: 52us
      lowbd SSE2: 5.6us
      highbd C: 53us
      highbd SSSE3: 5.8us
      
      Also add a correctness test based on the warp filter tests.
      
      Change-Id: Ia0d81100e8a414bbfc2b5f664d751cf24765299e
      5d34e6a7
    • David Barker's avatar
      ext-inter: Delete dead code · 0f3c94e1
      David Barker authored
      Patches https://aomedia-review.googlesource.com/c/11987/
      and https://aomedia-review.googlesource.com/c/11988/
      replaced the old masked motion search pipeline with
      a new one which uses different SAD/SSE functions.
      This resulted in a lot of dead code.
      
      This patch removes the now-dead code. Note that this
      includes vectorized SAD/SSE functions, which will need
      to be rewritten at some point for the new pipeline. It
      also includes the masked_compound_variance_* functions
      since these turned out not to be used by the new pipeline.
      
      To help with the later addition of vectorized functions, the
      masked_sad/variance_test.cc files are kept but are modified
      to work with the new functions. The tests are then disabled
      until we actually have the vectorized functions.
      
      Change-Id: I61b686abd14bba5280bed94e1be62eb74ea23d89
      0f3c94e1
  28. 22 May, 2017 1 commit
  29. 18 May, 2017 1 commit
    • David Barker's avatar
      ext-inter: Use joint_motion_search for masked compounds · c155e018
      David Barker authored
      Add functions which take both components of a masked compound and
      compute the resulting SAD/SSE. Extend joint_motion_search to understand
      masked compounds, and use it to evaluate NEW_NEWMV modes.
      
      Change-Id: I782199a20d119a6c61c6567df157508125ac7ce7
      c155e018
  30. 15 May, 2017 2 commits
    • Debargha Mukherjee's avatar
      Experimental high precision convolve for Wiener · 28d15c71
      Debargha Mukherjee authored
      Improves coding efficiency.
      
      Change-Id: I7bb12190cdc4581097809a020355cdc8867fc1ad
      28d15c71
    • Ralph Giles's avatar
      Remove armv6 media-extension assembly. · be111b38
      Ralph Giles authored
      Libvpx dropped armv6 support sometime after the aom fork.
      
      We don't intend to support this platform, which is likely
      too slow in any case. Remove the assembly and intrinsics
      optimized routines, their tests, cpu feature detection,
      and rtcd specialization for this instruction set extension.
      
      Change-Id: If44ec28e5ddafc6af179c5d1982ac7e81fe54d5e
      be111b38
  31. 11 May, 2017 1 commit
    • Yi Luo's avatar
      Partial IDCT 32x32 avx2 · 40f22ef8
      Yi Luo authored
      - Function level improvement (ms):
      Functions       ssse3  avx2   Percentage
      idct32x32_1024  794    374    52.9%
      idct32x32_135   354    169    52.2%
      idct32x32_34    197    142    27.9%
      idct32x32_1     n/a     26    n/a
      
      - Integrating in default scan order.
      
      Change-Id: I84815112b26b8a8cb800281a1cfb1706342af57d
      40f22ef8
  32. 08 May, 2017 2 commits
    • Yi Luo's avatar
      Partial IDCT 16x16 avx2 · f6176abb
      Yi Luo authored
      - Function level improvement:
      functions      sse2  avx2  percentage
      idct16x16_256  365   226   38%
      idct16x16_38   n/a   136   n/a
      idct16x16_10   171   110   35%
      idct16x16_1     34    26   23%
      
      - Integrated in AV1 for default scan order.
      
      Change-Id: Ieb1a8e730bea9c371ebc0e5f4a748640d8f5e921
      f6176abb
    • Urvang Joshi's avatar
      Add a new experiment SMOOTH_HV. · e6ca8e83
      Urvang Joshi authored
      This experiment extends ALT_INTRA by adding two new modes:
      smooth horizontal and smooth vertical.
      
      Improvement on *intra frames* in BDRate (PSNR):
      ===============================================
      
      AWCY (high latency): -0.46%
      (Also, -1.0% or more on PSNR Cb,Cr and APSNR Cb,Cr).
      
      AWCY (low latency): -0.43%
      (Also, -0.88% to -0.94% on PSNR Cb,Cr and APSNR Cb,Cr).
      
      Google sets:
      lowres: -0.454
      midres: -0.484
      hdres:  -0.525
      
      Improvement on *video overall* in BDRate (PSNR):
      ================================================
      
      AWCY (high latency): -0.15%
      
      Google sets:
      lowres: -0.085
      midres: -0.079
      
      Change-Id: I9f4e7c1b8ded1fe244c72838f336103ccc715d50
      e6ca8e83