1. 22 Nov, 2017 2 commits
    • Cheng Chen's avatar
      JNT_COMP: add ssse3 implementations for sad_avg · d0179a6b
      Cheng Chen authored
      Add ssse3 implementations for the sad_avg c function at low bit-depth.
      With this, aom_jnt_sad c functions can all have simd implementations.
      This CL follows existing MACRO definitions for multiple combinations
      of block sizes.
      
      Change-Id: I882343684026525f5589a239337cfac2dd411e11
      d0179a6b
    • Cheng Chen's avatar
      JNT_COMP: SIMD implementation for aom_jnt_sub_pixel_avg · d286443c
      Cheng Chen authored
      Change function names and add SIMD implementation for two c functions:
      (1) var_filter_block2d_bil_first_pass
      (2) var_filter_block2d_bil_second_pass
      
      This CL allows aom_jnt_sub_pixel_avg_variance now in SIMD.
      
      Change-Id: Ib41ef13d62ae91a0ca481bcebb24568dcd4722c4
      d286443c
  2. 10 Nov, 2017 1 commit
    • Urvang Joshi's avatar
      Remove smooth_hv experiment flag. · b7301cd6
      Urvang Joshi authored
      This experiment has been cleared by Tapas.
      
      Also, fix a couple of hash signatures in the test while we are at it.
      
      Change-Id: I1658bcb07913cf8bd47cfffadd729e16d5c55fc3
      b7301cd6
  3. 06 Nov, 2017 2 commits
    • Cheng Chen's avatar
      JNT_COMP: add SIMD implementations for c functions · ef34fff7
      Cheng Chen authored
      Add SIMD implementations for c functions for low bit-depth, making
      encoder speed faster by 3~4x than c functions.
      
      Change-Id: Icca0b07b25489759be9504aaec09d1239076fc52
      ef34fff7
    • Cheng Chen's avatar
      JNT_COMP: Refactor code · f78632e0
      Cheng Chen authored
      The refactoring serves two purposes:
      1. Separate code paths for jnt_comp and original compound average
      computation. It provides function interface for jnt_comp while leaving
      original compound average computation unchanged. In near future, SIMD
      functions can be added for jnt_comp using the interface.
      
      2. Previous implementation uses a hack on second_pred. But it may cause
      segmentation fault when the test clip is small. As reported in Issue
      944. This refactoring removes hacking and make it possible to address
      the seg fault problem in the future.
      
      Change-Id: Idd2cb99f6c77dae03d32ccfa1f9cbed1d7eed067
      f78632e0
  4. 31 Oct, 2017 1 commit
  5. 21 Oct, 2017 1 commit
  6. 20 Oct, 2017 1 commit
    • Yi Luo's avatar
      Lowbd D207E/D63E/D45E intrapred x86 optimization · ae676953
      Yi Luo authored
      D207E
      Predictor  SSE2 vs C
      4x4        ~2.6X
      4x8        ~2.5X
      8x4        ~8.0X
      8x8        ~9.1X
      8x16       ~11.7X
      16x8       ~16.9X
      16x16      ~17.3X
      16x32      ~17.2X
      32x16      ~30.2X
      32x32      ~35.5X
      
      D63E
      Predictor  SSE2 vs C
      4x4        ~4.7X
      4x8        ~4.9X
      8x4        ~7.8X
      8x8        ~8.9X
      8x16       ~9.3X
      16x8       ~15.7X
      16x16      ~14.7X
      16x32      ~17.3X
      32x16      ~18.0X
      32x32      ~15.7X
      
      D45E
      Predictor  SSSE3 vs C
      4x4        ~1.8X
      4x8        ~2.9X
      8x4        ~6.7X
      8x8        ~6.5X
      8x16       ~7.4X
      16x8       ~24.4X
      16x16      ~21.5X
      16x32      ~24.2X
      32x16      ~25.4X
      32x32      ~25.2X
      
      Change-Id: I8215de190e2b6314272749761600e389d1ca0fdf
      ae676953
  7. 16 Oct, 2017 1 commit
    • Yi Luo's avatar
      Highbd D207E/D63E intrapred sse2/avx2 optimization · 0b7127b3
      Yi Luo authored
      D207E
      Predictor SSE2 vs C   AVX2 vs C
      4x4       ~2.7x
      4x8       ~3.0x
      8x4       ~7.2x
      8x8       ~8.5x
      8x16      ~9.4x
      16x8      ~12.8x
      16x16     ~13.0x
      16x32     ~14.3x
      32x16                 ~19.9x
      32x32                 ~23.6x
      
      D63E
      Predictor SSE2 vs C   AVX2 vs C
      4x4       ~3.8x
      4x8       ~4.3x
      8x4       ~6.4x
      8x8       ~6.8x
      8x16      ~8.6x
      16x8                  ~9.0x
      16x16                 ~9.6x
      16x32                 ~10.3x
      32x16                 ~9.1x
      32x32                 ~11.0x
      
      Change-Id: I87373804c9d53276bf4d7788c4ae0d13d01c00dc
      0b7127b3
  8. 10 Oct, 2017 2 commits
    • Yi Luo's avatar
      Highbd D45E intrapred SSE2/AVX2 speedup · 56ad3dd3
      Yi Luo authored
      Function  SSE2 vs C  AVX2 vs C
      4x4       ~4.5x
      4x8       ~4.5x
      8x4       ~11.7x
      8x8       ~12.7x
      8x16      ~14.0x
      16x8                 ~21.7x
      16x16                ~24.0x
      16x32                ~28.7x
      32x16                ~20.5x
      32x32                ~24.4x
      
      Change-Id: Iaca49727d8df17b7f793b774a8d51a401ef8a8d1
      56ad3dd3
    • Yi Luo's avatar
      Migrate some vp9 highbd intrapred x86 speedup to av1 · 71b6e043
      Yi Luo authored
      Function speedup on i7-6700:
      D117   sse2   ssse3
      4x4    ~1.8x
      8x8           ~3.4x
      16x16         ~5.5x
      32x32         ~2.9x
      
      D135   sse2   ssse3
      4x4    ~1.9
      8x8           ~3.3x
      16x16         ~5.3x
      32x32         ~3.6x
      
      D153   sse2   ssse3
      4x4    ~1.9x
      8x8           ~2.8x
      16x16         ~5.5x
      32x32         ~3.6x
      
      Change-Id: I43ab5fa8dcbcfa51acbde554abf3e5d7d336f391
      71b6e043
  9. 06 Oct, 2017 1 commit
    • Yi Luo's avatar
      Lowbd SMOOTH_PRED intrapred ssse3 optimization · 46ae1ea3
      Yi Luo authored
      On i7-6700:
      Predictor    ssse3 v. C
      4x4          ~1.3x
      4x8          ~1.9x
      8x4          ~2.3x
      8x8          ~3.4x
      8x16         ~4.1x
      16x8         ~4.6x
      16x16        ~5.2x
      16x32        ~5.6x
      32x16        ~4.2x
      32x32        ~4.7x
      
      Change-Id: Ic12383cf9d4446361d6355eb8a480a3c7602060e
      46ae1ea3
  10. 04 Oct, 2017 1 commit
    • Yi Luo's avatar
      Lowbd TM_PRED intra pred avx2 optimization · 237cf1b2
      Yi Luo authored
      For block width >= 16, avx2 can further speedup the
      TM_PREM intra prediction.
      
      Function speedup on i7-6700:
      Predictor  avx2 v. ssse3
      16x8       ~1.6x
      16x16      ~1.8x
      16x32      ~1.9x
      32x16      ~1.9x
      32x32      ~1.9x
      
      Change-Id: I62c20bd7628f52251b0c051b99a9b738ee44f7e6
      237cf1b2
  11. 02 Oct, 2017 2 commits
  12. 29 Sep, 2017 3 commits
    • Yi Luo's avatar
      Lowbd TM_PRED intrapred ssse3 optimization · a0f66fc0
      Yi Luo authored
      Function speedup (i7-6700)
      Predictor  ssse3 v. C
      4x4        ~2.1x
      4x8        ~2.4x
      8x4        ~4.1x
      8x8        ~5.4x
      8x16       ~6.1x
      16x8       ~5.9x
      16x16      ~6.4x
      16x32      ~6.7x
      32x16      ~7.4x
      32x32      ~8.0x
      
      Change-Id: I52b8ebf8193e76f4ea1137cbad5ad7fa109d86d8
      a0f66fc0
    • Rupert Swarbrick's avatar
      Add 32x128/128x32 block sizes · 2fa6e1ce
      Rupert Swarbrick authored
      Change-Id: Ieb28f40d85e4db4af33648c32c406dd2931ceb89
      2fa6e1ce
    • Yi Luo's avatar
      Lowbd intrapred DC/TOP/LEFT/128/V/H avx2 · 23c61903
      Yi Luo authored
      For prediction block width equal to 32, avx2 can further speedup
      the prediction function (i7-6700):
      
      32x32     avx2 v. sse2
      DC        ~1.4x
      top       ~1.5x
      left      ~1.4x
      128       ~1.5x
      v         ~1.6x
      h         ~1.2x
      
      32x16     avx2 v. sse2
      DC        ~2.2x
      top       ~1.7x
      left      ~1.6x
      128       ~1.8x
      v         ~1.9x
      
      Note: 32x16 H_PRED on avx2 does not run faster enough than sse2 yet.
      
      Change-Id: I145ed504d1b3ea9df283b94927be66a2c6f81225
      23c61903
  13. 28 Sep, 2017 1 commit
    • Yi Luo's avatar
      Lowbd rectangle V/H intra pred sse2 optimization · 0c0fd1e5
      Yi Luo authored
      Function speedup sse2 v. C
      Predictor  V_PRED  H_PRED
      4x8        ~1.7x   ~1.8x
      8x4        ~1.8x   ~2.2x
      8x16       ~1.5x   ~1.4x
      16x8       ~1.9x   ~1.3x
      16x32      ~1.6x   ~1.4x
      32x16      ~2.0x   ~1.9x
      
      This patch disables speed tests to save Jenkins build
      time. Developer can manually enable them by using,
      --gtest_also_run_disabled_test flag in test command line.
      
      Change-Id: I81eaee5e8afc55275c7507c99774f78cc9e49f9a
      0c0fd1e5
  14. 27 Sep, 2017 2 commits
    • James Zern's avatar
      cosmetics,*rtcd*.pl: reindent · 1512fa97
      James Zern authored
      Change-Id: I612517c6218c561ee94888c8c14298964851484a
      1512fa97
    • Yi Luo's avatar
      Lowbd rect intrapred DC/LEFT/TOP/128 sse2 optimization · 39bdf36a
      Yi Luo authored
      Add lowbd unit test functionality to intrapred_test.cc
      Function speedup against C (i7-6700):
      Predictor   DC     LEFT   TOP    128
      4x8        ~1.4x  ~1.4x  ~1.7x  ~1.9x
      8x4        ~1.2x  ~1.6x  ~1.6x  ~2.6x
      8x16       ~1.4x  ~1.3x  ~1.4x  ~2.1x
      16x8       ~2.0x  ~1.8x  ~2.3x  ~2.1x
      16x32      ~2.0x  ~1.9x  ~1.8x  ~2.2x
      32x16      ~2.0x  ~2.0x  ~1.9x  ~2.2x
      
      Change-Id: I33db512020ca3c6853a9205a8079f3d00134f584
      39bdf36a
  15. 22 Sep, 2017 1 commit
    • Yi Luo's avatar
      Highbd rectangle intrapred V/DC sse2 optimization · bdddf33a
      Yi Luo authored
      Function speedup (i7-6700),  sse2 verse C:
      Predictor      V_PRED    DC_PRED
      4x8            ~1.5x     ~4.9x
      8x4            ~2.5x     ~4.8x
      8x16           ~1.9x     ~9.1x
      16x8           ~1.9x     ~4.4x
      16x32          ~2.1x     ~5.8x
      32x16          ~2.0x     ~3.6x
      
      Change-Id: I6deffd0637e57ee5d0bd533502f5705148c4cdd4
      bdddf33a
  16. 19 Sep, 2017 1 commit
    • Yi Luo's avatar
      Highbd intrapred DC_LEFT/TOP/128 sse2 optimization · bbf6186e
      Yi Luo authored
      Also extend intra pred speed test to rectangular block.
      Speedup (i7-6700)
      predictor      sse2 v. C
      left 4x4       ~5.6x
      top  4x4       ~7.2x
      128  4x4       ~6.9x
      left 4x8       ~7.7x
      top  4x8       ~10.1x
      128  4x8       ~10.0x
      
      left 8x4       ~8.1x
      top  8x4       ~9.1x
      128  8x4       ~10.1x
      left 8x8       ~10.3x
      top  8x8       ~13.6x
      128  8x8       ~14.8x
      left 8x16      ~12.6x
      top  8x16      ~14.0x
      128  8x16      ~15.5x
      
      left 16x8      ~6.3x
      top  16x8      ~7.0x
      128  16x8      ~6.5x
      left 16x16     ~6.5x
      top  16x16     ~7.1x
      128  16x16     ~8.2x
      left 16x32     ~5.1x
      top  16x32     ~6.4x
      128  16x32     ~5.6x
      
      left 32x16     ~4.2x
      top  32x16     ~4.3x
      128  32x16     ~4.5x
      left 32x32     ~3.8x
      top  32x32     ~3.7x
      128  32x32     ~3.9x
      
      Change-Id: Ie7fcc85b9ded3030ee904623c40e9edeec1695ae
      bbf6186e
  17. 18 Sep, 2017 1 commit
    • Yi Luo's avatar
      Highbd intra pred H_PRED sse2 optimization · 23b9b317
      Yi Luo authored
      sse2 v. C speedup:
      4x4   ~8.0x
      8x8   ~8.2x
      16x16 ~6.5x
      32x32 ~3.8x
      Blocksize:
      4x4, 4x8, 8x4, 8x8, 8x16, 16x8, 16x16, 16x32, 32x16, 32x32
      Square blocksize code is from libvpx:
      "30d9a1916 vpxdsp: [x86] add highbd_h_predictor functions",
      Credit goes to Scott LaVarnway. Speed tests do not support
      rectangular blocksize yet.
      
      Change-Id: I9a1f24aecab8de94f8ea59ec8748fe3537d721ae
      23b9b317
  18. 07 Sep, 2017 1 commit
    • Yi Luo's avatar
      Lowbd parallel_deblocking sse2 optimization · ea8a0d52
      Yi Luo authored
      Baseline + parallel_deblocking:
      
      - Passed unit tests *SSE2/Loop8Test6*, *AVX2/Loop8Test6*.
      - 1080p, 25 frames, profile=0, encoding/decoding, output match.
      - Decoder frame rate increases from 54.15 to 65.84.
      
      Change-Id: I55938c94961066594f4b9080192c7268c19d9bf9
      ea8a0d52
  19. 15 Aug, 2017 2 commits
    • Ralph Giles's avatar
      aom_dsp: regularize EXT_PARTITION_TYPES handling. · ccfdfce1
      Ralph Giles authored
      aom_dsp_rtcd_defs.pl compares most CONFIG_* keys to "yes"
      to see if they're set. The script was checking just
      
        if (aom_config("CONFIG_EXT_PARTITION_TYPES"))
      
      in some cases. The build system doesn't add disabled
      configuration options to libs.mk so this is effectively
      the same, however it means that setting the config
      key explicitly to 0 or "no" in the config headers
      was treated the same as setting it to 1 or "yes",
      and aom_dsp_rtcd.h would have opposite expections
      from aom_config.h or aom_config.asm.
      
      Treat this key similarly to others for consistency.
      
      Change-Id: I27bd7a5532ba4afc2bb289b43b57a1b1971c0348
      ccfdfce1
    • Urvang Joshi's avatar
      Remove ALT_INTRA flag. · 93b543ab
      Urvang Joshi authored
      This experiment has been adopted as it has been cleared by Tapas.
      
      Change-Id: I0682face60f62dd43091efa0a92d09d846396850
      93b543ab
  20. 10 Aug, 2017 1 commit
    • Yi Luo's avatar
      Highbd loop filter AVX2 · 6ae0054c
      Yi Luo authored
      - Speed test (ms) on i7-6700, Linux x86_64
        FUNCTION             SSE2    AVX2
        horizontal_edge_16   55      28
        vertical_16_dual     84      47
        horizontal_4_dual    27      13
        horizontal_8_dual    36      15
        vertical_4_dual      38      25
        vertical_8_dual      44      27
      - Decoder frame rate improves around 1.2% - 2.8%.
      
      Change-Id: I9c4123869bac9b6d32e626173c2a8e7eb0cf49e7
      6ae0054c
  21. 08 Aug, 2017 1 commit
    • Thomas Davies's avatar
      Refactor quantization C code. · f3b5ee14
      Thomas Davies authored
      This commit de-duplicates C reference quantization code
      and unifies quantization matrix (QM) and non-QM code
      paths when there is no SIMD.
      
      The reorganisation also will facilitate re-using SIMD quant
      functions for QM when the matrix is flat, as is the
      default when AOM_QM is enabled.
      
      Change-Id: Idbfdac9eb9a31adcffe734aac1877d58b86fab77
      f3b5ee14
  22. 04 Aug, 2017 1 commit
  23. 21 Jul, 2017 1 commit
    • Angie Chiang's avatar
      Integrate convolve_round with compound_segment · 7b517095
      Angie Chiang authored
      This integration only covers low bitdepth mode for now
      
      The performance of Convolve_round on top of compound_segment
      revives from 0.475% to 0.612% on lowres
      
      Change-Id: I21606c79d0a22c0834966730358267c082d8071e
      7b517095
  24. 12 Jul, 2017 1 commit
    • Rupert Swarbrick's avatar
      ext-partition-types: Add 4:1 partitions · 93c39e91
      Rupert Swarbrick authored
      This patch adds support for 4:1 rectangular blocks to various common
      data arrays, and adds new partition types to the EXT_PARTITION_TYPES
      experiment which will use them.
      
      This patch has the following restrictions, which can be lifted in
      future patches:
      
        * ext-partition-types is incompatible with fp_mb_stats and supertx
          for the moment
      
        * Currently only 32x32 superblocks can use the new partition types
      
      There's a slightly odd restriction about when we allow
      PARTITION_HORZ_4 or PARTITION_VERT_4. Since these both live in the
      EXT_PARTITION_TYPES CDF, read_partition() can only return them if both
      has_rows and has_cols is true. This means that at least half of the
      width and height of the block must be visible. It might be nice to
      relax that restriction but that would imply a change to how we encode
      partition types, which seems already to be in a state of flux, so
      maybe it's better to wait until that has settled down.
      
      Change-Id: Id7fc3fd0f762f35f63b3d3e3bf4e07c245c7b4fa
      93c39e91
  25. 08 Jul, 2017 1 commit
    • Fergus Simpson's avatar
      Fix frame scaling prediction · 505f0068
      Fergus Simpson authored
      Use higher precision offsets for more accurate predictor
      generation when references are at a different scale from
      the coded frame.
      
      Change-Id: I4c2c0ec67fa4824273cb3bd072211f41ac7802e8
      505f0068
  26. 29 Jun, 2017 1 commit
  27. 28 Jun, 2017 1 commit
  28. 27 Jun, 2017 1 commit
    • Yi Luo's avatar
      Fix inv txfm low/high bitdepth selection logic · 51281095
      Yi Luo authored
      We are going to have several commits to setup new low/high
      bitdepth data path selection logic. This patch is for inverse
      transform. Let me summarize the ideas as following.
      
      - For low/high bitdepth selection, encoder depends on
        input configuration, e.g., video sequence bitdepth,
        profile. Decoder depends on input bitstream. This has
        nothing to do with compiler/build  configuration.
      
      - Typical encoder usage for sampling format 4:2:0.
        1) 8-bit video sequence:
         a) --profile=0
         Fastest encoding/decoding pipeline on speedup.
      
         b) --profile=2 --bit-depth=10
         Image pixels are left shifted by 2 bits. It
         employs 16-bit reference frame buffer and has high
         calculation precision. It usually enjoys higher
         compression performance.
      
        2) 10/12-bit video sequence (HDR):
         --profile=2 --bit-depth=10/12
      
      - Transform coefficient type:
        Lowbitdepth:  int16_t
        Highbitdepth: int32_t
      
      - The type, tran_low_t is still used in codebase,
        Which is int32_t, defining the data path capacity.
        Naturally, it is high bitdepth.
      
      Eventually we shall remove the configuration flags,
      CONFIG_HIGHBITDEPTH/CONFIG_LOWBITDEPTH, and seperate
      low and high bitdepth data path. Two data paths co-exist
      in the same build environment.
      
      Change-Id: I35c06d4d4f19ebf80d909168fdddbae57c3cc884
      51281095
  29. 22 Jun, 2017 1 commit
    • Yi Luo's avatar
      Add avx2 highbd_quantize_b · 193422e7
      Yi Luo authored
      - First pass encoding time reduces ~10.9% on i7-6700
        at 100 frames, 1080p.
      - avx2 works for coeff number >= 8 cases; coeff number < 8
        case will be implemented by sse2.
      - Unit test is added type B/FP/DC.
      
      Change-Id: Ibe5b7807c64e6dfc2d59c470ed50a6e8ca94ef7c
      193422e7
  30. 19 Jun, 2017 1 commit
    • Timothy B. Terriberry's avatar
      encoder: Remove 64x upsampled reference buffers · 5d24b6f0
      Timothy B. Terriberry authored
      They do not handle border extension correctly (interpolation and
      border extension do not commute unless you upsample into the
      border), nor do they handle crop dimensions that are not a multiple
      of 8 (the upsampled version is not sufficiently large), in addition
      to using massive amounts of memory and being a criminal waste of
      cache (1 byte used for every 8 bytes fetched).
      
      This commit reimplements use_upsampled_references by computing the
      subpixel samples on the fly. This implementation not only corrects
      the border handling, but is also faster, while maintaining the
      same quality.
      
      HL AWCY results are basically noise:
          PSNR | PSNR HVS |   SSIM | MS SSIM | CIEDE 2000
        0.0188 |   0.0187 | 0.0045 |  0.0063 |     0.0228
      
      Change-Id: I7527db9f83b87a7bb8b35342f7e6457cd0bef9cd
      5d24b6f0
  31. 08 Jun, 2017 1 commit
  32. 06 Jun, 2017 1 commit
    • Urvang Joshi's avatar
      Add a new experiment "rect-intra-pred". · 766a389b
      Urvang Joshi authored
      Earlier, intra prediction for rectangular blocks was performed by
      running two steps of prediction on square sub-blocks.
      
      With this experiment, we do proper intra prediction for rectangular
      blocks. This ensures that we make use of all available neighboring
      pixels especially for directional modes. For this, all the intra
      predictors were updated to work with rectangular transform block sizes.
      
      Performance improvements are small but free of cost:
      
      All Intra frames:
      lowres: -0.126
      midres: -0.154
      
      Video Overall:
      lowres: -0.043
      midres: -0.100
      
      [Could not get AWCY results due to a backlog.]
      
      BUG=aomedia:551
      
      Change-Id: I7936e91b171d5c246cb0a4ea470a981a013892e6
      766a389b