1. 20 Nov, 2017 1 commit
    • Cheng Chen's avatar
      JNT_COMP: refactor if statements · 8263f80c
      Cheng Chen authored
      Refactor if statement that use frame_offset == -1 to indicate
      jnt_comp is not chosen, as distance now can not be negative.
      Instead, add a variable use_jnt_comp_avg for the same functionality.
      
      Change-Id: Ie6b9c6ab36131b48bc9e066babada17046729cd8
      8263f80c
  2. 17 Nov, 2017 1 commit
  3. 14 Nov, 2017 1 commit
  4. 13 Nov, 2017 1 commit
    • Cheng Chen's avatar
      JNT_COMP: SIMD for av1_warp_affine · fbaf5135
      Cheng Chen authored
      Add low bit-depth SIMD function for av1_warp_affine based on
      existing SIMD implementation.
      Unit tests are added.
      
      Change-Id: I1b4033fa75b53a81cb20a4bb5cc60413708b568c
      fbaf5135
  5. 06 Nov, 2017 2 commits
    • Cheng Chen's avatar
      JNT_COMP: Round the weighted sum · 7caa7382
      Cheng Chen authored
      Previously the weighted sums in convolve are right shifted without
      rounding. This patch adds rounding value before right shifts.
      
      Change-Id: Iea39aca419ac0ca0c32756f345293ce5e28dbd5b
      7caa7382
    • Cheng Chen's avatar
      JNT_COMP: add SIMD implementations for c functions · ef34fff7
      Cheng Chen authored
      Add SIMD implementations for c functions for low bit-depth, making
      encoder speed faster by 3~4x than c functions.
      
      Change-Id: Icca0b07b25489759be9504aaec09d1239076fc52
      ef34fff7
  6. 03 Nov, 2017 1 commit
    • Yue Chen's avatar
      Remove 4-tap filter intra · e2692c5c
      Yue Chen authored
      We reverted to using 3-tap filters. So 4-tap filters related code
      will not be used any more.
      
      Change-Id: I7f65cf227d2eb3e9785474e3b33d0bdbf489b1f1
      e2692c5c
  7. 02 Nov, 2017 2 commits
    • David Barker's avatar
      loop-restoration: Rework self-guided filter · 369d8f22
      David Barker authored
      Because we have an (effective) 3-pixel border around each
      processing unit, and the local sums in the self-guided filter are
      only taken over at most 5x5 regions, we have 1 pixel's worth of
      spare border.
      
      We can use this border to greatly simplify the filter: Instead
      of calculating a 64x64 region of the A[] and B[] arrays, we can
      calculate a 66x66 region. Then we don't have to deal with complicated
      boundary conditions when generating the final 64x64 output block.
      
      This also makes a few other related changes:
      * The 'boxnum' function has been effectively redundant
        for a while - due to the way we do the 5x5 (or 3x3) windowing,
        the values we actually use are always (2r+1)^2. So we can skip
        calling this function if MAX_RADIUS <= 2
      
      * We can remove the annoying special case for tiny processing units
        in the self-guided filter, as we no longer have to worry about
        border behaviour
      
      * We change the SSE4.1 code to match the new C code, removing a ton
        of complexity. Further refactoring/speedups are probably
        now possible, but this includes the minimal changes to pass all
        the tests.
      
      Change-Id: I99beee164a31349a5228a9bef048e5f35c9639f2
      369d8f22
    • Sebastien Alaiwan's avatar
      Remove experimental flag of EXT_TX · 3bac9928
      Sebastien Alaiwan authored
      This experiment has been adopted, we can simplify the code
      by dropping the associated preprocessor conditionals.
      
      Change-Id: I02ed47186bbc32400ee9bfadda17659d859c0ef7
      3bac9928
  8. 25 Oct, 2017 1 commit
    • Rupert Swarbrick's avatar
      Avoid UB from misaligned loads in selfguided_sse4.c · 84ffea31
      Rupert Swarbrick authored
      This follows on from the previous patch, which corrects xx_loadl_32
      for misaligned addresses. Calls to xx_loadl_32 in selfguided_sse4.c
      are all followed by a zero-extend, so this patch packages the two into
      the inlinable functions xx_load_extend_8_16 and xx_load_extend_8_32.
      
      There were also some hand-rolled loads (which matched the old body of
      xx_loadl_32 and weren't strictly correct when the pointer was
      misaligned). This patch fixes them up to use xx_load_extend_8_32.
      
      BUG=aomedia:912
      
      Change-Id: I9c76dd4f41baa1343149aa9c432218a17df8b415
      84ffea31
  9. 24 Oct, 2017 1 commit
    • Rupert Swarbrick's avatar
      Expose av1_loop_restoration_filter_unit in restoration.h · dd6f09ab
      Rupert Swarbrick authored
      This patch also does a certain amount of rejigging for loop
      restoration coefficients, grouping the information for a given
      restoration unit into a structure called RestorationUnitInfo. The end
      result is to completely dispense with the RestorationInternal
      structure.
      
      The copy_tile functions in restoration.c, together with those
      functions that operate on a single stripe, have been changed so that
      they take pointers to the top-left corner of the area on which they
      should work, together with a width and height.
      
      The same isn't true of av1_loop_restoration_filter_unit, which still
      takes pointers to the top-left of the tile. This is because you
      actually need the absolute position in the tile in order to do striped
      loop restoration properly.
      
      Change-Id: I768c182cd15c9b2d6cfabb5ffca697cd2a3ff9e1
      dd6f09ab
  10. 21 Oct, 2017 1 commit
  11. 19 Oct, 2017 1 commit
    • Rupert Swarbrick's avatar
      General tidy-ups in loop restoration code · d3d0615e
      Rupert Swarbrick authored
      This refactors the iteration in restoration.c so that all the scary
      stuff lies in a pair of general functions, filter_frame and
      filter_rest_unit.
      
      filter_frame is currently very simple, iterating over the restoration
      units in the frame. Once we've made it so that restoration units don't
      span tile boundaries, this function is the one we'll need to update to
      iterate over tiles and then restoration units within the tile.
      
      filter_rest_unit replaces the outer loop of the loop_*_filter_tile*
      functions. It deals with chopping the restoration unit into stripes of
      height procunit_height. When CONFIG_STRIPED_LOOP_RESTORATION is true,
      it also deals with calling setup_processing_stripe_boundary and
      restore_processing_stripe_boundary to use boundary data from the
      deblocked output.
      
      Some of the ugly #if/#endif blocks have been elided in the wiener
      filter code (both low and high bit depth), by defining a convolve
      alias based on USE_WIENER_HIGH_INTERMEDIATE_PRECISION.
      
      There are also changes to extend const-ness for the source frame. I've
      adopted the convention that the frame input is called "data" (as it
      was before) while it's non-const. This is true as far as
      filter_rest_unit. Then each "process one stripe" function takes a
      const pointer to the source frame, at which point it's called "src".
      
      The intention is that, once filter_rest_unit no longer needs a
      RestorationInternal pointer, this function can be exposed in
      restoration.h and can be used by pickrst.c
      
      Change-Id: I18043a172ef0ca1154d87cf7f63e3a80944627cd
      d3d0615e
  12. 16 Oct, 2017 1 commit
  13. 12 Oct, 2017 2 commits
  14. 11 Oct, 2017 1 commit
  15. 10 Oct, 2017 2 commits
    • Rupert Swarbrick's avatar
      Add an SSE4.1 implementation of av1_highbd_convolve_2d_scale · 724d31eb
      Rupert Swarbrick authored
      For large blocks this is about 8x the speed of the C version. The code
      needs SSE 4.1 for the PMULLD instruction that we use to do SIMD 32-bit
      multiplies.
      
      The patch uses av1_convolve_scale_test (written already to test the
      low bit depth path) to make sure the optimised code matches the C
      version.
      
      Change-Id: I9304d6bb3d2cb31390de93ed08ff1a852e3ace86
      724d31eb
    • Rupert Swarbrick's avatar
      Add an SSE4.1 implementation of av1_convolve_2d_scale · 98dc22b8
      Rupert Swarbrick authored
      For large blocks this is almost 8x the speed of the C version. The
      code needs SSE 4.1 for the PMULLD instruction that we use to do SIMD
      32-bit multiplies.
      
      This patch also makes av1_convolve_scale_test actually test something,
      making sure the optimised code matches the C version. The slightly
      excessive generality in the test (all the templating) is because of a
      following patch, which is for the high bit depth path and can then use
      most of the same test code.
      
      Change-Id: I6732bc6b2378ffaadae5aa6441100cf660f7ee11
      98dc22b8
  16. 07 Oct, 2017 1 commit
    • Joe Young's avatar
      [intra-edge] Pad intra edge samples to avoid valgrind warning · 7cfd5343
      Joe Young authored
      The SSE4 function filter_intra_edge_sse4_1() reads data slightly
      past the initialized part of the array. Those data are discarded
      later, but causes a valgrind warning. This change avoids the warning
      by initializing the array an extra +16 positions.
      
      BUG=aomedia:868
      
      Change-Id: Ib610492cff91492ae379c5d62895773f8747c4bc
      7cfd5343
  17. 05 Oct, 2017 1 commit
  18. 02 Oct, 2017 1 commit
    • Rupert Swarbrick's avatar
      Remove SSE specialisation for av1_fwd_txfm2d_64x64 · b98ea58d
      Rupert Swarbrick authored
      This specialisation can't work: it gets the configuration to use from
      av1_get_fwd_txfm_64x64_cfg, which specifies a TXFM_TYPE equal to
      TXFM_TYPE_DCT64 (reasonable enough), but the code in
      av1_fwd_txfm2d_sse4.c only supports TXFM_TYPE_DCT32 and
      TXFM_TYPE_ADST32.
      
      BUG=aomedia:852
      
      Change-Id: I37ffa0c8ae520c780105b30df9f627c2290de425
      b98ea58d
  19. 01 Oct, 2017 1 commit
  20. 20 Sep, 2017 1 commit
    • Joe Young's avatar
      [intra-edge] Vectorize upsampling · ad0196b8
      Joe Young authored
      Add sse4_1 functions for Intra-edge experiment:
        av1_upsample_intra_edge_sse4_1()
        av1_upsample_intra_edge_high_sse4_1()
      
      Approx cycle reduction at qp 20, 1 kf:
        Enc:  0.5% to 0.3%
        Dec:  0.4% to 0.2%
      
      Change-Id: I97f0eee09b78218b418b484d80c338cec037f1b9
      ad0196b8
  21. 16 Sep, 2017 1 commit
    • Joe Young's avatar
      [intra-edge] Vectorize edge filtering functions · 89d321f7
      Joe Young authored
      Add sse4_1 functions for Intra-edge experiment:
        av1_filter_intra_edge_sse4_1()
        av1_filter_intra_edge_high_sse4_1()
      
      Approx cycle reduction at qp 20, 1 kf:
        Enc (lbd) 1.4% to 0.3%
        Dec (lbd) 0.4% to 0.1%
        Enc (hbd) 1.1% to 0.2%
        Dec (hbd) 0.6% to 0.1%
      
      No change to bitstream
      
      Change-Id: I176b2d125424d7d226114c807915c33dde5c3720
      89d321f7
  22. 10 Sep, 2017 2 commits
    • Debargha Mukherjee's avatar
      Refactoring/simplification of buffers used for sgr · 1330dfd1
      Debargha Mukherjee authored
      Inlcudes miscellaneous cleanups, test fixes, and code reorganization
      for loop-restoration components.
      
      Change-Id: I5b2e6419234d945e6f4344b22636119b50df4054
      1330dfd1
    • Debargha Mukherjee's avatar
      Reduce/Eliminate line buffer for loop-restoration. · e168a783
      Debargha Mukherjee authored
      This patch forces the vertical filtering for the top and bottom
      rows of a processing unit for the Wiener filter to not use border
      more than what is set in the WIENER_BORDER_VERT macro.
      This macro is currently set at 0 to eliminate line buffer completely,
      but it could be increased to 1 or 2 to use limited line buffers
      if the coding efficiency is affected too much with a 0 line-buffer.
      
      Also, for the sgr filter we added the option of using overlapping
      windows horizonttally and vertically to improve coding efficiency.
      The vertical border used is set by the SGRPROJ_BORDER_VERT
      macro, while the horizontal border can be set by the
      SGRPROJ_BORDER_HORZ macro set at 2, the max needed. Currently we do not
      recommend changing SGRPROJ_BORDER_HORZ below 2.
      
      The overall line buffer requirement for LR is twice the max of
      WIENER_BORDER_VERT and SGRPROJ_BORDER_VERT.
      Currently both are set as 0, eliminating line buffers completely.
      
      Also this patch extends borders consistently before CDEF / LR.
      
      Change-Id: Ie58a98c784a0db547627b9cfcf55f018c30e8e79
      e168a783
  23. 06 Sep, 2017 1 commit
    • David Barker's avatar
      Adjust chroma position in warp filter · a60dc9d6
      David Barker authored
      When using chroma subsampling, the warp filter currently behaves
      strangely when projecting chroma pixels, especially when the
      subsamplings are not equal along the x and y axes.
      
      For example, when subsampling_x = 1 and subsampling_y = 0, we
      calculate the destination coordinates (dx, dy) from the source
      coordinates (sx, sy) as:
      dx = project(2*sx+0.5, 2*sy+0.5)/2 - 0.5
      dy = project(sx, sy)
      where project() applies the affine warp model.
      
      This patch changes to a simpler and more consistent model,
      where we:
      * Project the chroma sample into luma coordinates, taking
        the chroma sample to be co-located with the top-left luma
        sample in its (2x2, or 2x1, or 1x2) subsampling block
        (this is done for simplicity; we don't expect the exact
         position to make much difference to the output quality)
      * Apply the transformation in luma coordinates
      * Project the resulting luma sample back into chroma coordinates
      
      Change to software speed is in the noise, but this approach
      should be simpler in hardware, and should slightly improve
      quality for 4:2:2 and 4:4:0 videos.
      
      Change-Id: Idd455fdd3897594ca7d4edff5b85b78961d1638d
      a60dc9d6
  24. 22 Aug, 2017 1 commit
    • David Barker's avatar
      Fix ASan errors in SSE4.1 selfguided filter tests · 67a5e148
      David Barker authored
      The selfguided filter code was sometimes fetching 8 bytes of data
      when it only needed 4. This was normally fine, but lead to problems
      in the selfguided filter test when compiling for x86-32, where we
      accidentally read off the end of the input buffer.
      
      Fix this by only reading the amount of data we actually need.
      
      BUG=aomedia:700
      
      Change-Id: I2448b7b0d9cb2f9292a092675a66da64c89f913c
      67a5e148
  25. 21 Aug, 2017 1 commit
    • Rupert Swarbrick's avatar
      Obey do_average flag when doing convolve_round · 07089c68
      Rupert Swarbrick authored
      Doing this means that we don't have to memset temporary buffers to
      zero in reconinter.c, which was taking ~5% of cycles in a short
      encoding test (using perf to attach to a running encode).
      
      Change-Id: Ibb6e31920000b876c6ee99f454d89c8a97e9fb31
      07089c68
  26. 31 Jul, 2017 1 commit
    • Peter de Rivaz's avatar
      Unified warp_affine and warp_affine_post_round · b6a31753
      Peter de Rivaz authored
      This patch removes the need for a separate warp_affine_post_round
      function by adding the functionality to the warp_affine function.
      
      The encoded output should remain unchanged, but the encoder/decoder
      should operate faster because the sse2 and ssse3 warp implementation
      can now be used when post_rounding is being used.
      
      Change-Id: Ide52cae55de59a9da9c27c5793e17390f6d2c03e
      b6a31753
  27. 17 Jul, 2017 1 commit
    • Lester Lu's avatar
      Unify FWD_TXFM_PARAM and INV_TXFM_PARAM · 27319b6e
      Lester Lu authored
      Change two similar structs, FWD_TXFM_PARAM and INV_TXFM_PARAM,
      into a common struct: TxfmParam. Its definition is moved to
      aom_dsp/txfm_common.h to simplify dependency.
      
      This change is made so that, in later changes of the LGT
      experiment, functions requiring FWD_TXFM_PARAM and
      INV_TXFM_PARAM, such as get_fwd_lgt4 and get_inv_lgt4, can
      also be unified.
      
      Change-Id: I756b0176a02314005060adbf8e62386f10eeb344
      27319b6e
  28. 13 Jul, 2017 1 commit
    • Yi Luo's avatar
      Speed up convolve_round post-rounding by avx2 · 04cef497
      Yi Luo authored
      - Decoder convolve rounding cycle percentage drops from
        2.75% to 0.91% by using avx2 function on i7-6700.
      
      Change-Id: I34ae48f45c0b4073f8962647d2181365ffe3325b
      04cef497
  29. 07 Jul, 2017 1 commit
    • Lester Lu's avatar
      Signature changes for the LGT experiment · d8b1ddce
      Lester Lu authored
      The input arguments of av1_fht* and av1_iht* functions (and their
      HBD versions) are slightly changed. Input arguments tx_type and
      bd are carried by a struct fwd_txfm_param/inv_txfm_param. This
      struct is meant to later on carry other prediction information,
      such as intra top/left boundaries to the transform level, so
      that the choice of transforms can be more adaptive to the
      prediction mode and local video content.
      
      Change-Id: Ia42544248a51845be64b72855b642ef1fe5910a9
      d8b1ddce
  30. 24 Jun, 2017 1 commit
  31. 20 Jun, 2017 1 commit
  32. 12 Jun, 2017 1 commit
    • Sarah Parker's avatar
      Clean up hbd transform code · 30dfa883
      Sarah Parker authored
      Responding to some left over cosmetic comments from
      2b5cdb1cf87c933331a16cc0221455d0a8c255e1
      
      Change-Id: I42e126593526cedd6675adf35b9c1df78e1ddf54
      30dfa883
  33. 09 Jun, 2017 2 commits
    • David Barker's avatar
      Vectorize av1_convolve_2d() · 8295c7c7
      David Barker authored
      Includes a test case based on the warp filter tests
      
      Change-Id: I9abea53a088f68bb8a928ebd7cb96b3266a63c13
      8295c7c7
    • David Barker's avatar
      Add 'do_average' to ConvolveParams structure · e64d51a9
      David Barker authored
      The 'ref' member of ConvolveParams currently serves two purposes:
      * To indicate which component of a compound we're currently predicting,
        eg. for fetching interpolation filters with dual-filter enabled.
      * To determine whether we should average into the destination buffer.
      
      But there are two cases where we want to separate these out:
      * In joint_motion_search, we want to try combining a fixed second
        prediction with various first predictions.
      * When searching masked interinter compounds, we want to predict
        each component separately then try different combinations.
      
      In these cases, we set 'ref' to 0 and use temporary variables to
      make sure we use the correct interpolation filters. But this is
      quite fragile.
      
      This patch separates out the two uses into separate members.
      This allows us to remove some temporary variables, but more
      importantly gives easy fixes to two bugs in
      build_inter_predictors_single_buf (used by rdopt):
      
      * We previously set ref=0 but didn't fix up the interpolation filters
      * For ZERO_ZEROMV modes, the second component would accidentally
        average into the (uninitialized!) second prediction buffer
      
      BUG=aomedia:577
      BUG=aomedia:584
      BUG=aomedia:595
      
      Change-Id: Ibc31d1ac701a029ea5efaa1197dd402bc4b7af1e
      e64d51a9
  34. 08 Jun, 2017 1 commit
    • Sarah Parker's avatar
      Remove deprecated high-bitdepth functions · 31c66502
      Sarah Parker authored
      This unifies the codepath for high-bitdepth transforms and deletes
      all calls to the old deprecated versions. This required reworking
      the way 1d configurations are combined in order to support rectangular
      transforms.
      
      There is one remaining codepath that calls the deprecated 4x4 hbd
      transform from encoder/encodemb.c. I need to take a closer look
      at what is happening there and will leave that for a followup
      since this change has already gotten so large.
      
      lowres 10 bit: -0.035%
      lowres 12 bit: 0.021%
      
      BUG=aomedia:524
      
      Change-Id: I34cdeaed2461ed7942364147cef10d7d21e3779c
      31c66502