1. 10 Apr, 2013 1 commit
    • Ronald S. Bultje's avatar
      Make SB coding size-independent. · a3874850
      Ronald S. Bultje authored
      Merge sb32x32 and sb64x64 functions; allow for rectangular sizes. Code
      gives identical encoder results before and after. There are a few
      macros for rectangular block sizes under the sbsegment experiment; this
      experiment is not yet functional and should not yet be used.
      
      Change-Id: I71f93b5d2a1596e99a6f01f29c3f0a456694d728
      a3874850
  2. 04 Apr, 2013 1 commit
  3. 27 Mar, 2013 1 commit
    • Yunqing Wang's avatar
      Optimize 32x32 idct function · 21a718d9
      Yunqing Wang authored
      Wrote sse2 version of vp9_short_idct_32x32 function. Compared
      to c version, the sse2 version is 5X faster.
      
      Change-Id: I071ab7378358346ab4d9c6e2980f713c3c209864
      21a718d9
  4. 26 Mar, 2013 1 commit
    • Deb Mukherjee's avatar
      Implicit weighted prediction experiment · 23144d23
      Deb Mukherjee authored
      Adds an experiment to use a weighted prediction of two INTER
      predictors, where the weight is one of (1/4, 3/4), (3/8, 5/8),
      (1/2, 1/2), (5/8, 3/8) or (3/4, 1/4), and is chosen implicitly
      based on consistency of the predictors to the already
      reconstructed pixels to the top and left of the current macroblock
      or superblock.
      
      Currently the weighting is not applied to SPLITMV modes, which
      default to the usual (1/2, 1/2) weighting. However the code is in
      place controlled by a macro. The same weighting is used for Y and
      UV components, where the weight is derived from analyzing the Y
      component only.
      
      Results (over compound inter-intra experiment)
      derf: +0.18%
      yt: +0.34%
      hd: +0.49%
      stdhd: +0.23%
      
      The experiment suggests bigger benefit for explicitly signaled weights.
      
      Change-Id: I5438539ff4485c5752874cd1eb078ff14bf5235a
      23144d23
  5. 21 Mar, 2013 2 commits
    • Yunqing Wang's avatar
      Optimize 16x16 idct10 function · 869d6c05
      Yunqing Wang authored
      Wrote sse2 version of vp9_short_idct10_16x16 function. Compared
      to c version, the sse2 version is 2.3X faster.
      
      Change-Id: I314c4f09369648721798321eeed6f58e38857f26
      869d6c05
    • Yunqing Wang's avatar
      Optimize 16x16 idct function · ec310066
      Yunqing Wang authored
      Wrote sse2 version of vp9_short_idct16x16 function. Compared to c
      version, the sse2 version is over 2.5X faster.
      
      Change-Id: I38536e2b846427a2cc5c5423aaf305fd0e605d61
      ec310066
  6. 18 Mar, 2013 1 commit
    • Yunqing Wang's avatar
      Optimize 8x8 idct function · 6344c84c
      Yunqing Wang authored
      Wrote sse2 functions of vp9_short_idct8x8 and vp9_short_idct10_8x8.
      Compared to c version, the sse2 version is 2X faster. The decoder
      test didn't show noticeable gain since 8x8 idct doesn't take much
      of decoding time (less than 1% in my test).
      
      Change-Id: I56313e18cd481700b3b52c4eda5ca204ca6365f3
      6344c84c
  7. 15 Mar, 2013 1 commit
    • Christian Duvivier's avatar
      Faster vp9_short_fdct16x16. · 4418b790
      Christian Duvivier authored
      Scalar path is about 1.5x faster (3.1% overall encoder speedup).
      SSE2 path is about 7.2x faster (7.8% overall encoder speedup).
      
      Change-Id: I06da5ad0cdae2488431eabf002b0d898d66d8289
      4418b790
  8. 13 Mar, 2013 1 commit
    • Yaowu Xu's avatar
      removed reference to "LLM" and "x8" · 00555263
      Yaowu Xu authored
      The commit changed the name of files and function to remove obselete
      reference to LLM and x8.
      
      Change-Id: I973b20fc1a55149ed68b5408b3874768e6f88516
      00555263
  9. 08 Mar, 2013 1 commit
    • Yunqing Wang's avatar
      Add vp9_idct4_1d_sse2 · 11ca81f8
      Yunqing Wang authored
      Added SSE2 idct4_1d which is called by vp9_short_iht4x4. Also,
      modified the parameter type passed to vp9_short_iht functions to
      make it work with rtcd prototype.
      
      Change-Id: I81ba7cb4db6738f1923383b52a06deb760923ffe
      11ca81f8
  10. 07 Mar, 2013 1 commit
  11. 06 Mar, 2013 1 commit
    • Yunqing Wang's avatar
      Optimize add_residual function · 943c6d71
      Yunqing Wang authored
      Optimized adding diff to predictor, which gave 0.8% decoder
      performance gain.
      
      Change-Id: Ic920f0baa8cbd13a73fa77b7f9da83b58749f0f8
      943c6d71
  12. 05 Mar, 2013 1 commit
    • Ronald S. Bultje's avatar
      Make superblocks independent of macroblock code and data. · 111ca421
      Ronald S. Bultje authored
      Split macroblock and superblock tokenization and detokenization
      functions and coefficient-related data structs so that the bitstream
      layout and related code of superblock coefficients looks less like it's
      a hack to fit macroblocks in superblocks.
      
      In addition, unify chroma transform size selection from luma transform
      size (i.e. always use the same size, as long as it fits the predictor);
      in practice, this means 32x32 and 64x64 superblocks using the 16x16 luma
      transform will now use the 16x16 (instead of the 8x8) chroma transform,
      and 64x64 superblocks using the 32x32 luma transform will now use the
      32x32 (instead of the 16x16) chroma transform.
      
      Lastly, add a trellis optimize function for 32x32 transform blocks.
      
      HD gains about 0.3%, STDHD about 0.15% and derf about 0.1%. There's
      a few negative points here and there that I might want to analyze
      a little closer.
      
      Change-Id: Ibad7c3ddfe1acfc52771dfc27c03e9783e054430
      111ca421
  13. 04 Mar, 2013 1 commit
  14. 02 Mar, 2013 1 commit
  15. 01 Mar, 2013 1 commit
    • Yunqing Wang's avatar
      Add eob<=10 case in idct32x32 · c550bb3b
      Yunqing Wang authored
      Simplified idct32x32 calculation when there are only 10 or less
      non-zero coefficients in 32x32 block. This helps the decoder
      performance.
      
      Change-Id: If7f8893d27b64a9892b4b2621a37fdf4ac0c2a6d
      c550bb3b
  16. 28 Feb, 2013 4 commits
  17. 27 Feb, 2013 2 commits
    • John Koleszar's avatar
      Remove unused vp9_copy32xn · 7ad8dbe4
      John Koleszar authored
      This function was part of an optimization used in VP8 that required
      caching two macroblocks. This is unused in VP9, and might not
      survive refactoring to support superblocks, so removing it for now.
      
      Change-Id: I744e585206ccc1ef9a402665c33863fc9fb46f0d
      7ad8dbe4
    • Yunqing Wang's avatar
      Optimize vp9_dc_only_idct_add_c function · 35bc02c6
      Yunqing Wang authored
      Wrote SSE2 version of vp9_dc_only_idct_add_c function. In order to
      improve performance, clipped the absolute diff values to [0, 255].
      This allowed us to keep the additions/subtractions in 8 bits.
      Test showed an over 2% decoder performance increase.
      
      Change-Id: Ie1a236d23d207e4ffcd1fc9f3d77462a9c7fe09d
      35bc02c6
  18. 25 Feb, 2013 1 commit
    • Jingning Han's avatar
      clean up forward and inverse hybrid transform · 77a3becf
      Jingning Han authored
      Rebased.
      
      Remove the old matrix multiplication transform computation. The 16x16
      ADST/DCT can be switched on/off and evaluated by setting ACTIVE_HT16
      300/0 in vp9/common/vp9_blockd.h.
      
      Change-Id: Icab2dbd18538987e1dc4e88c45abfc4cfc6e133f
      77a3becf
  19. 23 Feb, 2013 1 commit
  20. 22 Feb, 2013 1 commit
    • Jingning Han's avatar
      Forward butterfly hybrid transform · babbd5d1
      Jingning Han authored
      This patch includes 4x4, 8x8, and 16x16 forward butterfly ADST/DCT
      hybrid transform. The kernel of 4x4 ADST is sin((2k+1)*(n+1)/(2N+1)).
      The kernel of 8x8/16x16 ADST is of the form sin((2k+1)*(2n+1)/4N).
      
      Change-Id: I8f1ab3843ce32eb287ab766f92e0611e1c5cb4c1
      babbd5d1
  21. 21 Feb, 2013 1 commit
  22. 20 Feb, 2013 1 commit
  23. 19 Feb, 2013 1 commit
    • Jingning Han's avatar
      16x16 butterfly inverse ADST/DCT hybrid transform · cd907b16
      Jingning Han authored
      rebased.
      
      This patch includes 16x16 butterfly inverse ADST/DCT hybrid
      transform. It uses the variant ADST of kernel
          sin((2k+1)*(2n+1)/4N),
      which allows a butterfly implementation.
      
      The coding gains as compared to DCT 16x16 are about 0.1% for
      both derf and std-hd. It is noteworthy that for std-hd sets
      many sequences gains about 0.5%, some 0.2%. There are also few
      points that provides -1% to -3% performance. Hence the average
      goes to about 0.1%.
      
      Change-Id: Ie80ac84cf403390f6e5d282caa58723739e5ec17
      cd907b16
  24. 15 Feb, 2013 1 commit
  25. 13 Feb, 2013 2 commits
    • Yaowu Xu's avatar
      fix the lossless experiment · 16f25f9d
      Yaowu Xu authored
      Change-Id: I95acfc1417634b52d344586ab97f0abaa9a4b256
      16f25f9d
    • Scott LaVarnway's avatar
      WIP: ssse3 version of convolve avg functions · 30f866f4
      Scott LaVarnway authored
      Initial ssse3 convolve avg functions and is one step closer
      to using x86inc.asm.  The decoder performance improved by 8% for
      the test clip used.  This should be revisited later to see if
      averaging outside the loop is better than having many similar
      filter functions.
      
      Change-Id: Ice3fafb423b02710b0448ffca18b296bcac649e9
      30f866f4
  26. 11 Feb, 2013 1 commit
    • Jingning Han's avatar
      butterfly inverse 4x4 ADST · 57e995ff
      Jingning Han authored
      fixed format issues.
      
      Implement the inverse 4x4 ADST using 9 multiplications. For this
      particular dimension, the original ADST transform can be
      factorized into simpler operations, hence is retained.
      
      Change-Id: Ie5d9749942468df299ab74e90d92cd899569e960
      57e995ff
  27. 09 Feb, 2013 2 commits
  28. 08 Feb, 2013 1 commit
    • John Koleszar's avatar
      Restore SSSE3 subpixel filters in new convolve framework · 29d47ac8
      John Koleszar authored
      This commit adds the 8 tap SSSE3 subpixel filters back into the code
      underneath the convolve API. The C code is still called for 4x4
      blocks, as well as compound prediction modes. This restores the
      encode performance to be within about 8% of the baseline.
      
      Change-Id: Ife0d81477075ae33c05b53c65003951efdc8b09c
      29d47ac8
  29. 07 Feb, 2013 1 commit
    • Jingning Han's avatar
      Butterfly ADST based hybrid transform · d15e1da4
      Jingning Han authored
      Refactor the 8x8 inverse hybrid transform. It is now consistent
      with the new inverse DCT. Overall performance loss (due to the
      use of this variant ADST, and the rounding errors in the butterfly
      implementation) for std-hd is -0.02.
      
      Fixed BUILD warning.
      
      Devise a variant of the original ADST, which allows butterfly
      computation structure. This new transform has kernel of the
      form: sin((2k+1)*(2n+1) / (4N)). One of its butterfly structures
      using floating-point multiplications was reported in Z. Wang,
      "Fast algorithms for the discrete W transform and for the discrete
      Fourier transform", IEEE Trans. on ASSP, 1984.
      
      This patch includes the butterfly implementation of the inverse
      ADST/DCT hybrid transform of dimension 8x8.
      
      Change-Id: I3533cb715f749343a80b9087ce34b3e776d1581d
      d15e1da4
  30. 06 Feb, 2013 1 commit
  31. 05 Feb, 2013 3 commits
    • Ronald S. Bultje's avatar
      [WIP] Add column-based tiling. · 1407bdc2
      Ronald S. Bultje authored
      This patch adds column-based tiling. The idea is to make each tile
      independently decodable (after reading the common frame header) and
      also independendly encodable (minus within-frame cost adjustments in
      the RD loop) to speed-up hardware & software en/decoders if they used
      multi-threading. Column-based tiling has the added advantage (over
      other tiling methods) that it minimizes realtime use-case latency,
      since all threads can start encoding data as soon as the first SB-row
      worth of data is available to the encoder.
      
      There is some test code that does random tile ordering in the decoder,
      to confirm that each tile is indeed independently decodable from other
      tiles in the same frame. At tile edges, all contexts assume default
      values (i.e. 0, 0 motion vector, no coefficients, DC intra4x4 mode),
      and motion vector search and ordering do not cross tiles in the same
      frame.
      t log
      
      Tile independence is not maintained between frames ATM, i.e. tile 0 of
      frame 1 is free to use motion vectors that point into any tile of frame
      0. We support 1 (i.e. no tiling), 2 or 4 column-tiles.
      
      The loopfilter crosses tile boundaries. I discussed this briefly with Aki
      and he says that's OK. An in-loop loopfilter would need to do some sync
      between tile threads, but that shouldn't be a big issue.
      
      Resuls: with tiling disabled, we go up slightly because of improved edge
      use in the intra4x4 prediction. With 2 tiles, we lose about ~1% on derf,
      ~0.35% on HD and ~0.55% on STD/HD. With 4 tiles, we lose another ~1.5%
      on derf ~0.77% on HD and ~0.85% on STD/HD. Most of this loss is
      concentrated in the low-bitrate end of clips, and most of it is because
      of the loss of edges at tile boundaries and the resulting loss of intra
      predictors.
      
      TODO:
      - more tiles (perhaps allow row-based tiling also, and max. 8 tiles)?
      - maybe optionally (for EC purposes), motion vectors themselves
        should not cross tile edges, or we should emulate such borders as
        if they were off-frame, to limit error propagation to within one
        tile only. This doesn't have to be the default behaviour but could
        be an optional bitstream flag.
      
      Change-Id: I5951c3a0742a767b20bc9fb5af685d9892c2c96f
      1407bdc2
    • Ronald S. Bultje's avatar
      Add SSE3 versions for sad{32x32,64x64}x4d functions. · 58c983d1
      Ronald S. Bultje authored
      Overall encoding about 15% faster.
      
      Change-Id: I176a775c704317509e32eee83739721804120ff2
      58c983d1
    • John Koleszar's avatar
      Convert subpixel filters to use convolve framework · 7a07eea1
      John Koleszar authored
      Update the code to call the new convolution functions to do subpixel
      prediction rather than the existing functions. Remove the old C and
      assembly code, since it is unused. This causes a 50% performance
      reduction on the decoder, but that will be resolved when the asm for
      the new functions is available.
      
      There is no consensus for whether 6-tap or 2-tap predictors will be
      supported in the final codec, so these filters are implemented in
      terms of the 8-tap code, so that quality testing of these modes
      can continue. Implementing the lower complexity algorithms is a
      simple exercise, should it be necessary.
      
      This code produces slightly better results in the EIGHTTAP_SMOOTH
      case, since the filter is now applied in only one direction when
      the subpel motion is only in one direction. Like the previous code,
      the filtering is skipped entirely on full-pel MVs. This combination
      seems to give the best quality gains, but this may be indicative of a
      bug in the encoder's filter selection, since the encoder could
      achieve the result of skipping the filtering on full-pel by selecting
      one of the other filters. This should be revisited.
      
      Quality gains on derf positive on almost all clips. The only clip
      that seemed to be hurt at all datarates was football
      (-0.115% PSNR average, -0.587% min). Overall averages 0.375% PSNR,
      0.347% SSIM.
      
      Change-Id: I7d469716091b1d89b4b08adde5863999319d69ff
      7a07eea1