1. 15 Oct, 2013 1 commit
  2. 14 Oct, 2013 1 commit
    • Jingning Han's avatar
      Move token_cache from cost_coeffs to MACROBLOCK · f60a3910
      Jingning Han authored
      This commit moves token_cache buffer into macroblock struct, instead
      of defining as a local variable in cost_coeffs. This avoids repeatedly
      re-allocating memory space in the rate-distortion optimization loop.
      
      The runtime at speed 0 reduces:
      bus 2000kbps, 161692ms to 159951ms
      football 600kbps, 229505ms to 225821ms
      
      Change-Id: If7da6b0b6d8c5138a16271a33c4548fba33d8840
      f60a3910
  3. 10 Oct, 2013 1 commit
    • Jingning Han's avatar
      Re-design rate-distortion cost tracking buffers · fc19243c
      Jingning Han authored
      This commit re-designs the per transformed block rate-distortion
      costs tracking buffers. It removes redundant buffer usage, makes
      the needed context memory allocation per VP9_COMP instance and
      reuses the same buffer sets inside the rate-distortion optimization
      search loop, thereby avoiding repeatedly requiring memory space.
      
      It reduces speed 0 runtime:
      
      bus at 2000 kbps from 166763ms to 158967ms,
      football at 600 kbps from 246614ms to 234257ms.
      
      Both about 5% speed-up. Local tests suggest about 2% to 5% speed-up
      for speed 1 and 2 settings. This does not change compression
      performance.
      
      Change-Id: I363514c5276b5cf9a38c7251088ffc6ab7f9a4c3
      fc19243c
  4. 09 Oct, 2013 1 commit
    • Jingning Han's avatar
      Deprecate the use of PARTITION_INFO from encoder · 03fe08ca
      Jingning Han authored
      Use b_mode_info to store the inter prediction mode of sub8x8 block,
      in replacement of the use of partition_info. Remove redundant buffer
      update for partition_info. For bus_cif at 2000 kbps, this seem to make
      speed 0 about 1% faster.
      
      Change-Id: Id1b3be45e75a24fb4b42335ac480c23e440978f6
      03fe08ca
  5. 01 Oct, 2013 1 commit
  6. 23 Sep, 2013 1 commit
    • Jingning Han's avatar
      Enable per transformed block zero coeffs forcing · a517343c
      Jingning Han authored
      This commit enables forcing all coefficients zero per transformed
      block, when its rate-distortion cost is lower than regular coeff
      quantization.
      
      The overall performance improvement (including its parent patch on
      calculating rd cost per transformed block) at speed 1:
      derf:  0.298%
      yt:    0.452%
      hd:    0.741%
      stdhd: 0.006%
      
      Change-Id: I66005fe0fd7af192c3eba32e02fd6d77952accb5
      a517343c
  7. 13 Sep, 2013 1 commit
    • Jingning Han's avatar
      Adaptive motion search control · c4826c59
      Jingning Han authored
      This commit enables adaptive constraint on motion search range for
      smaller partitions, given the motion vectors of collocated larger
      partition as a candidate initial search point.
      
      It makes speed 0 runtime of bus at CIF and 2000 kbps goes from
      167s down to 162s (3% speed-up), at 0.01dB performance gains. In
      the settings of speed 1, this makes the runtime goes from 33687 ms
      to 32142 ms (4.5% speed-up), at 0.03dB performance gains.
      
      Compression performance wise, it gains at speed 1:
      derf  0.118%
      yt    0.237%
      hd    0.203%
      stdhd 0.438%
      
      Change-Id: Ic8b34c67810d9504a9579bef2825d3fa54b69454
      c4826c59
  8. 27 Aug, 2013 1 commit
  9. 24 Aug, 2013 1 commit
  10. 07 Aug, 2013 1 commit
    • Jingning Han's avatar
      Use low precision 32x32fdct for encodemb in speed1 · debb9c68
      Jingning Han authored
      The low precision 32x32 fdct has all the intermediate steps within
      16-bit depth, hence allowing faster SSE2 implementation, at the
      expense of larger round-trip error. It was used in the rate-distortion
      optimization search loop only.
      
      Using the low precision version, in replace of the high precision one,
      affects the compression performance by about 0.7% (derf, stdhd) at
      speed 0. For speed 1, it makes derf set down by only 0.017%.
      
      Change-Id: I4e7d18fac5bea5317b91c8e7dabae143bc6b5c8b
      debb9c68
  11. 05 Aug, 2013 1 commit
    • Deb Mukherjee's avatar
      Add variance based mode/skipping · 8b3faccb
      Deb Mukherjee authored
      Adds a speed feature to skip all intra modes other than
      DC_PRED if the source variance is small. This feature is
      made part of speed 1 and up.
      
      Results on derf300: psnr -0.07%, speedup about 1-2%
      
      Also uses the source variance to fine-tune the early
      termination criteria when FLAG_EARLY_TERMINATE is on.
      This feature is made part of speed 2 and up.
      
      Results on derf300: psnr -0.52%, speedup about 5-7%
      
      Change-Id: I59e38aa836557cfa5405ae706fc64815cbfe4232
      8b3faccb
  12. 03 Aug, 2013 1 commit
  13. 29 Jul, 2013 2 commits
  14. 27 Jul, 2013 1 commit
    • Ronald S. Bultje's avatar
      Inverse dimension order in token_cost array. · 118ccdcd
      Ronald S. Bultje authored
      This allows us to increment the position at the band-level only as
      we go from one band to the next; more importantly, that allows us to
      use an add instead of multiply instruction, and omit the instruction
      altogether if the band doesn't change from one coef to the next, thus
      being slightly faster (probably more noticeable on systems where a
      multiply is expensive, like arm).
      
      Change-Id: I4343fe35b9f9a47fa00b217bdcbf5f91ff96c381
      118ccdcd
  15. 19 Jul, 2013 1 commit
    • Deb Mukherjee's avatar
      Reworked the auto_mv_step_size speed feature · 302698fb
      Deb Mukherjee authored
      This patch modifies the auto_mv_step_size speed feature to
      use a combination of the maximum magnitude mv from the last
      inter frame, and the maximum magnitude mv for the two reference
      mvs with the same reference. For arf frames, the max mav step
      for the resolution is used.
      The bounds therefore are slightly tighter. The feature is made
      a speed 1 feature.
      
      Rebased.
      
      Results (when this feature is turned on over speed 0):
      derfraw300: -0.046% psnr, about 5+% speedup
      (tested on football: goes from 4m30.760s to 4m17.410s).
      
      Change-Id: If492797a61b0b4b3e58c0b8f86afb880165fc9f6
      302698fb
  16. 18 Jul, 2013 1 commit
  17. 17 Jul, 2013 1 commit
    • Yunqing Wang's avatar
      Speed up motion estimation using small partitions' result(experiment) · df90d58f
      Yunqing Wang authored
      Current partition checking starts from small sizes, and then goes up
      to large sizes. This experiment uses the small partitions' motion
      estimation result, which is already available, to speed up the
      large partition's motion estimation. We can decide to skip some
      patition checkings if they are unlikely choices. We could use the
      motion vector(MV) result as current partition's prediction MV, limit
      the search range and reference frame.
      
      Current result at speed 1:
      psnr loss: 1.19% for stdhd, 0.287% for derf.
      speed gain: 14% for sunflower(hd), 11% for akiyo.
      
      Further improvement will be done later.
      
      Change-Id: I5abfd070e9cace2e91e2a0247d1325df313887ab
      df90d58f
  18. 15 Jul, 2013 1 commit
    • Jingning Han's avatar
      Skip duplicate block encoding in the rd loop · faff6ed0
      Jingning Han authored
      This speed feature allows the encoder to largely remove the spatial
      dependency between blocks inside a 64x64 superblock, thereby removing
      the need to repeatedly encode superblocks per partition type in the
      rate-distortion optimization loop.
      
      A major challenge lies in the intra modes tested in the rate-distortion
      optimization loop. The subsequent blocks do not have access to the
      reconstructed boundary pixels without the intermediate coding steps.
      This was resolved by using the original pixels for intra prediction
      in the rd loop, followed by an appropriately designed distortion
      modeling on the quantization parameters. Experiments also suggested
      that the performance impact is more discernible at lower bit-rate/psnr
      settings. Hence a quantizer dependent threshold is applied to deactivate
      skip of block coding.
      
      For bus_cif at 2000 kbps,
      speed 0: runtime 269854ms -> 237774ms (12% speed-up) at 0.05dB
               performance loss.
      
      speed 1: runtime 65312ms  -> 61536ms, (7% speed-up) at 0.04dB
               performance loss.
      
      This operation is currently turned on in settings of speed 1.
      
      Change-Id: Ib689741dfff8dd38365d8c1b92860a3e176f56ec
      faff6ed0
  19. 08 Jul, 2013 2 commits
    • Ronald S. Bultje's avatar
      Don't recalculate mv_ref costs for each block/partition. · 8fde07a3
      Ronald S. Bultje authored
      Changes cost_mv_ref() into doing a LUT into pre-calculated cost
      arrays instead. Encode time of first 50 frames of bus (speed 0)
      @ 1500kbps goes from 2min11.6 to 2min10.9, i.e. 0.5% faster overall.
      
      Change-Id: If186e92c34c201b29cbbc058785a15c9c09e433a
      8fde07a3
    • Ronald S. Bultje's avatar
      Make frame-wide filter-type decision fully RD-based. · ed995afb
      Ronald S. Bultje authored
      Overall, on all test sets, this gains about +0.2% on all metrics.
      City is a clip where this really hurts (-1.0% on all metrics), I'm
      not quite sure why yet. Maybe interesting to look into in the future.
      
      Change-Id: I6f0eecb20e72f0194633270d30bf00d76d9eae78
      ed995afb
  20. 01 Jul, 2013 1 commit
    • Ronald S. Bultje's avatar
      Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab
      Ronald S. Bultje authored
      Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
      goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
      x86-64 only, it needs some minor modifications to be 32bit compatible,
      because it uses 15 xmm registers, whereas 32bit only has 8.
      
      Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
      7353ceab
  21. 28 Jun, 2013 2 commits
    • Ronald S. Bultje's avatar
      Some minor optimizations for cost_coeffs(). · 91d223bd
      Ronald S. Bultje authored
      Cycle timings for first 3 frames of bus (speed 0) at 1500kbps:
      4x4: 298 -> 234 cycles
      8x8: 1227 -> 878 cycles
      16x16: 23426 -> 18134 cycles
      32x32: 4906 -> 3664 cycles
      
      Total encode time of first 50 frames of bus @ 1500kbps (speed 0) goes
      from 3min0.7 to 2min51.6 seconds, i.e. 5.3% faster.
      
      Change-Id: I68a0e1b530b0563b84a67342cca4b45146077e95
      91d223bd
    • Ronald S. Bultje's avatar
      Make coefficient skip condition an explicit RD choice. · af660715
      Ronald S. Bultje authored
      This commit replaces zrun_zbin_boost, a method of biasing non-zero
      coefficients following runs of zero-coefficients to be rounded towards
      zero, with an explicit skip-block choice in the RD loop.
      
      The logic is basically that if individual coefficients should be rounded
      towards zero (from a RD point of view), the trellis/optimize loop should
      take care of it. If whole blocks should be zero (from a RD point of
      view), a single RD check is much more efficient than a complete
      serialization of the quantization loop.
      
      Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
      SIMD for quantize will follow in a separate patch. Results for other
      test sets pending.
      
      Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
      af660715
  22. 18 Jun, 2013 1 commit
    • Jingning Han's avatar
      Make fdct32 computation flow within 16bit range · a41a4860
      Jingning Han authored
      This commit makes use of dual fdct32x32 versions for rate-distortion
      optimization loop and encoding process, respectively. The one for
      rd loop requires only 16 bits precision for intermediate steps.
      The original fdct32x32 that allows higher intermediate precision (18
      bits) was retained for the encoding process only.
      
      This allows speed-up for fdct32x32 in the rd loop. No performance
      loss observed.
      
      Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
      a41a4860
  23. 31 May, 2013 2 commits
  24. 30 May, 2013 1 commit
  25. 29 May, 2013 1 commit
    • Deb Mukherjee's avatar
      Balancing coef-tree to reduce bool decodes · b8b3f1a4
      Deb Mukherjee authored
      This patch changes the coefficient tree to move the EOB to below
      the ZERO node in order to save number of bool decodes.
      
      The advantages of moving EOB one step down as opposed to two steps down
      in the other parallel patch are: 1. The coef modeling based on
      the One-node becomes independent of the tree structure above it, and
      2. Fewer conext/counter increases are needed.
      
      The drawback is that the potential savings in bool decodes will be
      less, but assuming that 0s are much more predominant than 1's the
      potential savings is still likely to be substantial.
      
      Results on derf300: -0.237%
      
      Change-Id: Ie784be13dc98291306b338e8228703a4c2ea2242
      b8b3f1a4
  26. 28 May, 2013 1 commit
    • Jingning Han's avatar
      further clean-ups on intra4x4 coding · 4729a6f3
      Jingning Han authored
      Removed one 4x4 prediction step that was unnessary in the rd loop.
      Removed a unused modecosts estimate from encoder side.
      
      Change-Id: I65221a52719d6876492996955ef04142d2752d86
      4729a6f3
  27. 27 May, 2013 1 commit
    • Yaowu Xu's avatar
      a few clean-ups · 2b96ffe0
      Yaowu Xu authored
      1. remove prediction mode conversion
      2. unified bmode, same for key and non-key frame
      3. set I4X4_PRED count for pdf to 0, as I4X4_PRED is no longer
      coded ever. It is determined by ref_frame and block partition
      
      Change-Id: If5b282957c24339b241acdb9f2afef85658fe47d
      2b96ffe0
  28. 26 May, 2013 1 commit
    • Ronald S. Bultje's avatar
      Remove splitmv. · 5cac6607
      Ronald S. Bultje authored
      Also do per-partition motion vector referencing in <sb8x8 partitions,
      and adjust mvref finding for sub8x8 partitions.
      
      Change-Id: Id3ed1ed4d2a8910d11d327db6cc63b8eb79f941f
      5cac6607
  29. 23 May, 2013 1 commit
    • Jingning Han's avatar
      Merge 4x4 block level partition into codebase · 7ac5ac52
      Jingning Han authored
      Move 4x4/4x8/8x4 partition coding out of experimental list.
      
      This commit fixed the unit test failure issues. It also resolved
      the merge conflicts between 4x4 block level partition and iterative
      motion search for comp_inter_inter.
      
      Change-Id: I898671f0631f5ddc4f5cc68d4c62ead7de9c5a58
      7ac5ac52
  30. 14 May, 2013 1 commit
    • Jingning Han's avatar
      Enable recursive partition down to 4x4 · 1f26840f
      Jingning Han authored
      This commit allows the rate-distortion optimization recursion
      at encoder to go down to 4x4 block size. It deprecates the use
      of I4X4_PRED and SPLITMV syntax elements from bit-stream
      writing/reading. Will remove the unused probability models in
      the next patch.
      
      The partition type search and bit-stream are now capable of
      supporting the rectangular partition of 8x8 block, i.e., 8x4
      and 4x8. Need to revise the rate-distortion parts to get these
      two partition tested in the rd loop.
      
      Change-Id: I0dfe3b90a1507ad6138db10cc58e6e237a06a9d6
      1f26840f
  31. 10 May, 2013 1 commit
    • Jingning Han's avatar
      Enable recursive partition type search · e44678c0
      Jingning Han authored
      This commit enables the search for the optimal superblock
      partition types in the recursion form. The intention is to
      make the optimization process more concise and ready to
      support partition down to 4x4 block size next.
      
      Change-Id: Iae279a67df3a7cc372553c84c775bc4d2f3e4336
      e44678c0
  32. 07 May, 2013 1 commit
    • Jingning Han's avatar
      Merge SB8X8 into the codebase · 776c1482
      Jingning Han authored
      Pull sb8x8 out of experimental list. verified via borg run tests.
      Fixed unit test failures.
      
      Change-Id: I12a4bbd17395930580c048ab68becad1ffe46e76
      776c1482
  33. 03 May, 2013 1 commit
    • John Koleszar's avatar
      Separate transform and quant from vp9_encode_sb · 4529c68b
      John Koleszar authored
      This allows removing a large number of transform size specific functions,
      as well as supporting 444/alpha by routing all code through the
      subsampling-aware path.
      
      Change-Id: Ieb085cebe9f37f24fc24de179898b22abfda08a4
      4529c68b
  34. 02 May, 2013 1 commit
  35. 30 Apr, 2013 1 commit
    • Ronald S. Bultje's avatar
      sb8x8 integration in rd loop. · d068d869
      Ronald S. Bultje authored
      Work-in-progress, not yet ready for review. TODO items:
      - bitstream writing (encoder) and reading (decoder)
      - decoder reconstruction
      
      Change-Id: I5afb7284e7e0480847b47cd0097cb469433c9081
      d068d869
  36. 25 Apr, 2013 1 commit