1. 11 Nov, 2013 1 commit
  2. 06 Nov, 2013 1 commit
  3. 30 Oct, 2013 1 commit
  4. 24 Oct, 2013 1 commit
  5. 22 Oct, 2013 1 commit
    • Dmitry Kovalev's avatar
      Removing quantize_b_4x4 function pointer. · ec414372
      Dmitry Kovalev authored
      The pointer was asigned only once with vp9_regular_quantize_b_4x4, calling
      this function directly now. Also removing unused declarations:
        prototype_quantize_block
        prototype_quantize_block_pair
        prototype_quantize_mb
        vp9_regular_quantize_b_4x4_pair
        vp9_regular_quantize_b_8x8
      
      Change-Id: I14325bc2f082336820671eafbc06126651b79f73
      ec414372
  6. 19 Oct, 2013 1 commit
    • Dmitry Kovalev's avatar
      Removing NUM_ prefix from constant names. · 6d2a0da7
      Dmitry Kovalev authored
      Renames for consistency with other constants:
        NUM_FRAME_TYPES -> FRAME_TYPES
        NUM_PARTITION_CONTEXTS -> PARTITION_CONTEXTS
      
      Change-Id: I3db30acb2868eb0a424237c831087b2e264ec47f
      6d2a0da7
  7. 18 Oct, 2013 2 commits
    • Dmitry Kovalev's avatar
      Using INTER_MODES constant instead of MB_MODE_COUNT - NEARESTMV. · 18a4bd25
      Dmitry Kovalev authored
      Change-Id: Ie5ec392904d03fd5485474b33be8408108e9d3c9
      18a4bd25
    • Jingning Han's avatar
      Make memory alloc in pick_mode_context bsize aware · 72033fcf
      Jingning Han authored
      This commit makes the buffer allocation of zcoeff_blk array in
      pick_mode_context block size aware. It calculates the number of
      4x4 blocks in the partition and assigns the memory space accordingly.
      This process (and the uninitialization) is done once for each encoding
      pass. It allows memory copy of smaller buffer when possible.
      
      For football at 600kbps, the runtimes improve by about 1%:
      speed 1, 45961ms -> 45472ms
      speed 2, 23863ms -> 23598ms
      
      Change-Id: Id2ca24906fa89f46fa5fe742ec4b8efc2a61f877
      72033fcf
  8. 16 Oct, 2013 2 commits
  9. 15 Oct, 2013 3 commits
  10. 14 Oct, 2013 1 commit
    • Jingning Han's avatar
      Move token_cache from cost_coeffs to MACROBLOCK · f60a3910
      Jingning Han authored
      This commit moves token_cache buffer into macroblock struct, instead
      of defining as a local variable in cost_coeffs. This avoids repeatedly
      re-allocating memory space in the rate-distortion optimization loop.
      
      The runtime at speed 0 reduces:
      bus 2000kbps, 161692ms to 159951ms
      football 600kbps, 229505ms to 225821ms
      
      Change-Id: If7da6b0b6d8c5138a16271a33c4548fba33d8840
      f60a3910
  11. 10 Oct, 2013 1 commit
    • Jingning Han's avatar
      Re-design rate-distortion cost tracking buffers · fc19243c
      Jingning Han authored
      This commit re-designs the per transformed block rate-distortion
      costs tracking buffers. It removes redundant buffer usage, makes
      the needed context memory allocation per VP9_COMP instance and
      reuses the same buffer sets inside the rate-distortion optimization
      search loop, thereby avoiding repeatedly requiring memory space.
      
      It reduces speed 0 runtime:
      
      bus at 2000 kbps from 166763ms to 158967ms,
      football at 600 kbps from 246614ms to 234257ms.
      
      Both about 5% speed-up. Local tests suggest about 2% to 5% speed-up
      for speed 1 and 2 settings. This does not change compression
      performance.
      
      Change-Id: I363514c5276b5cf9a38c7251088ffc6ab7f9a4c3
      fc19243c
  12. 09 Oct, 2013 1 commit
    • Jingning Han's avatar
      Deprecate the use of PARTITION_INFO from encoder · 03fe08ca
      Jingning Han authored
      Use b_mode_info to store the inter prediction mode of sub8x8 block,
      in replacement of the use of partition_info. Remove redundant buffer
      update for partition_info. For bus_cif at 2000 kbps, this seem to make
      speed 0 about 1% faster.
      
      Change-Id: Id1b3be45e75a24fb4b42335ac480c23e440978f6
      03fe08ca
  13. 01 Oct, 2013 1 commit
  14. 23 Sep, 2013 1 commit
    • Jingning Han's avatar
      Enable per transformed block zero coeffs forcing · a517343c
      Jingning Han authored
      This commit enables forcing all coefficients zero per transformed
      block, when its rate-distortion cost is lower than regular coeff
      quantization.
      
      The overall performance improvement (including its parent patch on
      calculating rd cost per transformed block) at speed 1:
      derf:  0.298%
      yt:    0.452%
      hd:    0.741%
      stdhd: 0.006%
      
      Change-Id: I66005fe0fd7af192c3eba32e02fd6d77952accb5
      a517343c
  15. 13 Sep, 2013 1 commit
    • Jingning Han's avatar
      Adaptive motion search control · c4826c59
      Jingning Han authored
      This commit enables adaptive constraint on motion search range for
      smaller partitions, given the motion vectors of collocated larger
      partition as a candidate initial search point.
      
      It makes speed 0 runtime of bus at CIF and 2000 kbps goes from
      167s down to 162s (3% speed-up), at 0.01dB performance gains. In
      the settings of speed 1, this makes the runtime goes from 33687 ms
      to 32142 ms (4.5% speed-up), at 0.03dB performance gains.
      
      Compression performance wise, it gains at speed 1:
      derf  0.118%
      yt    0.237%
      hd    0.203%
      stdhd 0.438%
      
      Change-Id: Ic8b34c67810d9504a9579bef2825d3fa54b69454
      c4826c59
  16. 27 Aug, 2013 1 commit
  17. 24 Aug, 2013 1 commit
  18. 07 Aug, 2013 1 commit
    • Jingning Han's avatar
      Use low precision 32x32fdct for encodemb in speed1 · debb9c68
      Jingning Han authored
      The low precision 32x32 fdct has all the intermediate steps within
      16-bit depth, hence allowing faster SSE2 implementation, at the
      expense of larger round-trip error. It was used in the rate-distortion
      optimization search loop only.
      
      Using the low precision version, in replace of the high precision one,
      affects the compression performance by about 0.7% (derf, stdhd) at
      speed 0. For speed 1, it makes derf set down by only 0.017%.
      
      Change-Id: I4e7d18fac5bea5317b91c8e7dabae143bc6b5c8b
      debb9c68
  19. 05 Aug, 2013 1 commit
    • Deb Mukherjee's avatar
      Add variance based mode/skipping · 8b3faccb
      Deb Mukherjee authored
      Adds a speed feature to skip all intra modes other than
      DC_PRED if the source variance is small. This feature is
      made part of speed 1 and up.
      
      Results on derf300: psnr -0.07%, speedup about 1-2%
      
      Also uses the source variance to fine-tune the early
      termination criteria when FLAG_EARLY_TERMINATE is on.
      This feature is made part of speed 2 and up.
      
      Results on derf300: psnr -0.52%, speedup about 5-7%
      
      Change-Id: I59e38aa836557cfa5405ae706fc64815cbfe4232
      8b3faccb
  20. 03 Aug, 2013 1 commit
  21. 29 Jul, 2013 2 commits
  22. 27 Jul, 2013 1 commit
    • Ronald S. Bultje's avatar
      Inverse dimension order in token_cost array. · 118ccdcd
      Ronald S. Bultje authored
      This allows us to increment the position at the band-level only as
      we go from one band to the next; more importantly, that allows us to
      use an add instead of multiply instruction, and omit the instruction
      altogether if the band doesn't change from one coef to the next, thus
      being slightly faster (probably more noticeable on systems where a
      multiply is expensive, like arm).
      
      Change-Id: I4343fe35b9f9a47fa00b217bdcbf5f91ff96c381
      118ccdcd
  23. 19 Jul, 2013 1 commit
    • Deb Mukherjee's avatar
      Reworked the auto_mv_step_size speed feature · 302698fb
      Deb Mukherjee authored
      This patch modifies the auto_mv_step_size speed feature to
      use a combination of the maximum magnitude mv from the last
      inter frame, and the maximum magnitude mv for the two reference
      mvs with the same reference. For arf frames, the max mav step
      for the resolution is used.
      The bounds therefore are slightly tighter. The feature is made
      a speed 1 feature.
      
      Rebased.
      
      Results (when this feature is turned on over speed 0):
      derfraw300: -0.046% psnr, about 5+% speedup
      (tested on football: goes from 4m30.760s to 4m17.410s).
      
      Change-Id: If492797a61b0b4b3e58c0b8f86afb880165fc9f6
      302698fb
  24. 18 Jul, 2013 1 commit
  25. 17 Jul, 2013 1 commit
    • Yunqing Wang's avatar
      Speed up motion estimation using small partitions' result(experiment) · df90d58f
      Yunqing Wang authored
      Current partition checking starts from small sizes, and then goes up
      to large sizes. This experiment uses the small partitions' motion
      estimation result, which is already available, to speed up the
      large partition's motion estimation. We can decide to skip some
      patition checkings if they are unlikely choices. We could use the
      motion vector(MV) result as current partition's prediction MV, limit
      the search range and reference frame.
      
      Current result at speed 1:
      psnr loss: 1.19% for stdhd, 0.287% for derf.
      speed gain: 14% for sunflower(hd), 11% for akiyo.
      
      Further improvement will be done later.
      
      Change-Id: I5abfd070e9cace2e91e2a0247d1325df313887ab
      df90d58f
  26. 15 Jul, 2013 1 commit
    • Jingning Han's avatar
      Skip duplicate block encoding in the rd loop · faff6ed0
      Jingning Han authored
      This speed feature allows the encoder to largely remove the spatial
      dependency between blocks inside a 64x64 superblock, thereby removing
      the need to repeatedly encode superblocks per partition type in the
      rate-distortion optimization loop.
      
      A major challenge lies in the intra modes tested in the rate-distortion
      optimization loop. The subsequent blocks do not have access to the
      reconstructed boundary pixels without the intermediate coding steps.
      This was resolved by using the original pixels for intra prediction
      in the rd loop, followed by an appropriately designed distortion
      modeling on the quantization parameters. Experiments also suggested
      that the performance impact is more discernible at lower bit-rate/psnr
      settings. Hence a quantizer dependent threshold is applied to deactivate
      skip of block coding.
      
      For bus_cif at 2000 kbps,
      speed 0: runtime 269854ms -> 237774ms (12% speed-up) at 0.05dB
               performance loss.
      
      speed 1: runtime 65312ms  -> 61536ms, (7% speed-up) at 0.04dB
               performance loss.
      
      This operation is currently turned on in settings of speed 1.
      
      Change-Id: Ib689741dfff8dd38365d8c1b92860a3e176f56ec
      faff6ed0
  27. 08 Jul, 2013 2 commits
    • Ronald S. Bultje's avatar
      Don't recalculate mv_ref costs for each block/partition. · 8fde07a3
      Ronald S. Bultje authored
      Changes cost_mv_ref() into doing a LUT into pre-calculated cost
      arrays instead. Encode time of first 50 frames of bus (speed 0)
      @ 1500kbps goes from 2min11.6 to 2min10.9, i.e. 0.5% faster overall.
      
      Change-Id: If186e92c34c201b29cbbc058785a15c9c09e433a
      8fde07a3
    • Ronald S. Bultje's avatar
      Make frame-wide filter-type decision fully RD-based. · ed995afb
      Ronald S. Bultje authored
      Overall, on all test sets, this gains about +0.2% on all metrics.
      City is a clip where this really hurts (-1.0% on all metrics), I'm
      not quite sure why yet. Maybe interesting to look into in the future.
      
      Change-Id: I6f0eecb20e72f0194633270d30bf00d76d9eae78
      ed995afb
  28. 01 Jul, 2013 1 commit
    • Ronald S. Bultje's avatar
      Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab
      Ronald S. Bultje authored
      Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
      goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
      x86-64 only, it needs some minor modifications to be 32bit compatible,
      because it uses 15 xmm registers, whereas 32bit only has 8.
      
      Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
      7353ceab
  29. 28 Jun, 2013 2 commits
    • Ronald S. Bultje's avatar
      Some minor optimizations for cost_coeffs(). · 91d223bd
      Ronald S. Bultje authored
      Cycle timings for first 3 frames of bus (speed 0) at 1500kbps:
      4x4: 298 -> 234 cycles
      8x8: 1227 -> 878 cycles
      16x16: 23426 -> 18134 cycles
      32x32: 4906 -> 3664 cycles
      
      Total encode time of first 50 frames of bus @ 1500kbps (speed 0) goes
      from 3min0.7 to 2min51.6 seconds, i.e. 5.3% faster.
      
      Change-Id: I68a0e1b530b0563b84a67342cca4b45146077e95
      91d223bd
    • Ronald S. Bultje's avatar
      Make coefficient skip condition an explicit RD choice. · af660715
      Ronald S. Bultje authored
      This commit replaces zrun_zbin_boost, a method of biasing non-zero
      coefficients following runs of zero-coefficients to be rounded towards
      zero, with an explicit skip-block choice in the RD loop.
      
      The logic is basically that if individual coefficients should be rounded
      towards zero (from a RD point of view), the trellis/optimize loop should
      take care of it. If whole blocks should be zero (from a RD point of
      view), a single RD check is much more efficient than a complete
      serialization of the quantization loop.
      
      Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
      SIMD for quantize will follow in a separate patch. Results for other
      test sets pending.
      
      Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
      af660715
  30. 18 Jun, 2013 1 commit
    • Jingning Han's avatar
      Make fdct32 computation flow within 16bit range · a41a4860
      Jingning Han authored
      This commit makes use of dual fdct32x32 versions for rate-distortion
      optimization loop and encoding process, respectively. The one for
      rd loop requires only 16 bits precision for intermediate steps.
      The original fdct32x32 that allows higher intermediate precision (18
      bits) was retained for the encoding process only.
      
      This allows speed-up for fdct32x32 in the rd loop. No performance
      loss observed.
      
      Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
      a41a4860
  31. 31 May, 2013 2 commits
  32. 30 May, 2013 1 commit