1. 24 Jul, 2013 1 commit
  2. 23 Jul, 2013 4 commits
    • Jingning Han's avatar
      Unify the use of encode_b_args/optimize_block_args · ab77828b
      Jingning Han authored
      The struct optimize_block_args is defined same as encode_b_args.
      Remove this redundant definition, and use encode_b_args consistently.
      
      Change-Id: I1703aeeb3bacf92e98a34f4355202712110173d9
      ab77828b
    • Jingning Han's avatar
      Make xform_quant operations tx_type independent · e9e2fe8e
      Jingning Han authored
      The xform_quant() module is only used by inter modes, hence removing
      the redundant switches therein conditioned on tx_type.
      
      Change-Id: Ib87ce5b2f2e4cbf3ceb133a1108afa173c933a3f
      e9e2fe8e
    • Jingning Han's avatar
      Skip inverse transform when eob is zero · 0359ad7f
      Jingning Han authored
      When all the transform coefficients were quantized to zero, skip
      the inverse transform operation. For bus_cif at 1000 kbps, the
      runtime goes from 154967ms -> 149842ms, i.e., about 3% speed-up,
      at speed 0.
      
      Change-Id: Ic0a813fff5e28972d4888ee42d8747846a6c3cc6
      0359ad7f
    • Jim Bankoski's avatar
      clean up bw, bh · 86a9dec7
      Jim Bankoski authored
      many structures use bw and bh and they have different meanings.   This cl attempts
      to start this clean up and remove unneccessary 2 step look up log and then
      shift operations...
      
      also removed partition type multiple operation code in bitstream.c.
      
      Change-Id: I7e03e552bdfc0939738e430862e3073d30fdd5db
      86a9dec7
  3. 16 Jul, 2013 1 commit
    • Ronald S. Bultje's avatar
      Inline vp9_quantize() in xform_quant(). · 1ff94fea
      Ronald S. Bultje authored
      Cycle times:
      4x4:    151 to  131 cycles (15% faster)
      8x8:    334 to  306 cycles (9% faster)
      16x16: 1401 to 1368 cycles (2.5% faster)
      32x32: 7403 to 7367 cycles (0.5% faster)
      
      Total encode time of first 50 frames of bus @ 1500kbps (speed 0)
      goes from 1min39.2 to 1min38.6, i.e. a 0.67% overall speedup.
      
      Change-Id: I799a49460e5e3fcab01725564dd49c629bfe935f
      1ff94fea
  4. 15 Jul, 2013 3 commits
    • Ronald S. Bultje's avatar
      Inline xform_quant() in encode_block_intra(). · 6fb41874
      Ronald S. Bultje authored
      Also inline some of the block calculations to assist the compiler to
      not do silly things like calculating the same offset (or converting
      between raster/transform block offset or block, mi and pixel unit)
      many, many, many times.
      
      Cycle times:
      4x4:     584 ->   505 cycles (16% faster)
      8x8:    1651 ->  1560 cycles (6% faster)
      16x16:  7897 ->  7704 cycles (2.5% faster)
      32x32: 16096 -> 15852 cycles (1.5% faster)
      
      Overall, this saves about 0.5 seconds (1min49.8 -> 1min49.3) on the
      first 50 frames of bus (speed 0) @ 1500kbps, i.e. 0.5% overall.
      
      Change-Id: If3dd62453f8e2ab9d4ee616bc4ea956fb8874b80
      6fb41874
    • Jingning Han's avatar
      Skip inter-coded block reconstruction in rd loop · 043e0f9d
      Jingning Han authored
      Skip the inverse transform and reconstruction of inter-mode coded
      blocks in the rate-distortion optimization loop, when skip_encode_sb
      feature is turned on. This provides about 1% speed-up at speed 0,
      and 1.5% speed-up at speed 1. No performance change in both settings.
      
      Change-Id: I2932718bf4d007163702b61b16b6ff100cf9d007
      043e0f9d
    • Jingning Han's avatar
      Skip duplicate block encoding in the rd loop · faff6ed0
      Jingning Han authored
      This speed feature allows the encoder to largely remove the spatial
      dependency between blocks inside a 64x64 superblock, thereby removing
      the need to repeatedly encode superblocks per partition type in the
      rate-distortion optimization loop.
      
      A major challenge lies in the intra modes tested in the rate-distortion
      optimization loop. The subsequent blocks do not have access to the
      reconstructed boundary pixels without the intermediate coding steps.
      This was resolved by using the original pixels for intra prediction
      in the rd loop, followed by an appropriately designed distortion
      modeling on the quantization parameters. Experiments also suggested
      that the performance impact is more discernible at lower bit-rate/psnr
      settings. Hence a quantizer dependent threshold is applied to deactivate
      skip of block coding.
      
      For bus_cif at 2000 kbps,
      speed 0: runtime 269854ms -> 237774ms (12% speed-up) at 0.05dB
               performance loss.
      
      speed 1: runtime 65312ms  -> 61536ms, (7% speed-up) at 0.04dB
               performance loss.
      
      This operation is currently turned on in settings of speed 1.
      
      Change-Id: Ib689741dfff8dd38365d8c1b92860a3e176f56ec
      faff6ed0
  5. 11 Jul, 2013 1 commit
  6. 08 Jul, 2013 1 commit
  7. 02 Jul, 2013 2 commits
    • Dmitry Kovalev's avatar
      Removing redundant struct from union b_mode_info. · be77f6bb
      Dmitry Kovalev authored
      Change-Id: I08fc6e474ff2c12cfa065bae4989c724276e2c83
      be77f6bb
    • Jingning Han's avatar
      Calculate rd cost per transformed block · b91a1586
      Jingning Han authored
      Compute the rate-distortion cost per transformed block, and cumulate
      the cost through all blocks inside a partition. This allows encoder
      to detect if the cumulative rd cost is already above the best rd cost,
      thereby enabling early termination in the rate-distortion optimization
      search.
      
      Change-Id: I0a856367a9a7b6dd0b466e7b767f54d5018d09ac
      b91a1586
  8. 01 Jul, 2013 2 commits
    • Ronald S. Bultje's avatar
      Make get_coef_context() branchless. · 26b6318d
      Ronald S. Bultje authored
      This should significantly speedup cost_coeffs(). Basically what the
      patch does is to make the neighbour arrays padded by one item to
      prevent an eob check in get_coef_context(), then it populates each
      col/row scan and left/top edge coefficient with two times the same
      neighbour - this prevents a single/double context branch in
      get_coef_context(). Lastly, it populates neighbour arrays in pixel
      order (rather than scan order), so we don't have to dereference the
      scantable to get the correct neighbours.
      
      Total encoding time of first 50 frames of bus (speed 0) at 1500kbps
      goes from 2min10.1 to 2min5.3, i.e. a 2.6% overall speed increase.
      
      Change-Id: I42bcd2210fd7bec03767ef0e2945a665b851df56
      26b6318d
    • Ronald S. Bultje's avatar
      Quantize (64-bit only, for now) SSSE3 SIMD. · 7353ceab
      Ronald S. Bultje authored
      Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
      goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
      x86-64 only, it needs some minor modifications to be 32bit compatible,
      because it uses 15 xmm registers, whereas 32bit only has 8.
      
      Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
      7353ceab
  9. 28 Jun, 2013 2 commits
    • Ronald S. Bultje's avatar
      Inline vp9_get_coef_context() (and remove vp9_ prefix). · d00b8e5f
      Ronald S. Bultje authored
      Makes cost_coeffs() a lot faster:
      4x4: 236 -> 181 cycles
      8x8: 888 -> 588 cycles
      16x16: 3550 -> 2483 cycles
      32x32: 17392 -> 12010 cycles
      
      Total encode time of first 50 frames of bus (speed 0) @ 1500kbps goes
      from 2min51.6 to 2min43.9, i.e. 4.7% overall speedup.
      
      Change-Id: I16b8d595946393c8dc661599550b3f37f5718896
      d00b8e5f
    • Ronald S. Bultje's avatar
      Some minor optimizations for cost_coeffs(). · 91d223bd
      Ronald S. Bultje authored
      Cycle timings for first 3 frames of bus (speed 0) at 1500kbps:
      4x4: 298 -> 234 cycles
      8x8: 1227 -> 878 cycles
      16x16: 23426 -> 18134 cycles
      32x32: 4906 -> 3664 cycles
      
      Total encode time of first 50 frames of bus @ 1500kbps (speed 0) goes
      from 3min0.7 to 2min51.6 seconds, i.e. 5.3% faster.
      
      Change-Id: I68a0e1b530b0563b84a67342cca4b45146077e95
      91d223bd
  10. 27 Jun, 2013 1 commit
    • Jingning Han's avatar
      Make intra predictor reference buffer configurable · 861cb06c
      Jingning Han authored
      This commit enables configurable reference buffer pointer for intra
      predictor. This allows later removal of spatial dependency between
      blocks inside a 64x64 superblock in the rate-distortion optimization
      loop.
      
      Change-Id: I02418c2077efe19adc86e046a6b49364a980f5b1
      861cb06c
  11. 26 Jun, 2013 1 commit
  12. 25 Jun, 2013 1 commit
    • Dmitry Kovalev's avatar
      Removing unused code. · 87ee34aa
      Dmitry Kovalev authored
      Removing block index (ib) parameter from get_tx_type_{8x8, 16x16}
      functions.
      
      Change-Id: Ia213335aae7a7cb027f97b9cc9b04519840250f1
      87ee34aa
  13. 21 Jun, 2013 1 commit
  14. 18 Jun, 2013 1 commit
    • Jingning Han's avatar
      Make fdct32 computation flow within 16bit range · a41a4860
      Jingning Han authored
      This commit makes use of dual fdct32x32 versions for rate-distortion
      optimization loop and encoding process, respectively. The one for
      rd loop requires only 16 bits precision for intermediate steps.
      The original fdct32x32 that allows higher intermediate precision (18
      bits) was retained for the encoding process only.
      
      This allows speed-up for fdct32x32 in the rd loop. No performance
      loss observed.
      
      Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
      a41a4860
  15. 17 Jun, 2013 1 commit
  16. 10 Jun, 2013 1 commit
    • John Koleszar's avatar
      Fix use of get_uv_tx_size in loopfilter · 717d744a
      John Koleszar authored
      Change the argument of get_uv_tx_size() to be an MBMI pointer, so that the
      correct column's MBMI can be passed to the function.
      
      Change-Id: Ied6b8ec33b77cdd353119e8fd2d157811815fc98
      717d744a
  17. 07 Jun, 2013 1 commit
    • Ronald S. Bultje's avatar
      Change ref frame coding. · 6ef805eb
      Ronald S. Bultje authored
      Code intra/inter, then comp/single, then the ref frame selection.
      Use contextualization for all steps. Don't code two past frames
      in comp pred mode.
      
      Change-Id: I4639a78cd5cccb283023265dbcc07898c3e7cf95
      6ef805eb
  18. 06 Jun, 2013 1 commit
  19. 31 May, 2013 2 commits
  20. 29 May, 2013 2 commits
    • Deb Mukherjee's avatar
      Balancing coef-tree to reduce bool decodes · b8b3f1a4
      Deb Mukherjee authored
      This patch changes the coefficient tree to move the EOB to below
      the ZERO node in order to save number of bool decodes.
      
      The advantages of moving EOB one step down as opposed to two steps down
      in the other parallel patch are: 1. The coef modeling based on
      the One-node becomes independent of the tree structure above it, and
      2. Fewer conext/counter increases are needed.
      
      The drawback is that the potential savings in bool decodes will be
      less, but assuming that 0s are much more predominant than 1's the
      potential savings is still likely to be substantial.
      
      Results on derf300: -0.237%
      
      Change-Id: Ie784be13dc98291306b338e8228703a4c2ea2242
      b8b3f1a4
    • Sami Pietila's avatar
      Residual coding to cache energy class of tokens. · 88a4d4c5
      Sami Pietila authored
      Proposal for tuning the residual coding by changing how the context
      from previous tokens is calculated. Storing the energy class of previous
      tokens instead of the token itself eases the critical path of
      HW implementations.
      
      Change-Id: I6d71d856b84518f6c88de771ddd818436f794bab
      88a4d4c5
  21. 28 May, 2013 1 commit
    • Jingning Han's avatar
      further clean-ups on intra4x4 coding · 4729a6f3
      Jingning Han authored
      Removed one 4x4 prediction step that was unnessary in the rd loop.
      Removed a unused modecosts estimate from encoder side.
      
      Change-Id: I65221a52719d6876492996955ef04142d2752d86
      4729a6f3
  22. 27 May, 2013 1 commit
    • Yaowu Xu's avatar
      a few clean-ups · 2b96ffe0
      Yaowu Xu authored
      1. remove prediction mode conversion
      2. unified bmode, same for key and non-key frame
      3. set I4X4_PRED count for pdf to 0, as I4X4_PRED is no longer
      coded ever. It is determined by ref_frame and block partition
      
      Change-Id: If5b282957c24339b241acdb9f2afef85658fe47d
      2b96ffe0
  23. 23 May, 2013 1 commit
  24. 22 May, 2013 2 commits
    • Yaowu Xu's avatar
      changes intra coding to be based on txfm block · 8ba92a0b
      Yaowu Xu authored
      This commit changed the encoding and decoding of intra blocks to be
      based on transform block. In each prediction block, the intra coding
      iterates thorough each transform block based on raster scan order.
      
      This commit also fixed a bug in D135 prediction code.
      
      TODO next:
      The RD mode/txfm_size selection should take this into account when
      computing RD values.
      
      Change-Id: I6d1be2faa4c4948a52e830b6a9a84a6b2b6850f6
      8ba92a0b
    • Yaowu Xu's avatar
      Generalized intra 4x4 encoding for all sizes · 232d90d8
      Yaowu Xu authored
      Change-Id: I1b86744fa247233c8df031b3f4b87b212c8dd094
      232d90d8
  25. 20 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 4x4 idct/recon merge · ba48a111
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: I296604bf73579c45105de0dd1adbcc91bcc53c22
      ba48a111
  26. 17 May, 2013 1 commit
    • John Koleszar's avatar
      Initial version of alpha channel support · 679e4abd
      John Koleszar authored
      This is a mostly-working implementation of an extra channel in the
      bitstream. Configure with --enable-alpha to test. Notable TODOs:
      
       - Add extra channel to all mismatch tests, PSNR, SSIM, etc
       - Configurable subsampling
       - Variable number of planes (currently always uses all 4)
       - Loop filtering
       - Per-plane lossless quantizer
       - ARNR support
      
      This implementation just uses the same contents as the Y channel
      for the A channel, due to lack of content and general pain in
      playing back 4 channel content. A later patch will use the actual
      alpha channel passed in from outside the codec.
      
      Change-Id: Ibf81f023b1c570bd84b3064e9b4b8ae52e087592
      679e4abd
  27. 16 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 8x8 idct/recon merge · 794a7bed
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: Iacfd57324fbe2b7beca5d7f3dcae25c976e67f45
      794a7bed
  28. 15 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 16x16 idct/recon merge · a272ff25
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: Iea7976b22b1927d24b8004d2a3fddae7ecca3ba1
      a272ff25
  29. 14 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 32x32 idct/recon merge · 2cf0d4be
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: I4ea09df0e162591e420d869b7431c2e7f89a8c1a
      2cf0d4be