1. 27 Jun, 2013 1 commit
    • Jingning Han's avatar
      Make intra predictor reference buffer configurable · 861cb06c
      Jingning Han authored
      This commit enables configurable reference buffer pointer for intra
      predictor. This allows later removal of spatial dependency between
      blocks inside a 64x64 superblock in the rate-distortion optimization
      Change-Id: I02418c2077efe19adc86e046a6b49364a980f5b1
  2. 25 Jun, 2013 6 commits
    • Jingning Han's avatar
      Refactor intra predictor block · d19ea386
      Jingning Han authored
      Remove vp9_intra4x4_predict(). Use the common intra prediction
      function for all block sizes.
      Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560
    • Jingning Han's avatar
      Tune the rounding operations in 8x8 ADST/DCT sse2 · 0084e61d
      Jingning Han authored
      Improve the round-trip precision to meet the unit test setttings.
      Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79
    • Dmitry Kovalev's avatar
      Removing unused code. · 87ee34aa
      Dmitry Kovalev authored
      Removing block index (ib) parameter from get_tx_type_{8x8, 16x16}
      Change-Id: Ia213335aae7a7cb027f97b9cc9b04519840250f1
    • Jingning Han's avatar
      Add 8x8 dct/adst unit tests · ab362621
      Jingning Han authored
      This commit enables 8x8 DCT and hybrid transform unit tests. It
      also tunes the forward hybrid transform rounding opertions for
      more precise round-trip performance.
      Change-Id: If05c1ce59d75d641b9c6c91527d02d3a6ef498c3
    • Jingning Han's avatar
      Use aligned buffer operations in 8x8/16x16 2D-DCT · 82d504b5
      Jingning Han authored
      This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles.
      Change-Id: I137758b81cd127b936175284310e81378db64552
    • Jingning Han's avatar
      Enable sse2 implmentation of 8x8 ADST/DCT · a32a086d
      Jingning Han authored
      This commit makes use of the butterfly structure to enable the sse2
      version implementation of 8x8 ADST/DCT hybrid transform coding.
      The runtime of hybrid transform module goes down from 1170 cycles
      to 245 cycles. Overall speed-up around 1.5%.
      Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
  3. 24 Jun, 2013 2 commits
  4. 21 Jun, 2013 8 commits
  5. 20 Jun, 2013 12 commits
    • Ronald S. Bultje's avatar
      SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). · 1e6a32f1
      Ronald S. Bultje authored
      Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
      3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
      which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
      perfectly interleaved, and can probably be improved further in the
      future. I've marked this with a few TODOs/FIXMEs in the code.
      Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
    • Deb Mukherjee's avatar
      Improving model rd with variance and quant step · 7947a33d
      Deb Mukherjee authored
      Improves the rd modeling function and implements them using interpolation
      from a table which is a little faster. Also uses sse as input to the
      modeling function rather than var - since there is no dc prediction
      used and as a result the sse works a little better.
      derfraw300: +0.05%
      Speedup: ~1%
      Change-Id: I151353c6451e0e8fe3ae18ab9842f8f67e5151ff
    • Jim Bankoski's avatar
      adds force partitioning greater than or less than block size · 9f2a1ae2
      Jim Bankoski authored
      adds a new speed feature to force partitioning to be greater than
      or less than a certain size
      Change-Id: I8c048eeeef93700ae822eccf98f8751a45b2e7d0
    • Jim Bankoski's avatar
      adds a set partitioning to speed features · 18bdf708
      Jim Bankoski authored
      this feature lets you set a partitioning size to be used by the entire
      Change-Id: I208a4c8c701375cbb054418266f677768b6f8f06
    • Jim Bankoski's avatar
      partition by variance using var from last frame · 476d73d2
      Jim Bankoski authored
      This uses variance to split partition. Variance is calculated using
      nearest mv,  always from last ref frame.
      Change-Id: Idd015b4a9aa3bc82591759eac239680c07496896
    • Jim Bankoski's avatar
      convert all speed things to speed features · 1f94b976
      Jim Bankoski authored
      Change-Id: Ie24489a4d39f3e53e816eeebf75a1c9c7d94515a
    • Jim Bankoski's avatar
      new partition via variance · 727fa7b1
      Jim Bankoski authored
      Change-Id: Ideee45cad8b38087c509cd404484728e85d0c427
    • Jim Bankoski's avatar
      fix to set up new speed feature · 0fad6a9d
      Jim Bankoski authored
      This uses the speed feature functionality for code.
      Change-Id: I9cd16c0c5f98520ae27ebba81aa2c178546587f8
    • Jim Bankoski's avatar
      don't copy partitions for key frames or altrefs · df2314cf
      Jim Bankoski authored
      force us to go through slow partitioning for keyframes, altref and
      Change-Id: I1a286361bf74083e71973575a7296be46eb98742
    • Ronald S. Bultje's avatar
      Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. · 8fb6c581
      Ronald S. Bultje authored
      Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
      3min58). Specific changes to timings for each function compared to
      original assembly-optimized versions (or just new version timings if
      no previous assembly-optimized version was available):
      sse2   4x4:    99 ->   82 cycles
      sse2   4x8:           128 cycles
      sse2   8x4:           121 cycles
      sse2   8x8:   149 ->  129 cycles
      sse2   8x16:  235 ->  245 cycles (?)
      sse2  16x8:   269 ->  203 cycles
      sse2  16x16:  441 ->  349 cycles
      sse2  16x32:          641 cycles
      sse2  32x16:          643 cycles
      sse2  32x32: 1733 -> 1154 cycles
      sse2  32x64:         2247 cycles
      sse2  64x32:         2323 cycles
      sse2  64x64: 6984 -> 4442 cycles
      ssse3  4x4:           100 cycles (?)
      ssse3  4x8:           103 cycles
      ssse3  8x4:            71 cycles
      ssse3  8x8:           147 cycles
      ssse3  8x16:          158 cycles
      ssse3 16x8:   188 ->  162 cycles
      ssse3 16x16:  316 ->  273 cycles
      ssse3 16x32:          535 cycles
      ssse3 32x16:          564 cycles
      ssse3 32x32:          973 cycles
      ssse3 32x64:         1930 cycles
      ssse3 64x32:         1922 cycles
      ssse3 64x64:         3760 cycles
      Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
    • Jim Bankoski's avatar
      disable speed > 1 speed corrections in firstpass · f954490b
      Jim Bankoski authored
      need to rework these
      Change-Id: I17dc2c88d2faadd2f8fb117c52c25f04ea2e9856
    • Jim Bankoski's avatar
      copy partitioning from last fame · f033b44e
      Jim Bankoski authored
      Change-Id: I26e80ede80cb4389378a95afa95d229092a9859a
  6. 19 Jun, 2013 3 commits
    • Yunqing Wang's avatar
      Add two-pass quantization · b5bf7b13
      Yunqing Wang authored
      Optimized the quantization function by making it a two-pass
      process. The first pass does a quick checking of the transform
      coefficients against the base ZBIN, and only keep the good
      enough set of coefficients for quantization. A skipping
      check is added. If all coefficients are within the base ZBIN, no
      quantization is needed. The second pass is the actual quantization
      pass, which only processes the coefficient subset determined
      in first pass. This reduces the computation. Furthermore, an
      alternitive method is used for large transform size, which often
      has sparse nonzero quantized coefficients.
      Overall, the encoder speedup is about 4%. The quantization function
      itself gets 20% faster.
      Change-Id: I3a9dd0da6db030260b6d9c314a9fa48ecae89f22
    • Yaowu Xu's avatar
      Remove unnecessary copying of probs. · 12180c83
      Yaowu Xu authored
      Change-Id: Ic924f07c6ab0c929c6cdf11880d3c625806e272c
    • Dmitry Kovalev's avatar
      Renaming 'nmv' to 'mv' for several functions. · 87e1fa76
      Dmitry Kovalev authored
      Change-Id: I183a38997a9d01e4a1b869e92509f6915216fa09
  7. 18 Jun, 2013 1 commit
    • Jingning Han's avatar
      Make fdct32 computation flow within 16bit range · a41a4860
      Jingning Han authored
      This commit makes use of dual fdct32x32 versions for rate-distortion
      optimization loop and encoding process, respectively. The one for
      rd loop requires only 16 bits precision for intermediate steps.
      The original fdct32x32 that allows higher intermediate precision (18
      bits) was retained for the encoding process only.
      This allows speed-up for fdct32x32 in the rd loop. No performance
      loss observed.
      Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
  8. 17 Jun, 2013 4 commits
  9. 14 Jun, 2013 3 commits