1. 25 Jun, 2013 2 commits
    • Jingning Han's avatar
      Refactor intra predictor block · d19ea386
      Jingning Han authored
      Remove vp9_intra4x4_predict(). Use the common intra prediction
      function for all block sizes.
      
      Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560
      d19ea386
    • Jingning Han's avatar
      Enable sse2 implmentation of 8x8 ADST/DCT · a32a086d
      Jingning Han authored
      This commit makes use of the butterfly structure to enable the sse2
      version implementation of 8x8 ADST/DCT hybrid transform coding.
      
      The runtime of hybrid transform module goes down from 1170 cycles
      to 245 cycles. Overall speed-up around 1.5%.
      
      Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
      a32a086d
  2. 21 Jun, 2013 3 commits
    • John Koleszar's avatar
      Remove unused vp9_build_intra_predictors_sb{y,uv}_s · 9e7019f7
      John Koleszar authored
      The functions no longer referenced.
      
      Change-Id: If2705dfbc607f79ec8ec2242d5e03bec27a35aaf
      9e7019f7
    • Ronald S. Bultje's avatar
      Implement SSE2 block_error. · 54b2a596
      Ronald S. Bultje authored
      Change vp9_block_error() to return a 64bit error variable, change all
      callers to expect a 64bit return value (this will prevent overflows,
      which we basically don't check for at all right now). Remove duplicate
      block_error() function, which fixed that through truncation. Remove
      old (incompatible) mmx/sse2 block_error SIMD versions and replace with
      a new one that returns a 64bit value.
      
      Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
      3min23, i.e. a 3% overall speedup.
      
      Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
      54b2a596
    • Ronald S. Bultje's avatar
      Add subtract_block SSE2 version and unit test. · 25c588b1
      Ronald S. Bultje authored
      3% faster overall (3min35.0 to 3min28.5).
      
      Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
      25c588b1
  3. 20 Jun, 2013 2 commits
    • Ronald S. Bultje's avatar
      SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). · 1e6a32f1
      Ronald S. Bultje authored
      Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
      3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
      which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
      perfectly interleaved, and can probably be improved further in the
      future. I've marked this with a few TODOs/FIXMEs in the code.
      
      Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
      1e6a32f1
    • Ronald S. Bultje's avatar
      Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. · 8fb6c581
      Ronald S. Bultje authored
      Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
      3min58). Specific changes to timings for each function compared to
      original assembly-optimized versions (or just new version timings if
      no previous assembly-optimized version was available):
      
      sse2   4x4:    99 ->   82 cycles
      sse2   4x8:           128 cycles
      sse2   8x4:           121 cycles
      sse2   8x8:   149 ->  129 cycles
      sse2   8x16:  235 ->  245 cycles (?)
      sse2  16x8:   269 ->  203 cycles
      sse2  16x16:  441 ->  349 cycles
      sse2  16x32:          641 cycles
      sse2  32x16:          643 cycles
      sse2  32x32: 1733 -> 1154 cycles
      sse2  32x64:         2247 cycles
      sse2  64x32:         2323 cycles
      sse2  64x64: 6984 -> 4442 cycles
      
      ssse3  4x4:           100 cycles (?)
      ssse3  4x8:           103 cycles
      ssse3  8x4:            71 cycles
      ssse3  8x8:           147 cycles
      ssse3  8x16:          158 cycles
      ssse3 16x8:   188 ->  162 cycles
      ssse3 16x16:  316 ->  273 cycles
      ssse3 16x32:          535 cycles
      ssse3 32x16:          564 cycles
      ssse3 32x32:          973 cycles
      ssse3 32x64:         1930 cycles
      ssse3 64x32:         1922 cycles
      ssse3 64x64:         3760 cycles
      
      Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
      8fb6c581
  4. 18 Jun, 2013 1 commit
    • Jingning Han's avatar
      Make fdct32 computation flow within 16bit range · a41a4860
      Jingning Han authored
      This commit makes use of dual fdct32x32 versions for rate-distortion
      optimization loop and encoding process, respectively. The one for
      rd loop requires only 16 bits precision for intermediate steps.
      The original fdct32x32 that allows higher intermediate precision (18
      bits) was retained for the encoding process only.
      
      This allows speed-up for fdct32x32 in the rd loop. No performance
      loss observed.
      
      Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
      a41a4860
  5. 14 Jun, 2013 1 commit
    • Jingning Han's avatar
      Enable sse2 version of sad8x4/4x8 · c43af9a8
      Jingning Han authored
      The encoding time for bus at CIF goes from 661s to 625s. This commit
      also enabled unit test of sad8x4/4x8 in sad_test.cc.
      
      Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
      c43af9a8
  6. 13 Jun, 2013 1 commit
    • Jingning Han's avatar
      Enable sse2 version of sad8x4/4x8 · 15f50e7b
      Jingning Han authored
      The encoding time for bus at CIF goes from 661s to 625s. This commit
      also enabled unit test of sad8x4/4x8 in sad_test.cc.
      
      Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
      15f50e7b
  7. 12 Jun, 2013 5 commits
  8. 10 Jun, 2013 1 commit
  9. 08 Jun, 2013 1 commit
  10. 06 Jun, 2013 1 commit
    • John Koleszar's avatar
      Reimplementation of loop filter · 043d348a
      John Koleszar authored
      This version of the loop filter supports non-4:2:0 subsampling and
      a fourth plane, as well as changing the filtering order to be more
      friendly to hardware implementations.
      
      The filters are applied first to all vertical edges within the
      64x64 SB, followed by the top horizontal edge and any internal
      horizontal edges. Since filtering is applied on each 4x4 edge
      serially, a dependency is created from filtering one block edge
      to the next. It would be possible to remove this depencnecy by
      building all filtering decisions from the unfiltered
      reconstruction data.
      
      Change-Id: I08f3e9683eb7bded8a76651cbc50fc0dfdd05fa7
      043d348a
  11. 31 May, 2013 1 commit
    • Jim Bankoski's avatar
      Creates a new speed 1: · ced21bd6
      Jim Bankoski authored
      This speed 1 - uses variance threshold stolen from static-thresh
      to determine split.  Any superblock with greater than the variance
      set by static thresh * quantizer index squared is split. In addition
      transform size is set to largest size less than or equal to partition
      size, sub pixel filter is set to normal,  and only 12 modes are used
      at all.
      
      Change-Id: If7a2858ee70f96d1eb989c04fd87a332b147abef
      ced21bd6
  12. 23 May, 2013 1 commit
    • Jingning Han's avatar
      Merge 4x4 block level partition into codebase · 7ac5ac52
      Jingning Han authored
      Move 4x4/4x8/8x4 partition coding out of experimental list.
      
      This commit fixed the unit test failure issues. It also resolved
      the merge conflicts between 4x4 block level partition and iterative
      motion search for comp_inter_inter.
      
      Change-Id: I898671f0631f5ddc4f5cc68d4c62ead7de9c5a58
      7ac5ac52
  13. 22 May, 2013 1 commit
    • Yunqing Wang's avatar
      Optimize variance functions · f4fcfe30
      Yunqing Wang authored
      Added SSE2 version of variance functions for super blocks.
      
      Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d
      f4fcfe30
  14. 21 May, 2013 1 commit
  15. 20 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 4x4 idct/recon merge · ba48a111
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: I296604bf73579c45105de0dd1adbcc91bcc53c22
      ba48a111
  16. 16 May, 2013 2 commits
    • Scott LaVarnway's avatar
      WIP: 8x8 idct/recon merge · 794a7bed
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: Iacfd57324fbe2b7beca5d7f3dcae25c976e67f45
      794a7bed
    • Jingning Han's avatar
      Add building blocks for 4x8/8x4 rd search · 8e3d0e4d
      Jingning Han authored
      These building blocks enable rate-distortion optimization search
      over block sizes of 8x4 and 4x8. Need to convert them into mmx/sse
      forms.
      
      Change-Id: I570ea2d22d14ceec3fe3575128d7dfa172a577de
      8e3d0e4d
  17. 15 May, 2013 1 commit
    • Scott LaVarnway's avatar
      WIP: 16x16 idct/recon merge · a272ff25
      Scott LaVarnway authored
      This patch eliminates the intermediate diff buffer usage by
      combining the short idct and the add residual into one function.
      The encoder can use the same code as well.
      
      Change-Id: Iea7976b22b1927d24b8004d2a3fddae7ecca3ba1
      a272ff25
  18. 14 May, 2013 3 commits
  19. 13 May, 2013 1 commit
  20. 10 May, 2013 2 commits
    • Dmitry Kovalev's avatar
      Removing unused simple loopfilter code. · effaa326
      Dmitry Kovalev authored
      Change-Id: Ic11dc052fb641687c015e1bbc37181b9babcd43e
      effaa326
    • Yunqing Wang's avatar
      Add joint motion search in comp_inter_inter mode(experiment) · 9f5811c2
      Yunqing Wang authored
      In current code, motion vectors got from single prediction mode are used
      in compound prediction mode directly. These motion vectors may not give
      accurate prediction since they are searched independently. In this patch,
      we took Pascal's suggestion, and did joint motion search in compound
      prediction mode to find better motion vectors in this situation.
      Test results:
      Overall PSNR: 0.570%(derf), 0.918%(stdhd);
      SSIM: 0.572%(derf), 1.009%(stdhd);
      
      The encoder is a little slower. This can be improved since some c
      code is used in motion search.
      
      Change-Id: Ib30c9240f6c56c9b070867b4ca89412a76d9f3c6
      9f5811c2
  21. 07 May, 2013 1 commit
    • Jingning Han's avatar
      Merge SB8X8 into the codebase · 776c1482
      Jingning Han authored
      Pull sb8x8 out of experimental list. verified via borg run tests.
      Fixed unit test failures.
      
      Change-Id: I12a4bbd17395930580c048ab68becad1ffe46e76
      776c1482
  22. 04 May, 2013 1 commit
  23. 02 May, 2013 2 commits
  24. 01 May, 2013 1 commit
  25. 26 Apr, 2013 3 commits
    • John Koleszar's avatar
      Remove BLOCKD structure · bb41ab4a
      John Koleszar authored
      All members can be referenced from their per-plane counterparts, and
      removes assumptions about 24 blocks per macroblock.
      
      Change-Id: I7ff2fa72d22c29163eb558981c8193765a8113d9
      bb41ab4a
    • John Koleszar's avatar
      Remove destination pointers from BLOCKD · 4f55c561
      John Koleszar authored
      Access these members from MACROBLOCKD instead.
      
      Change-Id: I7907230dd473ff12ebe182b9280d8b7f12a888c4
      4f55c561
    • Scott LaVarnway's avatar
      Removed bmi from blockd · 57f180b3
      Scott LaVarnway authored
      This originally was "Removed update_blockd_bmi()".  Now,
      this patch removed bmi from blockd and uses the bmi found
      in mode_info_context.  Eliminates unnecessary bmi copies between
      blockd and mode_info_context.
      
      Change-Id: I287a4972974bb363f49e528daa9b2a2293f4bc76
      57f180b3