Skip to content
  • Scott LaVarnway's avatar
    VP9_COPY_CONVOLVE_SSE2 optimization · a5e97d87
    Scott LaVarnway authored
    This function suffers from a couple problems in small core(tablets):
    -The load of the next iteration is blocked by the store of previous iteration
    -4k aliasing (between future store and older loads)
    -current small core machine are in-order machine and because of it the store will spin the rehabQ until the load is finished
    fixed by:
    - prefetching 2 lines ahead
    - unroll copy of 2 rows of block
    - pre-load all xmm regiters before the loop, final stores after the loop
    The function is optimized by:
    copy_convolve_sse2 64x64 - 16%
    copy_convolve_sse2 32x32 - 52%
    copy_convolve_sse2 16x16 - 6%
    copy_convolve_sse2 8x8 - 2.5%
    copy_convolve_sse2 4x4 - 2.7%
    credit goes to Tom Craver(tom.r.craver@intel.com) and Ilya Albrekht(ilya.albrekht@intel.com)
    
    Change-Id: I63d3428799c50b2bf7b5677c8268bacb9fc29671
    a5e97d87