Skip to content
  • Rupert Swarbrick's avatar
    A working rewrite of the sgr sse code · 064c1d47
    Rupert Swarbrick authored
    This fixes some Valgrind errors caused by reads from x_by_xplus1 that
    used tainted data as an address (see the comments in selfguided_sse4.c
    for what's going on).
    
    It also rewrites the algorithm to use an integral image approach
    instead of the handwritten filters that the code was using. The end
    result is roughly the same efficiency (I think that there's one more
    memory load per group of pixels, but this seems not to be measurable)
    and I've done some performance optimisation with perf too. Several
    32-bit multiplications have been replaced by madd instructions which
    do 16-bit multiplications and add adjacent lanes. This is equivalent
    to a 32-bit multiplication when the 32-bit lanes contain numbers below
    2^15, but runs significantly faster.
    
    Change-Id: I3d0f3043c7861707a56e2fd1849574dc73897d6c
    064c1d47