    Add sad64x64 and sad32x32 SSE2 versions.
    Ronald S. Bultje authored
    Also port the 4x4, 16x16, 8x16 and 16x8 versions to x86inc.asm; this
    makes them all slightly faster, particularly on x86-64. Remove SSE3
    sad16x16 version, since the SSE2 version is now faster.
    About 1.5% overall encoding speedup.
    Change-Id: Id4011a78cce7839f554b301d0800d5ca021af797
