round_shift_array: Use SSE4 version everywhere.
Usage of CPU by round_shift_array goes from 2.01% to 1.04%. Overall encoding is slightly faster (~0.05%). This means some of the intermediate array have to be aligned. Also, these functions were moved to common header/source files. BUG=aomedia:1106 Change-Id: I492c9b1f2e7339c6cb83cfe68a61218642654d1b
Showing with 137 additions and 86 deletions