Commit 1b888f2e authored by David Barker's avatar David Barker Committed by Debargha Mukherjee

Optimize SSE2 warp filter

Improve the speed of the warp filter itself by ~30%. This leads
to an overall decoder speedup of 5-20%, depending on bitrate,
for the global-motion experiment, and a small speedup for
warped-motion.

Applies a very minor change to the rounding during filter
selection (ROUND_POWER_OF_TWO makes slightly more sense here
than ROUND_POWER_OF_TWO_SIGNED, and is faster)

Change-Id: I3f364221d1ec35a8aac0d2c8b0e427f527d12e43
parent 0b04e9b8
......@@ -992,8 +992,7 @@ void warp_affine_c(int32_t *mat, uint8_t *ref, int width, int height,
int ix = ix4 + l - 3;
// At this point, sx = sx4 + alpha * l + beta * k
const int16_t *coeffs =
warped_filter[ROUND_POWER_OF_TWO_SIGNED(sx,
WARPEDDIFF_PREC_BITS) +
warped_filter[ROUND_POWER_OF_TWO(sx, WARPEDDIFF_PREC_BITS) +
WARPEDPIXEL_PREC_SHIFTS];
int32_t sum = 0;
for (m = 0; m < 8; ++m) {
......@@ -1012,9 +1011,9 @@ void warp_affine_c(int32_t *mat, uint8_t *ref, int width, int height,
uint8_t *p =
&pred[(i - p_row + k + 4) * p_stride + (j - p_col + l + 4)];
// At this point, sy = sy4 + gamma * l + delta * k
const int16_t *coeffs = warped_filter[ROUND_POWER_OF_TWO_SIGNED(
sy, WARPEDDIFF_PREC_BITS) +
WARPEDPIXEL_PREC_SHIFTS];
const int16_t *coeffs =
warped_filter[ROUND_POWER_OF_TWO(sy, WARPEDDIFF_PREC_BITS) +
WARPEDPIXEL_PREC_SHIFTS];
int32_t sum = 0;
for (m = 0; m < 8; ++m) {
sum += tmp[(k + m + 4) * 8 + (l + 4)] * coeffs[m];
......
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment