Commit d5dfa96e authored by David Barker's avatar David Barker Committed by Debargha Mukherjee

Add SSE2 vectorized warp filter for lowbd

End-to-end speed improvements: (measured on tempete_cif.y4m,
20 frames for encoder and all 260 frames for decoder)

* GLOBAL_MOTION encoder: ~10% faster
* GLOBAL_MOTION decoder: 100-200% faster depending on bitrate
* WARPED_MOTION encoder: ~2.5% faster
* WARPED_MOTION decoder: ~20-40% faster depending on bitrate

The improvement in the GLOBAL_MOTION decoder is particularly
large because its runtime is dominated by calls to warp_plane().

This introduces minor changes to the output of the warp filter,
but these should be rare.

Change-Id: I5813ab9e90311e27587045153c32d400b6b9eb92
parent 3bd83775
......@@ -157,4 +157,8 @@ ifeq ($(CONFIG_FILTER_INTRA),yes)
AV1_COMMON_SRCS-$(HAVE_SSE4_1) += common/x86/filterintra_sse4.c
ifneq ($(findstring yes,$(CONFIG_GLOBAL_MOTION) $(CONFIG_WARPED_MOTION)),)
AV1_COMMON_SRCS-$(HAVE_SSE2) += common/x86/warp_plane_sse2.c
$(eval $(call rtcd_h_template,av1_rtcd,av1/common/
......@@ -765,4 +765,12 @@ if (aom_config("CONFIG_DERING") eq "yes") {
specialize qw/od_filter_dering_orthogonal_8x8 sse4_1/;
if ((aom_config("CONFIG_WARPED_MOTION") eq "yes") ||
(aom_config("CONFIG_GLOBAL_MOTION") eq "yes")) {
add_proto qw/void warp_affine/, "int32_t *mat, uint8_t *ref, int width, int height, int stride, uint8_t *pred, int p_col, int p_row, int p_width, int p_height, int p_stride, int subsampling_x, int subsampling_y, int ref_frm, int32_t alpha, int32_t beta, int32_t gamma, int32_t delta";
specialize qw/warp_affine sse2/;
This diff is collapsed.
......@@ -28,6 +28,8 @@
const int16_t warped_filter[WARPEDPIXEL_PREC_SHIFTS * 3][8];
typedef void (*ProjectPointsFunc)(int32_t *mat, int *points, int *proj,
const int n, const int stride_points,
const int stride_proj,
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment