- 27 Nov, 2017 1 commit
-
-
James Zern authored
Rename aom_lpf_horizontal_edge_8() to aom_lpf_horizontal_16(). Rename aom_lpf_horizontal_edge_16() to aom_lpf_horizontal_16_dual(). based on the same change from libvpx: 7f1f35183 Unify loopfilter function names Change-Id: I4fda7a2e3a893fc3dee0779975e2d4145c32f5d2
-
- 25 Nov, 2017 1 commit
-
-
Sebastien Alaiwan authored
Change-Id: I5ec79635c716b2d1f1b200dcc3067213f2eedd08
-
- 22 Nov, 2017 2 commits
-
-
Cheng Chen authored
Add ssse3 implementations for the sad_avg c function at low bit-depth. With this, aom_jnt_sad c functions can all have simd implementations. This CL follows existing MACRO definitions for multiple combinations of block sizes. Change-Id: I882343684026525f5589a239337cfac2dd411e11
-
Cheng Chen authored
Change function names and add SIMD implementation for two c functions: (1) var_filter_block2d_bil_first_pass (2) var_filter_block2d_bil_second_pass This CL allows aom_jnt_sub_pixel_avg_variance now in SIMD. Change-Id: Ib41ef13d62ae91a0ca481bcebb24568dcd4722c4
-
- 10 Nov, 2017 1 commit
-
-
Urvang Joshi authored
This experiment has been cleared by Tapas. Also, fix a couple of hash signatures in the test while we are at it. Change-Id: I1658bcb07913cf8bd47cfffadd729e16d5c55fc3
-
- 06 Nov, 2017 2 commits
-
-
Cheng Chen authored
Add SIMD implementations for c functions for low bit-depth, making encoder speed faster by 3~4x than c functions. Change-Id: Icca0b07b25489759be9504aaec09d1239076fc52
-
Cheng Chen authored
The refactoring serves two purposes: 1. Separate code paths for jnt_comp and original compound average computation. It provides function interface for jnt_comp while leaving original compound average computation unchanged. In near future, SIMD functions can be added for jnt_comp using the interface. 2. Previous implementation uses a hack on second_pred. But it may cause segmentation fault when the test clip is small. As reported in Issue 944. This refactoring removes hacking and make it possible to address the seg fault problem in the future. Change-Id: Idd2cb99f6c77dae03d32ccfa1f9cbed1d7eed067
-
- 31 Oct, 2017 1 commit
-
-
Sebastien Alaiwan authored
This experiment has been adopted, we can simplify the code by dropping the associated preprocessor conditionals. Change-Id: I2dce80e1e1b2116708b6ba9feeacaacc12af8fc4
-
- 21 Oct, 2017 1 commit
-
-
Yushin Cho authored
Change-Id: Id377c68e30031ad4697ca1ba311487b803a8af4c
-
- 20 Oct, 2017 1 commit
-
-
Yi Luo authored
D207E Predictor SSE2 vs C 4x4 ~2.6X 4x8 ~2.5X 8x4 ~8.0X 8x8 ~9.1X 8x16 ~11.7X 16x8 ~16.9X 16x16 ~17.3X 16x32 ~17.2X 32x16 ~30.2X 32x32 ~35.5X D63E Predictor SSE2 vs C 4x4 ~4.7X 4x8 ~4.9X 8x4 ~7.8X 8x8 ~8.9X 8x16 ~9.3X 16x8 ~15.7X 16x16 ~14.7X 16x32 ~17.3X 32x16 ~18.0X 32x32 ~15.7X D45E Predictor SSSE3 vs C 4x4 ~1.8X 4x8 ~2.9X 8x4 ~6.7X 8x8 ~6.5X 8x16 ~7.4X 16x8 ~24.4X 16x16 ~21.5X 16x32 ~24.2X 32x16 ~25.4X 32x32 ~25.2X Change-Id: I8215de190e2b6314272749761600e389d1ca0fdf
-
- 16 Oct, 2017 1 commit
-
-
Yi Luo authored
D207E Predictor SSE2 vs C AVX2 vs C 4x4 ~2.7x 4x8 ~3.0x 8x4 ~7.2x 8x8 ~8.5x 8x16 ~9.4x 16x8 ~12.8x 16x16 ~13.0x 16x32 ~14.3x 32x16 ~19.9x 32x32 ~23.6x D63E Predictor SSE2 vs C AVX2 vs C 4x4 ~3.8x 4x8 ~4.3x 8x4 ~6.4x 8x8 ~6.8x 8x16 ~8.6x 16x8 ~9.0x 16x16 ~9.6x 16x32 ~10.3x 32x16 ~9.1x 32x32 ~11.0x Change-Id: I87373804c9d53276bf4d7788c4ae0d13d01c00dc
-
- 10 Oct, 2017 2 commits
-
-
Yi Luo authored
Function SSE2 vs C AVX2 vs C 4x4 ~4.5x 4x8 ~4.5x 8x4 ~11.7x 8x8 ~12.7x 8x16 ~14.0x 16x8 ~21.7x 16x16 ~24.0x 16x32 ~28.7x 32x16 ~20.5x 32x32 ~24.4x Change-Id: Iaca49727d8df17b7f793b774a8d51a401ef8a8d1
-
Yi Luo authored
Function speedup on i7-6700: D117 sse2 ssse3 4x4 ~1.8x 8x8 ~3.4x 16x16 ~5.5x 32x32 ~2.9x D135 sse2 ssse3 4x4 ~1.9 8x8 ~3.3x 16x16 ~5.3x 32x32 ~3.6x D153 sse2 ssse3 4x4 ~1.9x 8x8 ~2.8x 16x16 ~5.5x 32x32 ~3.6x Change-Id: I43ab5fa8dcbcfa51acbde554abf3e5d7d336f391
-
- 06 Oct, 2017 1 commit
-
-
Yi Luo authored
On i7-6700: Predictor ssse3 v. C 4x4 ~1.3x 4x8 ~1.9x 8x4 ~2.3x 8x8 ~3.4x 8x16 ~4.1x 16x8 ~4.6x 16x16 ~5.2x 16x32 ~5.6x 32x16 ~4.2x 32x32 ~4.7x Change-Id: Ic12383cf9d4446361d6355eb8a480a3c7602060e
-
- 04 Oct, 2017 1 commit
-
-
Yi Luo authored
For block width >= 16, avx2 can further speedup the TM_PREM intra prediction. Function speedup on i7-6700: Predictor avx2 v. ssse3 16x8 ~1.6x 16x16 ~1.8x 16x32 ~1.9x 32x16 ~1.9x 32x32 ~1.9x Change-Id: I62c20bd7628f52251b0c051b99a9b738ee44f7e6
-
- 02 Oct, 2017 2 commits
-
-
Change-Id: I01c97d6200e3f4d17c6b38095ca7c8c31967a2ce
-
Sebastien Alaiwan authored
This experiment has been adopted, we can simplify the code by dropping the associated preprocessor conditionals. Change-Id: Ic077963f72e8cc2ae9872b58c8a0241988384110
-
- 29 Sep, 2017 3 commits
-
-
Yi Luo authored
Function speedup (i7-6700) Predictor ssse3 v. C 4x4 ~2.1x 4x8 ~2.4x 8x4 ~4.1x 8x8 ~5.4x 8x16 ~6.1x 16x8 ~5.9x 16x16 ~6.4x 16x32 ~6.7x 32x16 ~7.4x 32x32 ~8.0x Change-Id: I52b8ebf8193e76f4ea1137cbad5ad7fa109d86d8
-
Rupert Swarbrick authored
Change-Id: Ieb28f40d85e4db4af33648c32c406dd2931ceb89
-
Yi Luo authored
For prediction block width equal to 32, avx2 can further speedup the prediction function (i7-6700): 32x32 avx2 v. sse2 DC ~1.4x top ~1.5x left ~1.4x 128 ~1.5x v ~1.6x h ~1.2x 32x16 avx2 v. sse2 DC ~2.2x top ~1.7x left ~1.6x 128 ~1.8x v ~1.9x Note: 32x16 H_PRED on avx2 does not run faster enough than sse2 yet. Change-Id: I145ed504d1b3ea9df283b94927be66a2c6f81225
-
- 28 Sep, 2017 1 commit
-
-
Yi Luo authored
Function speedup sse2 v. C Predictor V_PRED H_PRED 4x8 ~1.7x ~1.8x 8x4 ~1.8x ~2.2x 8x16 ~1.5x ~1.4x 16x8 ~1.9x ~1.3x 16x32 ~1.6x ~1.4x 32x16 ~2.0x ~1.9x This patch disables speed tests to save Jenkins build time. Developer can manually enable them by using, --gtest_also_run_disabled_test flag in test command line. Change-Id: I81eaee5e8afc55275c7507c99774f78cc9e49f9a
-
- 27 Sep, 2017 2 commits
-
-
James Zern authored
Change-Id: I612517c6218c561ee94888c8c14298964851484a
-
Yi Luo authored
Add lowbd unit test functionality to intrapred_test.cc Function speedup against C (i7-6700): Predictor DC LEFT TOP 128 4x8 ~1.4x ~1.4x ~1.7x ~1.9x 8x4 ~1.2x ~1.6x ~1.6x ~2.6x 8x16 ~1.4x ~1.3x ~1.4x ~2.1x 16x8 ~2.0x ~1.8x ~2.3x ~2.1x 16x32 ~2.0x ~1.9x ~1.8x ~2.2x 32x16 ~2.0x ~2.0x ~1.9x ~2.2x Change-Id: I33db512020ca3c6853a9205a8079f3d00134f584
-
- 22 Sep, 2017 1 commit
-
-
Yi Luo authored
Function speedup (i7-6700), sse2 verse C: Predictor V_PRED DC_PRED 4x8 ~1.5x ~4.9x 8x4 ~2.5x ~4.8x 8x16 ~1.9x ~9.1x 16x8 ~1.9x ~4.4x 16x32 ~2.1x ~5.8x 32x16 ~2.0x ~3.6x Change-Id: I6deffd0637e57ee5d0bd533502f5705148c4cdd4
-
- 19 Sep, 2017 1 commit
-
-
Yi Luo authored
Also extend intra pred speed test to rectangular block. Speedup (i7-6700) predictor sse2 v. C left 4x4 ~5.6x top 4x4 ~7.2x 128 4x4 ~6.9x left 4x8 ~7.7x top 4x8 ~10.1x 128 4x8 ~10.0x left 8x4 ~8.1x top 8x4 ~9.1x 128 8x4 ~10.1x left 8x8 ~10.3x top 8x8 ~13.6x 128 8x8 ~14.8x left 8x16 ~12.6x top 8x16 ~14.0x 128 8x16 ~15.5x left 16x8 ~6.3x top 16x8 ~7.0x 128 16x8 ~6.5x left 16x16 ~6.5x top 16x16 ~7.1x 128 16x16 ~8.2x left 16x32 ~5.1x top 16x32 ~6.4x 128 16x32 ~5.6x left 32x16 ~4.2x top 32x16 ~4.3x 128 32x16 ~4.5x left 32x32 ~3.8x top 32x32 ~3.7x 128 32x32 ~3.9x Change-Id: Ie7fcc85b9ded3030ee904623c40e9edeec1695ae
-
- 18 Sep, 2017 1 commit
-
-
Yi Luo authored
sse2 v. C speedup: 4x4 ~8.0x 8x8 ~8.2x 16x16 ~6.5x 32x32 ~3.8x Blocksize: 4x4, 4x8, 8x4, 8x8, 8x16, 16x8, 16x16, 16x32, 32x16, 32x32 Square blocksize code is from libvpx: "30d9a1916 vpxdsp: [x86] add highbd_h_predictor functions", Credit goes to Scott LaVarnway. Speed tests do not support rectangular blocksize yet. Change-Id: I9a1f24aecab8de94f8ea59ec8748fe3537d721ae
-
- 07 Sep, 2017 1 commit
-
-
Yi Luo authored
Baseline + parallel_deblocking: - Passed unit tests *SSE2/Loop8Test6*, *AVX2/Loop8Test6*. - 1080p, 25 frames, profile=0, encoding/decoding, output match. - Decoder frame rate increases from 54.15 to 65.84. Change-Id: I55938c94961066594f4b9080192c7268c19d9bf9
-
- 15 Aug, 2017 2 commits
-
-
Ralph Giles authored
aom_dsp_rtcd_defs.pl compares most CONFIG_* keys to "yes" to see if they're set. The script was checking just if (aom_config("CONFIG_EXT_PARTITION_TYPES")) in some cases. The build system doesn't add disabled configuration options to libs.mk so this is effectively the same, however it means that setting the config key explicitly to 0 or "no" in the config headers was treated the same as setting it to 1 or "yes", and aom_dsp_rtcd.h would have opposite expections from aom_config.h or aom_config.asm. Treat this key similarly to others for consistency. Change-Id: I27bd7a5532ba4afc2bb289b43b57a1b1971c0348
-
Urvang Joshi authored
This experiment has been adopted as it has been cleared by Tapas. Change-Id: I0682face60f62dd43091efa0a92d09d846396850
-
- 10 Aug, 2017 1 commit
-
-
Yi Luo authored
- Speed test (ms) on i7-6700, Linux x86_64 FUNCTION SSE2 AVX2 horizontal_edge_16 55 28 vertical_16_dual 84 47 horizontal_4_dual 27 13 horizontal_8_dual 36 15 vertical_4_dual 38 25 vertical_8_dual 44 27 - Decoder frame rate improves around 1.2% - 2.8%. Change-Id: I9c4123869bac9b6d32e626173c2a8e7eb0cf49e7
-
- 08 Aug, 2017 1 commit
-
-
Thomas Davies authored
This commit de-duplicates C reference quantization code and unifies quantization matrix (QM) and non-QM code paths when there is no SIMD. The reorganisation also will facilitate re-using SIMD quant functions for QM when the matrix is flat, as is the default when AOM_QM is enabled. Change-Id: Idbfdac9eb9a31adcffe734aac1877d58b86fab77
-
- 04 Aug, 2017 1 commit
-
-
Rupert Swarbrick authored
Change-Id: I0c3772110e9fa62ac687bd99e290b5006bf3bd6c
-
- 21 Jul, 2017 1 commit
-
-
Angie Chiang authored
This integration only covers low bitdepth mode for now The performance of Convolve_round on top of compound_segment revives from 0.475% to 0.612% on lowres Change-Id: I21606c79d0a22c0834966730358267c082d8071e
-
- 12 Jul, 2017 1 commit
-
-
Rupert Swarbrick authored
This patch adds support for 4:1 rectangular blocks to various common data arrays, and adds new partition types to the EXT_PARTITION_TYPES experiment which will use them. This patch has the following restrictions, which can be lifted in future patches: * ext-partition-types is incompatible with fp_mb_stats and supertx for the moment * Currently only 32x32 superblocks can use the new partition types There's a slightly odd restriction about when we allow PARTITION_HORZ_4 or PARTITION_VERT_4. Since these both live in the EXT_PARTITION_TYPES CDF, read_partition() can only return them if both has_rows and has_cols is true. This means that at least half of the width and height of the block must be visible. It might be nice to relax that restriction but that would imply a change to how we encode partition types, which seems already to be in a state of flux, so maybe it's better to wait until that has settled down. Change-Id: Id7fc3fd0f762f35f63b3d3e3bf4e07c245c7b4fa
-
- 08 Jul, 2017 1 commit
-
-
Fergus Simpson authored
Use higher precision offsets for more accurate predictor generation when references are at a different scale from the coded frame. Change-Id: I4c2c0ec67fa4824273cb3bd072211f41ac7802e8
-
- 29 Jun, 2017 1 commit
-
-
Frederic Barbier authored
Cleanup related unit-tests. Change-Id: Ic756e6bbad80f5b9947ca1cdd55cdef77b985f81
-
- 28 Jun, 2017 1 commit
-
-
Yi Luo authored
Change-Id: Iaae46d0735539b8b8daf9faac81c2a3434838020
-
- 27 Jun, 2017 1 commit
-
-
Yi Luo authored
We are going to have several commits to setup new low/high bitdepth data path selection logic. This patch is for inverse transform. Let me summarize the ideas as following. - For low/high bitdepth selection, encoder depends on input configuration, e.g., video sequence bitdepth, profile. Decoder depends on input bitstream. This has nothing to do with compiler/build configuration. - Typical encoder usage for sampling format 4:2:0. 1) 8-bit video sequence: a) --profile=0 Fastest encoding/decoding pipeline on speedup. b) --profile=2 --bit-depth=10 Image pixels are left shifted by 2 bits. It employs 16-bit reference frame buffer and has high calculation precision. It usually enjoys higher compression performance. 2) 10/12-bit video sequence (HDR): --profile=2 --bit-depth=10/12 - Transform coefficient type: Lowbitdepth: int16_t Highbitdepth: int32_t - The type, tran_low_t is still used in codebase, Which is int32_t, defining the data path capacity. Naturally, it is high bitdepth. Eventually we shall remove the configuration flags, CONFIG_HIGHBITDEPTH/CONFIG_LOWBITDEPTH, and seperate low and high bitdepth data path. Two data paths co-exist in the same build environment. Change-Id: I35c06d4d4f19ebf80d909168fdddbae57c3cc884
-
- 22 Jun, 2017 1 commit
-
-
Yi Luo authored
- First pass encoding time reduces ~10.9% on i7-6700 at 100 frames, 1080p. - avx2 works for coeff number >= 8 cases; coeff number < 8 case will be implemented by sse2. - Unit test is added type B/FP/DC. Change-Id: Ibe5b7807c64e6dfc2d59c470ed50a6e8ca94ef7c
-
- 19 Jun, 2017 1 commit
-
-
Timothy B. Terriberry authored
They do not handle border extension correctly (interpolation and border extension do not commute unless you upsample into the border), nor do they handle crop dimensions that are not a multiple of 8 (the upsampled version is not sufficiently large), in addition to using massive amounts of memory and being a criminal waste of cache (1 byte used for every 8 bytes fetched). This commit reimplements use_upsampled_references by computing the subpixel samples on the fly. This implementation not only corrects the border handling, but is also faster, while maintaining the same quality. HL AWCY results are basically noise: PSNR | PSNR HVS | SSIM | MS SSIM | CIEDE 2000 0.0188 | 0.0187 | 0.0045 | 0.0063 | 0.0228 Change-Id: I7527db9f83b87a7bb8b35342f7e6457cd0bef9cd
-