1. 12 Jul, 2016 1 commit
    • Yi Luo's avatar
      HBD convolution filtering (10/12 taps) SSE4.1 optimization · 8cacca73
      Yi Luo authored
      - For experiment EXT_INTERP under high bit depth.
      - Add unit test to verify bit-exact.
      - Speed performance improvement:
        On Xeon E5-2680, park_joy_1080p_12.y4m, 50 frames, encoding time
        drops from 6682503 ms to 5390270 ms.
      
      Change-Id: Iea4debf5414f3accf1eb5672abeab56a0539ac77
      8cacca73
  2. 09 Jul, 2016 1 commit
    • Yue Chen's avatar
      Fix assertion failures in mips+msa setting · 4ab19eac
      Yue Chen authored
      Directly call c functions, otherwise when EXT_TX is enabled, hybrid
      transform other than combination of DCT/ADST has not been implemented, thus
      will cause assertion failures in the switch loops in vp10_fhtnxn_msa() and
      vp10_ihtnxn_nxn_add_msa().
      
      BUG=webm:1239
      
      Change-Id: I2379a07e5406f9489edcd2f3205682f679c9b091
      4ab19eac
  3. 23 Jun, 2016 1 commit
    • Yi Luo's avatar
      Convolution vertical filter SSSE3 optimization · 81ad9536
      Yi Luo authored
      - Apply 8-pixel vertical filtering direction parallelism.
      - Add unit tests to verify bit exact.
      - Encoder speed improves ~29% (enable EXT_INTERP) on Xeon E5-2680.
      - Combinational cycle count of vp10_convolve() drops from 26.06%
        to 6.73%.
      
      Change-Id: Ic1ae48f8fb1909991577947a8c00d07832737e57
      81ad9536
  4. 21 Jun, 2016 1 commit
  5. 20 Jun, 2016 1 commit
    • Yi Luo's avatar
      Convolution horizontal filter SSSE3 optimization · 229690a9
      Yi Luo authored
      - Apply signal direction/4-pixel vertical/8-pixel vertical
        parallelism.
      - Add unit test to verify the bit exact result.
      - Overall encoding time improves ~24% on Xeon E5-2680 CPU.
      
      Change-Id: I104dcbfd43451476fee1f94cd16ca5f965878e59
      229690a9
  6. 10 Jun, 2016 2 commits
    • Sarah Parker's avatar
      Move new quant experiment from nextgen · a21afd42
      Sarah Parker authored
      This experiment implements non-uniform quantization where
      the width of the bins increases gradually to more closely
      match a laplacian distribution of the coeficcients.
      
      Performance Gain:
      derflr: 0.15%
      hevcmr: 0.675%
      
      Change-Id: I25234244e3bcd94b87c1f77cf682190b61c8ef94
      a21afd42
    • Angie Chiang's avatar
      Revert "Optimize wedge partition selection." · 95340fcc
      Angie Chiang authored
      This reverts commit efda2831.
      
      This commit causes segmentation fault at SSE2/SumSquares2DTest.RandomValues/0
      
      Change-Id: I171937e4daf6f15323e8206418773deb03bd8c53
      95340fcc
  7. 06 Jun, 2016 1 commit
    • Geza Lore's avatar
      Optimize wedge partition selection. · efda2831
      Geza Lore authored
      We can optimize wedge partition selection by pre-computing the
      residuals of the 2 underlying predictors, and then blend these
      to compute the sse of the compound predictor, without actually
      having to compute and subtract the compound predictor.
      
      Similarly we can pre-compute a proxy array which we can use to
      cheaply check which mask sign would have lower sse.
      
      Details are in wedge_utils.c.
      
      Mathematically these are equivalence transformations, but due to the
      finite precision the encoder output will be perturbed, though on
      average this should make 0% difference.
      
      ext-inter gains about ~4.5% speedup.
      
      Change-Id: Ib2657c3209ae161b4090b58b4b6c392641bf2792
      efda2831
  8. 24 May, 2016 1 commit
    • Yi Luo's avatar
      HBD inverse HT 8x8 and 16x16 sse4.1 optimization · 28cdee44
      Yi Luo authored
      - Covers tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
      - Encoding speed improves ~27% on crowd_run_1080p_12.
      - Merge 4x4, 8x8, 16x16 unit tests in one test file.
      
      Change-Id: I058ef5254d068a9523a826480c78ebbdd231824c
      28cdee44
  9. 13 May, 2016 1 commit
    • Yi Luo's avatar
      HBD inverse HT 4x4 SSE4.1 optimization · a3a69b40
      Yi Luo authored
      - Tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
      - Encoder overall instruction count drops 2.91%.
      - Decoder overall instruction count drops 1.01%.
      - Add unit test to test bit-exact result against C.
      
      Change-Id: I908c9e0e5106c58f67dd72d28760e6c9ce54278e
      a3a69b40
  10. 10 May, 2016 2 commits
  11. 22 Apr, 2016 1 commit
    • Yi Luo's avatar
      Change hybrid transform function argument from TXFM_2D_CFG* to int · cf7f0069
      Yi Luo authored
        Unit test shows manually developed SSE4.1 code would performs ~30%
        better if TXFM_2D_CFG configuration is set in lower level. This
        change only updates function signature. There is no performance
        impact.
      
      Change-Id: I62692bd50a21ffc8a944bbd6c155c0a2020ad77b
      cf7f0069
  12. 30 Mar, 2016 3 commits
    • Angie Chiang's avatar
      Generalize txfm scale in highbd quantizer · c7c40d23
      Angie Chiang authored
      Change-Id: I359aa49c09b244e0d44ebd09442e365a3d22556c
      c7c40d23
    • Angie Chiang's avatar
      change vp10_fwd_txfm2d_#x#_sse2 to vp10_fwd_txfm2d_#x#_sse4_1 · 25520d8d
      Angie Chiang authored
      The speed performance for running 20k times  is as follows
      
      Notice that the vp10_highbd_fdct#x#_sse2 version is
      16-bit version plus range check
      
      The rest are 32-bit version
      
      vp10_fwd_txfm2d_4x4_c (2 ms)
      vp10_fwd_txfm2d_8x8_c (9 ms)
      vp10_fwd_txfm2d_16x16_c (45 ms)
      vp10_fwd_txfm2d_32x32_c (233 ms)
      
      vp10_fwd_txfm2d_4x4_sse4_1 (2 ms)
      vp10_fwd_txfm2d_8x8_sse4_1 (3 ms)
      vp10_fwd_txfm2d_16x16_sse4_1 (16 ms)
      vp10_fwd_txfm2d_32x32_sse4_1 (80 ms)
      
      vp10_highbd_fdct4x4_c (1 ms)
      vp10_highbd_fdct8x8_c (3 ms)
      vp10_highbd_fdct16x16_c (17 ms)
      highbd_fdct32x32_c (160 ms)
      
      vp10_highbd_fdct4x4_sse2 (0 ms)
      vp10_highbd_fdct8x8_sse2 (2 ms)
      vp10_highbd_fdct16x16_sse2 (8 ms)
      highbd_fdct32x32_sse2 (105 ms)
      
      Change-Id: I24daf1e0d4d66e91e4ce61ef71cefa7b70ee90ce
      25520d8d
    • Angie Chiang's avatar
      Add vp10_fwd_txfm2d_sse2 · 11d2bb54
      Angie Chiang authored
      Change-Id: Idfbe3c7f5a7eb799c03968171006f21bf3d96091
      11d2bb54
  13. 23 Mar, 2016 1 commit
    • Yi Luo's avatar
      Highbd fht4x4 SSE4.1 optimization for DCT_DCT mode · 977dccd1
      Yi Luo authored
      - Setup function vp10_highbd_fht4x4_sse4_1 for highbd SSE4.1
        intrinsics optimization.
      - Wrote SSE4.1 functions: load_buffer_4x4(), write_buffer_4x4(),
        and fdct4x4_sse4_1().
      - Used logic right shift to avoid coeff memory write/read.
      - Turned on vp10_highbd_fht4x4_sse4_1 for DCT_DCT mode only.
      - Improved overall encoding performance >2.3% for 50 frames
        sequence, park_joy_1080p_12.y4m, in which, --input-bit-depth=12,
        --bit-depth=12, 50 frames.
      - Unit test passed.
      
      Change-Id: Idd6dc6e472cbbf235f0ade4f66fbe859a860a004
      977dccd1
  14. 21 Mar, 2016 1 commit
  15. 10 Mar, 2016 1 commit
  16. 08 Mar, 2016 1 commit
    • Yi Luo's avatar
      Implemented DST 16x16 SSE2 intrinsics optimization · 50a164a1
      Yi Luo authored
      - Implemented fdst16_sse2(), fdst16_8col() against C version: fdst16().
      - Turned on 7 DST related hybrid txfm types in vp10_fht16x16_sse2().
      - Replaced vp10_fht10x10_c() with vp10_fht16x16_sse2() in
        fwd_txfm_16x16().
      - Added vp10_fht16x16_sse2() unit test against C version:
        vp10_fht16x16_c() (--gtest_filter=*VP10Trans16x16*).
      - Unit test passed.
      - Speed improvement: 2.4%, 3.2%, 3.2%, for city_cif.y4m, garden_sif.y4m,
        and mobile_cif.y4m.
      
      Change-Id: Ib30a67ce5d5964bef143d588d0f8fa438be8901f
      50a164a1
  17. 07 Mar, 2016 1 commit
    • Jingning Han's avatar
      Hybrid 1-D/2-D transform coding · a8dc9694
      Jingning Han authored
      This commit enables a hybrid 1-D/2-D transform coding scheme and
      the accompany entropy coding system. It currently uses hybrid
      1-D/2-D DCT transform coding. It provides coding performance gains:
      
      lowres_all  0.55%
      hdres_all   0.43%
      
      Change-Id: I2b30dcafd21eb2bb3371f6e854cbab440a4dfa78
      a8dc9694
  18. 17 Feb, 2016 1 commit
  19. 14 Dec, 2015 1 commit
  20. 09 Nov, 2015 1 commit
    • Johann's avatar
      Release v1.5.0 · cbecf57f
      Johann authored
      Javan Whistling Duck release.
      
      Change-Id: If44c9ca16a8188b68759325fbacc771365cb4af8
      cbecf57f
  21. 02 Oct, 2015 1 commit
  22. 10 Sep, 2015 2 commits
    • Angie Chiang's avatar
      Isolate vp10's fwd_txfm from vp9 · ee5b8059
      Angie Chiang authored
      1) copy fw_txfm related files from vpx_dsp tp vp10
      
          vpx_dsp/fwd_txfm.h → vp10/common/vp10_fwd_txfm.h
          vpx_dsp/fwd_txfm.c → vp10/common/vp10_fwd_txfm.c
          vpx_dsp/x86/fwd_dct32x32_impl_sse2.h →  vp10/common/x86/vp10_fwd_dct32x32_impl_sse2.h
          vpx_dsp/x86/fwd_txfm_sse2.c →  vp10/common/x86/vp10_fwd_txfm_sse2.c
          vpx_dsp/x86/fwd_txfm_impl_sse2.h → vp10/common/vp10_fwd_txfm_impl_sse2.h
      
      Change-Id: Ie9428b2ab1ffeb28e17981bb8a142ebe204f3bba
      ee5b8059
    • Angie Chiang's avatar
      Isolate vp10's inv_txfm from vp9 · 87175ed5
      Angie Chiang authored
      1) copy following files from vpx_dsp/ to vp10/common/
      vp10_inv_txfm.c
      vp10_inv_txfm.h
      vp10_inv_txfm_sse2.c
      vp10_inv_txfm_sse2.h
      
      2) change the function prefix "vpx_" to "vp10_" in above files
      
      3) add unit test at vp10_inv_txfm_test.cc
      
      Change-Id: I206f10f60c8b27d872c84b7482c3bb1d1cb4b913
      87175ed5
  23. 12 Aug, 2015 3 commits