1. 09 Mar, 2017 2 commits
    • David Barker's avatar
      Add SSE4.1 highbitdepth self-guided filter · 4d2af5db
      David Barker authored
      Performance is very similar to the lowbd path (only 4-5% slower)
      
      Change-Id: Ifdb272c3f6c0e6f41e7046cc49497c72b5a796d9
      4d2af5db
    • Yaowu Xu's avatar
      Avoid out-of-range memory access · 7e9f59e0
      Yaowu Xu authored
      The commit increase size of a few heap allocations to make sure later
      access is not out of bounds.
      
      BUG=aomedia:383
      
      Change-Id: Iadb08faa1e55be361dd3d4adaafeb85cecf23bbb
      7e9f59e0
  2. 08 Mar, 2017 2 commits
    • David Barker's avatar
      Make encoder use vectorized self-guided filter · 506eb723
      David Barker authored
      By rearranging the code in restoration.c, we can allow the
      encoder to use the SSE4.1 version of the self-guided filter
      while picking the loop-restoration filter.
      
      This also helps us prepare for adding a highbitdepth SSE4.1
      version of the self-guided filter.
      
      No effect on encoder output, but gives an end-to-end speedup
      of 1-2%.
      
      Change-Id: Id17ba4a0963ddce9f70a7cae666e212e138d5f2c
      506eb723
    • David Barker's avatar
      Handle non-multiple-of-4 widths in SSE4.1 self-guided filter · 5765fad5
      David Barker authored
      Adjust the vectorized filter so that it can handle tile widths
      which are not a multiple of 4, so we do not have to fall back
      to the C version of the filter.
      
      Negligible speed impact for tiles with widths which are multiples
      of 4, and greatly improves speed on tiles with non-multiple-of-4
      widths.
      
      Change-Id: Iae9d14f812c52c6f66910d27da1d8e98930df7ba
      5765fad5
  3. 06 Mar, 2017 1 commit
    • David Barker's avatar
      Vectorize self-guided filter · ce110cc5
      David Barker authored
      Add an SSE4.1 lowbd version of the self-guided filter for
      loop-restoration, and apply some optimizations to the C
      version.
      
      Approximate times per 128x128 / 256x256 tile on the machine
      this was developed on:
      Previous C:  620us / 2800us
      Optimized C: 500us / 2200us ( 24% /  27% faster)
      SSE4.1:      147us / 600us  (320% / 370% faster)
      
      Change-Id: I23ff5a5482a191aeb06f9d1f767a9f036bb357fe
      ce110cc5