Add optimized vpx_sum_squares_2d_i16 for vp10.
Using this we can eliminate large numbers of calls to predict intra, and is also faster than most of the variance functions it replaces. This is an equivalence transform so coding performance is unaffected. Encoder speedup is approx 7% when var_tx, super_tx and ext_tx are all enabled. Change-Id: I0d4c83afc4a97a1826f3abd864bd68e41bb504fb