float to short improvements

[22:58.40] <+jmspeex> xnorpx: About rounding, I actually think the right solution is to write functions that convert whole vectors at a time and then we can just use the normal run-time intrinsics method [22:59.09] <+jmspeex> BTW, does SSE2 even have a proper way to round without messing with rounding modes