Improve zlib inflate speed by using SSE2 chunk copy

Using SSE2 chunk copies improves the decoding rate of the PNG 140
corpus by an average 17%, giving a total 37% performance increase
when combined with SIMD adler32 code (https://crbug.com/772870#c3
for details).

Move the arm-specific code back into the main chunk copy code and
generalize the SIMD parts of chunkset_core() with inline function
calls for ARM, and Intel SSE2 devices. This removes the TODO from
arm/chunkcopy_arm.h, and that file can be deleted as a result.

Add SSE2 vector load / store SSE helpers for chunkset_core(). The
existing NEON load code had alignment issues, as noted in review.
Fix that: use unaligned loads in the ARM helper code.

Change chunkcopy.h to use __builtin_memcpy if it's available, use
zmemcpy otherwise such as on MSVC. Also call x86_check_features()
in inflateInit2_() to keep the adler32 SIMD code path enabled.

Update BUILD.gn to conditionally compile the SIMD chunk copy code
on Intel SSE2 and ARM NEON devices. Update names.h to add the new
symbol defined by the inflate chunk copy code path.

Code had various comment styles; pick one and use it consistently
everywhere. Add inffast_chunk.h TODO(cblume).

Bug: 772870
Change-Id: I47004c68ee675acf418825fb0e1f8fa8018d4342
Reviewed-on: https://chromium-review.googlesource.com/708834
Commit-Queue: Noel Gordon <noel@chromium.org>
Reviewed-by: Chris Blume <cblume@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#522764}
Cr-Mirrored-From: https://chromium.googlesource.com/chromium/src
Cr-Mirrored-Commit: c293a3255eb27dee8879f85f2c45dedff58e2452
7 files changed