Increase inflate speed: read decoder input into a uint64_t

The chunk-copy code contribution deals with writing decoded DEFLATE data
to the output with SIMD methods to increase inflate decode speed. Modern
compilers such as gcc/clang/msvc elide the portable memcpy() calls used,
replacing them with much faster SIMD machine instructions.

Similarly, reading the input data to the DEFLATE decoder with wide, SIMD
methods can also increase decode speed. See https://crbug.com/760853#c32
for details; content-encoding: gzip decoding speed improves by 2.17x, in
the median over the snappy testdata corpus, when this method is combined
with the chunk-copy, and the adler32, and crc32 SIMD contributions (this
method improves our current inflate decode speed by 20-30%).

Update the chunk-copy code with a wide input data reader, which consumes
input in 64-bit (8 byte) chunks. Update inflate_fast_chunk_() to use the
wide reader. This feature is supported on little endian machines, and is
enabled with the INFLATE_CHUNK_READ_64LE build flag in BUILD.gn on Intel
CPU only for now.

The wide reader idea is due to nigeltao@chromium.org who did the initial
work. This patch is based on his patch [1]. No change in behavior (other
than more inflate decode speed), so no new tests.

[1] https://chromium-review.googlesource.com/c/chromium/src/+/601694/16

Bug: 760853
Change-Id: Ia806d9a225737039367e1b803624cd59e286ce51
Reviewed-on: https://chromium-review.googlesource.com/900982
Commit-Queue: Noel Gordon <noel@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#535365}
Cr-Mirrored-From: https://chromium.googlesource.com/chromium/src
Cr-Mirrored-Commit: 6e212423a214e0e41794e8c9969c2896e2c33121
4 files changed