Splitting CSV files at 3GB/s

Splitting CSV files is a must when dealing with large, potentially larger than RAM, files. But how fast can it be?

Tech

José Muñoz

Backend Developer

The problem ¶

When dealing with large files, for example files that don’t fit on RAM, splitting the file in chunks is a must.

Some data formats like NDJSON can be trivially split, as all \n bytes are line separators. However, CSV can have newlines within quotes, which don’t indicate a new row, but a newline in the row’s field.

There are two kinds of solutions to the problem of finding a splitting point in a CSV file:

Read it. By simply reading the file, we can distinguish between \n bytes that are within quotes and the ones that aren’t, and that are points in which we can split the CSV. With encodings that preserve the ASCII code, like UTF-8, for the special characters we’ll be able to avoid extra state and code.
Statistical solutions. By going directly to the point in which we want to split the file, and by reading some bytes around it, we can guess a good splitting point. This solution is naturally faster, as you need to process less bytes, but it can return wrong splitting points too.

We decided to go with the first approach to favor correctness over speed. However, speed is very important to us, how much faster can we make it?

Speed wins

Our first implementation was a python-based parser. We tested its correct behavior, but we certainly saw that performance with this approach was a no-go.

Hello, C

Python performance can be very limitting, but the good thing about Python is that C code can be integrated easily on simple problems like this one.

We ported the Python implementation to C and called it using CFFI. This single optimization, without any other improvement, increased performance by two orders of magnitude, reaching now the 1GB/s barrier.

Simple is faster

We were quite happy about the 1GB/s initial implementation, but we knew we could do better. We started optimizing the algorithm. The first version was based on a single complex loop. The optimized version was simpler, and had a nested loop to deal with the quoted fields.

SIMD

This implementation was quite fast, but we still thought we could go a step further. This optimized version seemed very difficult to improve, as it just performed three quick comparisons per byte on the happy path.

Improving the time it took to process each byte seemed almost impossible, and actually, we didn’t do that. Instead, we processesed multiple bytes per iteration.

All modern x86 CPUs include SIMD instructions like SSE and AVX. These instructions allow the processor to work on multiple registers in parallel, improving throughput.

This is how our last version looks like:

Used compiler intrinsics:

__m128i _mm_set1_epi8(unsigned char c). Returns a 16-byte SSE group of registers with all bytes filled with c.
__m128i _mm_loadu_si128(__m128i const* mem_addr). Returns a 16-byte SSE group of registers with bytes copied from mem_addr.
__m128i _mm_cmpeq_epi8 (__m128i a, __m128i b). Returns a 16-byte SSE group of registers, comparing for equality the bytes of a and b. Each returned byte is set to 0xFF if the corresponding bytes of a and b are equal, 0 otherwise.
int _mm_movemask_epi8 (__m128i a). Compacts the 16 bytes of a by taking the most significand bit of each byte in a, returning a 16-bit sized mask.
int __builtin_popcount(unsigned int x). Counts the number of bits in x set to 1.
int __builtin_clz(unsigned int x). Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
int __builtin_ctz (unsigned int x). Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.