Home

Code golf & the art of CSV parsing

So, I recently released but-csv, an absolutely micro package to parse and generate CSVs, at (as of writing), 484 bytes. Yet, it started at almost double that—864 bytes—just a few days ago.

This post is about a kind of code golf in JavaScript—for real shippable software, while not trading off performance. I'll show some tips and tricks that the tools can't do for you.

But first! A graph of progress from a naïve starting point, over a weekend, of making this code small. (The blue line is after being passed through esbuild --minify, and the red is additionally compressed with gzip.)

Graph showing drop in but-csv bytes (both minified and gzipped) over a weekend
The size got pretty small

The gzip'ed version of the starting point actually ended up being larger than the final version's actual source. The compressed version tracks the original, but its line isn't as steep—at the start, gzip saved about 40%, but it reduced to 20% by the end.

(gzip is interesting but not really indicative of how a whole project will compress, since a real project won't just contain a ~500 byte CSV parser: gzip will have more to work with and the ratios will be better.)

But.. why?

Look, there's obviously a huge benefit in small and useful libraries, and me saying that "JS bundles are too big" is clearly like saying "water is wet". 💧

I'm of the view that every application is different, and it's needless to include 10-20k of JS that parses CSV in this certain way and so on… you're better off starting with a simple baseline and building the right code for your application. So if you need to parse CSV—including 500 bytes or so is an easy choice to make, because it's not going to bloat your output.

This post isn't just about code golf, because clearly you can't apply this to monolithic libraries: it's also a suggestion to library authors to write libraries which are:

Let's start with the ⛳🏌️ first though, because everyone likes that.

What Minifiers Do Well

So, you want to make your code small. Let's use esbuild --minify, which is really wiping the floor with every other tool today.

Yes, code golf is classically characterized as you writing tiny code. It's admirable and a fun task, but it doesn't help you deliver real software. But instead, let's use the minifier as a baseline and help it out along the way.

To get started, try formatting already minified code to better understand what it does and how you can improve it.

Renaming variables and functions

Source code is about being descriptive, and writing detailed names can help you do that. Don't try to golf your code by shrinking names, because they'll help you understand your code later.

let accumulatedValue = 123;
const doComplexOperation = (arg) => { ... };
// will become:
let a=123;const b=c=>...;

In fact, we can help a minifier by extracting some complex operation that we do often into a helper—especially if it uses built-in JavaScript methods, which can't be renamed. In but-csv, we have a really descriptive helper that looks like:

let getNextValueAndSetPos = (count) => data.slice(at, at += count);
// will become:
let a=c=>d.slice(a,a+=c);

…that's used whenever we need to read a string of a certain length and move the cursor forward. For an even simpler example, but-csv wants to get the current character code a lot:

let sourceCharCodeAt = () => source.charCodeAt(i);

The really simple win here is that by moving this to a helper function, we only ever have one call to .charCodeAt, which is a symbol we cannot rename. If we were calling .charCodeAt directly everywhere, it'd be a long string that many times.

Merge statements into runs

JavaScript programs are a list of statements, blocks and so on. Minifiers are really good at squishing together operations like assignments, and method calls.

The point here is that you shouldn't start by trying to totally rewrite your program into a convoluted format. It can still look like JS and read top-to-bottom.

if (y) {
  doSomething();
  i++;
}
// will become:
y&&doSomething(),i++;

Many if statements will be turned into ternary expressions, so they can be 'run together' with other expressions. Even statements like yield actually can be inlined into runs like (yield 123,i++).

What Minifiers Can't Do

Minifiers can rearrange your code, but they can't make intelligent decisions about what order things should happen in: and whether that order is important to your code. But you can, so let's learn how.

Hoist variable declarations

Creating new variables with let or const adds bytes. Instead of creating them throughout your code, you can instead declare them in one contiguous block, so your minifier will group their creation.

let x = 1;
doSomething();
let y = 2;
// x/y won't be merged, so you'll lose a few bytes:
let x=1;doSomething();let y=2;

If you want to be extreme, you can also hide these in a function definition to save more bytes:

const x = (inputString, temp_var_1, temp_var_2) => {
  // nb. arrow functions are always fewer bytes than "function", but can't be generators
};

…and just don't include the extra variables as part of your API definitions (e.g., in your ".d.ts").

Declare top-level temporary variables

Additionally, if your program needs temporary variables—the result of an .indexOf() call—then consider declaring them at the top of your function. (You could even do this in top-level module scope, if your program isn't async).

This looks a bit like:

let string = 'hello,there';
let temp_int;
for (;;) {
  temp_int = string.indexOf(',');
  ...
}
// as above, this will shrink into fewer variable declarations:
let string='hello,there',temp_int;
for(;;) {temp_int = ...}

There's a warning here: you should have one temporary variable per type. Just like in most compiled languages, you declare them as an int or float or so on, JS interpreters will optimize variables to contain a single type. So you might create temp_number, temp_string, and so on—and know that your minifier will rename this variable anyway, so the long name is fine.

Use variables from a higher scope

I show how you can extract common functionality into closures above. You should try to re-use variables e.g., from your function or module scope, to store temporary values.

// both "s" and "index" are declared above
let updateIndexIfPositive = () => {
  index = s.indexOf('something');
  if (index > 0) {
    doSomething();
  }
}

Use booleans as 0/1

TypeScript and friends won't like this, but it's valid to treat false as zero, and true as one. Consider doing this:

if (condition) {
  --x;
}
// could instead be:
x -= condition;

This'll save you a couple of bytes, with no performance hit.

This reminds me of writing OpenGL shaders, where branching is slow—you add 0's or 1's to results to provide choice without if.

Put operations inside function arguments

Minifiers seem to struggle with this, since it's just a bit too much rearranging. Programs are lists of operations interspersed with function calls. If your arguments need to change, you can inline them and use the result of that operation as an argument. For example:

// esbuild won't minify this
x += ',';
callMethodNeedsTrailingComma(x);
// could become
callMethodNeedsTrailingComma(x += ',');

There's two things happening here.

  1. The arguments to a function are evaluated before the function is called, left-to-right
  2. The result of an operation like x += ',' is the new value of x, so we can 'steal' it for our function

Use the right built-ins

In but-csv, we do a lot of string manipulations. Strings in JavaScript have both substr and slice: slice is a byte fewer, but has the same functionality.

This doesn't always give the same performance characteristics, though. It's possible to stringify an array (with comma separators) two ways, but one is much faster:

// don't do this:
const s = ''+arr;
// this is much faster:
const s = arr.join('');

Similarly, it's possible to call a function by pretending it's a tagged template literal, saving a few bytes:

function foo(string_arg) { ... }

// DON'T DO THIS, it's incredibly slow:
foo`hello`
// ... just call it a normal way:
foo('hello')

So it's worth benchmarking your choices.

Reduce keyword or built-in use

Keywords like return, or break, take up bytes. You can rewrite your code to avoid them completely, or have a centralized path where they only need to be called once.

if (complexCondition()) {
  return true;
}
while (--x) {
  caculateSomething();
}
return false;
// could become:
let check = complexCondition();
if (!check) {
  while (--x) {
    calculateSomething();
  }
}
return check;

The above example won't save that many bytes, but it can add up. Moving behavior into helper functions, if you call it from multiple places, can also reduce bytes.

Assess work

When minifying but-csv, we found that some of the operations were being done every loop—but didn't need to be. I was keeping an up-to-date position of the next newline, but if the code hits a multi-line CSV part (wrapped with quotes ""), we basically threw this away.

Instead, it was smaller to only find the upcoming newline where we actually checked it—when we were looking for an upcoming comma or newline to end a simple CSV value.

Screw it and use RegExp

Since writing this post, it's been point out you can get even smaller using regular expressions. Bravo!

This works very well for string manipulation problems, like CSV.

Being Productive While Golfing

It's all well and good to make your code small, but it can quickly become unsustainable to work on. Some tips:

The Philosophy Section

I wrote but-csv in a reaction to huge CSV libraries that somehow need entire documentation sites to describe their usage. CSV is a mostly simple format—yes, you can use .split() a lot, but it has some edge cases, hence the library—but 10-20k and setup code just to parse some raw data? Give me a break.

My hope is that small, composable libraries should be able to wrap up complex behavior and present it to you in a concise way. There's a funnel, or maybe an hourglass ⌛ shape metaphor here:

But great software engineering should allow you to meet in the middle at a very succint point. 💫

Thanks

I hope this has been a good read! Be sure to follow me on Twitter for more adventures in micro libraries, and use but-csv whenever you need to parse a CSV.

Thanks to mcpower for their help on bringing the size of but-csv down and ideas for this post. 📉