Tuesday, November 15, 2011

Code Size When Compiling to JavaScript

When compiling code to JavaScript from some other language, one of the questions is how big the code will be. This is interesting because code must be downloaded on the web, and large downloads are obviously bad. So I wanted to investigate this, to see where we stand and what we need to do (either in current compilers, or in future versions of the JavaScript language - being a better compiler target is one of the goals there).

The following is some preliminary data from two real-world codebases, the Bullet physics library (compiled to JavaScript in the ammo.js project) and Android's H264 decoder (compiled to JavaScript in the Broadway project):

Bullet

.js        19.2  MB
.js.cc      3.0  MB
.js.cc.gz   0.48 MB

.o          1.9  MB
.o.gz       0.56 MB

Android H264

.js       2,493 KB
.js.cc      265 KB
.js.cc.gz    61 KB

.o          110 KB
.o.gz        53 KB

Terms used: 

.js         Raw JS file compiled by Emscripten from LLVM bitcode
.js.cc      JS file with Closure Compiler simple opts
.js.cc.gz   JS file with Closure, gzipped

.o          Native code object file
.o.gz       Native code object file, gzipped

Notes on methodology:
  • Native code was generated with -O2. This leads to smaller code than without optimizations in both cases.
  • Closure Compiler advanced optimizations generate smaller JS code in these two cases, but not by much. While it optimizes better for size, it also does inlining which increases code size. In any case it is potentially misleading since its dead code elimination rationale is different from the one used for LLVM and native code, so I used simple opts instead.
  • gzip makes sense here because you can compress your scripts on the web using it (and probably should). You can even do gzip compression in JS itself (by compiling the decompressor).
  • Debug info was not left in any of the files compared here.
  • This calculation overstates the size of the JS files, because they have the relevant parts of Emscripten's libc implementation statically linked in. But, it isn't that much.
  • LLVM and clang 3.0-pre are used (rev 141881), Emscripten and Closure Compiler are latest trunk as of today.
Analysis

At least in these two cases it looks like compiled, optimized and gzipped JavaScript is very close to (also gzipped) native object files. In other words, the effective size of the compiled code is pretty much the same as you would get when compiling natively. This was a little surprising, I was expecting to see the size be bigger, and to then proceed to investigate what could be improved.

Now, the raw compiled JS is in fact very large. But that is mostly because the original variable names appear there, which is basically fixed by running Closure. After Closure, the main reason the code is large is because it's in string format, not an efficient binary format, so there are things like JavaScript keywords ('while', for example) that take a lot of space. That is basically fixed by running gzip since the same keywords repeat a lot. At that point, the size is comparable to a native binary.

Another comparison we can make is to LLVM bitcode. This isn't an apples-to-apples comparison of course, since LLVM bitcode is a compiler IR: It isn't designed as a way to actually store code in a compact way, instead it's a form that is useful for code analysis. But, it is another representation of the same code, so here are those numbers:

Bullet

.bc         3.9  MB
.bc.gz      2.2  MB

Android H264

.bc         365 KB
.bc.gz      258 KB

LLVM bitcode is fairly large, even with gzip: gzipped bitcode is over 4x larger than either gzipped native code or JS. I am not sure, but I believe the main reason why LLVM bitcode is so large here is because it is strongly and explicitly typed. Because of that, each instruction has explicit types for the expressions it operates on, and elements of different types must be explicitly converted. For example, in both native code and compiled JS, taking a pointer of one type and converting it to another is a simple assignment (which can even be eliminated depending on where it is later used), but in LLVM bitcode the pointer must be explicitly cast to the new type which takes an instruction.

So, JS and native code are similar in their lack of explicit types, and in their gzipped sizes. This is a little ironic since JS is a high level language and native code is the exact opposite. But both JS and native code are pretty space-efficient it turns out, while something that seems to be in between them - LLVM bitcode, which is higher than native code but lower than JS - ends up being much larger. But again, this actually makes sense since native code and JS are designed to simply execute, while LLVM bitcode is designed for analysis, so it really isn't in between those two.

(Note that this is in no way a criticism of LLVM bitcode! LLVM bitcode is an awesome compiler IR, which is why Emscripten and many other projects use it. It is not optimized for size, because that isn't what it is meant for, as mentioned above, it's a form that is useful for analysis, not compression. The reason I included those numbers here is that I think it's interesting seeing the size of another representation of the same compiled code.)

In summary, it looks like JavaScript is a good compilation target in terms of size, at least in these two projects. But as mentioned before, this is just a preliminary analysis (for example, it would be interesting to investigate specific compression techniques for each type of code, and not just generic gzip). If anyone has additional information about this topic, it would be much appreciated :)