IronJS Blog

All things IronJS and F#

Analyzer: Single-Pass vs. Multi-Pass

with 3 comments

I recently wrote about the new lexer and parser in IronJS, giving a 8x performance boost to parsing. Over the past two days I’ve been looking at the AST analyzer IronJS has been using, and how to improve it. The analyzer steps through the AST produced by the parser and figures out things like closures, static types, dynamic scopes, etc. Due to the non-mutable nature of discriminated unions in F# it has been forced to re-build the syntax tree to resolve everything it needs to, since it sometimes required changes to, it has also been a doing several passes over the syntax tree.

I’m glad to announce that with some clever use of reference cells I’ve been able to both eliminate the need to re-build the AST and also, due to having access to the internals of the new TDOP-based parser, manged to make it require only a single pass over the syntax tree, the performance difference is pretty staggering.

Milliseconds to parse and analyze jQuery 1.5.1

As you can see with both the new parser and analyzer it’s a whopping ~13x faster then the old ANTLR based parser and multi-pass analyzer. It’s also ~4x faster then the new parser with the old multi-pass analyzer.


Written by Fredrik Holmström

March 21, 2011 at 6:42 pm

Posted in IronJS

3 Responses

Subscribe to comments with RSS.

  1. Very interesting result. I wrote a generic TDOP/Pratt parsing library in C#, which performs scanning and lexing too from a declarative grammar specification. I imagine you are doing these steps manually as Crockford did.

    I haven’t run many benchmarks on my approach since it hasn’t been an issue for the small applications I’m doing for now, but it’s definitely the most expensive part of the library, so I’m curious where it would fall on this graph. I look forward to browsing your code to see how you approached the problem.

    Sandro Magi

    March 27, 2011 at 4:09 pm

    • There are a couple of reasons for the performance improvements, the main thing I did to the lexer was to force inlining on almost every function call by using the F# specific inline keyword. Most .NET lexers does a function call to read each character, check the type of a character, etc. While the JIT can inline a lot, it can’t inline everything – and this ends up costing a lot of performance when iterating over something like the jQuery source which is about 256kb.

      On the parser, as you’re familiar with TDOP parsers I’m sure you know that most implementations use a string symbol and then some type of map keyed on the string to store the binding powers and denotations. This is how my original version worked also, but after some benchmarking it turned out that a great deal of time was spent looking up binding powers and denotations. So what I did was convert every symbol to an integer in the range of 0-150 (IronJS only has about ~100 odd tokens but I added 50 extra to make space for the future), I then went on and replaced the previous Map objects I’ve used previously with 150 element arrays. So looking up a null denotation function goes from being something like this:

      let nud = Unchecked.defaultOf<NudDelegate>
      if nuds.TryGetValue(symbol, &nud) then ... else ...

      to this:

      let nud = nuds.[symbol]
      if Object.ReferenceEquals(nud, null) then .. else ...

      Which is lightyears faster since the reference equals method is static and always seems to be inlined by the JIT, not to mention there is no need to constantly convert a string to it’s hash value and do a dictionary lookup.

      Fredrik Holmström

      March 27, 2011 at 10:04 pm

      • Most Pratt parsers I’ve seen have been in dynamic languages which, as you point out, use pervasive associative data structures instead of a more structured abstraction. This just isn’t feasible in statically typed languages, which is actually fortunate because the natural solution of designing an abstraction is not only clearer, it’s faster too.

        I did the natural thing in my Pratt parsing library: I have Symbol and Token abstractions which have Nud and Led delegates as properties. If I understand what you’re saying, this should have the same performance advantages you discovered.

        Where I suspect I’d lose out is that my parsing library does the tokenizing for you too, where the original Pratt parser assumed a pass had created a token stream. A general tokenizing algorithm isn’t as efficient as a specialized one would be, but it certainly reduces the code you have to write to a single class that declares the operators and their precedence. That was my primary criterion.

        The only piece I’m missing, because I haven’t really needed it yet, is to make the symbol table dynamically extensible to introduce new symbols while parsing. This would be needed for languages that declare new operators, like F# and Haskell. I suspect you haven’t needed this either because JS doesn’t have this ability.

        Sandro Magi

        April 6, 2011 at 3:30 am

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: