Mar
22

Regex Performance in D Programming Language

Note: The problem described in this post was for DMD v2.054 and it no longer occurs with the more recent DMD v2.058  (see below for details or view the discussion at the D Forums).

I am currently working on a Ruby project that uses a lot of regexes on large volumes of text.  It is currently running too slowly, so I decided to try to optimise it by implementing the regex matching code in the D programming language.  D has given me a lot of joy (compared to C or C++)  by making things like string (with Unicode) handling a breeze without taking the performance hit of supposed “productivity” languages.  I painstakingly reimplemented my Ruby functions in D expecting a huge performance boost (actually I expected an order of magnitude performance jump) but instead I was shocked to see that Ruby outperformed my D code by a significant margin.  The Ruby implementation was finished after 80 seconds, whereas the D program required around 280 seconds using the exact same regexes and the exact same input.

I am not being a Ruby fan boy here, actually I will be the first in line to give you a long list of complaints I have with it, so please don’t view this as a “Ruby is awesome, everything else sucks article”.  It is also entirely possible that I am doing something really dumb in my implementation.  I’d like to ask if anybody out there has done any benchmarking with D (especially in the realm of regexes) if they know what might be going on and if there are known performance problems.  I will try to post my code (in Ruby and D) as soon as possible

Just as an aside, the D programming language is a very interesting language and if you haven’t already played with it then you really should.  I can also highly recommend The D Programming Language”.  There is, quite simply, no other book of this caliber about the D programming language.

Update 26 March 2012:

Apparently comments weren’t working on my blog, so sorry about that (the captcha plugin was sort of half activated, so I just switched it off until I get around to fixing it properly).  You are now able to comment on this post.  I got a nice email telling me that some people were curious over at the official D forums about my problem.  It seems that my post caught Andrei’s eye, so I thought I should get on with it and post my code and not leave people speculating on the forum.

The problem was as follows:  I was unable to post my work because it contained proprietary code and the data I was using was also somewhat sensitive, so I had to spend some time recreating the problem with “postable” code.  I recreated the problem using different input data and new regexes where I found D to vastly outperform Ruby.  This made my both happy and sad at the same time because I still needed to recreate the problem somehow.

My code loads a file from disk, stores the contents into an array and runs a series of regexes against each line in turn.  The time taken is only measured during the regex phase (so the file loading isn’t counted).  I don’t care if the regexes match or not, but I always run every regex against every line.  The regexes are the same in the Ruby and the D code (unless I’ve made a horrible mistake).

In the D code I am not using Compile Time Regexes, and my Regexes are cached before the test (i.e. I am not  recompiling them on every loop iteration).  Additionally,  I am using the more recent std.regex package and not the deprecated std.regexp package.

I am using Ruby 1.9.2 and the Digital Mars D compiler v2.054

I perform the test twice, once for a more or less random CSV (actually pipe delimited, but that doesn’t really matter) file and secondly for the Complete Works of Shakespeare in text format.  Actually, I stop after just 2000 lines because the performance difference is already obvious.  The regexes used are always the same.  The performance is very much dependent upon which input file I use:

With CSV file input:

D Program: 55,813 ms

Ruby Program: 4851 ms

With Shakespeare text:

D Program: 1973 ms

Ruby Program: 4683 ms

Maybe if somebody could take a look at my code you could tell me if I’m doing something dumb or not.

All the code and test files can be downloaded here: Benchmark.zip

Update 27 March 2012: The more recent v2.058 DMD compiler doesn’t suffer this performance problem anymore.  I just tested it and it now runs significantly faster than Ruby in both scenarios.  Thanks for the support in the D forums!

Tags: , , , ,

6 Responses to “Regex Performance in D Programming Language”

  1. Chang Long says:

    I use http://www.pcre.org/ with https://github.com/sleets/oak/blob/master/src/oak/util/Pcre.d

    If you are able to avoid implicit memory allocate, The D will be very fast on most situation.

  2. Jesse Phillips says:

    Look for the use of ‘regex(”pattern”)’ inside loops, especially tight ones. I can’t compare with Ruby performance as I won’t be rewriting my work there. But the latest Regex engine doesn’t handle this well (make since it has to do compilation of the regex each time). Also ctRegex may provide some performance gains.

  3. James says:

    Thanks for your comment Jesse.
    I have posted my Ruby and D source code in the post. My regexes are pre-compiled and cached, so I don’t think that this is the problem.

    I have yet to try ctRegex.
    James

  4. James says:

    Thanks for the tip Chang, I’ll take a look.

  5. Jesse Phillips says:

    Based on your compiler version, you won’t have ctRegex. 2.057 has the newer regex engine, the old one doesn’t appear to have the slowdown from the problem I described either.

  6. Yea, D’s regex engine was completely overhauled in 2.057, so using that should make a big difference. In fact, some people over at the D newsgroup have already reported that merely upgrading to the latest DMD makes all the difference and puts the D version about 3x *faster* than the Ruby version for *both* input files:

    http://forum.dlang.org/post/ybhefxwftrtjqudzcnoo@forum.dlang.org

    So there ya go.