girtby.net » boost http://girtby.net this blog is girtby.net Thu, 17 Sep 2009 14:27:44 +0000 http://wordpress.org/?v=2.9-rare en hourly 1 Wide Finder in C++ http://girtby.net/archives/2007/10/09/wide-finder-in-c/ http://girtby.net/archives/2007/10/09/wide-finder-in-c/#comments Tue, 09 Oct 2007 11:07:00 +0000 alastair http://girtby.net/2007/11/09/wide-finder-in-c Have you been following Tim Bray’s Wide Finder project? It’s an exercise to express a fairly simple task in a manner that will scale across multiple CPU cores. Some of Tim’s initial progress with Erlang, and other contributors in different languages, is quite fascinating.

Like Tim I was also amused at Pete Kirkham’s C++ implementation which was purported to be shorter than an initial attempt by an Erlang expert (in Erlang obviously). However on closer examination it seems that Pete’s C++ implementation was simply handling the I/O portion with simplified parsing and not the subsequent sorting.

So as another C++ coding Kata I decided to have a go. Whereas Tim’s goal was to evaluate different methods of expressing algoritms for parallel computation, mine was a lot more modest: just get it running concisely in C++ and compare performance with the raw Ruby version. Here’s what I came up with.

#include <iostream>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/regex.hpp>

using namespace std;

const boost::regex get_re("GET /ongoing/When/\\d{3}x/(\\d{4}/\\d{2}/\\d{2}/[^ .]+) ");

typedef map<string, unsigned> counts_by_key_t;
typedef multimap<unsigned, string> keys_by_count_t;

int main(int argc, char *argv[])
{
  counts_by_key_t counts_by_key;
  for(int arg = 1; arg < argc; ++arg) {
    try {
      boost::iostreams::mapped_file_source mf(argv[arg]);

      boost::cregex_iterator regi(mf.begin(), mf.end(), get_re), rege;
      for(; regi != rege; ++regi) {
        counts_by_key[(*regi)[1].str()] += 1;
      }
    } catch (ios::failure e) {
      cerr << argv[arg] << ": " << e.what() << endl;
      return 1;
    }
  }

  keys_by_count_t keys_by_count;
  for(counts_by_key_t::const_iterator i = counts_by_key.begin();
      i != counts_by_key.end(); ++i) {
    keys_by_count.insert(make_pair(i->second, i->first));
  }

  unsigned n = 10;
  for(keys_by_count_t::reverse_iterator ri = keys_by_count.rbegin();
      n && ri != keys_by_count.rend(); ++ri, --n) {

    cout << ri->first << ": " << ri->second << endl;
  }
  return 0;
}

So at 42 lines this one really is shorter than the 84-line Erlang version. Not a million miles away from the Ruby version either, in length if not in readability. However, I’ve made some simplifications, or taken some liberties, depending on your point of view:

  • Firstly, Boost is a third-party library and hence this is not standard C++. Given the (general) difficulty of incorporating libraries into C++ apps this might be more of a problem than on other languages where CPAN/RubyForge/CheeseShop/whatever rules supreme. However I would argue that Boost is such an indispensable part of modern C++ development that relying on it is quite acceptable, even for tasks like this.

  • Also, I’m using Boost’s mapped_file_stream, which uses a memory map to iterate through the file. This is very useful and quick but obviously places limits on the size of the file. Fortunately it handles oversize files gracefully.

  • That reverse_iterator should be a const_reverse_iterator, but I had to work around a gcc bug.

  • C++ generic programming purists may be scoffing at my explicit iteration through the containers. Try as I might I could not formulate a way of using the standard algorithms without signficantly increasing the complexity of the code. Inverting the counts_by_key map to create the keys_by_count multimap sounds particularly like the sort of thing that should be possible using the standard algorithms, but I was unable to work it out.

Removing the explicit iteration may prove to be a useful exercise, in order to get better parallelization. I can picture a class of STL algorithms which are smart enough to automatically distribute work amongst different worker threads, coordinate their shared state, etc. A parallel_for_each algorithm perhaps. Fully utilising existing algorithms such as for_each seems like a necessary first step towards this goal.

My other goal with this was to look at performance. The C++ version above processes the 200MB file in about 4.5 seconds on my laptop (running WinXP, code compiled with MSVC++ 8.0).

By comparison the Ruby version runs in 5.0 seconds. This is pretty damn impressive if you ask me, given that it’s not using memory-mapped IO.

For an encore I would like to convert the code above to use regular file I/O. An initial attempt to use the file_iterator class (from the Spirit library, also part of Boost) was not hopeful; boosting the runtime up to 30 seconds. I also looked at using the standard C++ iostream iterators but the regex matching needs bidirectional iterators, and they aren’t.

Oh, and don’t blame me that the code isn’t unicode safe. That was part of the brief!

]]>
http://girtby.net/archives/2007/10/09/wide-finder-in-c/feed/ 4