girtby.net » performance http://girtby.net this blog is girtby.net Thu, 17 Sep 2009 14:27:44 +0000 http://wordpress.org/?v=2.9-rare en hourly 1 Wide Finder 2: The Widening http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/ http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/#comments Thu, 03 Jul 2008 03:27:00 +0000 alastair http://girtby.net/2008/07/08/wide-finder-2-the-widening <movie-trailer-guy> Many months ago he attempted Tim Bray’s first Wide Finder in C++, mainly as a coding exercise. Back then the goal was readability and conciseness. This time … it’s performanceal. </movie-trailer-guy>

Targeting the Hardware of the Future

Computer architecture evolution is currently in the process of changing direction. Instead of CPUs becoming progressively ever faster, they are going “wider”. This means more processing cores, each of which is relatively low-powered. The combined processing power of multiple cores is greater than is achievable with traditional single core architectures, but requires new programming techniques. These techniques are perhaps well-understood at a theory level, but the Wide Finder project is an attempt to put theory into practice with real-world tasks by everyday coders, such as myself.

The goal of a Wide Finder 2 implementation is to produce some simple statistics from a very large (42GB) data file. The target machine is a Sun Fire T2000 with 8 cores (32 threads) and 32 GB of RAM. The task is relatively simple: we have to read the file, which contains web server log entries, and produce some elementary statistics such as the top 10 pages by hit, the top 10 clients, and so forth. Obviously it’s I/O bound … or is it?

I attempted this in (hopefully) idiomatic C++, using the Boost libraries. For those unfamiliar with Boost, you can think of it as the non-standard library, or maybe the unofficial standard libary. Basically if you’re not using Boost libraries today, you soon will be, because many of the them have gone on to form the basis for the TR1 standard library extensions, and hence targeted for inclusion in the next official C++ standard, known as C++0x. Thanks to Boost, my code is completely portable, compiling on both gcc (on my Mac) and Sun Studio compiler (on the T2000) without even a single #ifdef.

The Results, So Far

So, C++ should be able to whip all those other languages like Java into submission you might think, right? Well, I think the results speak for themselves. My time of 16 minutes is prettymuch in the middle of the pack. Not bad, but not outstanding either, and still behind some of the Java (or JVM-based) implementations. I have a few more optimisations up my sleeve though, so we’ll see how they pan out.

However the clear winner in my view, and hence worthy of much more recognition than it currently enjoys, is OCaml. With a run time of 5 minutes this is perilously close to the raw I/O speed sustainable on this box. Not only that but it was all done with ~150 lines of code which is frankly amazing (especially compared to my ~500 line C++ implementation). So OCaml is definitely a language to look at, in my humble opinion.

Hit the links from that results page for some often fascinating insights from the other Wide Finder implementers.

How To Go Wide

Fundamentally I think my approach is fairly similar to many of the other Wide Finders. This was to divide the input file into chunks, then walk through each using multiple threads. Each thread accumulates statistics in the form of hash tables. The individual hash tables are then merged before finally being sorted to produce the top-10 reports.

I want to share in detail some of the techniques that I used but like I said I’m still refining them. For now let me just point to a couple of key techniques that got me into contention for this project:

  • When it comes to this much data, you can’t afford to copy anything. This means I/O using mmap, and doing as much processing “in-place” as possible. For C++ this also means throwing out the iostreams library (even though it is otherwise quite well suited to this type of task, as demonstrated previously). And even with a 64-bit binary, you really don’t want to mmap an entire 42GB data file into memory, trust me [Update: Or maybe not. See comment below]. So I ended up with some quite ugly code to deal with mmap-ing segments of the input file while respecting chunk (ie line) boundaries and page alignment boundaries.

  • Multi-threaded C++ applications are always at risk of experiencing contention, but this is especially so when it comes to memory allocation. Doing too much allocation can kill any parallel processing you do with C++, because malloc and free both grab global mutexes (or worse, spinlocks) prior to doing their stuff. To solve this problem I ended up using a custom thread-specific memory allocator based on the Boost.Pool library. This minimises the number of times that the thread grabs memory during the normal course of operation. Code forthcoming.

Mad props to Shark, which is part of the Mac OS X developer suite. It pinpointed my bottlenecks very quickly and simply. An indispensable tool, don’t jump over this Shark.

Future Work

So I’ve got some more ideas on how to go even wider, and I’ll update here with the results. Basically it’s applying the above two techniques more thoroughly; minimise copying of data, and minimise memory allocation. (In particular the Boost.Regex library is next on my list to convert to using a thread-specific memory pool). Will let you know how it goes.

]]>
http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/feed/ 7
Required Viewing http://girtby.net/archives/2007/11/06/required-viewing/ http://girtby.net/archives/2007/11/06/required-viewing/#comments Tue, 06 Nov 2007 22:39:00 +0000 alastair http://girtby.net/2007/11/13/required-viewing If you’re at all interested in computing technology you can’t help but be amazed at the advances in CPU power over the last few decades, Moore’s Law, blah blah blah. But a few seconds pondering this invariably provokes the question as to how long this party can last.

The commonly accepted wisdom is that CPUs have gotten about as fast as they are likely to go in terms of sheer clock speed, and now manufacturers are turning to multiprocessing to provide more processing power for a given price point. The recent Intel price drops which made the quad-core Q6600 CPU available for less than AUD400 are a highly relevent (and welcome) data point to illustrate this trend.

This raises lots of hairy questions for developers, such as “how are we going to design our software to run efficiently in a multi-processing environment?” The previously-linked wide finder experiment is an attempt to explore some of these issues. And it’s pretty obvious that so far there is no silver bullet.

But wait, it gets worse. I will point you to a long but highly thought-provoking presentation from Herb Sutter. Turns out we are already hitting major architectural hurdles in the form of memory access limitations, and we’ll need to find some solutions for these before tackling the parallel computation problem.

Sutter’s presentation is deeply technical, but still quite accessible, and delivered with an engaging style that makes it required viewing. Highly recommended.

I recently had some experience diagnosing some memory-related performance problems (not quite in the same class as that discussed by Sutter, but similar) and I have to say there is a serious deficit in the development tools for these kinds of problems. Currently we need to look aggregate behaviour over multiple iterations to isolate some of these problems, and this is a difficult and error-prone approach. For example, check out Sutter’s technique to discover the memory cache line size in code. In the future it would be great if we could monitor cache misses, pipeline stalls, page faults, and other performance-impacting events within the debugger.

These issues also make me wonder about how higher-level languages are going to provide appropriate abstractions to avoid the performance problems. For example, garbage collection is a major win for programmer productivity but it does encourage memory usage patterns that are not always conducive to performance given architectural limitations in the underlying hardware. The same abstraction problems affect C/C++ of course but at least there is the option to go “bare-metal” where necessary.

Whatever the answers are here, it’s certain there are some interesting times ahead for developers.

]]>
http://girtby.net/archives/2007/11/06/required-viewing/feed/ 2