Comments on: Wide Finder 2: The Widening http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/ this blog is girtby.net Wed, 30 Sep 2009 01:44:34 -0400 http://wordpress.org/?v=2.9-rare hourly 1 By: Chris http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/comment-page-1/#comment-1801 Chris Thu, 03 Jul 2008 03:27:00 +0000 http://girtby.net/2008/07/08/wide-finder-2-the-widening#comment-1801 <p>Grats. That's looking pretty good. Have you looked at something like <a href="http://prisms.cs.umass.edu/emery/index.php?page=download-hoard">Hoard</a> to help with mallocs? You're in the best position to profile the parts that are easiest to improve, but Hoard might be a pretty easy drop-in if memory allocation is still hurting.</p> Grats. That’s looking pretty good. Have you looked at something like Hoard to help with mallocs? You’re in the best position to profile the parts that are easiest to improve, but Hoard might be a pretty easy drop-in if memory allocation is still hurting.

]]>
By: Alastair http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/comment-page-1/#comment-1802 Alastair Thu, 03 Jul 2008 03:27:00 +0000 http://girtby.net/2008/07/08/wide-finder-2-the-widening#comment-1802 <p>Thanks Chris. Yes I have looked at hoard, and in fact we use it at $WORK. However I'm finding I get much better performance improvement by switching to a Boost.Pool accessed through a thread-specific pointer. This drastically reduces the need to do malloc in the first place.</p> <p>On the flip side, I can reduce the need to deallocate memory simply by introducing memory leaks ... yes, you read that right! Basically when you have many millions of objects allocated it can take quite a while to deallocate them all. There's something like a 10-second pause after processing the 42G data file while my application de-allocates all of the objects, so an easy performance win is simply Not Do That. Every second counts, particularly when you have Java implementations to beat ...</p> Thanks Chris. Yes I have looked at hoard, and in fact we use it at $WORK. However I’m finding I get much better performance improvement by switching to a Boost.Pool accessed through a thread-specific pointer. This drastically reduces the need to do malloc in the first place.

On the flip side, I can reduce the need to deallocate memory simply by introducing memory leaks … yes, you read that right! Basically when you have many millions of objects allocated it can take quite a while to deallocate them all. There’s something like a 10-second pause after processing the 42G data file while my application de-allocates all of the objects, so an easy performance win is simply Not Do That. Every second counts, particularly when you have Java implementations to beat …

]]>
By: Thatcher http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/comment-page-1/#comment-1803 Thatcher Thu, 03 Jul 2008 03:27:00 +0000 http://girtby.net/2008/07/08/wide-finder-2-the-widening#comment-1803 <p>Re: thread contention in malloc, are you aware of tcmalloc? <a href="http://goog-perftools.sourceforge.net/doc/tcmalloc.html">http://goog-perftools.sourceforge.net/doc/tcmalloc.html</a></p> <p>-T</p> Re: thread contention in malloc, are you aware of tcmalloc? http://goog-perftools.sourceforge.net/doc/tcmalloc.html

-T

]]>
By: Brian http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/comment-page-1/#comment-1804 Brian Thu, 03 Jul 2008 03:27:00 +0000 http://girtby.net/2008/07/08/wide-finder-2-the-widening#comment-1804 <blockquote> <p>And even with a 64-bit binary, you really don’t want to mmap an entire 42GB data file into memory, trust me.</p> </blockquote> <p>Well I don't trust you. I currently and routinely mmap in 10+TB in one shot on a 64bit machine. So what's the problem ? Elaborate plesae?</p>

And even with a 64-bit binary, you really don’t want to mmap an entire 42GB data file into memory, trust me.

Well I don’t trust you. I currently and routinely mmap in 10+TB in one shot on a 64bit machine. So what’s the problem ? Elaborate plesae?

]]>
By: Alastair http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/comment-page-1/#comment-1805 Alastair Thu, 03 Jul 2008 03:27:00 +0000 http://girtby.net/2008/07/08/wide-finder-2-the-widening#comment-1805 <blockquote> <p>Well I don’t trust you.</p> </blockquote> <p>Well, as it turns out, Brian was right not to trust me. I was too hasty in condemning the all-at-once mmap.</p> <p>Here's a test app:</p> <pre class="htmlize"> <span class="constant">boost</span>::<span class="constant">iostreams</span>::<span class="type">mapped_file_source</span> <span class="variable-name">source</span>(<span class="type">argv</span>[1]); <span class="type">unsigned</span> <span class="variable-name">lines</span> = <span class="constant">std</span>::count(source.begin(), source.end(), <span class="string">'\n'</span>); <span class="constant">std</span>::cout << argv[1] << <span class="string">": "</span> << lines << <span class="string">" lines"</span> << <span class="constant">std</span>::endl; </pre> <p>As you can see it just counts the number of \n characters in the requested file. I ran this on the Wide Finder 2 full data set, and here's what happened:</p> <pre><code>~/wf2/ 512> time ./readmmap /wf1/data/logs/O.all /wf1/data/logs/O.all: 218201129 lines real 16m0.565s user 7m42.454s sys 7m53.235s </code></pre> <p>Two important things to note here. First, obviously mmap-ing the whole file works as advertised. But more importantly, it seems that my Wide Finder 2 implementation is <em>already</em> running at I/O speed.</p> <p>But other Wide Finder 2 implementations are going faster, which raises the obvious question as to how. mmap is traditionally the fasted form of I/O, given that it doesn't have to copy the data from kernel space into user space. But obviously that rule doesn't hold any longer, at least for Solaris.</p> <p>More investigation needed, I feel.</p>

Well I don’t trust you.

Well, as it turns out, Brian was right not to trust me. I was too hasty in condemning the all-at-once mmap.

Here’s a test app:

boost::iostreams::mapped_file_source source(argv[1]);
unsigned lines = std::count(source.begin(), source.end(), '\n');
std::cout << argv[1] << ": " << lines << " lines" << std::endl;

As you can see it just counts the number of \n characters in the requested file. I ran this on the Wide Finder 2 full data set, and here’s what happened:

~/wf2/ 512> time ./readmmap /wf1/data/logs/O.all
/wf1/data/logs/O.all: 218201129 lines

real    16m0.565s
user    7m42.454s
sys     7m53.235s

Two important things to note here. First, obviously mmap-ing the whole file works as advertised. But more importantly, it seems that my Wide Finder 2 implementation is already running at I/O speed.

But other Wide Finder 2 implementations are going faster, which raises the obvious question as to how. mmap is traditionally the fasted form of I/O, given that it doesn’t have to copy the data from kernel space into user space. But obviously that rule doesn’t hold any longer, at least for Solaris.

More investigation needed, I feel.

]]>
By: Sunny Kalsi http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/comment-page-1/#comment-1806 Sunny Kalsi Thu, 03 Jul 2008 03:27:00 +0000 http://girtby.net/2008/07/08/wide-finder-2-the-widening#comment-1806 <p>Could it be that these implementations are reading from the file sequentially? If you read from the file sequentially in tiny chunks, dynamically starting threads and letting them die, you might get better results than using something which will possibly cache your data or maybe cause your disk to do random reads instead of sequential. If it were me I'd try something like this:</p> <ol> <li>Have 1 thread keeping stats.</li> <li>Have dynamically starting threads on block boundaries - i.e. read 4k at a time (or whatever the HDD's block size is) and start the thread with that instead of explicitly searching for a newline. These will send messages to two threads. One for the statistics, and another for a "residual" (a message not ending a line).</li> <li>Have another thread which waits for residuals and matches them up. Once it gets a bunch of em, it can dynamically start one of the threads in (2).</li> </ol> <p>The only downside is the mallocs. IMO you need to copy data or else you're hosed from a multi-threaded perspective. Just thinking out loud here...</p> Could it be that these implementations are reading from the file sequentially? If you read from the file sequentially in tiny chunks, dynamically starting threads and letting them die, you might get better results than using something which will possibly cache your data or maybe cause your disk to do random reads instead of sequential. If it were me I’d try something like this:

  1. Have 1 thread keeping stats.
  2. Have dynamically starting threads on block boundaries – i.e. read 4k at a time (or whatever the HDD’s block size is) and start the thread with that instead of explicitly searching for a newline. These will send messages to two threads. One for the statistics, and another for a “residual” (a message not ending a line).
  3. Have another thread which waits for residuals and matches them up. Once it gets a bunch of em, it can dynamically start one of the threads in (2).

The only downside is the mallocs. IMO you need to copy data or else you’re hosed from a multi-threaded perspective. Just thinking out loud here…

]]>
By: Marc http://girtby.net/archives/2008/07/03/wide-finder-2-the-widening/comment-page-1/#comment-1807 Marc Thu, 03 Jul 2008 03:27:00 +0000 http://girtby.net/2008/07/08/wide-finder-2-the-widening#comment-1807 <p>Hello,</p> <p>what compiler options are you using? In particular, are you making use of prefetch? Specifying a page size? On something as simple as the exemple that counts the newlines it could make a difference.</p> Hello,

what compiler options are you using? In particular, are you making use of prefetch? Specifying a page size? On something as simple as the exemple that counts the newlines it could make a difference.

]]>