girtby.net » coding http://girtby.net this blog is girtby.net Thu, 17 Sep 2009 14:27:44 +0000 http://wordpress.org/?v=2.9-rare en hourly 1 C++ 1, Unicode 0 http://girtby.net/archives/2007/03/09/c-1-unicode-0/ http://girtby.net/archives/2007/03/09/c-1-unicode-0/#comments Fri, 09 Mar 2007 11:14:00 +0000 alastair http://girtby.net/2008/02/04/c-1-unicode-0 Yes, another C++ post. Yes, I’ve been doing a lot of it lately.

Recently on WorseThanFailure there have been several incidences of functions intended to perform relatively simple string manipulation tasks. Being worthy of posting to WTF, they have of course been hilariously over-long, complicated and bug-ridden. One recent example was attempting to compare two strings in a case-insensitive manner. Another was attempting to remove spaces from a string.

I’m going to have a go at a similar problem, namely writing a C++ program to count whitespace characters in a file.

Ready to follow along? Good!

Generating a Test File

Let’s use this as a test file to start:

Foo Bar Baz

Looks easy, right? Of the 12 characters, our app should count 3 whitespace characters; one for each of the spaces and one for the newline.

But wait a sec, things aren’t as simple as they appear. View source and you’ll see that the second space isn’t an ordinary space. It’s a Unicode NO-BREAK SPACE (U+00A0). Hmm, I wonder how that will turn out?

Well if we’re using Unicode we have to decide on an encoding. These days the default choice prettymuch has to be UTF-8. Those of you who are following along at home may want to use the following python command to write out the test file:

python -c 'f = file("foo.txt", "w"); f.write(u"Foo Bar\u00A0Baz\n".encode("utf-8"))'

As a programmer, particularly a C++ programmer, you may be getting an uneasy feeling at this point. But in case you think my test case is particularly contrived, let me just point out that no-break spaces are very commonly used, particularly on the Internet. Ditto UTF-8.

Attempt #1: The Textbook Approach

On a first attempt we might think about writing a program to take its input from stdin and write output on stdout. This can be trivially used with file based input, and could also be useful if we wanted to use it with in the output of another program through a pipe.

For the sake of brevity, I’ll assume the appropriate #includes and using namespace std; declarations have been made. And so we might end up with something like this.

int main()
{
  cin.unsetf(ios_base::skipws);
  unsigned ccount = 0, wscount = 0;
  char ch;

  while (cin >> ch) {
    if (isspace(ch)) {
      ++wscount;
    }
    ++ccount;
  }
  cout << wscount << " whitespace characters out of " << ccount << endl;
  return 0;
}

Basically this just iterates over the input, a char at a time. It is literally a textbook approach, as Stroustrup shows something similar in The C++ Programming Language section 21.3.4.

Let’s see how it goes with the the test file. Remember we are after 3 whitespace out of 12 characters.

$ ./wscount1 < foo.txt
2 whitespace characters out of 13

Nope. It didn’t even read the right number of characters! Stepping through with a debugger it’s easy to see what is going on. The no-break space is encoded as two bytes, and hence read by our program as two separate characters, neither of which are being counted as spaces.

The fact that it failed on UTF-8 input should come as no surprise but I feel it’s worth highlighting this because the char-at-a-time model is extremely widespread. In fact, reading through the attempted solutions to a similar problem on WorseThanFailure, I got to page 4 of the comments before someone even asked the question “hey, what about multibyte characters?”

Attempt #2: Wide Characters

So pretty obviously we can’t use a char if we are going to be dealing with individual characters from a set of more than 256. Fortunately C++ gives us wchar_t, whose size is undefined but is guaranteed to be big enough to hold characters of the “implementation” character set (more on this later).

The change to wchar_t is necessary, but not sufficient. I won’t show it, but trust me, the result is the same, 2 whitespace out of 13 characters.

The problem is that we haven’t told the iostream how to decode the incoming bytes. In the absence of this information, the iostream does the only thing it can do, namely push every input byte into a separate wchar_t. Not particularly useful.

More intelligent conversion of incoming data is one of the functions of the locale classes. The relevant “facet” of the locale object is called codecvt. It is a template class with an in() method that looks like this:

result
in(state_type& __state, const extern_type* __from,
   const extern_type* __from_end, const extern_type*& __from_next,
   intern_type* __to, intern_type* __to_end,
   intern_type*& __to_next) const

It’s a method signature only a mother could love. But the good news is that you don’t have to call it directly, because iostreams will do it for you. As long as we’re talking file streams, that is. So attempt #2 at the white space problem needs to be written in terms of file streams. And that means some extra error handling and other stuff:

int main(unsigned argc, const char * argv[])
{
  for (unsigned a = 1; a < argc; ++a) {
    wifstream fs(argv[a]);
    fs.unsetf(ios_base::skipws);
    unsigned ccount = 0,  wscount = 0;
    wchar_t ch;

    while (fs >> ch) {
      if (isspace(ch)) {
        ++wscount;
      }
      ++ccount;
    }

    cout << argv[a] << ": ";
    if (fs.bad() || !fs.eof()) {
      cout << "error encountered after " << ccount << " characters" << endl;
      return 1;
    } else {
      cout << wscount << " whitespace characters out of " << ccount << endl;
    }
  }
  return 0;
}

Looks a bit more promising, if you forgive the poor error reporting. But we still only get 2 whitespace characters out of 13. What’s going on?

Diversion into locales

I mentioned above that the character set conversion routines live inside the locale part of the standard library. This is a slightly odd place for them to live, but I expect that historically, different regions often had their own unique character sets. In a Unicode world this is no longer the case.

Anyway, the wifstream constructor above is taking a snapshot of the global locale and using it for converting the incoming characters. So what is the global locale? Could it be something to do with the current user’s locale, as visible when you type locale on the unix command line?

Well, not necessarily. At startup, the global locale is set to the “classic” or “C” locale. For maximum compatibility, this is a very simple locale, and assumes an ASCII characters set for input data. On the other hand, the current user locale is referred to by an empty string.

Here’s a program to print the name of the current user locale:

int main()
{
  cout << "user locale is: " << locale("").name() << endl;
  return 0;
}

Here’s where things get tricky, or at least operating system dependent. Running the above program on MacOS X I get:

$ ./cpplocale
user locale is: C

Not particularly helpful. Using the -a option to locale I can see that there are lots of other locales installed. Lets see what happens when I try to use one:

$ locale -a | grep en_AU
en_AU
en_AU.ISO8859-1
en_AU.ISO8859-15
en_AU.US-ASCII
en_AU.UTF-8
$ LC_ALL="en_AU" ./cpplocale
terminate called after throwing an instance of 'std::runtime_error'
  what():  locale::facet::_S_create_c_locale name not valid
user locale is: Abort trap

From a brief play it looks looks like none of the installed locales (besides “C” of course) are available to C++ programs on MacOS X. Boo!

Here’s how it should work, courtesy of Ubuntu Linux:

$ ./cpptest
Using locale: en_AU.UTF-8

Note that we still don’t have a portable way of specifying that the input file is UTF-8 encoded. Aside from the classic and the user locale, none of the locale names are standardised.

The other thing is that it’s not entirely obvious to me what character set we’re actually using here. This gets back to the question of the “implementation” character set. Sure, they are wchar_ts but are they Unicode? In this case the answer is yes, but is that assumption true on the DeathStation 9000? If Unicode, is it UTF-16 or UTF-32? What normalisation form? As far as I can tell, none of these questions can be answered in a portable manner. (And so Boost Serialization should become your new best friend)

I’ll leave Windows as an exercise for the reader. For the sake of simplicity I’ll switch to Linux for the remainder of this article.

Attempt #3: With Locales

Just because I’m allergic to global variables, I’ll use imbue to set the locale of the stream after construction.

int main(unsigned argc, const char * argv[])
{
  locale loc("");

  for (unsigned a = 1; a < argc; ++a) {
    wifstream fs(argv[a]);
    fs.unsetf(ios_base::skipws);
    fs.imbue(loc);

    /* ... as above ... */
  }
  return 0;
}

And the result:

$ ./wscount3 foo.txt
foo.txt: 2 whitespace characters out of 12

Hooray for Zoidberg! We haven’t quite got the right result, but we are making progress. We successfully converted the input UTF-8 into wide characters, probably UTF-32. But why didn’t it count the right number of whitespace characters?

When is a space a space?

Try this with your favourite C++ compiler:

int main()
{
  locale loc("");
  cout << "isspace(no-break space): " << isspace(wchar_t(0xA0), loc) << endl;
  return 0;
}

On both Linux and MacOS, this program produces a negative result. In other words, the unicode NO-BREAK SPACE is not a space according to the isspace() function. I’ll just let that sink in for a bit …

If you look at the glibc sources you’ll see that this has been a deliberate decision. There is even an accompanying comment:

/* Don't make U+00A0 a space. Non-breaking space means that all programs
   should treat it like a punctuation character, not like a space. */

Accurate but not exactly helpful. I suspect the reason has to do with not wanting to conflict with the thousands separator. In some locales they use spaces instead of commas to separate the thousands. If we consider no-break spaces as punctuation then we can use the same code to process large numeric quantities in all locales without risk of breaking them into multiple words. Or something like that.

The point remains though: the various ctype functions (ie the isxxxx functions) do not map on to the corresponding Unicode character properties.

Attempt #4: Extending and enhancing isspace

A proper solution here would probably involve hardcoded tests of the input character against each of the Unicode space characters. Which is, you guessed it, another exercise for the reader. I’m going to cheat a bit and just test for the no-break space for now.

So here is the final version, in all its glory:

int main(unsigned argc, const char * argv[])
{
  locale loc("");

  for (unsigned a = 1; a < argc; ++a) {
    wifstream fs(argv[a]);
    fs.unsetf(ios_base::skipws);
    fs.imbue(loc);
    unsigned ccount = 0,  wscount = 0;
    wchar_t ch;

    while (fs >> ch) {
      if ((0x00a0 == ch) || isspace(ch)) {
        ++wscount;
      }
      ++ccount;
    }

    cout << argv[a] << ": ";
    if (fs.bad() || !fs.eof()) {
      cout << "error encountered after " << ccount << " characters" << endl;
      return 1;
    } else {
      cout << wscount << " whitespace characters out of " << ccount << endl;
    }
  }
  return 0;
}

And the money shot:

$ ./wscount4 foo.txt
foo.txt: 3 whitespace characters out of 12

About freakin’ time, you might be thinking.

Learnings

So here’s how I would summarise this whole exercise.

  • If you are doing any character-by-character processing of strings, you need to use wide chars. In fact, that’s probably a good idea even if you’re not peering inside strings. Unicode is here to stay, get over it.
  • Don’t rely on the standard libraries to always correctly convert your input data to wide characters. Use the Boost serialisation library mentioned above, or iconv, or (recently discovered and promising) u8u16.
  • Don’t rely on the standard libraries to process unicode characters. For that you probably want ICU or something.
  • C++ needs to Try Harder to support unicode, particularly on MacOS (The real WTF).
  • Just give up and use Ruby. Oh no, wait…
]]>
http://girtby.net/archives/2007/03/09/c-1-unicode-0/feed/ 5
Kata Four in C++ http://girtby.net/archives/2007/02/26/kata-four-in-c/ http://girtby.net/archives/2007/02/26/kata-four-in-c/#comments Mon, 26 Feb 2007 12:37:00 +0000 alastair http://girtby.net/2007/10/09/kata-four-in-c On a whim, I attempted Dave Thomas’ Kata Four in C++. Yes that’s right, C++.

Here’s what I ended up with, feel free to throw peanuts.

I did the parts in order, without looking ahead. When it came to part three of the exercise, it became apparent that I needed to separate the analysis of the file from the mechanics of reading it. I used the common Functor idiom, calling it with each line of input:

template <class T>
void analyze_file(const char * dat, T & an)
{
  fstream fs(dat);
  unsigned processed = 0;

  while(true) {
    string line;
    getline(fs, line);
    if (!fs.good())
      break;
    if (an(line))
      processed++;
  }
  cout << processed << " lines processed" << endl;
  if (processed) {
    cout << an << endl;
  }
}

Then it’s a simple matter of defining a functor for each type of analysis. For the weather data, it looks like this:

class WeatherAnalyzer {
public:
  WeatherAnalyzer() : minday(0), minspread(numeric_limits<unsigned>::max()) {}
  bool operator() (const string & line);
  unsigned minday, minspread;
};

ostream & operator << (ostream & os, WeatherAnalyzer & w)
{
  return os << "Min spread = " << w.minspread << " (day " << w.minday << ")";
}

bool WeatherAnalyzer::operator() (const string & line)
{
  istringstream ls(line);
  unsigned d, maxt, mint;
  ls >> d >> maxt >> mint;
  if (!ls.good() || (maxt < mint))
    // ignore unparseable lines:
    return false;

  unsigned spread = maxt - mint;
  if (spread < minspread) {
    minday = d;
    minspread = spread;
  }
  return true;
}

Add some #includes, a main(), and we’re all set. It all came together pretty quickly thanks mainly to the power of the C++ iostreams library.

All in all I think it came out pretty well, although it is by no means perfect. If it was going into production here’s what I’d be looking at:

  • Defining a proper stream insertion operator that can handle wide chars.
  • Better error reporting, particularly when the input file cannot be opened.
  • Logging of “unexpected” unparseable lines.
  • Getter methods, rather than public member variables, for accessing the accumulated values of the functor.

In answer to the Kata questions:

  • I made some early design decisions which were not validated when writing subsequent programs. Specifically I decided that each line should be parsed strictly once the start of data was detected. In other words, I had a skip_to_data function in my original part 1 solution. This was overturned when I went to part two because I observed that it was better to simply skip lines that could not be parsed, particularly when there was spurious data in the middle of the dataset. However making this change was fairly simple.
  • The second program was a copy and paste of the first, with the relevant changes in fairly obvious parts. (This is usually very poor practice, but is justified in this case because I wasn’t actually shipping anything)
  • I don’t believe that factoring out common code is always a good thing. For example, it would have been a fairly easy change to use an abstract Analyzer base class instead of a functor object. This might contain some common functionality such as the code to accumulate the minimum value and associated identifier. However I deliberately didn’t do this because it would have required a complicated interface between the base class and subclasses, for very little reuse benefit. As for readability and maintainability of the refactoring, I’d say it was a definite win.
]]>
http://girtby.net/archives/2007/02/26/kata-four-in-c/feed/ 1