Putting UTF-8 into C/C++ Source Code

After much googling, I could not find any tools for converting a UTF-8 string into an escaped C/C++ string literal suitable for pasting into an ASCII source file. Therefore I produced this Perl script which seems to provide a fairly readable escaped string:

use strict;
chomp;
print '"';
my $prev_esc = 0;
print map
    {
        if (ord $_ > 0x7f) {
            $prev_esc = 1;
            sprintf('\\x%lx', ord $_);
        } else {
            my $need_break = $prev_esc && /[0-9A-Fa-f]/;
            $prev_esc = 0;
            ($need_break ? '" "' : '') . $_;
        }
    }
    split('', $_);
print '"' . "\n";

Run it with the Perl -n option, and it will output an escaped string literal for each line input:

$perl -n utf8esc.pl
Grüße aus Bärenhöfe
"Gr\xc3\xbc\xc3\x9f" "e aus B\xc3\xa4renh\xc3\xb6" "fe"

Hit Ctrl-D on a blank line to exit.

Unfortunately, C/C++ seems to have the strange rule that all hex characters following a “\x” apply to that escape sequence, even though the maximum value allowed is 0xff. Therefore it is necessary to break the string into separate segments.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.