HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Boost.Spirit UTF-8 string literal parser with escape support

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
boostutfspiritparserwithliteralstringsupportescape

Problem

I wrote (as part of a greater work) a Boost.Spirit grammar that would parse string literals, including support for the various escape sequences known from C/C++ (\n, \x7f, \341, \u017f, \U00010451).

At some point I encountered some problems, mostly due to my lack of understanding either Boost.Spirit or Boost.Phoenix to the detail required to have full control over what I'm doing, and despairing at the rather non-descriptive error messages Boost.Spirit generates. ;-) User sehe was very helpful over at StackOverflow, and my grammar is now functional.

However, some things are still bothering me:

  • The functor cp2utf8_f does the conversion of a UChar32 to UTF-8 byte sequence. However, as a struct inside the grammar, it is not exactly re-usable. I would like to have it as a stand-alone function, but have failed to make it work.



  • The escapes rule basically does the same thing in five different ways -- determine a UChar32 code point, and pass it to the functor (see above) using semantic actions, which appends it to the result string. This should really be a rule with an UChar32 result, which is then passed to the functor at the point the rule is called (to avoid the five-fold repitition of the functor call). Again, I had an idea of how it should work, but it didn't.



  • The error handlers (straight from the tutorial) currently print to std::cout. That's not nice; I'd rather have the error message generated by the handler thrown as exception (let's say std::runtime_error for the sake of this review). Again, my lack of in-depth understanding of what is going on here exactly makes me scratch my head at why the compiler complains about "invalid use of void exception" when I replace the std::cout



Any other suggestions (like, how to better learn to fish in Boost.Spirit instead of asking you to hand me the fish...) are likewise welcome. I know the "test driver"
main()` is crude; I didn't want to make this longer than necessary by going through th

Solution


  1. The error handlers



The problem with the throw expression, as the compiler kindly reminded you, is that they're void-expressions.

Even if it compiled, it would not do what you want: it'd throw during the grammar constructor...

The repeating story here is that semantic actions (and error handlers in this case) require Phoenix actors (a.k.a. lazy or deferred functions), so that spirit knows how to evalute them against the spirit context when needed. The simple case:

qi::on_error
(
    quoted_string,
    phoenix::throw_(
        phoenix::construct( "Illegal string literal. (Unterminated string?)" )
    )
);


The more complex version requires stream concatenation. You could do this with a local/let-expression, but I'd keep it simple and extract a Phoenix function make_error_message:

qi::on_error
(
    escapes,
    phoenix::throw_(
        phoenix::construct( make_error_message(qi::_4, qi::_3, qi::_2) )
    )
);


Now, you can just code that function in any which way you like:

struct make_error_message_f {
    template  struct result { using type = std::string; };

    template 
    std::string operator()(Info const& info, F f, L l) const {
        std::ostringstream oss;
        oss  make_error_message;



See below for ways to make make_error_message a function that's adapted for Phoenix use.

  1. Using a global function




However, as a struct inside the grammar, it is not exactly re-usable. I would like to have it as a stand-alone function, but have failed to make it work.

You can of course just relay the implementation of cp2utf8_f::operator() to a re-usable function of your choice. Of course, that makes the cp2utf8_f function object merely red-tape code. If you don't mind putting traits in the Phoenix extensions namespaces, you can use the existing adaptation macros:

namespace my_helpers {
    void cp2utf8(std::string& a, UChar32 codepoint)
    {
        icu::StringByteSink bs(&a);
        icu::UnicodeString::fromUTF32(&codepoint, 1).toUTF8( bs );
    }

    template
        std::string make_error_message(boost::spirit::info const& info, Iterator first, Iterator last) {
            std::ostringstream oss;
            oss << "Illegal escape sequence. Expecting " << info << " here: \"" << std::string(first,last) << "\"";
            return oss.str();
        }
}

BOOST_PHOENIX_ADAPT_FUNCTION(void,        cp2utf8_,            my_helpers::cp2utf8,            2)
BOOST_PHOENIX_ADAPT_FUNCTION(std::string, make_error_message_, my_helpers::make_error_message, 3)


// (And I don't like *result* and *cp2utf8* lying around here
// when a stand-alone function should do just as well.)


They're private inner types. They inlining food. What's the cost you measured?

Personally, I prefer the localized function objects because they give you more control and prevent namespace pollution. Note that on sufficiently advanced version you may be able to drop the inner result_type/result<>::type constructs (see RESULT_OF docs).

  1. Reducing WET-ness (repetition)



Is this what you had in mind:

escapes = '\\' > ( 
          escaped_character
        | ("x" > qi::uint_parser())
        | ("u" > qi::uint_parser())
        | ("U" > qi::uint_parser())
        | (      qi::uint_parser()) 
      ) [ cp2utf8_( qi::_val, qi::_1 ) ]
;


DEMO

Includes the improvements described, and also some excess scope/namespace pollution issues.

Live on Coliru

```
#define BOOST_SPIRIT_UNICODE

#include
#include

#include
#include

#include
#include

namespace qi = boost::spirit::qi;

using boost::spirit::unicode::char_;
using boost::spirit::eol;

namespace my_helpers {
void cp2utf8(std::string& a, UChar32 codepoint)
{
icu::StringByteSink bs(&a);
icu::UnicodeString::fromUTF32(&codepoint, 1).toUTF8( bs );
}

template
std::string make_error_message(boost::spirit::info const& info, Iterator first, Iterator last) {
std::ostringstream oss;
oss
struct QuotedString : qi::grammar
{
QuotedString() : QuotedString::base_type( quoted_string )
{
quoted_string = '"' > *( +( char_ - ( '"' | eol | '\\' ) ) | escapes ) > '"';

escapes = '\\' > (
escaped_character
| ("x" > qi::uint_parser())
| ("u" > qi::uint_parser())
| ("U" > qi::uint_parser())
| ( qi::uint_parser())
) [ cp2utf8_( qi::_val, qi::_1 ) ]
;

escaped_character.add
( "a", 0x07 ) // alert
( "b", 0x08 ) // backspace
( "f", 0x0c ) // form feed
( "n", 0x0a ) // new line
( "r", 0x0d ) // carriage return
( "t", 0x09 ) // horizontal tab
( "v", 0x0b ) // vertical tab
( "\"", 0x22 ) // literal quotation mark
( "\\", 0x5c ) // literal backslash
;

namespace phx = boost::phoenix;

qi::on_error (

Code Snippets

qi::on_error< qi::fail >
(
    quoted_string,
    phoenix::throw_(
        phoenix::construct<std::runtime_error>( "Illegal string literal. (Unterminated string?)" )
    )
);
qi::on_error< qi::fail >
(
    escapes,
    phoenix::throw_(
        phoenix::construct<std::runtime_error>( make_error_message(qi::_4, qi::_3, qi::_2) )
    )
);
struct make_error_message_f {
    template <typename ...> struct result { using type = std::string; };

    template <typename Info, typename F, typename L>
    std::string operator()(Info const& info, F f, L l) const {
        std::ostringstream oss;
        oss << "Illegal escape sequence. Expecting " << info << " here: \"" << std::string(f,l) << "\"";
        return oss.str();
    }
};

phoenix::function<make_error_message_f> make_error_message;
namespace my_helpers {
    void cp2utf8(std::string& a, UChar32 codepoint)
    {
        icu::StringByteSink<std::string> bs(&a);
        icu::UnicodeString::fromUTF32(&codepoint, 1).toUTF8( bs );
    }

    template<typename Iterator>
        std::string make_error_message(boost::spirit::info const& info, Iterator first, Iterator last) {
            std::ostringstream oss;
            oss << "Illegal escape sequence. Expecting " << info << " here: \"" << std::string(first,last) << "\"";
            return oss.str();
        }
}

BOOST_PHOENIX_ADAPT_FUNCTION(void,        cp2utf8_,            my_helpers::cp2utf8,            2)
BOOST_PHOENIX_ADAPT_FUNCTION(std::string, make_error_message_, my_helpers::make_error_message, 3)
// (And I don't like *result* and *cp2utf8* lying around here
// when a stand-alone function should do just as well.)

Context

StackExchange Code Review Q#102374, answer score: 6

Revisions (0)

No revisions yet.