HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

HTML Parser (using SAX)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
parserusinghtmlsax

Problem

Got bored writing a review on an HTML parser and decided I wanted to try.

So I threw this together to see I could parse an Amazon page.

curl -A 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.13) Gecko/20080313 Firefox' https://www.amazon.com | parser


Look what I found on amazons home page:

Comment:        _
       .__(.)< (MEOW)
        \___)
 ~~~~~~~~~~~~~~~~~~


Looks like a duck cat!!!

Note: This is not designed to parse valid HTML. The idea was to parse invalid HTML that is found on the web. So it makes allowances for a couple of common problems found in HTML that you see on the web.

It also assumes (incorrectly) that all text between ` => and => is one big blob of text.

parser.h

``
#ifndef THORSANVIL_HTMLPARSER_PARSER_H
#define THORSANVIL_HTMLPARSER_PARSER_H

#include
#include
#include

namespace ThorsAnvil
{
namespace HTMLParser
{

using Attributes = std::map;

class HTMLTokenI
{
public:
virtual ~HTMLTokenI() {}
// By default the functions deliberately do nothing.
virtual void DocType(std::string const& docString) {}
virtual void tagOpen(std::string const& tagName, Attributes const& attr) {}
virtual void tagOpenClose(std::string const& tagName, Attributes const& attr) {}
virtual void tagClose(std::string const& tagName) {}
virtual void comment(std::string const& comment) {}
virtual void text(std::string const& text) {}
virtual void error(std::string const& message) {}
};

class HTMLSaxParser
{
std::istream& htmlpage;
HTMLTokenI& callback;
public:
HTMLSaxParser(std::istream& htmlpage, HTMLTokenI& callback)
: htmlpage(htmlpage)
, callback(callback)
{}

void parse();
private:
void parseDocType();
void parseTag();
void parseComment();
void parseTagClose();
void parseTagOpen();

bool attributesFinished;
Attrib

Solution

Minor nitpicks, I suppose.

-
You need to add

#include 


to use std::find and std::find_if.

-
Make loop an unsigned type.

You have

int loop;
for(loop = 0; loop < tag.size(); ++loop)


That produces a compiler warning from g++:

warning: comparison between signed and unsigned integer expressions [-Wsign-compare]

I suggest changing that to:

decltype(tag.size()) loop;
for(loop = 0; loop < tag.size(); ++loop)


-
Example usage and real usage don't match.

Your example usage says:

curl -A '.....' www.amazon.com | parser


However, in main you are using a hard coded filename, named "t1".

std::ifstream       amazon("t1");


The usage should be:

curl -A '.....' www.amazon.com > t1 && ./parser

Code Snippets

#include <algorithm>
int loop;
for(loop = 0; loop < tag.size(); ++loop)
decltype(tag.size()) loop;
for(loop = 0; loop < tag.size(); ++loop)
curl -A '.....' www.amazon.com | parser
std::ifstream       amazon("t1");

Context

StackExchange Code Review Q#138092, answer score: 2

Revisions (0)

No revisions yet.