patterncppMinor
HTML Parser (using SAX)
Viewed 0 times
parserusinghtmlsax
Problem
Got bored writing a review on an HTML parser and decided I wanted to try.
So I threw this together to see I could parse an Amazon page.
Look what I found on amazons home page:
Looks like a duck cat!!!
Note: This is not designed to parse valid HTML. The idea was to parse invalid HTML that is found on the web. So it makes allowances for a couple of common problems found in HTML that you see on the web.
It also assumes (incorrectly) that all text between `
#ifndef THORSANVIL_HTMLPARSER_PARSER_H
#define THORSANVIL_HTMLPARSER_PARSER_H
#include
#include
#include
namespace ThorsAnvil
{
namespace HTMLParser
{
using Attributes = std::map;
class HTMLTokenI
{
public:
virtual ~HTMLTokenI() {}
// By default the functions deliberately do nothing.
virtual void DocType(std::string const& docString) {}
virtual void tagOpen(std::string const& tagName, Attributes const& attr) {}
virtual void tagOpenClose(std::string const& tagName, Attributes const& attr) {}
virtual void tagClose(std::string const& tagName) {}
virtual void comment(std::string const& comment) {}
virtual void text(std::string const& text) {}
virtual void error(std::string const& message) {}
};
class HTMLSaxParser
{
std::istream& htmlpage;
HTMLTokenI& callback;
public:
HTMLSaxParser(std::istream& htmlpage, HTMLTokenI& callback)
: htmlpage(htmlpage)
, callback(callback)
{}
void parse();
private:
void parseDocType();
void parseTag();
void parseComment();
void parseTagClose();
void parseTagOpen();
bool attributesFinished;
Attrib
So I threw this together to see I could parse an Amazon page.
curl -A 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.13) Gecko/20080313 Firefox' https://www.amazon.com | parserLook what I found on amazons home page:
Comment: _
.__(.)< (MEOW)
\___)
~~~~~~~~~~~~~~~~~~Looks like a duck cat!!!
Note: This is not designed to parse valid HTML. The idea was to parse invalid HTML that is found on the web. So it makes allowances for a couple of common problems found in HTML that you see on the web.
It also assumes (incorrectly) that all text between `
=> and => is one big blob of text.
parser.h
``#ifndef THORSANVIL_HTMLPARSER_PARSER_H
#define THORSANVIL_HTMLPARSER_PARSER_H
#include
#include
#include
namespace ThorsAnvil
{
namespace HTMLParser
{
using Attributes = std::map;
class HTMLTokenI
{
public:
virtual ~HTMLTokenI() {}
// By default the functions deliberately do nothing.
virtual void DocType(std::string const& docString) {}
virtual void tagOpen(std::string const& tagName, Attributes const& attr) {}
virtual void tagOpenClose(std::string const& tagName, Attributes const& attr) {}
virtual void tagClose(std::string const& tagName) {}
virtual void comment(std::string const& comment) {}
virtual void text(std::string const& text) {}
virtual void error(std::string const& message) {}
};
class HTMLSaxParser
{
std::istream& htmlpage;
HTMLTokenI& callback;
public:
HTMLSaxParser(std::istream& htmlpage, HTMLTokenI& callback)
: htmlpage(htmlpage)
, callback(callback)
{}
void parse();
private:
void parseDocType();
void parseTag();
void parseComment();
void parseTagClose();
void parseTagOpen();
bool attributesFinished;
Attrib
Solution
Minor nitpicks, I suppose.
-
You need to add
to use
-
Make
You have
That produces a compiler warning from g++:
warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
I suggest changing that to:
-
Example usage and real usage don't match.
Your example usage says:
However, in
The usage should be:
-
You need to add
#include to use
std::find and std::find_if.-
Make
loop an unsigned type.You have
int loop;
for(loop = 0; loop < tag.size(); ++loop)That produces a compiler warning from g++:
warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
I suggest changing that to:
decltype(tag.size()) loop;
for(loop = 0; loop < tag.size(); ++loop)-
Example usage and real usage don't match.
Your example usage says:
curl -A '.....' www.amazon.com | parserHowever, in
main you are using a hard coded filename, named "t1".std::ifstream amazon("t1");The usage should be:
curl -A '.....' www.amazon.com > t1 && ./parserCode Snippets
#include <algorithm>int loop;
for(loop = 0; loop < tag.size(); ++loop)decltype(tag.size()) loop;
for(loop = 0; loop < tag.size(); ++loop)curl -A '.....' www.amazon.com | parserstd::ifstream amazon("t1");Context
StackExchange Code Review Q#138092, answer score: 2
Revisions (0)
No revisions yet.