snippetjavascriptTip
Simple HTML tokenization & validation in JavaScript
Viewed 0 times
javascriptvalidationsimpletokenizationhtml
Problem
I've been down the tokenization rabbit hole for a little while now, if it wasn't obvious from the previous articles on bracket pair matching and math expression tokenization. This time, I wanted to try something a little more complex, but still simple enough to be done in a single article. So, I decided to try my hand at tokenizing an HTML string and validating that its tags are balanced correctly.
> [!NOTE]
>
> Huge disclaimer right here that this is a learning exercise and, quite possibly, an exercise in futility. What I hope to achieve here is you coming out of this article with a basic understanding of how to tokenize some more complex inputs and how to validate them. This is not a full-fledged HTML parser, nor is it meant to be.
If you've read the previous article on math expression tokenization, you'll know that tokenization is the process of breaking down a string into smaller, more manageable pieces. In that article, we didn't really delve into multi-character tokens, except for numbers, which made the process a little simpler. In this article, we'll be dealing with multi-character tokens, specifically HTML tags and text nodes.
> [!NOTE]
>
> Huge disclaimer right here that this is a learning exercise and, quite possibly, an exercise in futility. What I hope to achieve here is you coming out of this article with a basic understanding of how to tokenize some more complex inputs and how to validate them. This is not a full-fledged HTML parser, nor is it meant to be.
If you've read the previous article on math expression tokenization, you'll know that tokenization is the process of breaking down a string into smaller, more manageable pieces. In that article, we didn't really delve into multi-character tokens, except for numbers, which made the process a little simpler. In this article, we'll be dealing with multi-character tokens, specifically HTML tags and text nodes.
Solution
const flushBuffer = () => {
if (!buffer.length) return;
const value = buffer.trim();
if (value.startsWith('<') || value.endsWith('>'))
processTagToken(value);
else
processTextToken(value);
};>
> Huge disclaimer right here that this is a learning exercise and, quite possibly, an exercise in futility. What I hope to achieve here is you coming out of this article with a basic understanding of how to tokenize some more complex inputs and how to validate them. This is not a full-fledged HTML parser, nor is it meant to be.
If you've read the previous article on math expression tokenization, you'll know that tokenization is the process of breaking down a string into smaller, more manageable pieces. In that article, we didn't really delve into multi-character tokens, except for numbers, which made the process a little simpler. In this article, we'll be dealing with multi-character tokens, specifically HTML tags and text nodes.
The first problem we'll have to solve is distinguishing between an HTML tag and a text node. An HTML tag is a string that starts with
< and ends with >, while a text node is everything else. In order to tackle this, we'll have to create a flushBuffer function, much like we did in the math expression tokenizer. Only this time around, we'll make it a simple conditional that delegates responsibility to different functions based on the detected token type.This function is responsible for loosely detecting the token type based on the buffer's contents and delegating the processing to the appropriate function. If the buffer starts with
< or ends with >, we'll assume it's an HTML tag and pass it to processTagToken. Otherwise, we'll assume it's a text node and pass it to processTextToken. We'll implement these functions next.Text tokens aren't very interesting. They're just text nodes that don't contain any HTML tags. We'll simply add them to a
tokens array as a simple object with a type of text and a value of the text node.Code Snippets
const flushBuffer = () => {
if (!buffer.length) return;
const value = buffer.trim();
if (value.startsWith('<') || value.endsWith('>'))
processTagToken(value);
else
processTextToken(value);
};const processTextToken = str => {
tokens.push({ type: 'text', value: str });
buffer = '';
};const SELF_CLOSING_TAGS = new Set([
'br', 'img', 'input', 'meta', 'hr', 'link'
]);
const processTagToken = str => {
if (!str.match(/^<[^<>]+>$/))
throw new Error(`${str} is not a valid HTML tag`);
const tagName = str.match(/^<\/?([^<>/ ]+)/)[1];
const isClosingTag = str.startsWith('</');
const isSelfClosingTag =
str.endsWith('/>') || SELF_CLOSING_TAGS.has(tagName);
const tagAttributeString = str.
replace(new RegExp(`^</?${tagName}`), '').
replace(/\/?>/, '').
trim() || null;
tokens.push({
type: 'tag',
tagName,
opening: !isClosingTag || isSelfClosingTag,
closing: isClosingTag || isSelfClosingTag,
tagAttributeString
});
buffer = '';
};Context
From 30-seconds-of-code: simple-html-tokenization-validation
Revisions (0)
No revisions yet.