patterncsharpMinor
Lexer for C# source code
Viewed 0 times
sourcecodeforlexer
Problem
This code reads a .cs source file in the \bin folder and parses the C# code in it. The program outputs keywords, identifiers, separators and numerical constants.
How could it be improved?
```
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Tema3Compilatoare
{
class LexicalAnalysis
{
string[] keywords = { "abstract", "as", "base", "bool", "break", "by",
"byte", "case", "catch", "char", "checked", "class", "const",
"continue", "decimal", "default", "delegate", "do", "double",
"descending", "explicit", "event", "extern", "else", "enum",
"false", "finally", "fixed", "float", "for", "foreach", "from",
"goto", "group", "if", "implicit", "in", "int", "interface",
"internal", "into", "is", "lock", "long", "new", "null", "namespace",
"object", "operator", "out", "override", "orderby", "params",
"private", "protected", "public", "readonly", "ref", "return",
"switch", "struct", "sbyte", "sealed", "short", "sizeof",
"stackalloc", "static", "string", "select", "this",
"throw", "true", "try", "typeof", "uint", "ulong", "unchecked",
"unsafe", "ushort", "using", "var", "virtual", "volatile",
"void", "while", "where", "yield" };
string[] separator = { ";", "{","}","\r","\n","\r\n"};
string[] comments = { "//", "/", "/" };
string[] operators = { "+", "-", "*", "/", "%", "&","(",")","[","]",
"|", "^", "!", "~", "&&", "||",",",
"++", "--", ">", "==", "!=", "", "=", "=", "+=", "-=", "*=", "/=", "%=", "&=", "|=",
"^=", ">=", ".", "[]", "()", "?:", "=>", "??" };
public string Parse(string item)
{
StringBuilder str = new StringBuilder();
int ok;
if (Int32.TryParse(item, out ok))
{
How could it be improved?
```
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Tema3Compilatoare
{
class LexicalAnalysis
{
string[] keywords = { "abstract", "as", "base", "bool", "break", "by",
"byte", "case", "catch", "char", "checked", "class", "const",
"continue", "decimal", "default", "delegate", "do", "double",
"descending", "explicit", "event", "extern", "else", "enum",
"false", "finally", "fixed", "float", "for", "foreach", "from",
"goto", "group", "if", "implicit", "in", "int", "interface",
"internal", "into", "is", "lock", "long", "new", "null", "namespace",
"object", "operator", "out", "override", "orderby", "params",
"private", "protected", "public", "readonly", "ref", "return",
"switch", "struct", "sbyte", "sealed", "short", "sizeof",
"stackalloc", "static", "string", "select", "this",
"throw", "true", "try", "typeof", "uint", "ulong", "unchecked",
"unsafe", "ushort", "using", "var", "virtual", "volatile",
"void", "while", "where", "yield" };
string[] separator = { ";", "{","}","\r","\n","\r\n"};
string[] comments = { "//", "/", "/" };
string[] operators = { "+", "-", "*", "/", "%", "&","(",")","[","]",
"|", "^", "!", "~", "&&", "||",",",
"++", "--", ">", "==", "!=", "", "=", "=", "+=", "-=", "*=", "/=", "%=", "&=", "|=",
"^=", ">=", ".", "[]", "()", "?:", "=>", "??" };
public string Parse(string item)
{
StringBuilder str = new StringBuilder();
int ok;
if (Int32.TryParse(item, out ok))
{
Solution
The
Your
There's a lot of this:
That could be simply written as:
Your public interface looks like this:
...which isn't crystal-clear. What's an
Looking at the commented-out code in the
I like to design my objects starting with the public interface - if I like what I'm seeing then I go and implement it. In this case I'd probably have gone with something like this:
What's an
The more you'll want to extend this code into an actual lexer/parser, the more crying the need will become to define a grammar, and generate the lexer/parser off the grammar rules. Look into Antl4 for C# if that's an avenue you wish to explore.
Alternatively, if you're on C# 6.0, you could leverage the Roslyn API and parse C# code using the actual C# compiler!
keywords, separator, comments, and operators arrays could be static readonly, so that they don't need to be re-initialized for every instance of a LexicalAnalysis class you create; the type would probably be better off as LexicalAnalyzer though. The analyzer performs the analysis, but it is not "the analysis".Your
private members members are all explicitly private, except these arrays. That's an easily fixed inconsistency.There's a lot of this:
if (Array.IndexOf(comments, str) > -1)
return true;
return false;That could be simply written as:
return comments.Contains(str);Your public interface looks like this:
string Parse(string item);
string GetNextLexicalAtom(ref string item);...which isn't crystal-clear. What's an
item? Is a "lexical atom" a token? Why is item passed by reference?Looking at the commented-out code in the
Program class, it seems you're not sure either.I like to design my objects starting with the public interface - if I like what I'm seeing then I go and implement it. In this case I'd probably have gone with something like this:
IParseTree Parse(string code);What's an
IParseTree? It's what parsers typically return: a tree structure representing the code that was parsed. A Parse(string) method that returns a string is quite puzzling actually.The more you'll want to extend this code into an actual lexer/parser, the more crying the need will become to define a grammar, and generate the lexer/parser off the grammar rules. Look into Antl4 for C# if that's an avenue you wish to explore.
Alternatively, if you're on C# 6.0, you could leverage the Roslyn API and parse C# code using the actual C# compiler!
Code Snippets
if (Array.IndexOf(comments, str) > -1)
return true;
return false;return comments.Contains(str);string Parse(string item);
string GetNextLexicalAtom(ref string item);IParseTree Parse(string code);Context
StackExchange Code Review Q#113418, answer score: 8
Revisions (0)
No revisions yet.