HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

any hope for a universal automatic parser?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
parseruniversalanyhopeforautomatic

Problem

Say you are a program, and you are given some source code but you don't know in what language, it can be C++/Java/Python/Lisp/... all you know is that it is highly structured and LR(1) parse-able, and you want to make some guesses on the corresponding syntactic tree.

How would you try to achieve this, and would you have any hope to make it working on some toy problems ?

To be clear, the goal isn't to make a parser for 4 or 5 different (but similar) programming languages, but really to detect -from some examples- the regularity in how the data is structured, and build from it a syntactic tree, with the hope to generate a grammar, a syntactic parser and -let's dream- a semantic parser and an interpreter.

My thoughts : a major problem is that in general the grammar needed for parsing a given source code isn't LR(1), but only locally LR(1), and globally recursive. For example HTML contains some Javascript, itself containing some HTML+Javascript hidden in a string :


   document.onLoad = (function(){ 
            document.write("     \
                        \
                           function fn() { alert('wow a language nested into another'); }    \
                     ");


For parsing this, you need to generate a grammar for HTML and another one for Javascript, and generate a recursive grammar saying that sometimes in the syntactic tree (from one node to its child) it can jump from one language to the other.

Do you have any reference/idea on this problem?

Solution

You might be interested in learning about grammar induction: given a set of examples of strings from a context-free language, there are algorithms to learn a context-free grammar that generates those strings.

To learn more about it, read the Wikipedia article I linked to, and Inducing a context free grammar, Is there a known method for constructing a grammar given a finite set of finite strings?, and the the Sequitur, Lempel-Ziv-Welch, and byte pair encoding algorithms.

There's no hope for automatically building an interpreter for the language, because that depends on the semantics (meaning) of the programs, and you can't determine that solely from the program source code; the most you can hope to learn is to characterize the syntax of the language, not the semantics.

Context

StackExchange Computer Science Q#64556, answer score: 10

Revisions (0)

No revisions yet.