ln -s /usr/local/waxeye/bin/waxeye ~/bin/
Copyright © 2008 Orlando D. A. R. Hill
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".
As programmers, we are required to make use of data that is presented in a variety of formats. In order to extract and manipulate the desired information, we need the ability to navigate the structure of the language the data is written in. Unless the language is very simple, we must use a parser that understands the language and gives us the data in a form we can more readily use.
Manually creating parsers can be boring and time consuming. It is, therefore, common to use a use parser generator to do the grunt work of constructing the parser. This is where Waxeye comes in handy.
You can download the current official release of Waxeye from Sourceforge. If you want to get the very latest version of Waxeye's source, you can download it from GitHub using the Git version control system.
There are no external dependencies needed to run a pre-built version of Waxeye. If you build from source, you'll need MzScheme.
To use a generated parser, you need a supported programming language to run it from.
Extract the files of the distribution.
Copy the waxeye directory to where you wish to install it.
Add the bin/waxeye binary to your search path. e.g. If you have ~/bin in your PATH and installed waxeye to /usr/local/waxeye then you might do the following.
ln -s /usr/local/waxeye/bin/waxeye ~/bin/
Extract the files of the distribution.
Copy the waxeye directory to where you wish to install it.
Currently, Waxeye may be used from a command-line interface. You can use it as a command-line tool or as part of a script or build-system. There are plans to develop a graphical tool at a later stage.
Run Waxeye by executing the waxeye binary.
Use a command prompt to run waxeye.exe.
This chapter gives a brief overview of how to use Waxeye. All material presented here will be covered in greater detail throughout the following chapters. The main purpose of this chapter is to jump-start those already familiar with parser generators. If you are just beginning to learn about parsing, feel free to skip this chapter and return to it later on.
To describe a language with Waxeye, write a series of non-terminal definitions in a grammar file. By convention, Waxeye grammars end with the .waxeye file extension.
A non-terminal definition is of the following form:
Where Name matches [a-zA-Z] *[a-zA-Z0-9_-].
Example <- A | B
To define a non-terminal whose result will never be included in the AST, use this form:
Example <: *([ \t] | Comment)
By default, the first non-terminal in a Waxeye grammar file is treated as the starting symbol of the language.
An expression can be any of the following types:
.
Matches any character from the input.
'text'
Matches text in the input.
"text"
Matches text in the input while ignores case. This equivalent to the expression [tT][eE][xX][tT] but, is much more readable.
[a-z_-]
Character-class that matches either a lower-case English character, _ or -.
NT
References the non-terminal named NT.
(e)
Raises the precedence of the expression e.
:e
Doesn't include the result of e when building the AST.
*e
Puts e within a closure.
+e
Puts e within a plus-closure.
?e
Puts e within an optional.
!e
Checks that e fails.
&e
Checks that e succeeds.
e1 e2
Matches e1 and e2 in sequence.
e1|e2
Tries to match e1 and, if that fails, tries to match e2.
When we want to understand data that has been written in a language of interest (L), we need to break our data into units of the language. A parser (P), for language L is a program specifically designed to perform that task. We give P data that is written in L and P will split our data based on how L is structured.
When language L is simple, it is easy for us to, manually, write our parser in the programming language of our choice. However, as the structural complexity of L increases, so, too, does the size and complexity of the parser program.
Manually writing and maintaining a large parser soon becomes a tedious and laborious job. Thankfully, automating boring tasks is exactly what computers were designed for. Enter parser generators.
A parser generator is a tool to help software developers by automating the process of creating a parser. Just like compilers and assemblers, a parser generator takes a description of a program, automatically does the dirty work for you and gives you your program as output. Each accepts input in one language (L1), performs various transformations and creates output in another language (L2).
L1 --> Compiler --> L2 L1 --> Assembler --> L2 L1 --> Parser Generator --> L2
The key difference between the three tools is the level of abstraction held by each of the input and output languages. The assembler works at the lowest level by taking assembly files and producing machine code. The compiler works above the assembler by taking a more abstract programming language and generating assembly files or machine code directly. Finally, the parser generator has the highest level of abstraction and transforms a grammar file into programming language source code for a compiler to process.
A grammar file is a file that describes a language. To generate a parser for a language (L), you must supply the parser generator with a grammar file that describes L.
To make the most of your data, it must be presented in a form that is easy to understand and manipulate. The most effective form for this is a tree.
A Waxeye parser will automatically represent your data as an Abstract Syntax Tree (AST). The layout of the AST is based on the structure of your language's grammar.
We can define a language as the set of strings it contains. While it is sometimes possible to specify a language simply by enumerating all of its strings, such an approach has significant drawbacks. To try to write out each string in our language would, likely, be very time consuming and, potentially, take forever.
Suppose we need to read time information from within files as part of a larger program. In a trivial case, the time information may be presented as two digits for the hours, a colon :, and then two digits for the minutes.
00:00, 00:01, 00:02, ... 14:23, 14:24, 14:25, ... 23:57, 23:58, 23:59
We could describe our time language this way, but writing all 1,440 possible hour/minute combinations wouldn't be much fun. Not to mention how bad things would be if we extended our language to include date information.
As another example, consider the language that consists of all strings of one or more alphabet character.
a, b, c, ... z, aa, ab, ac, ... az, aaa, aab, aac, ...
Even worse than our time example, this language is infinite. It would be impossible for us to explicitly list every string in the language.
If we want to talk about such languages, we need a notation that is more abstract and compact than simply writing out each string. We call such a notation a grammar.
A grammar is made up of expressions. These expressions give us a succinct representation of a language and allow us to describe its syntactic structure.
.
Matches any character from the input.
'text'
Matches text in the input.
"text"
Matches text in the input while ignores case. This equivalent to the expression [tT][eE][xX][tT] but, is much more readable.
[a-z_-]
Character-class that matches either a lower-case English character, _ or -.
NT
References the non-terminal named NT.
(e)
Raises the precedence of the expression e.
:e
Doesn't include the result of e when building the AST.
*e
Puts e within a closure.
+e
Puts e within a plus-closure.
?e
Puts e within an optional.
!e
Checks that e fails.
&e
Checks that e succeeds.
e1 e2
Matches e1 and e2 in sequence.
e1|e2
Tries to match e1 and, if that fails, tries to match e2.
In Waxeye grammars, some expressions can have other expressions nested within them. When we use parentheses, we are explicitly denoting the nesting structure of the expressions.
((?A) B) | C
At times, this can seem needlessly verbose. In many cases, we are able to omit the parentheses in favor of a shorter notation. We do this by exploiting the precedence of each expression type.
?A B | C
The precedence of an expression determines the priority it has when resolving implicitly nested expressions. Each expression type has a level of precedence relative to all other types. There are four different precedence levels in Waxeye grammars.
The highest precedence is held by the atomic expressions. Because these expressions cannot, themselves, contain expressions, there is no need to consider which expressions are nested within them.
The prefix expressions hold the next precedence level. Their nesting is resolved directly after the atomic expressions.
Sequences of expressions are formed once the atomic and prefix expressions have been resolved.
Finally, once all other expressions have been resolved, the different choices of the alternation expression are resolved.
Sometimes, creating a new AST node from a non-terminal will give us more information than we need. To make it easier to process the AST, we can remove this vertical noise by using the pruning non-terminal type. This non-terminal type has the following form:
When Name has successfully parsed a string, one of three things will happen, depending on the number of results to be included from Name's expressions.
If there are no expression results to be included, nothing new will be added to the AST.
If there is one expression result to be included, that result will take the place of the Name AST node.
Otherwise, a new Name AST node will be created, just like a normal non-terminal.
To help understand how this works, consider an example from a simple arithmetic grammar.
Product <- Number *([*/] Number) Number <- +[0-9]
If we use the Product rule to parse the string 3*7, we get a tree with Product at the root and, below that, a Number, a * character and then another Number.
Product
-> Number
| 3
| *
-> Number
| 7
However, if the Product rule parses a string with just one Number in it, we will get a tree that is slightly bigger than we need. Parsing the string 5 produces the following tree.
Product
-> Number
| 5
In this case, having a Product node at the root of the AST isn't necessary. If we want to, we can rewrite the original grammar to use a pruning non-terminal.
Product <= Number *([*/] Number) Number <- +[0-9]
Now, when we use Product to parse 3*7, we will get the same result as before but, when parsing 5, we get an AST with Number as the root.
Number | 5
As a second example, let's look at a grammar for nested parentheses.
A <- :'(' A :')' | B B <- 'b'
Here are some example inputs and their resulting ASTs:
Input: b
A
-> B
| b
Input: (b)
A
-> A
-> B
| b
Input: (((b)))
A
-> A
-> A
-> A
-> B
| b
Unless we want to know the number of parentheses matched, trees like these contain more information than we need. Again, we are able to solve this by rewriting the grammar using a pruning non-terminal.
A <= :'(' A :')' | B B <- 'b'
This time, parsing the input (((b))) gives us a much shorter tree.
B | b
There are two types of comments in Waxeye grammars; single-line and multi-line.
Single-line comments start at the first # outside of an atomic expression and extend until the end of the line.
# This is a single-line comment.
Multi-line comments are opened at the first /* outside of an atomic expression and closed with a */.
/* This is a multi-line comment. */
/* This is, also, a multi-line comment. */
As an added convenience for when editing a grammar, multi-line comments can be nested within each other. This is handy when you want to comment out a section of the grammar that already contains a comment.
/* This is the outer comment. A <- 'a' /* * This is the inner comment. */ B <- 'b' */
This chapter will show you how to setup Waxeye for your programming language. It covers language specific installation requirements and presents some basic boilerplate code to get you started. You can find copies of this boilerplate code in src/example/. I use $WAXEYE_HOME to refer to the location where you have installed the files of the Waxeye distribution.
The example grammar we'll be using can be found in grammars/num.waxeye. You may wish to copy it to the directory you're working in so you can experiment with extending and modifying the grammar.
Num <- '0' | [1-9] *[0-9]
Once setup and run, the boilerplate example will use the parser you generated to parse the string 42 and print the AST it creates.
Num | 4 | 2
Waxeye's C runtime is currently supported on unix platforms and MacOSX.
To install the C runtime, you need to compile it and, optionally, install the header files and library in your system search paths.
To compile the runtime, perform the command make lib from within the src/c directory of your waxeye installation.
cd $WAXEYE_HOME/src/c make lib make clean
To install the header files and library in your search paths, you could copy the files directly but, creating symbolic links to them will make upgrading easier.
sudo ln -s $WAXEYE_HOME/lib/libwaxeye.a /usr/local/lib/ sudo ln -s $WAXEYE_HOME/src/c/include/waxeye.h /usr/local/include/ sudo ln -s $WAXEYE_HOME/src/c/include/waxeye /usr/local/include/
waxeye -g c . num.waxeye
#include <string.h> #include "parser.h" int main() { // Create our parser struct parser_t *parser = parser_new(); // Setup our input char data[] = "42"; struct input_t *input = input_new(data, strlen(data)); // Parse our input struct ast_t *ast = parse(parser, input); // Print our ast display_ast(ast, type_strings); ast_recursive_delete(ast); input_delete(input); parser_delete(parser); return 0; }
If you installed the headers and library in your system path:
gcc example.c parser.c -lwaxeye -o example
Otherwise:
FLAGS="-I $WAXEYE_HOME/src/c/include/ -L $WAXEYE_HOME/lib/" gcc $FLAGS example.c parser.c -lwaxeye -o example
Finally,
./example
Waxeye's Java runtime is compatible with version 1.5 and 1.6 of the JRE. It should also be possible to use Waxeye with JRE versions 1.3 and 1.4 of by retrofitting the classes with Retroweaver or Retrotranslator.
To use a Waxeye parser from Java, you need Waxeye's Java runtime in your classpath. The required classes are in the Jar file lib/waxeye.jar.
waxeye -g java . num.waxeye
import org.waxeye.input.InputBuffer; import org.waxeye.parser.ParseResult; public class Example { public static void main(final String[] args) { // Create our parser final Parser parser = new Parser(); // Setup our input final InputBuffer input = new InputBuffer("42".toCharArray()); // Parse our input final ParseResult<Type> result = parser.parse(input); // Print our ast System.out.println(result); } }
javac -cp .:$WAXEYE_HOME/lib/waxeye.jar Example.java Parser.java Type.java
java -cp .:$WAXEYE_HOME/lib/waxeye.jar Example
Waxeye's Ruby runtime is compatible with Ruby version 1.8.6.
Install the Waxeye gem; either from Rubyforge or, from the gem file in lib.
# Install the Waxeye gem from Rubyforge sudo gem install waxeye
waxeye -g ruby . num.waxeye
require 'rubygems' require 'waxeye' require 'parser' # Create our parser p = Parser.new() # Parse our input ast = p.parse("42") # Print our AST ast.display
ruby example.rb
Waxeye's Scheme runtime is compatible with v372 and v4 of PLT-Scheme's MzScheme. You will need to have MzScheme installed to use it; either by itself or with DrScheme.
Install the waxeye collection in your preferred place where MzScheme can find it.
# Install the Waxeye collection; change to your install paths as needed sudo ln -s /usr/local/waxeye/src/scheme/waxeye /usr/local/plt/lib/plt/collects/
waxeye -g scheme . num.waxeye
(module example mzscheme (require "parser.scm") (define (run) ;; Parse our input (let ((ast (parser "42"))) ;; Print the ast (if (ast? ast) (display-ast ast) (display-parse-error ast)))) (run) )
mzscheme -mv -t example.scm
mzscheme -t example.scm
Since just printing an Abstract Syntax Tree isn't very interesting, let's have a look at how to access the information the ASTs contain.
When you use a Waxeye parser, the result will be one of two things. If the parser successfully parsed the input, the result will be an AST. If the input doesn't match the syntax of the language, the result will be a parse error.
switch (result->type) { case AST_TREE: return "tree ast"; case AST_EMPTY: return "empty ast"; case AST_ERROR: return "error"; }
ParseResult<Type> result = parser.parse(input); if (result.getAST() != null) { if (result instanceof IEmpty) { return "empty ast"; } else { return "tree ast"; } } else { return "error"; }
case result.class when Waxeye::AST "tree ast" when Waxeye::ParseError "error" else "empty ast"
(cond ((ast? result) "tree ast") ((parse-error? result) "error") (else "empty ast"))
ASTs come in three different forms; tree, char and empty.
A tree AST contains a type, a list of children and, the start and end position in the input.
A char AST contains a single character and has no children.
An empty AST simply signifies that parsing was successful. If your starting non-terminal is voided or is pruning and had no children, you will get an empty AST.
If a given AST node will only ever have char children, you may wish to treat that node as a single string.
char *str = ast_children_as_string(ast); printf("%s\n", str); free(str);
System.out.println(ast.childrenAsString());
puts ast.children.to_s
(display (list->string (ast-c ast))) (newline)
A parse error contains information about where the input is invalid and hints about what is wrong with it.
Now that we know how to write grammars, generate parsers and manipulate AST, we can put these skills together to build a small language interpreter. In this chapter, we create a command-line calculator.
Our calculator reads a line of input, parses it as an arithmetic expression and computes the result. The arithmetic language supports the following constructs.
floating point numbers
binary operators +,-,*,/
unary negation
parentheses
calc <- ws sum sum <- prod *([+-] ws prod) prod <- unary *([*/] ws unary) unary <= '-' ws unary | :'(' ws sum :')' ws | num num <- +[0-9] ?('.' +[0-9]) ws ws <: *[ \t\n\r]
#include <stdio.h> #include "parser.h" double sum(struct ast_t *ast); // GNU libc extension extern ssize_t getline(char **lineptr, size_t *n, FILE *stream); double num(struct ast_t *ast) { char *buf = ast_children_as_string(ast); double val = atof(buf); free(buf); return val; } double unary(struct ast_t *ast) { struct ast_tree_t *tree = ast->data.tree; switch (tree->type) { case TYPE_UNARY: return - unary(vector_get(tree->children, 1)); case TYPE_SUM: return sum(ast); default: return num(ast); } } double prod(struct ast_t *ast) { struct vector_t *chil = ast->data.tree->children; size_t num_chil = chil->size; double val = unary(vector_get(chil, 0)); size_t i; for (i = 1; i < num_chil; i +=2) { char operator = ((struct ast_t*) vector_get(chil, i))->data.c; double operand = unary(vector_get(chil, i + 1)); if (operator == '*') { val *= operand; } else { val /= operand; } } return val; } double sum(struct ast_t *ast) { struct vector_t *chil = ast->data.tree->children; size_t num_chil = chil->size; double val = prod(vector_get(chil, 0)); size_t i; for (i = 1; i < num_chil; i +=2) { char operator = ((struct ast_t*) vector_get(chil, i))->data.c; double operand = prod(vector_get(chil, i + 1)); if (operator == '+') { val += operand; } else { val -= operand; } } return val; } void calc(struct parser_t *parser, struct input_t *input) { struct ast_t *ast = parse(parser, input); if (ast->type == AST_ERROR) { display_ast(ast, type_strings); } else { printf("%f\n", sum(vector_get(ast->data.tree->children, 0))); } ast_recursive_delete(ast); } struct input_t *rl() { size_t data_size = 0; char* data = NULL; ssize_t line_size = getline(&data, &data_size, stdin); if (line_size == -1) { free(data); return NULL; } return input_new(data, line_size); } int main() { struct input_t *input; struct parser_t *parser = parser_new(); printf("calc> "); input = rl(); while (input != NULL) { calc(parser, input); input_and_data_delete(input); printf("calc> "); input = rl(); } printf("\n"); parser_delete(parser); return 0; }
import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.IOException; import java.util.List; import org.waxeye.ast.IAST; import org.waxeye.ast.IChar; import org.waxeye.input.InputBuffer; import org.waxeye.parser.ParseResult; /** * A commandline arithmetic calculator. */ public final class Calculator { private static final Parser p = new Parser(); private Calculator() { } private static Object calc(final String input) { final ParseResult<Type> result = p.parse(new InputBuffer(input.toCharArray())); final IAST<Type> ast = result.getAST(); if (ast == null) { return result.getError(); } else { return sum(ast.getChildren().get(0)); } } private static double sum(final IAST<Type> sum) { final List<IAST<Type>> chil = sum.getChildren(); double val = prod(chil.get(0)); for (int i = 1; i < chil.size(); i += 2) { final char operator = ((IChar)chil.get(i)).getValue(); if (operator == '+') { val += prod(chil.get(i + 1)); } else { val -= prod(chil.get(i + 1)); } } return val; } private static double prod(final IAST<Type> prod) { final List<IAST<Type>> chil = prod.getChildren(); double val = unary(chil.get(0)); for (int i = 1; i < chil.size(); i += 2) { final char operator = ((IChar)chil.get(i)).getValue(); if (operator == '*') { val *= unary(chil.get(i + 1)); } else { val /= unary(chil.get(i + 1)); } } return val; } private static double unary(final IAST<Type> unary) { switch (unary.getType()) { case Unary: return - unary(unary.getChildren().get(1)); case Sum: return sum(unary); default: return num(unary); } } private static double num(final IAST<Type> num) { return Double.parseDouble(num.childrenAsString()); } public static void main(String[] args) { try { final BufferedReader in = new BufferedReader( new InputStreamReader(System.in)); System.out.print("calc> "); String line = in.readLine(); while (line != null) { System.out.println(calc(line)); System.out.print("calc> "); line = in.readLine(); } System.out.println(); } catch (IOException e) { e.printStackTrace(); } } }
require 'rubygems' require 'waxeye' require 'parser' # A commandline arithmetic calculator. class Calculator @@p = Parser.new() def self.calc(input) ast = @@p.parse(input) if ast.is_a?(Waxeye::ParseError) ast.display() else puts sum(ast.children[0]) end end def self.bin_op(ast, fn, ch, op1, op2) chil = ast.children val = self.send(fn, (chil[0])) i = 1 until i == chil.size # Increment val by the operator applied to val and the operand val = val.send(chil[i] == ch ? op1 : op2, self.send(fn, chil[i + 1])) i += 2 end val end def self.sum(ast) bin_op(ast, :prod, '+', :'+', :'-') end def self.prod(ast) bin_op(ast, :unary, '*', :'*', :'/') end def self.unary(unary) case unary.type when :unary - unary(unary.children[1]) when :sum sum(unary) else num(unary) end end def self.num(num) num.children.to_s.to_f end end print 'calc> ' STDIN.each {|input| Calculator.calc(input); print 'calc> ' } puts
(module calculator mzscheme (require "parser.scm") ;; A commandline arithmetic calculator. (define (calc input) (let ((ast (parser input))) (if (ast? ast) (begin (display (sum (car (ast-c ast)))) (newline)) (display-parse-error ast)))) (define (bin-op ast fn ch op1 op2) (let* ((chil (list->vector (ast-c ast))) (val (fn (vector-ref chil 0)))) (let loop ((i 1)) (unless (= i (vector-length chil)) ;; Increment val by the operator applied to val and the operand (set! val ((if (equal? (vector-ref chil i) ch) op1 op2) val (fn (vector-ref chil (+ i 1))))) (loop (+ i 2)))) val)) (define (sum ast) (bin-op ast prod #\+ + -)) (define (prod ast) (bin-op ast unary #\* * /)) (define (unary ast) (case (ast-t ast) ((unary) (- (unary (cadr (ast-c ast))))) ((sum) (sum ast)) (else (num ast)))) (define (num ast) (string->number (list->string (ast-c ast)))) (define (rl) (display "calc> ") (read-line (current-input-port))) (let loop ((input (rl))) (if (eof-object? input) (newline) (begin (calc input) (loop (rl))))) )
;; These are tests for the 'Grammar' non-terminal (Grammar ; <- This is the non-terminal's name ;; Following the name are pairs of input string and expected output. The ;; output is either the keyword 'pass', the keyword 'fail' or an AST. The AST ;; specifies the structure of the expected tree, the names of the nodes and ;; the individual characters. If you don't want to specify the whole tree, ;; just use the wild-card symbol '*' for the portion of the tree you want to ;; skip. "" ; <- This is the input (Grammar) ; <- This