1. Introduction
As programmers, we are required to make use of data that is presented in a variety of formats. In order to extract and manipulate the desired information, we need the ability to navigate the structure of the language the data is written in. Unless the language is very simple, we must use a parser that understands the language and gives us the data in a form we can more readily use.
Manually creating parsers can be boring and time consuming. It is, therefore, common to use a use parser generator to do the grunt work of constructing the parser. This is where Waxeye comes in handy.
2. Getting Started
2.1. Downloading
You can download the latest version of Waxeye’s source code from GitHub.
2.2. Requirements
There are no external dependencies needed to run a pre-built version of Waxeye. If you build from source, you’ll need Racket.
To use a generated parser, you need a supported programming language to run it from.
2.3. Installation
2.3.1. Unix and MacOSX
-
Extract the files of the distribution.
-
Copy the waxeye directory to where you wish to install it.
-
Add the bin/waxeye binary to your search path. e.g. If you have
~/bin
in your PATH and installed waxeye to /usr/local/waxeye then you might do the following.
ln -s /usr/local/waxeye/bin/waxeye ~/bin/
2.3.2. Windows
-
Extract the files of the distribution.
-
Copy the waxeye directory to where you wish to install it.
2.4. Running
Currently, Waxeye is used from a command-line interface. You can use it as a command-line tool or, as part of a script or build-system. There are plans to develop a graphical tool at a later stage.
2.4.1. Unix and MacOSX
Run Waxeye by executing the waxeye
binary.
2.4.2. Windows
Use a command prompt to run waxeye.exe
.
3. Basic Concepts
3.1. What is a parser?
When we want to understand data that has been written in a language of interest (L), we need to break our data into units of the language. This process of breaking our input into different parts, based on the structure of L, is called parsing. A program used for parsing is called a parser.
3.2. What is the result of a parser?
Once your input has been parsed, you need the result to be presented in a from that is easy to understand and manipulate. Since the input was organized based on the hierarchical structure of the language, it makes sense that the output of the parser mimic this structure. The most effective form to do this with is a tree.
Such a tree is known as an Abstract Syntax Tree (AST). A Waxeye parser will automatically give you an AST that represents your input. The structure of this AST is based on the structure of your language’s grammar.
3.3. What is a parser generator?
If L is simple, it is easy for us to use our programming lanugage of choice to, manually, write a parser for L. However, as the structural complexity of L increases, so too, does the size and complexity of the parser program. Writing and maintaining a large parser, by hand, can quickly become a tedious and laborious job. Thankfully, we can use a parser generator to automate the work of creating a parser so we can focus on other problems.
A parser generator is a tool designed to help software developers automate the process of creating a parser. Just like compilers and assemblers, a parser generator takes a description of a program, automatically does the boring work for you and gives you a transformed program as output. Each tool accepts input in one language (L1), performs various transformations and creates output in another language (L2).
L1 --> Compiler --> L2
L1 --> Assembler --> L2
L1 --> Parser Generator --> L2
The key difference between the three tools is the level of abstraction held by the input and output languages. The assembler works at the lowest level by taking assembly files and producing machine code. The compiler works above the assembler by taking a more abstract programming language and generating assembly files or machine code directly. Finally, the parser generator has the highest level of abstraction and transforms a grammar file into programming language source code for a compiler to process.
3.4. What is a grammar file?
We can define a language as the set of strings it contains. While it is sometimes possible to specify a language simply by enumerating all of its strings, such an approach has significant drawbacks. Trying to write each string in our language could be very time consuming and, potentially, take forever.
Suppose we need to read time information as part of a larger program. In a
trivial case, the time information may be presented as two digits for the
hours, a colon :
, and then two digits for the minutes.
00:00, 00:01, 00:02, ... 14:23, 14:24, 14:25, ... 23:57, 23:58, 23:59
We could describe our time language this way but, writing all 1,440 possible hour/minute combinations wouldn’t be much fun. Not to mention how bad things would be if we extended our language to include date information.
As another example, consider the language that consists of all strings of one or more alphabet character.
a, b, c, ... z, aa, ab, ac, ... az, aaa, aab, aac, ...
Even worse than our time example, this language is infinite. It would be impossible for us to explicitly list every string in the language.
If we want to describe such languages, we need a notation that is more abstract than simply writing out strings. We call this notation a grammar and the file that contains it a grammar file.
4. Waxeye Grammars
To generate a parser for a language, you must supply the parser generator with
a grammar file that describes the language. Waxeye grammar files are written as
text documents and are, by convention, given the .waxeye
file extension.
A Waxeye grammar consists of a set of rule definitions, called non-terminals. Together, the non-terminals succinctly describe the syntax of the language. By default, the first non-terminal is considered the starting point of the language definition.
4.1. Non-terminals
Non-terminals are defined in three parts; a name, a rule type and one or more grammar expressions.
The most common non-terminal type is the tree constructing non-terminal. A tree constructing non-terminal has the following form:
Where Name matches [a-zA-Z_] *[a-zA-Z0-9_-]
.
Example <- A | B
The other common non-terminal type is the void non-terminal. The result of a void non-terminal is not included in the AST that is constructed by the parser. To define a void non-terminal, use this form:
Example <: A | B
4.2. Expressions
The most important part of each non-terminal definition is the set of expressions it contains. Grammar expressions come in different forms and have their own meanings. Places where an expression can be contained within another expression are denoted with an e.
4.2.1. Atomic Expressions
Wildcard
.
Matches any character from the input.
Literal
'text'
Matches text
in the input.
Case-insensitive Literal
"text"
Matches text
in the input while ignores case. This is equivalent to the
expression [tT][eE][xX][tT]
but, is much more readable.
Character Class
[a-z_-]
Character-class that matches either a lower-case English character, _
or
-
.
Non-terminal
NT
References the non-terminal named NT
.
Parentheses
(
e)
Raises the precedence of the expression e.
4.2.2. Prefix Expressions
Void
:
e
Doesn’t include the result of e when building the AST.
Closure
*
e
Puts e within a closure.
Plus
+
e
Puts e within a plus-closure.
Optional
?
e
Puts e within an optional.
Negative Check
!
e
Checks that e fails.
Positive Check
&
e
Checks that e succeeds.
4.2.3. Sequence Expressions
e1 e2
Matches e1 and e2 in sequence.
4.2.4. Alternation Expressions
e1|
e2
Tries to match e1 and, if that fails, tries to match e2.
4.3. Precedence
In Waxeye grammars, some expressions can have other expressions nested within them. When we use parentheses, we are explicitly denoting the nesting structure of the expressions.
((?A) B) | C
At times, this can seem needlessly verbose. In many cases, we are able to omit the parentheses in favor of a shorter notation. We do this by exploiting the precedence of each expression type.
?A B | C
The precedence of an expression determines the priority it has when resolving implicitly nested expressions. Each expression type has a level of precedence relative to all other types. There are four different precedence levels in Waxeye grammars.
4.3.1. Level 4
The highest precedence is held by the atomic expressions. Because these expressions cannot, themselves, contain expressions, there is no need to consider which expressions are nested within them.
4.3.2. Level 3
The prefix expressions hold the next precedence level. Their nesting is resolved directly after the atomic expressions.
4.3.3. Level 2
Sequences of expressions are formed once the atomic and prefix expressions have been resolved.
4.3.4. Level 1
Finally, once all other expressions have been resolved, the different choices of the alternation expression are resolved.
4.4. Pruning Non-terminals
Sometimes, creating a new AST node will give us more information than we need. We might want to create a new AST node, only if doing so will tell us something interesting about our input. If the additional node gives us nothing of interest, our tree could be said to contain vertical noise.
To make it easier to process the AST, we can remove this vertical noise by using the pruning non-terminal type. This non-terminal type has the following form:
When Name has successfully parsed a string, one of three things will happen, depending on the number of results to be included from Name's expressions.
-
If there are no expression results to be included, nothing new will be added to the AST.
-
If there is one expression result to be included, that result will take the place of the Name AST node.
-
Otherwise, a new Name AST node will be created, just like a tree constructing non-terminal.
To help understand how this works, consider an example from a simple arithmetic grammar.
Product <- Number *([*/] Number) Number <- +[0-9]
If we use the Product rule to parse the string 3*7
, we get a tree with
Product at the root and, below that, a Number, a *
character and then
another Number.
Product
-> Number
| 3
| *
-> Number
| 7
However, if the Product rule parses a string with just one Number in it, we
will get a tree that is slightly bigger than we need. Parsing the string 5
produces the following tree.
Product
-> Number
| 5
In this case, having a Product node at the root of the AST isn’t necessary. If we want to, we can rewrite the original grammar to use a pruning non-terminal.
Product <= Number *([*/] Number) Number <- +[0-9]
Now, when we use Product to parse 3*7
, we will get the same result as
before but, when parsing 5
, we get an AST with Number as the root.
Number
| 5
As a second example, let’s look at a grammar for nested parentheses.
A <- :'(' A :')' | B B <- 'b'
Here are some example inputs and their resulting ASTs:
Input: b
A
-> B
| b
Input: (b)
A
-> A
-> B
| b
Input: (((b)))
A
-> A
-> A
-> A
-> B
| b
Unless we want to know the number of parentheses matched, trees like these contain more information than we need. Again, we are able to solve this by rewriting the grammar using a pruning non-terminal.
A <= :'(' A :')' | B B <- 'b'
This time, parsing the input (((b)))
gives us a much shorter tree.
B
| b
4.5. Comments
There are two types of comments in Waxeye grammars; single-line and multi-line.
4.5.1. Single-line
Single-line comments start at the first #
outside of an atomic expression and
extend until the end of the line.
# This is a single-line comment.
4.5.2. Multi-line
Multi-line comments are opened at the first /*
outside of an atomic
expression and closed with a */
.
/* This is a multi-line comment. */
/* This is, also, a multi-line comment. */
As an added convenience for when editing a grammar, multi-line comments can be nested within each other. This is handy when you want to comment out a section of the grammar that already contains a comment.
/* This is the outer comment. A <- 'a' /* * This is the inner comment. */ B <- 'b' */
5. Using Waxeye
This chapter will show you how to setup Waxeye for your programming language.
It covers language specific installation requirements and presents some basic
boilerplate code to get you started. You can find copies of this boilerplate
code in src/example/
. I use $WAXEYE_HOME
to refer to the location where you
have installed the files of the Waxeye distribution.
The example grammar we’ll be using can be found in grammars/num.waxeye
. You
may wish to copy it to the directory you’re working in so you can experiment
with extending and modifying the grammar.
Num <- '0' | [1-9] *[0-9]
Once setup and run, the boilerplate example will use the parser you generated
to parse the string 42
and print the AST it creates.
Num
| 4
| 2
5.1. Using Waxeye from Racket
Waxeye’s Racket runtime is compatible with Racket.
5.1.1. Install
Install the waxeye collection where Racket can find it.
# Install the Waxeye collection; change to your install paths as needed
sudo ln -s /usr/local/waxeye/src/racket/waxeye /usr/local/racket/lib/racket/collects/
5.1.2. Generate Parser
waxeye -g racket . num.waxeye
5.1.3. Use Parser
#lang racket (require "parser.rkt") ;; Parse our input (let ((ast (parser "42"))) ;; Print our AST (display-ast ast))
5.1.4. Run from Racket
racket -t example.rkt
6. Using ASTs and Parse Errors
Since just printing an Abstract Syntax Tree isn’t very interesting, let’s have a look at how to access the information the ASTs contain.
When you use a Waxeye parser, the result will be one of two things. If the parser successfully parsed the input, the result will be an AST. If the input doesn’t match the syntax of the language, the result will be a parse error.
6.1. ASTs
ASTs come in three different forms; tree, char and empty.
-
A tree AST contains a type, a list of children and, the start and end position in the input.
-
A char AST contains a single character and has no children.
-
An empty AST simply signifies that parsing was successful. If your starting non-terminal is voided or is pruning and had no children, you will get an empty AST.
6.1.1. Using an AST node as string
If a given AST node will only ever have char children, you may wish to treat that node as a single string.
From Racket
(display (list->string (ast-c ast))) (newline)
6.2. Parse Errors
A parse error contains information about where the input is invalid and hints about what is wrong with it.
6.3. Determining the result type
6.3.1. From Racket
(cond ((ast? result) "tree ast") ((parse-error? result) "error") (else "empty ast"))
7. Example: A Calculator
Now that we know how to write grammars, generate parsers and manipulate AST, we can put these skills together to build a small language interpreter. In this chapter, we create a command-line calculator.
Our calculator reads a line of input, parses it as an arithmetic expression and computes the result. The arithmetic language supports the following constructs.
-
floating point numbers
-
binary operators +,-,*,/
-
unary negation
-
parentheses
calc <- ws sum sum <- prod *([+-] ws prod) prod <- unary *([*/] ws unary) unary <= '-' ws unary | :'(' ws sum :')' ws | num num <- +[0-9] ?('.' +[0-9]) ws ws <: *[ \t\n\r]
7.1. Calculator in Racket
#lang racket (require "parser.rkt") ;; A commandline arithmetic calculator. (define (calc input) (let ((ast (parser input))) (if (ast? ast) (begin (display (sum (car (ast-c ast)))) (newline)) (display-parse-error ast)))) (define (bin-op ast fn ch op1 op2) (let* ((chil (list->vector (ast-c ast))) (val (fn (vector-ref chil 0)))) (let loop ((i 1)) (unless (= i (vector-length chil)) ;; Increment val by the operator applied to val and the operand (set! val ((if (equal? (vector-ref chil i) ch) op1 op2) val (fn (vector-ref chil (+ i 1))))) (loop (+ i 2)))) val)) (define (sum ast) (bin-op ast prod #\+ + -)) (define (prod ast) (bin-op ast unary #\* * /)) (define (unary ast) (case (ast-t ast) ((unary) (- (unary (cadr (ast-c ast))))) ((sum) (sum ast)) (else (num ast)))) (define (num ast) (string->number (list->string (ast-c ast)))) (define (rl) (display "calc> ") (read-line (current-input-port))) (let loop ((input (rl))) (if (eof-object? input) (newline) (begin (calc input) (loop (rl)))))
8. Grammar Testing
;; These are tests for the 'Grammar' non-terminal (Grammar ; <- This is the non-terminal's name ;; Following the name are pairs of input string and expected output. The ;; output is either the keyword 'pass', the keyword 'fail' or an AST. The AST ;; specifies the structure of the expected tree, the names of the nodes and ;; the individual characters. If you don't want to specify the whole tree, ;; just use the wild-card symbol '*' for the portion of the tree you want to ;; skip. "" ; <- This is the input (Grammar) ; <- This is the expected output "A <- 'a'" pass ; <- The keyword 'pass' "A" fail ; <- The keyword 'fail' "A <- 'a' B <- 'b'" (Grammar (Definition (Identifier #\A) *) ; <- Here we skip some of (Definition (Identifier #\B) *)) ; Definition's children "A <- 'a'" (Grammar (*)) ; <- Here we specify a child tree of any type "A <- [a-z] *[a-z0-9]" (Grammar (Definition (Identifier #\A) (LeftArrow) (Alternation *))) "A <- 'a'" (Grammar (Definition (Identifier #\A) (LeftArrow) (Alternation (Sequence (Unit (Literal (LChar #\a))))))) )
9. Modular Grammars
It is sometimes desirable to have a grammar split across multiple files and to have a final grammar built from those files. We can do this by using a modular grammar.
Having our grammar split in this way provides us with the opportunity to manipulate the definition of the non-terminals and, in the process, create new languages. Depending on how we compose our final grammar, we can create vastly different languages from the same base grammars and only need to change the one modular grammar.
One of the biggest advantages of modular grammars is that they make it very easy to embed one language within another. Many languages can be thought of in this way. Prime examples are when a programming language is embedded within XML or HTML. Or, going the other way, you could embed a data language like SQL within a programming language.
There are also cases when a language’s syntax changes subtly over time. We want to have parsers for each version of the language but without duplicating large parts of our grammars.
9.1. Grammar Composition
A modular grammar is made up of expressions that pull together non-modular grammars. Some modular expressions can have other expressions nested within them. An expression is one of the following:
-
"grammar.waxeye"
A path to a .waxeye file. This path should either be relative to the modular grammar or be an absolute path. -
(rename modular-exp (old-name . new-name) …)
Renames the specified non-terminals with their new names. -
(only modular-exp non-term …)
Includes only the listed non-terminals. -
(all-except modular-exp non-term …)
Includes all non-terminals except those listed. -
(prefix prefix modular-exp)
Prefixes the names of non-terminals from modular-exp. -
(prefix-only prefix modular-exp non-term …)
Prefixes only the listed non-terminals. -
(prefix-all-except prefix modular-exp non-term …)
Prefixes all non-terminals except those listed. -
(join modular-exp …)
Combines the results of multiple modular expressions into a single expression. Not needed at the top-level.
;; A contrived example where we replace the definition of Number in Json with a ;; much simpler one that only supports integers. (all-except "../json.waxeye" Number) (rename (only "../num.waxeye" Num) (Num . Number))
10. Waxeye Options
waxeye [ <option> ... ] <grammar>
where <option> is one of
Waxeye modes:
/ -g <language> <dir> : Generate
| -i : Interpret
\ -t <test> : Test
Grammar options:
-m : Modular Grammar - default: false
-s <start> : Starting non-terminal - default: first non-terminal
Parser options:
-c <comment> : Header comment for generated files - default: none
-e <eof> : Check parser consumes all input - default: true
-n <namespace> : Module or package namespace - default: none
-p <prefix> : Name prefix for generated files - default: none
Misc options:
--debug : Activates debug information
--version : Prints version number and copyright notice
--help, -h : Show this help
-- : Do not treat any remaining argument as a switch (at this level)
/|\ Brackets indicate mutually exclusive options.
Multiple single-letter switches can be combined after one `-'; for
example: `-h-' is the same as `-h --'
10.1. Waxeye Modes
grammar
The grammar file describing the language you want to parse. It is the last argument given to Waxeye and is required by all of Waxeye’s operating modes.
10.1.1. Generate
-g <language> <dir>
Creates a parser written in the specified programming language. Writes the parser’s files to the specified directory.
Currently supported programming languages:
-
racket
waxeye -g racket . grammar.waxeye
10.1.2. Interpret
-i
Parses input as a string from the language defined by the grammar. Displays the resulting AST or parse error.
waxeye -i grammar.waxeye < input.txt
10.1.3. Test
-t <test>
Runs the tests in the specified test file for the language defined by the grammar. Displays any test errors.
waxeye -t tests.rkt grammar.waxeye
10.2. Grammar Options
-m
Indicates that the grammar is a modular grammar.
-s <start>
Specifies the non-terminal that starts the language. Default - The first non-terminal in the grammar.
10.3. Parser Options
-c <comment>
The file to be used as the header comment of generated files. Default - none.
-e <eof>
Whether to check that the parser consumes all input. Default - true.
-n <namespace>
The module or package namespace. Default - none.
-p <prefix>
The name prefix for any generated files. Default - none.
10.4. Misc Options
--debug
Activates debug information.
--version
Prints the version number and copyright notice.
--help, -h
Prints a message describing the available command-line options.
Copyright © 2008-2021 Orlando Hill
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.