Simple, but not so simple

In spite of a Scheme implementation, Guile is also an extension language platform. This means you can write new language on it, which could be totally different from Scheme. Say, PHP/Lua/Ecmascript...and all these front-end will take advantage of the compiling optimization machenism of Guile.

This article introduced a simple way to define a very simple programing language with 50 lines code, in Guile.

The contents may require some basic knowledge of compiler principle. If you're not so comfortable with the terms in the article, please read the Dragon Book first.

Some concepts

It's better to clarify something before we start. When I'm talking 'create a language', I mean designing the grammar of the language in certain form, usually, in BNF. And when I say 'implement a language', it means writing lexer/parser, then transfer the source code to an immediate form, like AST(Abstract Syntax Tree) or other IL(Intermediate Language) for some optimizations. Finally we may have two choices:

1. To interprete the IL to get the meaningful result -- It's an Interpreter!

2. To generate some kind of code to store in a file -- It's a Compiler, if the final code is bytecode, you're writing a VM one; or native code for AOT compiler.

This article is about the front-end only: lexer and parser, and transforming a simple AST (actually it's list type in Scheme) to another kind of AST, tree-il, the first level of Guile intermediate language. After the tree-il was generated, the rest of the compiling work would be taken by Guile.

So we don't have to face the complicated compiling optimization stuffs. This feature makes Guile very easy to implement new languages. Well, now we need to create a language first, we name it 'simple', here's the BNF:

The language:

exp ::= exp op exp
      | number

op  ::= * 
      | /  ;; `multi' and `div' has higher precedence than `plus' and `minus'
      | +
      | -

Very easy huh? Only the elementary arithmetic. The only difficulty is to handle the precedence, we can rewrite it to this:

exp ::= exp + term
      | exp - term
      | term

term ::= term * factor
       | term / factor
       | factor

This new BNF is clearer than the old one, and it promises the correct precedence.

LALR parser generator:

Guile has intergrated LALR(1) parser generator in the core, but no lexer generator, so users have to write lexer manually. It's fine to me, since writing lexer is interesting. Then the key point to multi-lang is to know what does this lalr module wants to eat.

Anyway there's an alternative external lexer generator: silex.

The parser generated by the lalr-parser macro is a function that takes two parameters. The first parameter is a lexical analyzer while the second is an error handler. A token is either a symbol, or a record created by the function make-lexical-token:

(make-lexical-token category source value)

A lexical token record has three fields: category, which must be a symbol, a source location object source, and a semantic value associated with the token. For example, a string token would have a category set to 'STRING while its semantic value is set to the string value "hello". The field accessors are:

lexical-token-category
lexical-token-source
lexical-token-value

Once the end of file encountered, the lexical analyzer must always return the symbol '*eoi*. In spite of this, your lexer must return the token struct to the parser, then let the parser do its work.

Before we start our 50-lines tour, we need some preparation, the code below is to load LALR module, and define two helper functions.

;; Be sure you imported LALR module:
(use-modules (system base lalr))

;; Two helper macros to create the token struct for returning
(define-syntax-rule (port-source-location port)
  (make-source-location (port-filename port)
                        (port-line port)
                        (port-column port)
                        (false-if-exception (ftell port))
                        #f))
(define-syntax-rule (return port category value)
  (make-lexical-token category (port-source-location port) value))

The lexer:

The lexer is a tool of lexical analysis, which aims to produce tokens.

Let's see some code:

These functions are useful to predicate different token for you.

(define (is-whitespace? c) (char-set-contains? char-set:whitespace c))
(define (is-number? c) (char-set-contains? char-set:hex-digit c))
;; operators, in this simple case, we just have four operators
(define (is-op? c) (string-contains "+-*/" (string c)))
(define (is-delimiter? c) 
  (or (eof-object? c) (string-contains " +-*/;\n" (string c))))

And these two functions are used to get the basic two types of token in our Simple language: numbers, and the operators.

(define (get-number port)
 (let lp((c (peek-char port)) (ret '()))
   (cond
    ((is-delimiter? c) ; encounter delimiter, finish to read a number
     ;; convert to a number representation
     (string->number (list->string (reverse ret))))
    (else
     (read-char port) ; next char
     (lp (peek-char port) (cons c ret))))))
(define (get-op port) (string->symbol (string (read-char port))))

The key function is next-token which is used to check then get then return the proper token to the parser.

(define (next-token port)
  (let ((c (peek-char port)))
    (cond
     ((or (eof-object? c) (char=? c #\nl)) ; end of line, or end src
      '*eoi*) ; return '*eoi* because LALR module need it
     ((is-whitespace? c)
      (read-char port)
      (next-token port)) ; skip white space
     ((is-number? c)
      (return port 'number (get-number port)))
     ((is-op? c)
      (return port (get-op port) #f))
     (else
      (read-char port)
      (next-token port)))))

This tokenizer is important because the parser need it be passed in.

(define (make-simple-tokenizer port) (lambda () (next-token port)))

The tokenizer must return a thunk (stands for a function without any args) Each time calling this thunk, it returns a token. The parser will call it several times automatically, depends on the length of your source code. That's what LALR parser need to be fed.

The parser

Now it's the parser part:

(define (make-parser)
  (lalr-parser
   ;; Since we handled precedence manually in BNF, so we
   ;; don't need to specify it in lalr-parser.
   ;; (number (left: + -) (left: * /))
   (number + - * /)
   (program (exp) : $1
            (*eoi*) : (call-with-input-string "" read)) ; *eof-object*
   (exp  (exp + term) : `(+ ,$1 ,$3)
         (exp - term) : `(- ,$1 ,$3)
         (term) : $1)
   (term (term * factor) : `(* ,$1 ,$3)
         (term / factor) : `(/ ,$1 ,$3)
         (factor) : $1)
   (factor (number) : `(number ,$1))))

You may found that we just converted the BNF to a parse tree, which is actually the list in Scheme, say:

'(* (number 1) (number 3))

This is one of the perfect features of Scheme, because you can use this fundamental data type to indicate a tree. You don't have to write any new/complex data structure for that. That's why I emphasized that it is very cool to implement new languages with Scheme language.

Then you may try this line in the REPL to test if the precedence is correct:

(call-with-input-string "1+1*2/3-4" 
  (lambda (port) 
    ((make-parser) (make-simple-tokenizer port) error)))
It should return:
(- (+ (number 1) 
      (/ (* (number 1) (number 2))
         (number 3)))
   (number 4))
Maybe you're not so comfortable with this form, it's actually the same with:
(1 + ((1 * 2) / 3)) - 4

It's the correct precedence according to our BNF.

Tree IL

These code defined a function named compile-tree-il which is used to transform our parse tree to sort of AST, say, Tree-IL [1]. You see, only few lines.

(define (compile-tree-il exp env opts)
  (values (parse-tree-il (comp exp '())) env env))
(define (comp src e)
  (match src
    (('number x) `(const ,x))
    ;; If you're using master branch of Guile, please use `call' to replace `apply'.
    ;; If you are using stable-2.0 or Guile-2.0.x, `apply' should be fine [1].
    ((op x y) `(apply (primitive ,op) ,(comp x e) ,(comp y e)))))

You may put all the code showed above into one file, say, simple.scm. And make sure it's in language/simple directory, please make sure for it. Because Guile will pick the language front-end from language directory when you ask for loading the language.

Final step, the spec!

Although we've done all the coding work for our Simple language, we have to write one more file to let Guile know it. This is the spec file. The syntax is very easy:

(define-module (language simple spec)
  #:use-module (system base language)
  #:use-module (language simple simple)
  #:export (simple))

;; The definition of Simple language
;; You don't have to understand it, just copy and modify it from other front-end
(define-language simple
  #:title       "simple"
  #:reader      (lambda (port env) 
                  ((make-parser) (make-simple-tokenizer port) error))
  #:compilers   `((tree-il . ,compile-tree-il))
  #:printer     write)

Run it!

You may type all the code yourself, it's better. But here is a git repo if you need any example:

git clone https://github.com/NalaGinrut/simple.git

Please type sudo make install, and run Guile REPL (just run 'guile' in your command line). And run ,L simple in your REPL, you should see:

GNU Guile 2.1.0.1771-48c2a
Copyright (C) 1995-2014 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> ,L simple
;;; note: auto-compilation is enabled, set GUILE_AUTO_COMPILE=0
;;;       or pass the --no-auto-compile argument to disable.
;;; compiling /usr/local/share/guile/2.2/language/simple/spec.scm
;;; compiled /home/nalaginrut/.cache/guile/ccache/2.2-LE-8-3.4/usr/local/share/guile/2.2/language/simple/spec.scm.go
Happy hacking with simple!  To switch back, type `,L scheme'.
simple@(guile-user)> 

OK, now you see the language was changed to 'simple', and just play maths in it:

1+1
=> 2
3*4
=> 12

Any further?

'simple' is just a simple language, maybe too simple for a serious compiler writer. Because even a front-end will take you a lot of time and hack power. No mention the backend. Fortunately, Guile provides a nice way to let language fans focus on the grammar rather than optimization. Nevertheless, all the languages front-end can call from each other, If you're interested in this feature, please read Wingo's post[2].

I wrote an IMP[3] language before. I expect to share more in the future.

----------

Refs

[1] Tree-IL in Guile: https://www.gnu.org/software/guile/manual/html_node/Tree_002dIL.html

[2] Andy Wingo introduced Ecmacript in Guile, and inter calling between Ecmascript and Scheme: http://wingolog.org/archives/2009/02/22/ecmascript-for-guile

[3] https://github.com/NalaGinrut/imp