Sunday, November 20, 2011

Kaleidoscope

I'm proud to release my own programming language Kaleidoscope! With this powerful language, you can define functions, which can call another functions, add or multiply values, and... well, that's it. But look at this fabulous example:
# imports
extern input()
extern print(v)

# magic function
def transform(a, b)
    a * a + 2 * a * b + b * b

main
    print(transform(input(), input()))
Just kidding. This project is only a tutorial, and aims to use libfirm, "a library that provides an intermediate representation and optimisations for compilers. Programs are represented in a graph based SSA form." Since I'm writing my master thesis on this theme, I've wanted to dig into a tutorial to understand the core concepts.

Moreover, I used Bison / Flex for the first time, and this is pretty fun to program!

Note: Kaleidoscope comes from a tutorial written by the authors of libfirm, which is sadly not up-to-date with the latest versions of the library.

Parsing the code

The code needs to be cut into semantic tokens: it's the role of Flex. It is a lexical analyzer generator, that is, you define token rules, and it creates a C function which can match these rules.

For instance, given the rules
"+"
    { return '+'; }

-?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
    { return DOUBLE_LITERAL; }

[A-Za-z_][0-9A-Za-z_]*
    { return IDENTIFIER; }
Flex will produce a function which transform the following code
foobar + 42
into this series of token
IDENTIFIER, "+", DOUBLE_LITERAL
Quite easy. Then comes Bison, a parser generator. Once again, you define rules to translate tokens into C code, and the generated parser will do the job. Here is an extract of the rules I use:
expr : DOUBLE_LITERAL
        {
            // constant value
            num_expr_t *expr = (num_expr_t *)malloc(sizeof(num_expr_t));
            expr->next = NULL;
            expr->which = EXPR_NUM;
            expr->val = $1;
            $$ = (expr_t *)expr;
        }
    | IDENTIFIER
        {
            // variable
            id_expr_t *expr = (id_expr_t *)malloc(sizeof(id_expr_t));
            expr->next = NULL;
            expr->which = EXPR_ID;
            expr->name = $1;
            $$ = (expr_t *)expr;
        }
    | expr '+' expr
        {
            // addition
            bin_expr_t *expr = (bin_expr_t *)malloc(sizeof(bin_expr_t));
            expr->next = NULL;
            expr->which = EXPR_BIN;
            expr->op = '+';
            expr->lhs = $1;
            expr->rhs = $3;
            $$ = (expr_t *)expr;
        }
    ;
which parses an expression.

As you see, I do not generate the final graph here, instead I construct an abstract syntax tree (AST). I used this solution because when parsing an expression, the context is still unknown, especially the function the expression belongs to. The AST would be something like:
{
    "which" : EXPR_BIN,
    "op" : "+",
    "lhs" : {
        "which" : EXPR_ID,
        "name" : "foobar"
    },
    "rhs" : {
        "which" : EXPR_NUM,
        "val" : 42
    }
}
when represented in JSON :)

Constructing the graph

Before parsing the code to compile, I initialize the libfirm library, and create a program representation inside the lib. Parsing shall complete this program with nodes, which describes the program in a logical way.

When the tokens EXTERN, DEF or MAIN are parsed, I create a new function prototype. MAIN has its own prototype (no parameters, returns an int), whereas EXTERN and DEF have custom prototypes depending on the function they declare (parameters and return value are double).

Moreover, DEF and MAIN define a function body, which is an expression. Actually, it's the previously constructed AST. So the tree is traversed to construct nodes. Here is the representation of the function graph for the first example's "transform" function:

transform: (a, b) -> a*a + 2*a*b + b*b

A function has basically three blocks: a start block (prototype, etc.), a body block, and an end block. In the start block we see that the function takes two parameters (nodes 75 and 76) and returns only one (node 72). Green nodes are just links between blocks, unfortunately you'll have to guess these links.

Note that the graph has been automatically optimized: the "2 * a" has been translated into "a + a" (node 84), which may be faster.

Generating the assembly file

Once the parsing is done, the program is transformed into another language (libfirm cannot directly generate executable file, it relies on assemblers to do this complex task). Here is the x86 assembly for "transform":
# -- Begin  transform
    .p2align 4,,15
.globl transform
    .type transform, @function
transform:
    .p2align 4,,7
    /* .L66: preds: none, freq: 1.000000 */
    pushl %ebp                       /* ia32_Push T[217:33]  */
    movl %esp, %ebp                  /* be_Copy Iu[220:36]  */
    fldl 8(%ebp)                     /* ia32_fld T[189:11]  */
    fld %st                          /* ia32_fpush ANY[229:45]  */
    fmul %st(1), %st                 /* ia32_fmul E[192:14]  */
    fxch %st(1)                      /* ia32_fxch ANY[230:46]  */
    fadd %st, %st                    /* ia32_fadd E[193:15]  */
    fldl 16(%ebp)                    /* ia32_fld T[194:16]  */
    fxch %st(1)                      /* ia32_fxch ANY[231:47]  */
    fmul %st(1), %st                 /* ia32_fmul E[196:18]  */
    faddp %st, %st(2)                /* ia32_faddp E[197:19]  */
    fmul %st, %st                    /* ia32_fmul E[198:20]  */
    faddp %st, %st(1)                /* ia32_faddp E[199:21]  */
    movl %ebp, %esp                  /* ia32_CopyEbpEsp Iu[224:40]  */
    popl %ebp                        /* ia32_PopEbp T[225:41]  */
    ret                              /* be_Return X[171:22]  */
    .size transform, .-transform
# -- End  transform
That's it. Have fun with Kaleidoscope!

No comments:

Post a Comment