Semantics I: Expressions, Bindings, Functions

Here is a toy language featuring expressions, variable bindings, and functions. These are basic features present in almost all languages. We will also run into a few topics and discussions.

For least setup, I also start with dynamic checking of types (during evaluation, not upfront) and even dynamic checking that variables exist at use sites. Static checking of those two are two big lessons in their own right.

Vocabulary: Generally, “static X” means doing X by analysing the code, without needing to run it and wait; “dynamic X” and recently “just in time X” means doing X while running the code, and therefore only on code paths actually taken in a run.

Most of my interpreter is meant to be more mathematical than efficient. I am secretly writing mathematical formulas to define the meanings of the constructs.

“A formula is worth a thousand pictures.” – Dijkstra

Setup

I will have several kinds of run-time errors: Type errors, variable not found. If you support division, you will have have divide by zero.

I use the Either monad to represent this possibility. For simplicity, I use strings for error messages; you should use an algebraic data type instead, especially if later you support catching and handling errors (you don't want to examine an arbitrary string in your handler).

So my interpreter type goes as:

mainInterp :: Expr -> Either String Value

in SemanticsFunctions.hs

Immediately you can already think that I use a monad to model the fact that my language has the effect of errors/exceptions.

I have a function “raise” for raising errors. It is defined to be simply Left in this lecture; but it won't be that easy in a more featureful interpreter for a more complex language (e.g., stateful languages). I want to get into the habit of using raise, and it can always be re-defined to fit the interpreter.

raise :: String -> Either String a
raise = Left

in SemanticsFunctions.hs

For reasons explained in a later section (about variables and environments), I will be passing around a dictionary that maps variables to values. So I need this:

mainInterp expr = interp expr Map.empty

interp :: Expr -> Map String Value -> Either String Value

in SemanticsFunctions.hs

And now I implement interp for each construct.

Basic constructs

My language has number literals, boolean literals, and binary operators:

data Expr = Num Integer
          | Bln Bool
          | Prim2 Op2 Expr Expr         -- Prim2 op operand operand
          | ...

data Op2 = Eq | Plus | Mul

in SemanticsFunctions.hs

I will be evaluating them (and more) to number values, boolean values, and later another kind of values. A clean habit is not to re-use the abstract syntax tree type, but to define a separate type, since values have much fewer possibilities, and some possibilities will not correspond well to any abstract syntax tree.

data Value = VN Integer
           | VB Bool
           | ...

in SemanticsFunctions.hs

Here is how I evaluate a number literal. (Boolean literal is similar.)

interp (Num i) _ = pure (VN i)

in SemanticsFunctions.hs

Arithmetic (all operands evaluated)

Here is how I evaluate addition; most other arithmetic operators are similar. The insight is to use structural recursion to evaluate the operands, then you will have number values to add. The annoying part is to check that the values are actually numbers—so I re-factor out the checking to a helper function (learned it the hard way).

interp (Prim2 Plus e1 e2) env =
    interp e1 env
    >>= \a -> intOrDie a
    >>= \i -> interp e2 env
    >>= \b -> intOrDie b
    >>= \j -> return (VN (i+j))

intOrDie :: Value -> Either String Integer
intOrDie (VN i) = pure i
intOrDie _ = raise "type error"

in SemanticsFunctions.hs

I don't support division. If you do, you have one more annoying task to check for zero in the denominator!

Short-circuiting, conditionals (operands selectively evaluated)

I have an if-then-else:

data Expr = ...
          | Cond Expr Expr Expr         -- Cond test then-branch else-branch

in SemanticsFunctions.hs

This is a short-circuiting operator: some operands are selectively evaluated, others skipped. Here it goes:

interp (Cond test eThen eElse) env =
    interp test env
    >>= \a -> case a of
      VB True -> interp eThen env
      VB False -> interp eElse env
      _ -> raise "type error"

in SemanticsFunctions.hs

You can add short-circuting logical operators (and, or), and their semantics will be similar. Also, we could/should re-factor checking that we actually get a boolean value! (Exercise: implement and use a boolOrDie.)

Variables, local bindings, environments (scopes)

data Expr = ...
          | Var String

in SemanticsFunctions.hs

Where do we get the contents of variables from? A nice solution is to maintain a dictionary that maps variables to contents, so we can just look up. This dictionary is called “environment”; and mapping a variable to its content is called “binding”.

Also, the nature of the contents depend on the evaluation strategy of the language, e.g., call by value just needs values (what we do in this lecture), lazy evaluation needs something more complex to cater for partly evaluated, partly unevaluated expressions.

I use Data.Map (internally binary search trees) for dictionaries. Practical interpreters use hash tables; compilers use an array because they compile variable names to addresses.

Two constructs in this lecture are responsible for setting up environments: local bindings in this section later, and function application in a later section. For now we first read from the environment.

So to evaluate a variable, just look up, and give up if not found.

interp (Var v) env = case Map.lookup v env of
  Just a -> pure a
  Nothing -> raise "variable not found"

in SemanticsFunctions.hs

Roast: In some languages, when a variable is not found, it is created on the spot and bound to a “default” value. What's wrong is that rationally there is no such thing. What should be the default boolean value? (Note that either choice is wrong half of the time.) What should be the default number value? (Note that 0 is wrong for many use cases, e.g., when you plan to multiply numbers.) What should be the default value for time? For temperature?

My local binding construct wraps an expression inside a new local context of 0 or more “name = expr” bindings:

data Expr = ...
          | Let [(String, Expr)] Expr   -- Let [(name, rhs), ...] eval-me

in SemanticsFunctions.hs

E.g.,

let { x=1; y=0 }
 in x+y

is represented as

Let [("x", Num 1), ("y", Num 0)] (Prim2 Plus (Var "x") (Var "y"))

There are a number of decisions to make about the semantics of the local binding construct. (It is also possible to offer many different local binding constructs, one for each way of making these decisions. Scheme is a good example.)

Scoping and recursion: Suppose you have
```
let { x=2+3; y=x+4; } in ... 
```
Is it equivalent to
```
let { x=2+3 } in let { y=x+4 } in ...
```
so y=x+4 use the x in x=2+3?
I do this in this lecture. This is also what “let*” in Scheme does. Perhaps we can call it “sequential binding”, as opposed to…
An alternative is that y=x+4 uses an outer x. Scheme has “let” for this. This is called “parallel binding”. Note: This does not always imply parallel computing, it only implies semantic independence. Perhaps we call it “independent binding”.
Those two choices don't support [mutual] recursion. The third alternative supports mutual recursion, so every equation may use every variable defined in the same group. If you do call by value, then you also need to place restrictions on the RHSes. (Exercise: Why? Give an example of what wouldn't make sense.) Scheme has “letrec” for this.
Evaluation strategy: The choices are call by value, lazy evaluation, and call by name (like macro expansion but no name clash).
I use call by value in this lecture. So first go through the equations in the given order, evaluate the RHS of each one right away, and lastly evaluate the wrapped expression.

Because I chose sequential binding, right after I process one equation, I have to extend the environment to include its new binding, under which I process the remaining equations and eventually the wrapped expression.

interp (Let eqns evalMe) env =
    extend eqns env
    >>= \env' -> interp evalMe env'
    -- Example:
    --    let x=2+3; y=x+4 in x+y
    -- -> x+y   (with x=5, y=9 in the larger environment env')
    -- "extend env eqns" builds env'
  where
    extend [] env = return env
    extend ((v,rhs) : eqns) env =
        interp rhs env
        >>= \a -> let env' = Map.insert v a env
                  in extend eqns env'

in SemanticsFunctions.hs

E.g., working on let { x=2+4; y=x+4; } in x+y:

extend's nth iteration	v	rhs	env	env'
1	x	2+4	{}	{ x = 5 }
2	y	x+4	{ x = 5 }	{ x = 5, y = 9 }

Now we're ready to evaluate x+y under { x = 5, y = 9 }.

How these are local variables unknown to the outside: Suppose I have

(let {x=2+3; y=x+4;} in x+y) * (1+1)

I make a recursive call to handle the “let-in”. Inside that recursive call the new environment { x = 5, y = 9 } is built for internal use, but not returned or passed back to the outside. The outside still uses the outside environment for “1+1” etc.

Function construction (lambda), closures

For simplicity, I just have a lambda construct for anonymous functions; if you want to define a function with a name, use lambda together with “let”. I don't support recursive functions for now.

data Expr = ...
          | Lambda String Expr          -- Lambda var body

in SemanticsFunctions.hs

E.g.,

\x -> ...

is represented as

Lambda "x" (...)

To evaluate a lambda, the first instinct is that there is almost nothing you can do now, the value is like the lambda itself, simply a tuple/record storing the parameter name and the function body. You're almost there! This is just missing some environment because…

Suppose your lambda is “\y->x+y”. What do you use for the x there?

Terminology: In that expression, y is a “bound variable”, x is a “free variable”.

This also carries to “let”: In “let y=x+1 in x+1” y is a bound variable, x is a free variable.

Bound variable (of/in an expression): You can see where the variable is bound (introduced or declared or defined).

Free variable (of/in an expression): You can't, it has to come from the outside.

A free variable like x is probably bound in an outer context, e.g.,

let x=10 in ..... (\y->x+y) ....

When the interpreter runs into the lambda, and if it already knows x=10 from the outer context, it needs to attach “x=10” to the lambda so it is not forgotten!

So the value after evaluating a lambda needs to remember 3 things: parameter name, function body, the environment in scope for this lambda. (In practice people trim the environment down to just what the function body needs. For simplicity I don't.)

Terminology: The combination of “\y->x+y” plus “oh BTW x=10” is called a “closure”.

A closure is a record or data structure that stores an expression (usually a lambda, but generally can be any expression) together with the environment for all of its free variables.

In this lecture, my only use of closures is for lambdas, so my closure representation is specialized for that purpose only.

data Value = ...
           | VClosure (Map String Value) String Expr

interp (Lambda v body) env = pure (VClosure env v body)

in SemanticsFunctions.hs

Fun fact: When Javascript, Java, and C++ finally added lambdas, they finally realized how difficult this business of free variables and closures is! They also have an extra issue:

Since x is a mutable variable in their languages, you get to ask: Does it mean the value at the time of creating the lambda, or does it mean a reference?

Java: value at the time of lambda creation (“final variable”)
Javascript: reference
C++: extra syntax for you to choose

Function application

data Expr = ...
          | App Expr Expr               -- App func param

in SemanticsFunctions.hs

Exercise: Give some scenerios supporting why the function can be an arbitrary expression, not always a lambda or a name. Most 1970s languages allowed a name only! How dare…

There is a decision to make about the semantics of function application, namely evaluation strategy: call by value, lazy evaluation, call by name. I do call by value.

However, all of them require you to evaluate the function until you get a function closure, sooner or later. (Even if it's a name, you still have to look it up.) Usually people do it first, so do I.

Afterwards, since I do call by value, I evaluate the parameter until I get a value. Now I'm ready to plug and chug.

How do I plug and chug? Just like evaluating let, I can first extend the environment to bind the parameter name to the parameter value, then it makes sense to evaluate the function body under that environment.

Exercise: But extend which environment? Recall I have two now: The one in scope, and the one stored in the closure. Which one?

interp (App f e) env =
    interp f env
    >>= \c -> case c of
      VClosure fEnv v body ->
          interp e env
          >>= \eVal -> let bEnv = Map.insert v eVal fEnv  -- fEnv, not env
                       in interp body bEnv
          -- E.g.,
          --    (\y -> 10+y) 17
          -- -> 10 + y      (but with y=17 in environment)
          --
      _ -> raise "type error"

in SemanticsFunctions.hs

E.g., for (let x=7 in \y -> x+y) 10:

interp on the function gives:
```
VClosure {x = VN 7} "y" («x+y»)
```
interp on the “10” gives VN 10
Now do
```
interp («x+y») {x = VN 7, y = VN 10}
```

What if we extend the environment in scope, i.e., bEnv extends env rather than fEnv? This was basically the bug made in early implementations of Lisp. Some people like this bug and regard it as a feature: “dynamic scoping”. Roast: It's a bug, not a feature.

Dynamic scoping: a variable name refers to whoever has that name at the time of evaluation. Don't do this. Example:

  let { x=10; f = \y->x+y ; } in
  let { x=5; } in
  f 0
→ 5+0

Lexical/Static scoping: a variable name refers to whoever has that name in the code location. This is the sane way we all do today. Do this

Why environment, not substitution

It is possible to not use environments. In let and function application, we could perform substitutions instead. When done correctly, this would give the same result.

But subtitution suffers from a few problems, even academically:

It is tricky to do correctly. Look up “variable capture” for what can go wrong in a naïve implementation. The environment way is easy to get right.
It takes exponential time in code size. (Exercise: Come up with an example.)

Recursion

I don't have a recursion or iteration construct yet. Do you believe that I can do factorial? Do you even believe that I can do non-termination?

Answer: Try “(\x -> x x) (\x -> x x)”. TEE HEE HEE! “(\x -> x x)” is called “diagonal” and “Delta”. Recall the diagonalization proofs in computability, which talk about running a program with itself as input.

And here is how to do factorial:

let mkFac = \f -> \n -> if n=0 then 1 else n * (f f) (n-1)
in mkFac mkFac 2

The trick is to take an extra parameter for a function to be called. Then, like in diagonal, call a function with itself as a parameter.

Here is a trace:

   mkFac mkFac 2
→
   (\f -> \n -> if n=0 then 1 else n * (f f) (n-1)) mkFac 2
→
   (\n -> if n=0 then 1 else n * (mkFac mkFac) (n-1)) 2
→
   if 2=0 then 1 else 2 * (mkFac mkFac) (2-1)
→
   2 * (mkFac mkFac) (2-1)
→
   2 * (\f -> \n -> ...) mkFac (2-1)
→
   2 * (\n -> if n=0 then 1 else n * (mkFac mkFac) (n-1)) (2-1)
→
   2 * (\n -> if n=0 then 1 else n * (mkFac mkFac) (n-1)) 1
→
   2 * if 1=0 then 1 else 1 * (mkFac mkFac) (1-1)
→
   2 * 1 * mkFac mkFac (1-1)
→
   2 * 1 * (\n -> if n=0 then 1 else n * mkFac mkFac (n-1)) (1-1)
→
   2 * 1 * (\n -> if n=0 then 1 else n * mkFac mkFac (n-1)) 0
→
   2 * 1 * if 0=0 then 1 else 0 * ...
→
   2 * 1 * 1

Exercise: Extend this technique to mutual recursion among multiple functions.

Question for advanced theory or philosophy: Where does the recursion come from?!

So why do languages support recursion directly? Why not just make you write code like this? Answer: Language design tries to let you express your intention directly rather than encoding/simulating your intention as something else. Your code is then more comprehensible.

Supporting recursive functions

I support recursive functions by this local binding construct storing the function name, the parameter name, the function body, and the expression that uses this function:

data Expr = ...
          | LetRecFun String String Expr Expr
            -- LetRecFun funcname paramname funcbody eval-me

in SemanticsFunctions.hs

E.g.,

letrec f n = if n==0 then 0 else f (n-1)
in f 10

is represented as

LetRecFun "f" "n" («if n==0 then 0 else f (n-1)») («f 10»)

How to make the closure for the recursive function? The hard part is the environment it stores.

closure for f = VClosure fEnv "n" («if n==0 then 0 else f (n-1)»)

fEnv = { f = ? }  -- necessary because function body uses f

That environment must bind the function name to some closure because the function body will call it (recursive call!). That closure must store an environment that must bind the function name again likewise, ad infinitum…

Or isn't it just a cyclic data structure? The closure stores an environment. That environment binds the function name to said closure. Then recursive calls at any depth will resolve to the same closure again and again.

closure for f = VClosure fEnv "n" («if n==0 then 0 else f (n-1)»)

fEnv = { f = the above }

Example trace: for letrec f n = if n==0 then 0 else f(n-1) in f 10:

This closure and this environment ar built, I call them fc and fEnv:

fc = VClosure fEnv "n" («if n==0 then 0 else f(n-1)»)
fEnv = { f = fc }

interp («f 10») { f = fc }, this is an App case. After a few steps…
interp («if n==0 then 0 else f(n-1)») { f = fc, n = VN 10 }
This environment comes from taking fEnv and adding n = VN 10 for the parameter.
This contains f = fc so when later we hit f(n-1) we can continue.

Another perspective: When calling a recursive function, my interpreter traverses that cyclic data structure (until base case is hit and no more recursive calls).

There are two ways to build this cyclic data structure (or any cyclic data structure). In Haskell, recursive data are directly allowed:

interp (LetRecFun f v fbody evalMe) env =
    let closure = VClosure env' v fbody
        env' = Map.insert f closure env
    in interp evalMe env'

in SemanticsFunctions.hs

In other languages, you have mutable fields, so you stage it in two stages:

closure := VClosure garbage v fbody
env' := Map.insert f closure env
closure.1stField := env'

Exercise: Extend this technique (Haskell version or mutable field version) to mutual recursion among multiple functions.