Recursive Descent Parsing

Recursive descent parsers can be nicely expressed in an embedded domain-specific language, built from a few primitives and composed using the connectives from Functor, Applicative, and Monad; and there is one more relevant connective type class, Alternative, for failures and choices.

For minimal infrastructure, we will parse characters directly, but we are aware that going through tokenization first is more tidy, and we can still write our parsers in that spirit.

Parser representation

Each parser can be represented as a function taking an input string, consuming a prefix of it, and giving one of:

When success, it is important to give the unconsumed suffix rather than losing it. When liftA2 and >>= chain up two parsers, the second parser needs to see the leftover from the first parser! You can also think of a state variable for the current string to parse. Overall we are combining two effects, failure and state.

So we use such a function type to define our parser type:

data Parser a = MkParser (String -> Maybe (String, a))

unParser :: Parser a -> String -> Maybe (String, a)
unParser (MkParser sf1) = sf1
in ParserLib.hs

I use Maybe because I anticipate at most one valid answer, and for simplicity I don't include error information for failures.

It is also possible to use [] to anticipate ambiguous grammars and multiple valid answers, with the empty list for failure.

Here is the function for using a parser. You give it your input string, it gives you overall failure or final answer. (It discards the final leftover; usually you aren't interested. If you're interested, use unParser above.)

runParser :: Parser a -> String -> Maybe a
runParser (MkParser sf) inp = case sf inp of
                                Nothing -> Nothing
                                Just (_, a) -> Just a
                              -- OR: fmap (\(_,a) -> a) (sf inp)
in ParserLib.hs

Parsing primitives (character level)

A good way to deepen understanding of the parser type is to implement a few primitives. They will also be actually used.

A most basic primitive reads a character and gives it to you. So when does it fail? When there is no character to read!

anyChar :: Parser Char
anyChar = MkParser sf
  where
    sf "" = Nothing
    sf (c:cs) = Just (cs, c)
in ParserLib.hs

Next, perhaps you have an expected character in mind and want to read and check. Failure if mismatch or no character to read.

char :: Char -> Parser Char
char wanted = MkParser sf
  where
    sf (c:cs) | c == wanted = Just (cs, c)
    sf _ = Nothing
in ParserLib.hs

E.g., if you expect ‘A’ (and reject everything else), you say “char 'A'”.

More generally, perhaps you expect a character that satisfies a predicate of yours.

satisfy :: (Char -> Bool) -> Parser Char
satisfy pred = MkParser sf
  where
    sf (c:cs) | pred c = Just (cs, c)
    sf _ = Nothing
in ParserLib.hs

E.g., if you expect a letter, you say “satisfy isAlpha”. (isAlpha is from Data.Char.)

One last character-level primitive, the other most basic one: Checking that the input string is empty. So its failure/success criterion is the opposite of char's.

eof :: Parser ()
eof = MkParser sf
  where
    sf "" = Just ("", ())
    sf _ = Nothing
in ParserLib.hs

You should test them using runParser or even unParser to get a feeling of them.

Functor, Applicative, Monad, Alternative connectives

The effects of the Parser type are a combination of failure and state. Accordingly, the implementation of the Functor, Applicative, and Monad methods also combine those of Maybe and State, i.e., checking for Nothing vs Just, and plumbing for state values (input strings and leftovers).

instance Functor Parser where
    -- fmap :: (a -> b) -> Parser a -> Parser b
    fmap f (MkParser sf) = MkParser sfb
      where
        sfb inp = case sf inp of
                    Nothing -> Nothing
                    Just (rest, a) -> Just (rest, f a)
                  -- OR: fmap (\(rest, a) -> (rest, f a)) (sf inp)

instance Applicative Parser where
    -- pure :: a -> Parser a
    pure a = MkParser (\inp -> Just (inp, a))

    -- liftA2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
    -- Consider the 1st parser to be stage 1, 2nd parser stage 2.
    liftA2 op (MkParser sf1) p2 = MkParser g
      where
        g inp = case sf1 inp of
                  Nothing -> Nothing
                  Just (middle, a) ->
                      case unParser p2 middle of
                        Nothing -> Nothing
                        Just (rest, b) -> Just (rest, op a b)

instance Monad Parser where
    -- return :: a -> Parser a
    return = pure

    -- (>>=) :: Parser a -> (a -> Parser b) -> Parser b
    MkParser sf1 >>= k = MkParser g
      where
        g inp = case sf1 inp of
                  Nothing -> Nothing
                  Just (rest, a) -> unParser (k a) rest
in ParserLib.hs

In Control.Applicative there are more utiltiy connectives, two of which I am fond of using:

(*>) :: Applicative f => f a -> f b -> f b
p *> q = liftA2 (\a b -> b) p q
-- Drop p's answer, give only q's answer.  Like (>>) but Applicative.

(<*) :: Applicative f => f a -> f b -> f a
p <* q = liftA2 (\a b -> a) p q
-- Drop q's answer, give only p's answer.

Example of chaining up several primitive parsers sequentially: I want a letter, then a digit, then ‘!’; the answer is the letter and the digit in a string, and drop the ‘!’:

lde :: Parser String
lde = liftA2 (\x y -> [x,y]) (satisfy isAlpha) (satisfy isDigit)
      <* (char '!')
-- Try: runParser lde "B6!", runParser lde "B6a"

There is one more type class “Alternative” (in the standard library; import it from Control.Applicative) containing methods for failure and choice. Why they are grouped together: choice is an associative binary operator, and failure is the identity element.

class Applicative f => Alternative f where
    empty :: f a
    (<|>) :: f a -> f a -> f a
    many :: f a -> f [a]  -- has default implementation
    some :: f a -> f [a]  -- has default implementation

This type class was actually inspired by parsing! The <|> operator came from the “|” in BNF, and many and some came from “0 or more times” and “1 or more times”.

And our implementation for our Parser:

instance Alternative Parser where
    -- empty :: Parser a
    -- Always fail.
    empty = MkParser (\_ -> Nothing)

    -- (<|>) :: Parser a -> Parser a -> Parser a
    -- Try the 1st one. If success, done; if failure, do the 2nd one
    MkParser sf1 <|> p2 = MkParser g
      where
        g inp = case sf1 inp of
                  Nothing -> unParser p2 inp
                  j -> j        -- the Just case

    -- many :: Parser a -> Parser [a]
    -- 0 or more times, maximum munch, collect the answers into a list.
    -- Can use default implementation. And it goes as:
    many p = some p <|> pure []
    -- How to make sense of it: To repeat 0 or more times, first try 1 or more
    -- times!  If that fails, then we know it's 0 times, and the answer is the
    -- empty list.

    -- some :: Parser a -> Parser [a]
    -- 1 or more times, maximum munch, collect the answers into a list.
    -- Can use default implementation. And it goes as:
    some p = liftA2 (:) p (many p)
    -- How to make sense of it: To repeat 1 or more times, do 1 time, then 0 or
    -- more times!  Use liftA2 to chain up and collect answers.
in ParserLib.hs

Example use of empty: satisfy can be expressed as:

satisfyV2 pred = anyChar >>= \c -> if pred c then return c else empty

Example use of <|>: I want ‘A’ or ‘B’, followed by ‘0’ or ‘1’:

ab01 :: Parser String
ab01 = liftA2 (\x y -> [x,y]) (char 'A' <|> char 'B') (char '0' <|> char '1')

In Control.Applicative there are also a utility connective based on Alternative. It's very handy when an ENBF rules says “0 or 1 time”:

optional :: Alternative f => f a -> f (Maybe a)
optional p = fmap Just p <|> pure Nothing

Parsing Primitives (lexeme/token level)

We won't actually use the character-level primitives directly. A reason is that spaces will get into the way. Another is that we think at the token level actually, even intuitively before you learned grammars and parsing! We only use character-level primitives to implement token-level primitives such as the ones below. Then we use connectives and token-level primitives for the grammar.

Whitespace handling convention: Token-level primitives assume there is no leading spaces, and skip trailing spaces (so the next primitive may assume no leading spaces). Something else at the outermost level will have to skip initial leading spaces; we'll get there later.

-- | Space or tab or newline (unix and windows).
whitespace :: Parser Char
whitespace = satisfy (\c -> c `elem` ['\t', '\n', '\r', ' '])

-- | Consume zero or more whitespaces, maximum munch.
whitespaces :: Parser String
whitespaces = many whitespace

-- | Read a natural number (non-negative integer), then skip trailing spaces.
natural :: Parser Integer
natural = fmap read (some (satisfy isDigit)) <* whitespaces
-- read :: Read a => String -> a
-- For converting string to your data type, assuming valid string.  Integer
-- is an instance of Read, and our string is valid, so we can use read.

-- | Read an identifier, then skip trailing spaces.  Disallow the listed keywords.
identifier :: [String] -> Parser String
identifier keywords =
    satisfy isAlpha
    >>= \c -> many (satisfy isAlphaNum)
    >>= \cs -> whitespaces
    >> let str = c:cs
    in if str `elem` keywords then empty else return str

-- | Read the wanted keyword, then skip trailing spaces.
keyword :: String -> Parser String
keyword wanted =
    satisfy isAlpha
    >>= \c -> many (satisfy isAlphaNum)
    >>= \cs -> whitespaces
    *> if c:cs == wanted then return wanted else empty
in ParserLib.hs

There are a few more in ParserLib.hs, but you get the point.

CFG Parsing

A parser for a context-free grammar can mostly look like the grammar rules. There are however a few gotchas to watch out for, some tricks, and that lingering issue of initial leading spaces.

The parsers here will produce abstract syntax trees of this type:

data Expr
    = Num Integer
    | Var String
    | Prim2 Op2 Expr Expr       -- Prim2 op operand operand
    | Let [(String, Expr)] Expr -- Let [(name, rhs), ...] body

data Op2 = Add | Mul
in Parsing.hs

Right-associating operator

Take this simple rule, and suppose we intend the operator to associate to the right:

muls ::= natural { "*" natural }
OR
muls ::= natural [ "*" muls ]

The second form uses right recursion to convey right association. Perfect for recursive descent parsing!

mulsRv1 :: Parser Expr
mulsRv1 = liftA2 link
          (fmap Num natural)
          (optional (liftA2 (,)
                      (operator "*" *> pure (Prim2 Mul))
                      mulsRv1))
  where
    link x Nothing = x
    link x (Just (op,y)) = op x y
in Parsing.hs

After you understand this, you would rather not write this recursion by hand again for every right-associating operator, instead call a re-factored function and specify just your operand parser and operator parser.

Here is the re-factored general function for right-associating operators.

chainr1 :: Parser a               -- ^ operand parser
        -> Parser (a -> a -> a)   -- ^ operator parser
        -> Parser a               -- ^ whole answer
chainr1 getArg getOp = liftA2 link
                       getArg
                       (optional
                         (liftA2 (,) getOp (chainr1 getArg getOp)))
  where
    link x Nothing = x
    link x (Just (op,y)) = op x y
in ParserLib.hs

So here is how we will implement the rule in practice:

mulsRv2 :: Parser Expr
mulsRv2 = chainr1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))
in Parsing.hs

Left-associating operator

Suppose we want the operator to associate to the left instead. We cannot code up left recursion directly. But here is the trick. Implement the other form of the rule:

muls ::= natural { "*" natural }

Use many for the “{ "*" natural }” part, get a list of tuples of (operator, number). For example if the input string is “2 * 5 * 3 * 7”, my plan is to:

  1. read “2” and get Num 2
  2. read “* 5 * 3 * 7” with the help of many and get [(Prim2 Mul, Num 5), (Prim2 Mul, Num 3), (Prim2 Mul, Num 7)]

Then a simple use of foldl on the list, starting with Num 2 as the initial accumulator, will build the left-leaning tree!

mulsLv1 :: Parser Expr
mulsLv1 = liftA2 link
          (fmap Num natural)
          (many (liftA2 (,)
                  (operator "*" *> pure (Prim2 Mul))
                  (fmap Num natural)))
  
  where
    link x opys = foldl (\accum (op,y) -> op accum y) x opys
in Parsing.hs

Again in practice we don't write this code again, we re-factor this into a general function:

chainl1 :: Parser a               -- ^ operand parser
        -> Parser (a -> a -> a)   -- ^ operator parser
        -> Parser a               -- ^ whole answer
chainl1 getArg getOp = liftA2 link
                       getArg
                       (many (liftA2 (,) getOp getArg))
  where
    link x opys = foldl (\accum (op,y) -> op accum y) x opys
in ParserLib.hs

Then we use it like:

mulsLv2 :: Parser Expr
mulsLv2 = chainl1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))
in Parsing.hs

Initial space, final junk

Token-level parsers assume no leading spaces; someone has to skip initial leading spaces to kick-start this invariant.

As well, a small parser for a part of the grammar can leave non-space stuff unconsumed, since we anticipate that later a small parser for another part may need it. But the overall combined parser for the whole grammar cannot: by the time you're done with the whole grammar, any non-space leftover means the original input string is actually erroneous, e.g., we don't consider “2*3*” to be a legal arithmetic expression (our muls parsers can make sense of the prefix “2*3” but leaves the last “*” unconsumed).

Here is the trick for solving both. Have a “main” parser, and its job is simply to clear initial leading spaces, call the parser for the start symbol, then use eof to check that there is nothing left.

Example: If the muls rule is already my whole grammar, and so muls is my start symbol, I write:

lesson2 :: Parser Expr
lesson2 = whitespaces *> muls <* eof
  where
    muls = chainl1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))
in Parsing.hs

Why not do these inside the parser for the start symbol: There may be recursive calls to it in the middle of your grammar! Wrong to do the eof check there/then.

Operator precedence and parentheses

Suppose I have two operators “*” and “+”, with “+” having lower precedence, and I also support parentheses for overriding precedence.

In other words, from lowest precedence (binding most loosely) to highest (binding most tightly): “+”, then “*”, then individual numbers and parentheses (same level without ambiguity).

The trick is to have lower (looser) rules call higher (tighter) rules, and have the parentheses rule call the lowest rule for recursion. The start symbol is from the lowest rule. This is also how you can write your grammar to convey precedence.

So my grammar goes like (start symbol is adds):

adds ::= muls { "+" muls }
muls ::= atom { "*" atom }
atom ::= natural | "(" adds ")"

And my parser goes like (let's say left-associating operators):

lesson3 :: Parser Expr
lesson3 = whitespaces *> adds <* eof
  where
    adds = chainl1 muls (operator "+" *> pure (Prim2 Add))
    muls = chainl1 atom (operator "*" *> pure (Prim2 Mul))
    atom = fmap Num natural <|> (openParen *> adds <* closeParen)
in Parsing.hs

Keywords and variables

And now my really whole grammar (start symbol is expr):

expr ::= local | adds
local ::= "let" { var "=" expr ";" } "in" expr
adds ::= muls { "+" muls }
muls ::= atom { "*" atom }
atom ::= natural | var | "(" expr ")"

Problem: “let inn+4” should be a syntax error, but a naïve parser implementation sees “let”, “in”, “n”, “+”, “4”.

Solution: A parser for a reserved word should first read as many alphanums as possible (not just the expected letters), and then check that the whole string equals the keyword. This is what keyword does in an earlier section.

Conversely, the parser for identifiers should read likewise, but then check that the string doesn't clash with reserved words. This is why identifier from that earlier section takes a parameter for reserved words to avoid.

With that, here is the whole parser:

lesson4 :: Parser Expr
lesson4 = whitespaces *> expr <* eof
  where
    expr = local <|> adds

    local = pure (\_ eqns _ e -> Let eqns e)
            <*> keyword "let"
            <*> many equation
            <*> keyword "in"
            <*> expr
    -- Basically a liftA4.
    -- Could also be implemented in monadic style, like equation below.

    equation = var
               >>= \v -> operator "="
               >> expr
               >>= \e -> semicolon
               >> return (v, e)
    -- Basically a liftA4.
    -- Could also be implemented in applicative style, like local above.

    semicolon = char ';' *> whitespaces
    adds = chainl1 muls (operator "+" *> pure (Prim2 Add))
    muls = chainl1 atom (operator "*" *> pure (Prim2 Mul))
    atom = fmap Num natural
           <|> fmap Var var
           <|> (openParen *> expr <* closeParen)
    var = identifier ["let", "in"]
in Parsing.hs