Recursive Descent Parsing

Recursive descent parsers can be nicely built from a few primitives and combined by connectives for sequential chaining and backtracking choice. Some of them can be non-obvious until you realize you also need to convert received strings to data such as numbers and ASTs, or even to perform evaluation right away.

For minimal infrastructure, we will parse characters directly, but we are aware that going through tokenization first is more tidy, and we can still write our parsers in that spirit.

Parser representation

A parser can be represented as a function taking an input string, consuming a prefix of it, and resulting in one of:

data Parser a = MkParser (String -> Maybe (String, a))

unParser :: Parser a -> String -> Maybe (String, a)
unParser (MkParser sf1) = sf1

in ParserLib.hs

It would be possible to replace Maybe by [] to allow ambiguous grammars and multiple valid answers.

Here is the function for using a parser. You give it your input string, it gives you overall failure or final answer. (It discards the final leftover; usually you aren't interested. If you're interested, use unParser above.)

runParser :: Parser a -> String -> Maybe a
runParser (MkParser sf) inp = case sf inp of
                                Nothing -> Nothing
                                Just (_, a) -> Just a

in ParserLib.hs

Parsing primitives (character level)

A good way to deepen understanding of the parser type is to implement a few primitives. They will also be actually used.

A most basic primitive reads a character and gives it to you. So when does it fail? When there is no character to read!

anyChar :: Parser Char
anyChar = MkParser sf
  where
    sf "" = Nothing
    sf (c:cs) = Just (cs, c)

in ParserLib.hs

Next, perhaps you have an expected character in mind and want to read and check. Failure if mismatch or no character to read.

char :: Char -> Parser Char
char wanted = MkParser sf
  where
    sf (c:cs) | c == wanted = Just (cs, c)
    sf _ = Nothing

in ParserLib.hs

E.g., if you expect ‘A’ (and reject everything else), you say “

char
'A'

”.

Why return the character that the caller already specifies: We can have a grammar rule like “<B> ::= '0' | '1'”, the parser code shown later will be “char '0' <|> char '1'”, now you want to know which one succeeds!

More generally, perhaps you expect a character that satisfies a given predicate:

satisfy :: (Char -> Bool) -> Parser Char
satisfy pred = MkParser sf
  where
    sf (c:cs) | pred c = Just (cs, c)
    sf _ = Nothing

in ParserLib.hs

E.g., if you expect a letter, you say “satisfy isAlpha”. (isAlpha is from Data.Char.)

One last character-level primitive, the other most basic one: Checking that the input string is empty. In a sense an opposite of anyChar.

eof :: Parser ()
eof = MkParser sf
  where
    sf "" = Just ("", ())
    sf _ = Nothing

in ParserLib.hs

You should test them using runParser or even unParser to get a feeling of them.

Connectives

You would not want to write your own String -> Maybe (String, a) functions from scratch for larger parsers. You want to just glue the primitives above with connectives that say “foo and then bar”, “foo or bar”, or “foo but convert the answer”.

Converting Answers

fmap :: (a -> b) -> Parser a -> Parser b
fmap f (MkParser sf) = MkParser sfb
  where
    sfb inp = case sf inp of
                Nothing -> Nothing
                Just (rest, a) -> Just (rest, f a)

in ParserLib.hs

Example: Read a character, but yield its ASCII or Unicode code (the function is ord from Data.Char):

anyCode :: Parser Int
anyCode = fmap ord anyChar
-- Try: runParser anyCode "AB", runParser anyCode ""

in Parsing.hs

Sequential Composition

Sequential composition of two parsers, “foo and then bar”: Unconsumed input from the 1st becomes input of the 2nd. Combine their answers by a given binary operator/function.

liftA2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
liftA2 op (MkParser sf1) p2 = MkParser g
  where
    g inp = case sf1 inp of
              Nothing -> Nothing
              Just (middle, a) ->
                  case unParser p2 middle of
                    Nothing -> Nothing
                    Just (rest, b) -> Just (rest, op a b)

in ParserLib.hs

(<*) :: Parser a -> Parser b -> Parser a
pa <* pb = liftA2 (\a _ -> a) pa pb

(*>) :: Parser a -> Parser b -> Parser b
pa *> pb = liftA2 (\_ b -> b) pa pb

in ParserLib.hs

Example: I want a letter, then a digit, then ‘!’; the answer is the string consisting of the letter and the digit:

ldeV1 :: Parser String
ldeV1 = liftA2 (\x y -> [x,y]) (satisfy isAlpha) (satisfy isDigit)
        <*
        (char '!')
-- Try: runParser ldeV1 "B6!", runParser ldeV1 "B6a"

in Parsing.hs

liftA2 has some kind of identity element: A “trivial” parser that always yields the same answer and doesn't consume input:

pure :: a -> Parser a
pure a = MkParser (\inp -> Just (inp, a))

in ParserLib.hs

If the binary operator to liftA2 is function application, it has enough power (together with pure) to express liftA3, liftA4, …

(<*>) :: Parser (a -> b) -> Parser a -> Parser b
pf <*> pa = liftA2 (\f a -> f a) pf pa

liftA3 :: (a -> b -> c -> d) -> Parser a -> Parser b -> Parser c -> Parser d
liftA3 f pa pb pc = ((pure f <*> pa) <*> pb) <*> pc
-- Note: Those parentheses are redundant, <*> is left-associative.

in ParserLib.hs

Dependent Sequential Composition

Sometimes you want the 2nd parser to know/depend on the answer from the 1st parser:

(>>=) :: Parser a -> (a -> Parser b) -> Parser b
MkParser sf1 >>= k = MkParser g
  where
    g inp = case sf1 inp of
              Nothing -> Nothing
              Just (rest, a) -> unParser (k a) rest

in ParserLib.hs

Example: I want a letter, then a digit, then ‘!’; the answer is the string consisting of the letter and the digit:

ldeV2 :: Parser String
ldeV2 = satisfy isAlpha
        >>= (\x -> satisfy isDigit
        >>= (\y -> char '!'
        >>= (\_ -> pure [x, y])))
        -- Note: Those parentheses are redundant.
-- Try: runParser ldeV2 "B6!", runParser ldeV2 "B6a"

in Parsing.hs

Choice

Backtracking choice/alternation, “foo or else bar”: Try the 1st parser, if it fails, try the 2nd parser. And its identity element: The parser that always fails. Backtracking is the expensive part of recursive descent parsing.

(<|>) :: Parser a -> Parser a -> Parser a
MkParser sf1 <|> p2 = MkParser g
  where
    g inp = case sf1 inp of
              Nothing -> unParser p2 inp
              j -> j        -- the Just case

empty :: Parser a
empty = MkParser (\_ -> Nothing)

in ParserLib.hs

ab01 :: Parser String
ab01 = liftA2 (\x y -> [x,y]) (char 'A' <|> char 'B') (char '0' <|> char '1')

in Parsing.hs

satisfyV2 pred = anyChar >>= \c -> if pred c then pure c else empty

in Parsing.hs

With choice and recursion, we can try a parser for 0 or 1 time, 0 or more times, 1 or more times:

-- | 0 or 1 time.
optional :: Parser a -> Parser (Maybe a)
optional p = fmap Just p <|> pure Nothing
-- Try 1 time. If success, tag with Just; if failure, answer Nothing.

-- | 0 or more times, maximum munch, collect the answers into a list.
many :: Parser a -> Parser [a]
many p = some p <|> pure []
-- Explanation: To repeat 0 or more times, first try 1 or more
-- times!  If that fails, then we know it's 0 times, and the answer is the
-- empty list.

-- | 1 or more times, maximum munch, collect the answers into a list.
some :: Parser a -> Parser [a]
some p = liftA2 (:) p (many p)
-- Explanation: To repeat 1 or more times, do 1 time, then 0 or more times!

in ParserLib.hs

Parsing Primitives (lexeme/token level)

We won't actually use the character-level primitives directly. A reason is that spaces will get into the way. Another is that we think at the token level actually, even intuitively before you learned grammars and parsing! We only use character-level primitives to implement token-level primitives such as the ones below. Then we use connectives and token-level primitives for the grammar.

Whitespace handling convention: Token-level primitives assume there is no leading spaces, and skip trailing spaces (so the next token primitive may assume no leading spaces). Something else at the outermost level will have to skip initial leading spaces; we'll get there later.

-- | Space or tab or newline (unix and windows).
whitespace :: Parser Char
whitespace = satisfy (\c -> c `elem` ['\t', '\n', '\r', ' '])

-- | Consume zero or more whitespaces, maximum munch.
whitespaces :: Parser String
whitespaces = many whitespace

-- | Read a natural number (non-negative integer), then skip trailing spaces.
natural :: Parser Integer
natural = fmap read (some (satisfy isDigit)) <* whitespaces
-- read :: Read a => String -> a
-- For converting string to your data type, assuming valid string.  Integer
-- is an instance of Read, and our string is valid, so we can use read.

-- | Read an identifier, then skip trailing spaces.  Disallow the listed keywords.
identifier :: [String] -> Parser String
identifier keywords =
    satisfy isAlpha
    >>= \c -> many (satisfy isAlphaNum)
    >>= \cs -> whitespaces
    *> let str = c:cs
    in if str `elem` keywords then empty else pure str

-- | Read the wanted keyword, then skip trailing spaces.
keyword :: String -> Parser String
keyword wanted =
    satisfy isAlpha
    >>= \c -> many (satisfy isAlphaNum)
    >>= \cs -> whitespaces
    *> if c:cs == wanted then pure wanted else empty

in ParserLib.hs

CFG Parsing

A parser for a context-free grammar can mostly look like the grammar rules. There are however a few gotchas to watch out for, some tricks, and that lingering issue of initial leading spaces.

data Expr
    = Num Integer
    | Var String
    | Prim2 Op2 Expr Expr       -- Prim2 op operand operand
    | Let [(String, Expr)] Expr -- Let [(name, rhs), ...] body

data Op2 = Add | Sub | Mul | Pow

in Parsing.hs

Right-associating operator

Take this simple rule, and suppose we intend the operator to associate to the right:

The second form uses right recursion to convey right association. Perfect for recursive descent parsing!

powsV1 :: Parser Expr
powsV1 = fmap Num natural
         >>= \x ->     ((operator "^" *> pure (Prim2 Pow))
                        >>= \op -> powsV1
                        >>= \y -> pure (op x y))
                   <|>
                       pure x

in Parsing.hs

After you understand this, you would rather not write this recursion by hand again for every right-associating operator, instead call a re-factored function and specify just your operand parser and operator parser.

chainr1 :: Parser a               -- ^ operand parser
        -> Parser (a -> a -> a)   -- ^ operator parser
        -> Parser a               -- ^ whole answer
chainr1 getArg getOp = getArg
                       >>= \x ->     (getOp
                                      >>= \op -> chainr1 getArg getOp
                                      >>= \y -> pure (op x y))
                                 <|>
                                     pure x

in ParserLib.hs

powsV2 :: Parser Expr
powsV2 = chainr1 (fmap Num natural) (operator "^" *> pure (Prim2 Pow))

in Parsing.hs

Left-associating operator

Suppose we want the operator to associate to the left instead. We cannot code up left recursion directly. But here is the trick. Implement the other form of the rule:

Use many for the “{ "*" natural }” part, get a list of tuples of (operator, number). For example if the input string is “2 * 5 * 3 * 7”, my plan is to:

Then a simple use of foldl on the list, starting with Num 2 as the initial accumulator, will build the left-leaning tree!

mulsV1 :: Parser Expr
mulsV1 = fmap Num natural
         >>= \x -> many (liftA2 (,)
                                (operator "*" *> pure (Prim2 Mul))
                                (fmap Num natural))
         >>= \opys -> pure (foldl (\accum (op,y) -> op accum y) x opys)

in Parsing.hs

Again in practice we don't write this code again, we re-factor this into a general function:

chainl1 :: Parser a               -- ^ operand parser
        -> Parser (a -> a -> a)   -- ^ operator parser
        -> Parser a               -- ^ whole answer
chainl1 getArg getOp = getArg
                       >>= \x -> many (liftA2 (,) getOp getArg)
                       >>= \opys -> pure (foldl (\accum (op,y) -> op accum y) x opys)

in ParserLib.hs

mulsV2 :: Parser Expr
mulsV2 = chainl1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))

in Parsing.hs

So for example “7-3+2” should make sense. The cool trick is how do you specify one single operator that tries both plus and minus?

Initial space, final junk

Token-level parsers assume no leading spaces; someone has to skip initial leading spaces to kick-start this invariant.

As well, a small parser for a part of the grammar can leave non-space stuff unconsumed, since we anticipate that later a small parser for another part may need it. But the overall combined parser for the whole grammar cannot: by the time you're done with the whole grammar, any non-space leftover means the original input string is actually erroneous, e.g., we don't consider “2*3*” to be a legal arithmetic expression (our muls parsers can make sense of the prefix “2*3” but leaves the last “*” unconsumed).

Here is the trick for solving both. Have a “main” parser, and its job is simply to clear initial leading spaces, call the parser for the start symbol, then use eof to check that there is nothing left.

Example: If the muls rule is already my whole grammar, and so muls is my start symbol, I write:

lesson2 :: Parser Expr
lesson2 = whitespaces *> muls <* eof
  where
    muls = chainl1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))

in Parsing.hs

Why not do these inside the parser for the start symbol: There may be recursive calls to it in the middle of your grammar! Wrong to do the eof check there/then.

Operator precedence and parentheses

Suppose I have two operators “^” and “*”, with “*” having lower precedence, and I also support parentheses for overriding precedence.

In other words, from lowest precedence (binding most loosely) to highest (binding most tightly): “*”, then “^”, then individual numbers and parentheses (same level without ambiguity).

The trick is to have lower (looser) rules call higher (tighter) rules, and have the parentheses rule call the lowest rule for recursion. The start symbol is from the lowest rule. This is also how you can write your grammar to convey precedence.

So my grammar goes like (start symbol is muls):

muls ::= pows { "*" pows }
pows ::= atom { "^" atom }
atom ::= natural | "(" muls ")"

And my parser goes like:

lesson3 :: Parser Expr
lesson3 = whitespaces *> muls <* eof
  where
    muls = chainl1 pows (operator "*" *> pure (Prim2 Mul))
    pows = chainr1 atom (operator "^" *> pure (Prim2 Pow))
    atom = fmap Num natural <|> (openParen *> muls <* closeParen)

in Parsing.hs

Exercise: Extend to cover “+” and “-”, both same precedence and lower than “*”.

Keywords and variables

And now my really whole grammar (start symbol is expr):

expr ::= local | muls
local ::= "let" { var "=" expr ";" } "in" expr
muls ::= pows { "*" pows }
pows ::= atom { "^" atom }
atom ::= natural | var | "(" expr ")"

Problem: “let inx” should be a syntax error, but a naïve parser implementation sees “let”, “in”, “x”.

Solution: A parser for a reserved word should first read as many alphanums as possible (not just the expected letters), and then check that the whole string equals the keyword. This is what keyword does in an earlier section.

Conversely, the parser for identifiers should read likewise, but then check that the string doesn't clash with reserved words. This is why identifier from that earlier section takes a parameter for reserved words to avoid.

With that, here is the whole parser:

lesson4 :: Parser Expr
lesson4 = whitespaces *> expr <* eof
  where
    expr = local <|> muls

    local = pure (\_ eqns _ e -> Let eqns e)
            <*> keyword "let"
            <*> many equation
            <*> keyword "in"
            <*> expr
    -- Basically a liftA4.
    -- Could also be implemented with (>>=), like equation below.

    equation = var
               >>= \v -> operator "="
               *> expr
               >>= \e -> semicolon
               *> pure (v, e)
    -- Basically a liftA4.
    -- Could also be implemented with (<*>), like local above.

    semicolon = char ';' *> whitespaces
    muls = chainl1 pows (operator "*" *> pure (Prim2 Mul))
    pows = chainr1 atom (operator "^" *> pure (Prim2 Pow))
    atom = fmap Num natural
           <|> fmap Var var
           <|> (openParen *> expr <* closeParen)
    var = identifier ["let", "in"]

in Parsing.hs

Contents

Recursive Descent Parsing

Parser representation

Parsing primitives (character level)

Connectives

Converting Answers

Sequential Composition

Dependent Sequential Composition

Choice

Parsing Primitives (lexeme/token level)

CFG Parsing

Right-associating operator

Left-associating operator

Initial space, final junk

Operator precedence and parentheses

Keywords and variables