Recursive descent parsers can be nicely expressed in an embedded domain-specific language, built from a few primitives and composed using the connectives from Functor, Applicative, and Monad; and there is one more relevant connective type class, Alternative, for failures and choices.
For minimal infrastructure, we will parse characters directly, but we are aware that going through tokenization first is more tidy, and we can still write our parsers in that spirit.
Each parser can be represented as a function taking an input string, consuming a prefix of it, and giving one of:
When success, it is important to give the unconsumed suffix rather than
losing it. When liftA2
and >>=
chain up two
parsers, the second parser needs to see the leftover from the first parser! You
can also think of a state variable for the current string to parse. Overall we
are combining two effects, failure and state.
So we use such a function type to define our parser type:
data Parser a = MkParser (String -> Maybe (String, a)) unParser :: Parser a -> String -> Maybe (String, a) unParser (MkParser sf1) = sf1in ParserLib.hs
I use Maybe
because I anticipate at most one valid answer,
and for simplicity I don't include error information for failures.
It is also possible to use []
to anticipate ambiguous grammars
and multiple valid answers, with the empty list for failure.
Here is the function for using a parser. You give it your input string, it
gives you overall failure or final answer. (It discards the final leftover;
usually you aren't interested. If you're interested, use unParser
above.)
runParser :: Parser a -> String -> Maybe a runParser (MkParser sf) inp = case sf inp of Nothing -> Nothing Just (_, a) -> Just a -- OR: fmap (\(_,a) -> a) (sf inp)in ParserLib.hs
A good way to deepen understanding of the parser type is to implement a few primitives. They will also be actually used.
A most basic primitive reads a character and gives it to you. So when does it fail? When there is no character to read!
anyChar :: Parser Char anyChar = MkParser sf where sf "" = Nothing sf (c:cs) = Just (cs, c)in ParserLib.hs
Next, perhaps you have an expected character in mind and want to read and check. Failure if mismatch or no character to read.
char :: Char -> Parser Char char wanted = MkParser sf where sf (c:cs) | c == wanted = Just (cs, c) sf _ = Nothingin ParserLib.hs
E.g., if you expect ‘A’ (and reject everything else), you say “char
'A'
”.
More generally, perhaps you expect a character that satisfies a predicate of yours.
satisfy :: (Char -> Bool) -> Parser Char satisfy pred = MkParser sf where sf (c:cs) | pred c = Just (cs, c) sf _ = Nothingin ParserLib.hs
E.g., if you expect a letter, you say “satisfy
isAlpha
”. (isAlpha
is from Data.Char.)
One last character-level primitive, the other most basic one: Checking that
the input string is empty. So its failure/success criterion is the opposite of
char
's.
eof :: Parser () eof = MkParser sf where sf "" = Just ("", ()) sf _ = Nothingin ParserLib.hs
You should test them using runParser
or even
unParser
to get a feeling of them.
The effects of the Parser
type are a combination of failure and
state. Accordingly, the implementation of the Functor, Applicative, and Monad
methods also combine those of Maybe
and State
, i.e.,
checking for Nothing
vs Just
, and plumbing for state
values (input strings and leftovers).
instance Functor Parser where -- fmap :: (a -> b) -> Parser a -> Parser b fmap f (MkParser sf) = MkParser sfb where sfb inp = case sf inp of Nothing -> Nothing Just (rest, a) -> Just (rest, f a) -- OR: fmap (\(rest, a) -> (rest, f a)) (sf inp) instance Applicative Parser where -- pure :: a -> Parser a pure a = MkParser (\inp -> Just (inp, a)) -- liftA2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c -- Consider the 1st parser to be stage 1, 2nd parser stage 2. liftA2 op (MkParser sf1) p2 = MkParser g where g inp = case sf1 inp of Nothing -> Nothing Just (middle, a) -> case unParser p2 middle of Nothing -> Nothing Just (rest, b) -> Just (rest, op a b) instance Monad Parser where -- return :: a -> Parser a return = pure -- (>>=) :: Parser a -> (a -> Parser b) -> Parser b MkParser sf1 >>= k = MkParser g where g inp = case sf1 inp of Nothing -> Nothing Just (rest, a) -> unParser (k a) restin ParserLib.hs
In Control.Applicative there are more utiltiy connectives, two of which I am fond of using:
(*>) :: Applicative f => f a -> f b -> f b p *> q = liftA2 (\a b -> b) p q -- Drop p's answer, give only q's answer. Like (>>) but Applicative. (<*) :: Applicative f => f a -> f b -> f a p <* q = liftA2 (\a b -> a) p q -- Drop q's answer, give only p's answer.
Example of chaining up several primitive parsers sequentially: I want a letter, then a digit, then ‘!’; the answer is the letter and the digit in a string, and drop the ‘!’:
lde :: Parser String lde = liftA2 (\x y -> [x,y]) (satisfy isAlpha) (satisfy isDigit) <* (char '!') -- Try: runParser lde "B6!", runParser lde "B6a"
There is one more type class “Alternative” (in the standard library; import it from Control.Applicative) containing methods for failure and choice. Why they are grouped together: choice is an associative binary operator, and failure is the identity element.
class Applicative f => Alternative f where empty :: f a (<|>) :: f a -> f a -> f a many :: f a -> f [a] -- has default implementation some :: f a -> f [a] -- has default implementation
This type class was actually inspired by parsing! The <|>
operator came from the “|” in BNF, and many
and some
came from “0 or more times” and “1 or more times”.
And our implementation for our Parser
:
instance Alternative Parser where -- empty :: Parser a -- Always fail. empty = MkParser (\_ -> Nothing) -- (<|>) :: Parser a -> Parser a -> Parser a -- Try the 1st one. If success, done; if failure, do the 2nd one MkParser sf1 <|> p2 = MkParser g where g inp = case sf1 inp of Nothing -> unParser p2 inp j -> j -- the Just case -- many :: Parser a -> Parser [a] -- 0 or more times, maximum munch, collect the answers into a list. -- Can use default implementation. And it goes as: many p = some p <|> pure [] -- How to make sense of it: To repeat 0 or more times, first try 1 or more -- times! If that fails, then we know it's 0 times, and the answer is the -- empty list. -- some :: Parser a -> Parser [a] -- 1 or more times, maximum munch, collect the answers into a list. -- Can use default implementation. And it goes as: some p = liftA2 (:) p (many p) -- How to make sense of it: To repeat 1 or more times, do 1 time, then 0 or -- more times! Use liftA2 to chain up and collect answers.in ParserLib.hs
Example use of empty
: satisfy
can be expressed as:
satisfyV2 pred = anyChar >>= \c -> if pred c then return c else empty
Example use of <|>
: I want ‘A’ or ‘B’, followed by ‘0’ or
‘1’:
ab01 :: Parser String ab01 = liftA2 (\x y -> [x,y]) (char 'A' <|> char 'B') (char '0' <|> char '1')
In Control.Applicative there are also a utility connective based on Alternative. It's very handy when an ENBF rules says “0 or 1 time”:
optional :: Alternative f => f a -> f (Maybe a) optional p = fmap Just p <|> pure Nothing
We won't actually use the character-level primitives directly. A reason is that spaces will get into the way. Another is that we think at the token level actually, even intuitively before you learned grammars and parsing! We only use character-level primitives to implement token-level primitives such as the ones below. Then we use connectives and token-level primitives for the grammar.
Whitespace handling convention: Token-level primitives assume there is no leading spaces, and skip trailing spaces (so the next primitive may assume no leading spaces). Something else at the outermost level will have to skip initial leading spaces; we'll get there later.
-- | Space or tab or newline (unix and windows). whitespace :: Parser Char whitespace = satisfy (\c -> c `elem` ['\t', '\n', '\r', ' ']) -- | Consume zero or more whitespaces, maximum munch. whitespaces :: Parser String whitespaces = many whitespace -- | Read a natural number (non-negative integer), then skip trailing spaces. natural :: Parser Integer natural = fmap read (some (satisfy isDigit)) <* whitespaces -- read :: Read a => String -> a -- For converting string to your data type, assuming valid string. Integer -- is an instance of Read, and our string is valid, so we can use read. -- | Read an identifier, then skip trailing spaces. Disallow the listed keywords. identifier :: [String] -> Parser String identifier keywords = satisfy isAlpha >>= \c -> many (satisfy isAlphaNum) >>= \cs -> whitespaces >> let str = c:cs in if str `elem` keywords then empty else return str -- | Read the wanted keyword, then skip trailing spaces. keyword :: String -> Parser String keyword wanted = satisfy isAlpha >>= \c -> many (satisfy isAlphaNum) >>= \cs -> whitespaces *> if c:cs == wanted then return wanted else emptyin ParserLib.hs
There are a few more in ParserLib.hs, but you get the point.
A parser for a context-free grammar can mostly look like the grammar rules. There are however a few gotchas to watch out for, some tricks, and that lingering issue of initial leading spaces.
The parsers here will produce abstract syntax trees of this type:
data Expr = Num Integer | Var String | Prim2 Op2 Expr Expr -- Prim2 op operand operand | Let [(String, Expr)] Expr -- Let [(name, rhs), ...] body data Op2 = Add | Mulin Parsing.hs
Take this simple rule, and suppose we intend the operator to associate to the right:
muls ::= natural { "*" natural } OR muls ::= natural [ "*" muls ]
The second form uses right recursion to convey right association. Perfect for recursive descent parsing!
mulsRv1 :: Parser Expr mulsRv1 = liftA2 link (fmap Num natural) (optional (liftA2 (,) (operator "*" *> pure (Prim2 Mul)) mulsRv1)) where link x Nothing = x link x (Just (op,y)) = op x yin Parsing.hs
After you understand this, you would rather not write this recursion by hand again for every right-associating operator, instead call a re-factored function and specify just your operand parser and operator parser.
Here is the re-factored general function for right-associating operators.
chainr1 :: Parser a -- ^ operand parser -> Parser (a -> a -> a) -- ^ operator parser -> Parser a -- ^ whole answer chainr1 getArg getOp = liftA2 link getArg (optional (liftA2 (,) getOp (chainr1 getArg getOp))) where link x Nothing = x link x (Just (op,y)) = op x yin ParserLib.hs
So here is how we will implement the rule in practice:
mulsRv2 :: Parser Expr mulsRv2 = chainr1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))in Parsing.hs
Suppose we want the operator to associate to the left instead. We cannot code up left recursion directly. But here is the trick. Implement the other form of the rule:
muls ::= natural { "*" natural }
Use many
for the “{ "*" natural }
” part, get a list
of tuples of (operator, number). For example if the input string is “2 * 5 * 3
* 7”, my plan is to:
Num 2
many
and
get [(Prim2 Mul, Num 5), (Prim2 Mul, Num 3), (Prim2 Mul, Num
7)]
Then a simple use of foldl
on the list, starting with Num
2
as the initial accumulator, will build the left-leaning tree!
mulsLv1 :: Parser Expr mulsLv1 = liftA2 link (fmap Num natural) (many (liftA2 (,) (operator "*" *> pure (Prim2 Mul)) (fmap Num natural))) where link x opys = foldl (\accum (op,y) -> op accum y) x opysin Parsing.hs
Again in practice we don't write this code again, we re-factor this into a general function:
chainl1 :: Parser a -- ^ operand parser -> Parser (a -> a -> a) -- ^ operator parser -> Parser a -- ^ whole answer chainl1 getArg getOp = liftA2 link getArg (many (liftA2 (,) getOp getArg)) where link x opys = foldl (\accum (op,y) -> op accum y) x opysin ParserLib.hs
Then we use it like:
mulsLv2 :: Parser Expr mulsLv2 = chainl1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))in Parsing.hs
Token-level parsers assume no leading spaces; someone has to skip initial leading spaces to kick-start this invariant.
As well, a small parser for a part of the grammar can leave non-space stuff unconsumed, since we anticipate that later a small parser for another part may need it. But the overall combined parser for the whole grammar cannot: by the time you're done with the whole grammar, any non-space leftover means the original input string is actually erroneous, e.g., we don't consider “2*3*” to be a legal arithmetic expression (our muls parsers can make sense of the prefix “2*3” but leaves the last “*” unconsumed).
Here is the trick for solving both. Have a “main” parser, and its job is
simply to clear initial leading spaces, call the parser for the start symbol,
then use eof
to check that there is nothing left.
Example: If the muls rule is already my whole grammar, and so muls is my start symbol, I write:
lesson2 :: Parser Expr lesson2 = whitespaces *> muls <* eof where muls = chainl1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))in Parsing.hs
Why not do these inside the parser for the start symbol: There may be
recursive calls to it in the middle of your grammar! Wrong to do
the eof
check there/then.
Suppose I have two operators “*” and “+”, with “+” having lower precedence, and I also support parentheses for overriding precedence.
In other words, from lowest precedence (binding most loosely) to highest (binding most tightly): “+”, then “*”, then individual numbers and parentheses (same level without ambiguity).
The trick is to have lower (looser) rules call higher (tighter) rules, and have the parentheses rule call the lowest rule for recursion. The start symbol is from the lowest rule. This is also how you can write your grammar to convey precedence.
So my grammar goes like (start symbol is adds):
adds ::= muls { "+" muls } muls ::= atom { "*" atom } atom ::= natural | "(" adds ")"
And my parser goes like (let's say left-associating operators):
lesson3 :: Parser Expr lesson3 = whitespaces *> adds <* eof where adds = chainl1 muls (operator "+" *> pure (Prim2 Add)) muls = chainl1 atom (operator "*" *> pure (Prim2 Mul)) atom = fmap Num natural <|> (openParen *> adds <* closeParen)in Parsing.hs
And now my really whole grammar (start symbol is expr):
expr ::= local | adds local ::= "let" { var "=" expr ";" } "in" expr adds ::= muls { "+" muls } muls ::= atom { "*" atom } atom ::= natural | var | "(" expr ")"
Problem: “let inn+4” should be a syntax error, but a naïve parser implementation sees “let”, “in”, “n”, “+”, “4”.
Solution: A parser for a reserved word should first read as many alphanums as
possible (not just the expected letters), and then check that the whole string
equals the keyword. This is what keyword
does in an earlier
section.
Conversely, the parser for identifiers should read likewise, but then check
that the string doesn't clash with reserved words. This is
why identifier
from that earlier section takes a parameter for
reserved words to avoid.
With that, here is the whole parser:
lesson4 :: Parser Expr lesson4 = whitespaces *> expr <* eof where expr = local <|> adds local = pure (\_ eqns _ e -> Let eqns e) <*> keyword "let" <*> many equation <*> keyword "in" <*> expr -- Basically a liftA4. -- Could also be implemented in monadic style, like equation below. equation = var >>= \v -> operator "=" >> expr >>= \e -> semicolon >> return (v, e) -- Basically a liftA4. -- Could also be implemented in applicative style, like local above. semicolon = char ';' *> whitespaces adds = chainl1 muls (operator "+" *> pure (Prim2 Add)) muls = chainl1 atom (operator "*" *> pure (Prim2 Mul)) atom = fmap Num natural <|> fmap Var var <|> (openParen *> expr <* closeParen) var = identifier ["let", "in"]in Parsing.hs