Recursive descent parsers can be nicely built from a few primitives and combined by connectives for sequential chaining and backtracking choice. Some of them can be non-obvious until you realize you also need to convert received strings to data such as numbers and ASTs, or even to perform evaluation right away.
For minimal infrastructure, we will parse characters directly, but we are aware that going through tokenization first is more tidy, and we can still write our parsers in that spirit.
A parser can be represented as a function taking an input string, consuming a prefix of it, and resulting in one of:
data Parser a = MkParser (String -> Maybe (String, a)) unParser :: Parser a -> String -> Maybe (String, a) unParser (MkParser sf1) = sf1in ParserLib.hs
It would be possible to replace Maybe
by []
to allow
ambiguous grammars and multiple valid answers.
Here is the function for using a parser. You give it your input string, it
gives you overall failure or final answer. (It discards the final leftover;
usually you aren't interested. If you're interested, use unParser
above.)
runParser :: Parser a -> String -> Maybe a runParser (MkParser sf) inp = case sf inp of Nothing -> Nothing Just (_, a) -> Just ain ParserLib.hs
A good way to deepen understanding of the parser type is to implement a few primitives. They will also be actually used.
A most basic primitive reads a character and gives it to you. So when does it fail? When there is no character to read!
anyChar :: Parser Char anyChar = MkParser sf where sf "" = Nothing sf (c:cs) = Just (cs, c)in ParserLib.hs
Next, perhaps you have an expected character in mind and want to read and check. Failure if mismatch or no character to read.
char :: Char -> Parser Char char wanted = MkParser sf where sf (c:cs) | c == wanted = Just (cs, c) sf _ = Nothingin ParserLib.hs
E.g., if you expect ‘A’ (and reject everything else), you say “char
'A'
”.
Why return the character that the caller already specifies: We can have a
grammar rule like “<B> ::= '0' | '1'
”, the parser code shown
later will be “char '0' <|> char '1'
”, now you want to know
which one succeeds!
More generally, perhaps you expect a character that satisfies a given predicate:
satisfy :: (Char -> Bool) -> Parser Char satisfy pred = MkParser sf where sf (c:cs) | pred c = Just (cs, c) sf _ = Nothingin ParserLib.hs
E.g., if you expect a letter, you say “satisfy isAlpha
”.
(isAlpha
is from Data.Char.)
One last character-level primitive, the other most basic one: Checking that
the input string is empty. In a sense an opposite of anyChar
.
eof :: Parser () eof = MkParser sf where sf "" = Just ("", ()) sf _ = Nothingin ParserLib.hs
You should test them using runParser
or even
unParser
to get a feeling of them.
You would not want to write your own String -> Maybe (String, a)
functions from scratch for larger parsers. You want to just glue the primitives
above with connectives that say “foo and then bar”, “foo or bar”, or “foo but
convert the answer”.
Convert a parser's answer by a given function:
fmap :: (a -> b) -> Parser a -> Parser b fmap f (MkParser sf) = MkParser sfb where sfb inp = case sf inp of Nothing -> Nothing Just (rest, a) -> Just (rest, f a)in ParserLib.hs
Example: Read a character, but yield its ASCII or Unicode code (the function
is ord
from Data.Char):
anyCode :: Parser Int anyCode = fmap ord anyChar -- Try: runParser anyCode "AB", runParser anyCode ""in Parsing.hs
Sequential composition of two parsers, “foo and then bar”: Unconsumed input from the 1st becomes input of the 2nd. Combine their answers by a given binary operator/function.
liftA2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c liftA2 op (MkParser sf1) p2 = MkParser g where g inp = case sf1 inp of Nothing -> Nothing Just (middle, a) -> case unParser p2 middle of Nothing -> Nothing Just (rest, b) -> Just (rest, op a b)in ParserLib.hs
Derived utilities when you just want one of the answers:
(<*) :: Parser a -> Parser b -> Parser a pa <* pb = liftA2 (\a _ -> a) pa pb (*>) :: Parser a -> Parser b -> Parser b pa *> pb = liftA2 (\_ b -> b) pa pbin ParserLib.hs
Example: I want a letter, then a digit, then ‘!’; the answer is the string consisting of the letter and the digit:
ldeV1 :: Parser String ldeV1 = liftA2 (\x y -> [x,y]) (satisfy isAlpha) (satisfy isDigit) <* (char '!') -- Try: runParser ldeV1 "B6!", runParser ldeV1 "B6a"in Parsing.hs
liftA2
has some kind of identity element: A “trivial” parser
that always yields the same answer and doesn't consume input:
pure :: a -> Parser a pure a = MkParser (\inp -> Just (inp, a))in ParserLib.hs
It is analogous to identity because these hold:
liftA2 op (pure a) pb = fmap (\b -> op a b) pb liftA2 op pa (pure b) = fmap (\a -> op a b) pa
If the binary operator to liftA2
is function application, it has
enough power (together with pure
) to express liftA3, liftA4, …
(<*>) :: Parser (a -> b) -> Parser a -> Parser b pf <*> pa = liftA2 (\f a -> f a) pf pa liftA3 :: (a -> b -> c -> d) -> Parser a -> Parser b -> Parser c -> Parser d liftA3 f pa pb pc = ((pure f <*> pa) <*> pb) <*> pc -- Note: Those parentheses are redundant, <*> is left-associative.in ParserLib.hs
You can pronounce <*>
as “ap”.
Brief explanation:
pure f :: Parser (a -> (b -> (c -> d))) pure f <*> pa :: Parser (b -> (c -> d)) (pure f <*> pa) <*> pb :: Parser (c -> d) ((pure f <*> pa) <*> pb) <*> pc :: Parser d
Sometimes you want the 2nd parser to know/depend on the answer from the 1st parser:
(>>=) :: Parser a -> (a -> Parser b) -> Parser b MkParser sf1 >>= k = MkParser g where g inp = case sf1 inp of Nothing -> Nothing Just (rest, a) -> unParser (k a) restin ParserLib.hs
You can pronounce >>=
as “bind”.
Example: I want a letter, then a digit, then ‘!’; the answer is the string consisting of the letter and the digit:
ldeV2 :: Parser String ldeV2 = satisfy isAlpha >>= (\x -> satisfy isDigit >>= (\y -> char '!' >>= (\_ -> pure [x, y]))) -- Note: Those parentheses are redundant. -- Try: runParser ldeV2 "B6!", runParser ldeV2 "B6a"in Parsing.hs
If you feel that it's callback programming, yes it is!
pure
is also a kind of identity for >>=
:
pure a >>= k = k a k >>= pure = k
Backtracking choice/alternation, “foo or else bar”: Try the 1st parser, if it fails, try the 2nd parser. And its identity element: The parser that always fails. Backtracking is the expensive part of recursive descent parsing.
(<|>) :: Parser a -> Parser a -> Parser a MkParser sf1 <|> p2 = MkParser g where g inp = case sf1 inp of Nothing -> unParser p2 inp j -> j -- the Just case empty :: Parser a empty = MkParser (\_ -> Nothing)in ParserLib.hs
Example: I want ‘A’ or ‘B’, followed by ‘0’ or ‘1’:
ab01 :: Parser String ab01 = liftA2 (\x y -> [x,y]) (char 'A' <|> char 'B') (char '0' <|> char '1')in Parsing.hs
Example satisfy
can be expressed with >>=
and
empty
:
satisfyV2 pred = anyChar >>= \c -> if pred c then pure c else emptyin Parsing.hs
With choice and recursion, we can try a parser for 0 or 1 time, 0 or more times, 1 or more times:
-- | 0 or 1 time. optional :: Parser a -> Parser (Maybe a) optional p = fmap Just p <|> pure Nothing -- Try 1 time. If success, tag with Just; if failure, answer Nothing. -- | 0 or more times, maximum munch, collect the answers into a list. many :: Parser a -> Parser [a] many p = some p <|> pure [] -- Explanation: To repeat 0 or more times, first try 1 or more -- times! If that fails, then we know it's 0 times, and the answer is the -- empty list. -- | 1 or more times, maximum munch, collect the answers into a list. some :: Parser a -> Parser [a] some p = liftA2 (:) p (many p) -- Explanation: To repeat 1 or more times, do 1 time, then 0 or more times!in ParserLib.hs
Example: 1 or more digits: some (satisfy isDigit)
We won't actually use the character-level primitives directly. A reason is that spaces will get into the way. Another is that we think at the token level actually, even intuitively before you learned grammars and parsing! We only use character-level primitives to implement token-level primitives such as the ones below. Then we use connectives and token-level primitives for the grammar.
Whitespace handling convention: Token-level primitives assume there is no leading spaces, and skip trailing spaces (so the next token primitive may assume no leading spaces). Something else at the outermost level will have to skip initial leading spaces; we'll get there later.
-- | Space or tab or newline (unix and windows). whitespace :: Parser Char whitespace = satisfy (\c -> c `elem` ['\t', '\n', '\r', ' ']) -- | Consume zero or more whitespaces, maximum munch. whitespaces :: Parser String whitespaces = many whitespace -- | Read a natural number (non-negative integer), then skip trailing spaces. natural :: Parser Integer natural = fmap read (some (satisfy isDigit)) <* whitespaces -- read :: Read a => String -> a -- For converting string to your data type, assuming valid string. Integer -- is an instance of Read, and our string is valid, so we can use read. -- | Read an identifier, then skip trailing spaces. Disallow the listed keywords. identifier :: [String] -> Parser String identifier keywords = satisfy isAlpha >>= \c -> many (satisfy isAlphaNum) >>= \cs -> whitespaces *> let str = c:cs in if str `elem` keywords then empty else pure str -- | Read the wanted keyword, then skip trailing spaces. keyword :: String -> Parser String keyword wanted = satisfy isAlpha >>= \c -> many (satisfy isAlphaNum) >>= \cs -> whitespaces *> if c:cs == wanted then pure wanted else emptyin ParserLib.hs
There are a few more in ParserLib.hs, but you get the point.
A parser for a context-free grammar can mostly look like the grammar rules. There are however a few gotchas to watch out for, some tricks, and that lingering issue of initial leading spaces.
The parsers here will produce abstract syntax trees of this type:
data Expr = Num Integer | Var String | Prim2 Op2 Expr Expr -- Prim2 op operand operand | Let [(String, Expr)] Expr -- Let [(name, rhs), ...] body data Op2 = Add | Sub | Mul | Powin Parsing.hs
Take this simple rule, and suppose we intend the operator to associate to the right:
pows ::= natural { "^" natural } OR pows ::= natural [ "^" pows ]
The second form uses right recursion to convey right association. Perfect for recursive descent parsing!
powsV1 :: Parser Expr powsV1 = fmap Num natural >>= \x -> ((operator "^" *> pure (Prim2 Pow)) >>= \op -> powsV1 >>= \y -> pure (op x y)) <|> pure xin Parsing.hs
After you understand this, you would rather not write this recursion by hand again for every right-associating operator, instead call a re-factored function and specify just your operand parser and operator parser.
Here is the re-factored general function for right-associating operators.
chainr1 :: Parser a -- ^ operand parser -> Parser (a -> a -> a) -- ^ operator parser -> Parser a -- ^ whole answer chainr1 getArg getOp = getArg >>= \x -> (getOp >>= \op -> chainr1 getArg getOp >>= \y -> pure (op x y)) <|> pure xin ParserLib.hs
So here is how we will implement the rule in practice:
powsV2 :: Parser Expr powsV2 = chainr1 (fmap Num natural) (operator "^" *> pure (Prim2 Pow))in Parsing.hs
Suppose we want the operator to associate to the left instead. We cannot code up left recursion directly. But here is the trick. Implement the other form of the rule:
muls ::= natural { "*" natural }
Use many
for the “{ "*" natural }
” part, get a list
of tuples of (operator, number). For example if the input string is “2 * 5 * 3
* 7”, my plan is to:
Num 2
many
to
[(Prim2 Mul, Num 5), (Prim2 Mul, Num 3), (Prim2 Mul, Num 7)]
Then a simple use of foldl
on the list, starting with
Num 2
as the initial accumulator, will build the left-leaning tree!
mulsV1 :: Parser Expr mulsV1 = fmap Num natural >>= \x -> many (liftA2 (,) (operator "*" *> pure (Prim2 Mul)) (fmap Num natural)) >>= \opys -> pure (foldl (\accum (op,y) -> op accum y) x opys)in Parsing.hs
Again in practice we don't write this code again, we re-factor this into a general function:
chainl1 :: Parser a -- ^ operand parser -> Parser (a -> a -> a) -- ^ operator parser -> Parser a -- ^ whole answer chainl1 getArg getOp = getArg >>= \x -> many (liftA2 (,) getOp getArg) >>= \opys -> pure (foldl (\accum (op,y) -> op accum y) x opys)in ParserLib.hs
Then we use it like:
mulsV2 :: Parser Expr mulsV2 = chainl1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))in Parsing.hs
Exercise: Use one single chainl
to implement
addorsubs ::= natural { op natural } op ::= "+" | "-"
So for example “7-3+2” should make sense. The cool trick is how do you specify one single operator that tries both plus and minus?
Token-level parsers assume no leading spaces; someone has to skip initial leading spaces to kick-start this invariant.
As well, a small parser for a part of the grammar can leave non-space stuff unconsumed, since we anticipate that later a small parser for another part may need it. But the overall combined parser for the whole grammar cannot: by the time you're done with the whole grammar, any non-space leftover means the original input string is actually erroneous, e.g., we don't consider “2*3*” to be a legal arithmetic expression (our muls parsers can make sense of the prefix “2*3” but leaves the last “*” unconsumed).
Here is the trick for solving both. Have a “main” parser, and its job is
simply to clear initial leading spaces, call the parser for the start symbol,
then use eof
to check that there is nothing left.
Example: If the muls rule is already my whole grammar, and so muls is my start symbol, I write:
lesson2 :: Parser Expr lesson2 = whitespaces *> muls <* eof where muls = chainl1 (fmap Num natural) (operator "*" *> pure (Prim2 Mul))in Parsing.hs
Why not do these inside the parser for the start symbol: There may be
recursive calls to it in the middle of your grammar! Wrong to do
the eof
check there/then.
Suppose I have two operators “^” and “*”, with “*” having lower precedence, and I also support parentheses for overriding precedence.
In other words, from lowest precedence (binding most loosely) to highest (binding most tightly): “*”, then “^”, then individual numbers and parentheses (same level without ambiguity).
The trick is to have lower (looser) rules call higher (tighter) rules, and have the parentheses rule call the lowest rule for recursion. The start symbol is from the lowest rule. This is also how you can write your grammar to convey precedence.
So my grammar goes like (start symbol is muls):
muls ::= pows { "*" pows } pows ::= atom { "^" atom } atom ::= natural | "(" muls ")"
And my parser goes like:
lesson3 :: Parser Expr lesson3 = whitespaces *> muls <* eof where muls = chainl1 pows (operator "*" *> pure (Prim2 Mul)) pows = chainr1 atom (operator "^" *> pure (Prim2 Pow)) atom = fmap Num natural <|> (openParen *> muls <* closeParen)in Parsing.hs
Exercise: Extend to cover “+” and “-”, both same precedence and lower than “*”.
And now my really whole grammar (start symbol is expr):
expr ::= local | muls local ::= "let" { var "=" expr ";" } "in" expr muls ::= pows { "*" pows } pows ::= atom { "^" atom } atom ::= natural | var | "(" expr ")"
Problem: “let inx” should be a syntax error, but a naïve parser implementation sees “let”, “in”, “x”.
Solution: A parser for a reserved word should first read as many alphanums as
possible (not just the expected letters), and then check that the whole string
equals the keyword. This is what keyword
does in an earlier
section.
Conversely, the parser for identifiers should read likewise, but then check
that the string doesn't clash with reserved words. This is why
identifier
from that earlier section takes a parameter for reserved
words to avoid.
With that, here is the whole parser:
lesson4 :: Parser Expr lesson4 = whitespaces *> expr <* eof where expr = local <|> muls local = pure (\_ eqns _ e -> Let eqns e) <*> keyword "let" <*> many equation <*> keyword "in" <*> expr -- Basically a liftA4. -- Could also be implemented with (>>=), like equation below. equation = var >>= \v -> operator "=" *> expr >>= \e -> semicolon *> pure (v, e) -- Basically a liftA4. -- Could also be implemented with (<*>), like local above. semicolon = char ';' *> whitespaces muls = chainl1 pows (operator "*" *> pure (Prim2 Mul)) pows = chainr1 atom (operator "^" *> pure (Prim2 Pow)) atom = fmap Num natural <|> fmap Var var <|> (openParen *> expr <* closeParen) var = identifier ["let", "in"]in Parsing.hs