Compilers - Lexical Analysis
compsci ✒ compilers ✒ lexical-analysis
Lexical analysis converts source code (a list of characters) to a list of tokens.
A simple program goes along the source code trying to match as much as possible at each point.
i.e. (in pseudocode)
lex s = token : lex (drop matchlength s)
where (token, matchlength) = biggestMatch s
Regular expressions are often used to describe and match tokens.
Example token set (for addition and subtraction of integers)
| Token | Regular Expression | Example match |
| integer | [0-9]+ | 427 |
| plus | \+ | + |
| minus | \- | - |
Here, backslash is used to escape the symbols + and - so they don't mean anything special in the regular expression.
With this token set, this string:
123+4+23-1-45+2+0000000
Would be converted to this token stream:
integer("123"), plus, integer("4"), plus, integer("23"), minus, integer("1"), minus, integer("45"), plus, integer("2"), plus, integer("0000000")
A little code could run after lexing to change the strings into numbers, e.g. "00000000" to 0.
Lexing with regular expressions
Simon Thompson has written a an excellent primer on regular expression therory and the RegEx ✒ NFA ✒ DFA conversion process.
Why bother?
- no other parts of the compiler have to care about character sets
- simplifies the parser
- allows changing of language keywords etc. [e.g. allowing colour to be treated like an existing keyword color] without having to modify other parts
Copyright (C) 2006-8 Ryan Lothian. All rights reserved.
