Compilers - Lexical Analysis

compsci ✒ compilers ✒ lexical-analysis
Lexical analysis

Lexical analysis converts source code (a list of characters) to a list of tokens.


A simple program goes along the source code trying to match as much as possible at each point.

i.e. (in pseudocode)

lex s = token : lex (drop matchlength s)
    where (token, matchlength) = biggestMatch s

Regular expressions are often used to describe and match tokens.


Example token set (for addition and subtraction of integers)

TokenRegular ExpressionExample match
integer[0-9]+427
plus\++
minus\--

Here, backslash is used to escape the symbols + and - so they don't mean anything special in the regular expression.

With this token set, this string:

123+4+23-1-45+2+0000000

Would be converted to this token stream:

integer("123"), plus, integer("4"), plus, integer("23"), minus, integer("1"), minus, integer("45"), plus, integer("2"), plus, integer("0000000")

A little code could run after lexing to change the strings into numbers, e.g. "00000000" to 0.

Lexing with regular expressions

Simon Thompson has written a an excellent primer on regular expression therory and the RegEx ✒ NFA ✒ DFA conversion process.

Why bother?

comment on this page