Introduction

In this posts, I will share my approach of how I built a Lexical Analyzer, and I hope this information can help you if your teacher of Theory of Computation asks you to do this as a final project.

The purpose of this post is to understand the tasks performed by the lexical analyzer and how they are performed. This post describes how to build the first phase of a compiler, the lexer. For a programming language, in this case for BASIC.

I had used Python to develop this project because it is a language that i had been using lately.

What's a compiler?

A compiler is just a program that translates other programs. Traditional compilers translate source code into executable machine code that your computer understands. The compiler derives its name from the way it works, looking at the entire piece of source code and collecting and reorganizing the instructions. The compilation process is a sequence of various phases.

Diagram showing the process from source code to executable code via a compiler. Source code is represented by a text editor, the compiler by gears, and executable code by a binary display on a monitor.

What's a lexical analyzer?

The lexer, also called lexical analyzer or tokenizer. In this first phase, the compiler breaks the submitted source code into meaningful elements called lexemes and generates a sequence of tokens, which is just a sequence of characters that corresponds to a unit in the grammar of our language from the lexemes. For example, these are some valid tokens in BASIC programming language:

15 (NUMBER)
PRINT (KEYWORD)
“Hello” (STRING)
i (IDENTIFIER_ALNUM)

The lexer reads the input source code character by character and removes any whitespace or comments. At the end of the process, it will represents these lexemes in the form of tokens as:

<token-name, attribute-value>

If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.

Design of the lexer

This section gives introduction of the workflow and brief explanation of how the lexical analyzer is made.

Scanner or Tokenizer

Scanners may be handwritten or automatically generated by a lexical analyzer generator (e.g. Lex, Flex). For this project, the lexer will be implemented as a class, whose constructor takes an input string in parameter (representing the source code to perform lexical analysis on). It exposes a method to recognize and return the next token in the input. In this case, regular expressions will be used to recognize the tokens for the BASIC language.

Code snippet analysis displaying "40 IF N <= 0 THEN GOTO 90" with each element labeled: "40" as line number, "IF" and "THEN" as keywords, "N" as numeric identifier, "<=" as less equal operator, "0" and "90" as numbers. Arrows point to descriptions below each element.

All possible lexemes that can appear in code written in a programming language, are described in the specification of that programming language as a set of rules called lexical grammar. Rules in the lexical grammar are often transformed into automata called finite state machines (FSM).

Finite State Machines

A Finite State Machine or FSM is an abstract machine that is in one and only one state at any point in time. The FSM can change from one state to another as a result of an event. Changing from a state to another is called a transition. To recognize a token described by a regular definition, the regular expression in the definition is transformed into a FSM. The resulting FSM has a finite number of states comprising an initial state and a set of accepting states.

The FSM moves from one state to another by consuming one of the characters or elements in the regular expression. Following the transitions from the initial state to one of the accepting states yields a valid string described by the regular expression. To conclude this part on FSMs, let’s see how we can use it in order to recognize identifiers.

The rows of the transition table are labeled with the states of the FSM and the columns with all the characters or elements that can possibly be consumed. Each cell of the table contains the state of the FSM will be in, if the character at the corresponding column is consumed while the FSM is in the state at the corresponding row. A cell containing a letter E means that there is no transition in the FSM for the corresponding state and character. The transition table provide a quick visual reference when we are writing a transition function.

Conclusion

Now that I have shared the conceptual information about the workflow of the lexical analyzer and some concepts that we need in order to understand well this process.

In the next article, I will be sharing the development process of creating a lexical analyzer. This will have how break down raw source code into meaningful tokens, analyzing them using a finite state machine (FSM), and returning a table with the identified tokens.

How I built a Lexical Analyzer for a Theory of Computation class - Part 1