Introduction

In this post, I will share the way I built a parser using python.

The purpose of this post is to understand the tasks performed by a FSM. This post describes how I developed the Finite State Machine (FSM) to validate the syntax of the programming language.

Once armed the lexer class, we will be developing the parser. It will include a FSM in order to validate the syntax of the raw source code using the tokens that lexer class generate.

The parser

With the transitions table already done, We can start with the parser. First, I will be adding two dictionaries. The first dictionary called input_transitions_map is used to link the token type with the column in the transition table.

from src.TokenType import TokenType

# Mapping specific token types to their column in the transition table
input_transitions_map = {
    TokenType["LINE_NUMBER"].value: 0,
    TokenType["BASIC_KEYWORDS"].value[1]: 1,  # INPUT
    TokenType["BASIC_KEYWORDS"].value[0]: 2,  # PRINT
    TokenType["BASIC_KEYWORDS"].value[2]: 3,  # IF
    TokenType["BASIC_KEYWORDS"].value[5]: 4,  # FOR
    TokenType["BASIC_KEYWORDS"].value[8]: 5,  # REM
    # ...
    TokenType["NE"].value: 21,
}

The next step will be the creation of the second dictionary. It will be a nested dictionary. This is a dictionary that contains another dictionary as its value.This dictionary is called sparse_transitions_table, where the keys represent the rows of the FSM states, each key have another dictionary as their value, which represent the columns of the transition table (the elements) where the key represent the element and the value correspond to the next state. By using a sparse transition table, we have the option to only store valid transitions.

# A nested dictionary representing a sparse FSM transition table.
# Outer keys: FSM state (row index).
# Inner keys: Input symbol (column index).
# Inner values: Next state.
sparse_transitions_table = {
    0: {0: 1},
    1: {
        0: 1,
        1: 2,
        2: 7,
        3: 12,
        4: 24,
        5: 41,
        6: 43,
        7: 38,
        8: 33,
        9: 30,
        10: 32,
    },
    2: {0: 1, 7: 6, 8: 5, 11: 3},
    3: {17: 4},
    # ...
    44: {0: 1, 13: 39},
}

For this python file, I will be adding the Parser class which it will be analysing the list of token generated by Lexer class. For this class I add attributes in order to keep track of the current state of our FSM.

class Parser:
    """Represents the parser for source code analysis.

    Attributes:
        tokens_list: A list of tokens generated by Lexer class
        line: Tracks the current line number
        current_state: Represents the current state in the FSM.
    """

    def __init__(self, tokens: list):
        """Creates a parser instance with a list of tokens.

        Args:
            tokens_list: A list of tokens generated by Lexer class
        """
        self.tokens_list = tokens
        self.line = 0
        self.current_state = 0

In this class, We will be using two functions: one for token validation and another for error handling. The validate_syntax function will be reading token by token while keeping track of the current state. For each token, it updates the current state based on the next transition. At the end of the execution of the loop, it returns a dictionary with the analysis result. If, during the loop, the next state is -1, the error_handler function will be called. The dictionary contains two keys:

successful: A boolean indicating if the analysis was successful.
message: "No errors found. Analysis successful." on success, or an error message if an invalid transition occurs.

    def validate_syntax(self):
        """Validates the sequence of tokens based on the FSM transition 
            states.

        Returns:
            analisys_result: A dictionary with:
            - "successful": True if analysis completes without errors, 
                    False otherwise.
            - "message": A success message or an error description.
        """
        analisys_result = {
            "successful": True,
            "message": "No errors found. Analysis successful.",
        }

        for token in self.tokens_list:
            transition_column = (
                input_transitions_map[token.value]
                if token.type == "KEYWORD"
                else input_transitions_map[token.type]
            )

            previous_state = self.current_state
            previous_line = self.line

            self.line = token.line

            self.current_state = sparse_transitions_table[
                self.current_state
            ].get(transition_column, -1)

            if self.current_state == -1:
                error_message = self.error_handler(
                    sparse_transitions_table[previous_state],
                    previous_line
                )
                return {"successful": False, "message": error_message}

        return analisys_result

The error_handler function will be the responsible of generate an appropriate error message when an invalid transitions occurs during syntax validation. This message includes the expected inputs, their corresponding symbols and the line number where the error was detected. This function provide feedback about syntax errors in the source code.

def error_handler(
        self,
        previous_state_dict: dict,
        previous_line: int,
        input_transitions_map: dict = input_transitions_map,
    ):
        """Handles syntax errors by identifying expected inputs 
            and generating an error message.

        Args:
            previous_state_dict: A dictionary representing the previous 
                state transitions.
            previous_line: An integer representing the previous line number.
            input_transitions_map: A mapping of valid state transitions.

        Returns:
            message: An error message indicating the expected inputs and the 
                line number where the syntax error occurred.
        """
        expected_inputs = []
        row_input_transitions_map = {}

        # Swapping key and values of input_transitions_map
        for item in input_transitions_map:
            row_input_transitions_map.setdefault(
                input_transitions_map[item], []
            ).append(item)

        # Mapping specific token values to their symbols
        token_type_map = {
            TokenType["SEMMICOLON"].value: ";",
            TokenType["LE"].value: "<=",
            TokenType["GE"].value: ">=",
            TokenType["PLUS"].value: "+",
            TokenType["MINUS"].value: "-",
            TokenType["TIMES"].value: "*",
            TokenType["DIVIDE"].value: "/",
            TokenType["EQUALS"].value: "=",
            TokenType["NE"].value: "<>",
        }

        # Collect expected inputs for the previous state
        expected_inputs = [
            row_input_transitions_map[element]
            for element in previous_state_dict
        ]

        # Flatten the list of lists
        expected_inputs = [item for row in expected_inputs for item in row]

        # Add symbolic representations where applicable
        for index, value in enumerate(expected_inputs):
            if value in token_type_map:
                expected_inputs[index] += " (" + token_type_map[value] + ")"

        message = f"Syntax error: Expected {', '.join(expected_inputs)} in line {previous_line}"
        return message

Testing the parser

In order to know if the parser works correctly, I will be doing the same as with the Lexer class. I created a file called test_parser.py, where I define the test function. In this functions, I instantiate the Lexer class, passing raw source code to its constructor, and then call the extract_tokens function. Once the token list is returned, I instantiate the Parser class, pass this list to its constructor, and call the validate_syntax function. Once the result is returned, I compare both keys (successful and message) with the expected result to ensure that it works properly.

from src.Lexer import Lexer
from src.Parser import Parser

def test_validate_input_keyword_string_alnum_identifier():
    """
    Test a INPUT statement containing a string and an alphanumeric identifier
    is correctly parsed without errors.
    """
    raw_code = '10 INPUT "What is your name:"; NN$'
    tokens = Lexer(raw_code).extract_tokens()
    result = Parser(tokens).validate_syntax()

    assert result.get("successful") is True
    assert result.get("message") == "No errors found. Analysis successful."

def test_validate_input_keyword_alnum_identifier():
    """
    Test a INPUT statement containing an alphanumeric identifier
    is correctly parsed without errors.
    """
    raw_code = "10 INPUT NN$"
    tokens = Lexer(raw_code).extract_tokens()
    result = Parser(tokens).validate_syntax()

    assert result.get("successful") is True
    assert result.get("message") == "No errors found. Analysis successful."

    # ...

def test_validate_rem_keyword_comment():
    """
    Test an assignment containing an REM keyword and a comment
    is correctly parsed without errors.
    """
    raw_code = '25 REM This is a comment'
    tokens = Lexer(raw_code).extract_tokens()
    result = Parser(tokens).validate_syntax()

    assert result.get("successful") is True
    assert result.get("message") == "No errors found. Analysis successful."

Once we have this in place, we can run our tests, and they should be passing.

Creating a GUI for the Lexical Analyzer

For this project, I will be using Tkinter, the standard Python GUI library. It provides a set of tools and widgets to create desktop applications with graphical interfaces. This library provides various controls, such as buttons, labels, text boxes, checkboxes, and more.

I will be using Tkinter Designer with the purpose of speed up the GUI development process in Python. This tool use Figma to make Tkinter GUIs. It analyses the design file and generate the respective code and files needed for the GUI.

Once I finalize the design in Figma, I will pass the required information to Tkinter Designer to generate the code for this design. From there, I will implement the functionality of the lexer analyzer, integrating it into the interface.

Final Result

Here is the completed project: In the first picture you can see when the analysis is successful, in the second image you can see when the it detects an error.

Here is the completed project:

In the first picture, you can see when the analysis is successful.

In the second image, you can see when it detects an error.

🔗 Check out the repository here: https://github.com/Saul-Lara/Lexical-Analyzer

Conclusion

This project serves as a demonstration of how a Lexical Analyzer can be implemented. Showing the process from how break down raw source code into meaningful tokens, analyzing them using a finite state machine (FSM), and returning a result.

Building a Lexical Analyzer for Theory of Computation: Part 3 - Developing the Parser