Ken Thompson (sitting) with Dennis Ritchie. Ken Thompson wrote a 1968 journal paper on his regexp implementation for the QED editor which he wrote in assembly language on an IBM 7094 running the CTSS OS.

Lexical Analyzer for Decaf

Start on May 26, 2025 | Due on Jun 9, 2025 | With grace days Jun 11, 2025

Your task for this homework is to write a lexical analyzer (lexer for short) for the Decaf programming language which is the programming language specifically for this course.

Lex

We will be using a lexical analyzer generator called Lex to do this homework. Before you start programming for this homework it is very important you work through the Lex practice problems first.

The programming tool is called lex and the implementation we will be using is GNU flex.

Getting Started

You must have git and python (3.x) on your system to run the assignments. Once you’ve confirmed this, run this command:

git clone https://github.com/anoopsarkar/compilers-class-hw.git

In the decaflex directory you will find various python programs which you will use to test your solution to this homework.

You will get updates to the homework files by going to the directory where you cloned the repository and then doing:

# go to the directory where you did a git clone for HW1
git pull origin master

To get started with your homework do the following steps.

Set up your repository

Make sure you have already followed the instructions in HW0 to set up your GitHub repository.

Copy over files

Clone your repository and enter that directory and copy over the files:

git clone git@github.sfu.ca:GROUPUSER/CMPT379-1254-g-GROUP.git
cd CMPT379-1254-g-GROUP
mkdir -p decaflex
cd decaflex
cp -r /your-path-to/compilers-class-hw/decaflex/* .
git add *
git commit -m 'initial commit'
git push

If you update my repository using git pull then you might have to copy over the new files into your repository. Be careful you do not clobber your own files in the answer directory.

Default solution

Your solution must be compiled in the answer directory and must be called decaflex. There is an incomplete solution to this homework in answer/default.lex. Copy it over as your initial solution:

cd your-repo-name/answer
cp default.lex decaflex.lex
make decaflex

The Challenge

The goal of this homework is to write a lexical analyzer for the Decaf programming language. The details of the lexical elements in Decaf are in the Decaf specification:

Decaf specification

Read the specification carefully at least upto the section called Decaf Program Structure and in particular the section called List of Tokens.

The lexical analyzer produces a stream of tokens for a given Decaf program. The input is taken from stdin (standard input) and the output token stream is sent to stdout (standard output). You must issue errors on the stderr (standard error) stream.

For example, for the input Decaf program:

package Test { func main() int { } }

The lexical analyzer produces the following token stream:

T_PACKAGE package
T_WHITESPACE
T_ID Test
T_WHITESPACE
T_LCB {
T_WHITESPACE
T_FUNC func
T_WHITESPACE
T_ID main
T_LPAREN (
T_RPAREN )
T_WHITESPACE
T_INTTYPE int
T_WHITESPACE
T_LCB {
T_WHITESPACE
T_RCB }
T_WHITESPACE
T_RCB }
T_WHITESPACE \n

The default lexer you were provided does work for this input. Run it and see:

# go to the answer directory and build your binary (see instructions above)
./decaflex < ../testcases/dev/default-passes.decaf

The full list of tokens is provided in the section List of Tokens in the Decaf specification.

Your Task

Using the Decaf language specification as your guide, provide a lex program that is a lexical analyzer for the Decaf language.

Make sure you obey the following requirements:

If your program succeeds in parsing the input you should exit from your program using exit(EXIT_SUCCESS). And if your program finds a lexical error you should exit using exit(EXIT_FAILURE). The definitions of EXIT_SUCCESS and EXIT_FAILURE are in cstdlib (for C++) and in stdlib.h (for C).
Note that the token names and lexeme values should be identical to the sample output provided to you in the testcases directory.
You must use the token names provided in the List of Tokens section of the Decaf specification.
You must include a special whitespace and comment token. The whitespace token should have a lexeme value that includes all the whitespace characters. The whitespace and comment lexemes should convert the newline character into the literal string \n so that the line number and character number of each token can be recovered from the lexical analyzer output.
Provide appropriate error reporting with the line number and location in the line where the error was detected. Note that check.py does not check for the contents of the error message, only the return value from your lexer.

Development and upload procedure

Remember to push your solution source code to your git repository:

git add decaflex.lex
git commit -m 'initial solution'
git push

Then each time you finish a component of your solution you can push it to the remote repository:

git add decaflex.lex # or other files you worked on
git commit -m 'commit message' decaflex.lex # or other files you worked on
git push

You have been given three helper programs to help you develop your solution to this homework.

Run your solution on testcases

Run your solution program on the testcases using the Python program zipout.py. Your solution must be compiled in the answer directory and must be called decaflex. Run against all testcases as follows:

# go to the directory with the file zipout.py
python3 zipout.py

This creates a directory called output and a file output.zip which can be checked against the reference output files (see section on Check your solution below).

If you run zipout.py multiple times it will overwrite your output directory and zip file which should be fine most of the time (but be careful).

Check your solution

Check your solution accuracy using the Python program check.py. You must create an output.zip file using the above step in Run your solution on testcases. Note that the references are only available for the dev testcases. When you are graded you will be evaluated on both the dev and test testcases. output.zip contains your output for both sets of testcases.

python3 check.py 
Correct(dev): 4 / 59
Score(dev): 4.00
Total Score: 4.00

Package your source for Coursys

You must also upload your source code to Coursys. You should prepare your source for upload using the Python program zipsrc.py.

# go to the directory with the file zipsrc.py
python3 zipsrc.py

This will create a zip file called source.zip. You should upload this file as your submission to hw1 on Coursys.

Be careful: zipsrc.py will only package files in the answer directory. Make sure you have put all your supporting files in that directory. In particular, put relevant documentation into answer/README.md.

If you add any testcases of your own please put them in the directories answer/testcases/[your-username]/ and answer/references/[your-username]/ using the same convention used by zipout.py and check.py.

Ground Rules

You must turn in two things:
- Your source code from the answer directory as a zip file source.zip produced by running python3 zipsrc.py must be uploaded to the hw1 submission page on Coursys.
- Your output on the testcases which is the file output.zip produced by running python3 zipout.py must be uploaded to the hw1 submission page on Coursys. When we run check.py on the public testcases it should have a value higher than the output from the default.lex program to get any marks.
Your source code from source.zip must be on your GitHub repository.
Make sure that we can run make decaflex in your answer directory to create the decaflex binary.
You cannot use data or code resources outside of what is provided to you. If you use external code snippets provide citations in the answer/README.md file.
For the written description of your submission and supporting documentation, you can use plain ASCII but for math equations it is better to use kramdown. Do not use any proprietary or binary file formats such as Microsoft Word.

Grading

Score for testcases both dev and test.
Code review by TAs. Please check for comments on your code on GitHub.

If you have any questions or you’re confused about anything, just ask.