SlideShare a Scribd company logo
1 of 4
Download to read offline
Java Programming:
Introduction: Lexer 1
In this project, we will begin our lexer. Our lexer will start by reading the strings of the .shank file
that the user wants to run. It will break the Shank code up into "words" or tokens and build a
collection of these tokens. We can consider the lexer complete when it can take any Shank file
and output a list of the tokens generated.
We will not be using the Scanner class that you may be familiar with for reading from a file; we
are, instead, using Files.readAllLines. This is a much simpler way of dealing with files.
Example of readAllLines:
Path myPath = Paths.get("someFile.shank");
List <String> lines = Files.readAllLines(myPath, StandardCharsets.UTF_8);
A second concept you may not be familiar with is "enum". Enum, short for enumeration. This is a
Java language construct that lets us create a variable that may be any one of a list of things. We
will use this to define the types of tokens - a tokenType.
enum colorType { RED, GREEN, BLUE }
colorType myFavorite = colorType.BLUE;
System.out.println(myFavorite); // prints BLUE
Details
To start with, we will build a lexer that accepts words and numbers and generates a collection of
tokens. There are three types of tokens in this assignment - WORD, NUMBER and ENDOFLINE.
A word is defined as a letter (upper or lower case) and then any number of letters or numbers. In
regular expression terms: [A-Za-z][A-Za-z0-9]* - anything not a letter or number ends the word.
A number is defined as integer or floating point; in regular expressions: [0-9]*[.]?[0-9]+ - anything
else ends the number.
Any character that is not word or number is not a token (for now) but does show the end of a
token.
For example:
word1 123.456 2
would lex to 4 tokens:
WORD (word1)
NUMBER (123.456)
NUMBER (2)
ENDOFLINE
The lexer class will hold a collection of tokens; it will start out empty and calls to lex() will add to it.
You must use a state machine to keep track of what type of token you are in the middle of. Any
character that is not a letter or number will reset the state machine and output the current token.
The end of the line will also cause the current token to be output. Output an ENDOFLINE token at
the end of each line.
This assignment must have three different source code files.
One file must be called Shank.java.
Shank.java must contain main. Your main must ensure that there is one and only one argument
(args). If there are none or more than 1, it must print an appropriate error message and exit. That
one argument will be considered as a filename. Your main must then use Files.ReadAllLines to
read all of the lines from the file denoted by filename. Your main must instantiate one instance of
your Lexer class (to be defined below). You must parse all lines using the lex method of the Lexer
class (calling it repeatedly). If lex throws an exception, you must catch the exception, print that
there was an exception. You must then print each token out (this is a temporary step to show that
it works) once the lexing is complete.
One file must be called Token.java. This file must contain a Token class. The token class is made
up of an instance of an enum (tokenType) and a value string. There must be a public accessor for
both the enum and the value string; the underlying variables must be private. You may create
whatever constructors you choose. The enum must be defined as containing values appropriate
to what we will be processing. The definition of the enum should be public, but the instance inside
Token must be private. We will add to this enum in the next several assignments. You will find it
helpful to create an appropriate "ToString" overload. The enum should be inside the Token class
so you will reference it as:
token.tokenType.WORD
The final file must be called Lexer.java. The Lexer class must contain a lex method that accepts a
single string and returns nothing. The lex method must use one or more state machine(s) to
iterate over the input string and create appropriate Tokens. Any character not allowed by your
state machine(s) should throw an exception. The lexer needs to accumulate characters for some
types (consider 123 - we need to accumulate 1, then 2, then 3, then the state machine can tell that
the number is complete because the next character is not a number). You may not use regular
expressions to do your lexical analysis - you must build your own state machine(s).
Below is Rubric:
Introduction: Lexer 2
In this assignment, we will work toward completing our lexer. We will start by switching from just
"WORD" to actual keywords. We will add the punctuation that we need. We will process string and
character literals ("hello" 'a'), deal with/ignore comments and finally, we will deal with indentation
levels.
There are a lot of words in Shank that we must deal with. When we read (in English), we
recognize words from the words that we have learned in the past. You might say that we associate
a word with a concept or an idea. This association is a powerful concept and Java has support for
it with HashMap.
A HashMap is a data structure that maps (in the mathematical sense) or associates two values. It
can also be called a "key-value store". In the Hash Map, you can provide a key and get a value.
This is perfect for looking up words. We want to, lookup a string ("while" for example) and get a
token type (tokenType.WHILE).
HashMap<String, tokenType> knownWords = new HashMap<String, tokenType>();
knownWords.put("while", tokenType.WHILE);
boolean doWeHaveWhile = knownWords.containsKey("while");
tokenType whileType = knownWords.get("while");
Details
Look through the Language Description and build a list of keywords. Add a HashMap to the Lexer
class and initialize all the keywords. Change the lexer so that it checks each string before making
the WORD token and creates a token of the appropriate type if the work is a key word. When the
exact type of a token is known (like "WHILE"), you should NOT fill in the value string, the type is
enough. For tokens with no exact type (like "hello"), we still need to fill in the token's string. Finally,
rename "WORD" to "IDENTIFIER".
Similarly, look through the Language Description for the list of punctuation. A hash map is not
necessary or helpful for these - they need to be added to the state machine. Be particularly careful
about the multi-character operators like := or >=. These require a little more complexity in your
state machine. See the comment state machine example for an idea on how to implement this.
Strings and characters will require some additions to your state machine. Create
"STRINGLITERAL" and "CHARACTERLITERAL" token types. These cannot cross line
boundaries. Note that we aren't going to build in escaping like Java does ( " This is a double
quote" that is inside a string" or ''').
Comments, too, require a bit more complexity in your state machine. When a comment starts, you
need to accept and ignore everything until the closing comment character. Assume that comments
cannot be nested - {{this is invalid} and will be a syntax error later}. Remember, though, that
comments can span lines, unlike numbers or words or symbols; no token should be output for
comments.
Your lexer should throw an exception if it encounters a character that it doesn't expect outside of a
comment, string literal or character literal. Create a new exception type that includes a good error
message and the token that failed. Ensure that the ToString method prints nicely. An example of
this might be:
ThisIsAnIdentifier 123 ! { <- that exclamation is unexpected }
Add "line number" to your Token class. Keep track of the current line number in your lexer and
populate each Token's line number; this is straightforward because each call to lex() will be one
line greater than the last one. The line number should be added to the exception, too, so that
users can fix the exceptions.
Finally, indentation. This is not as bad as it seems. For each line, count from the beginning the
number of spaces and tabs until you reach a non-space/tab. Each tab OR four spaces is an
indentation level. If the indentation level is greater than the last line (keep track of this in the lexer),
output one or more INDENT tokens. If the indentation level is less than the last line, output one or
more DEDENT tokens (obviously you will need to make new token types). For example:
1 { indent level 0, output NUMBER 1 }
a { indent level 1, output an INDENT token, then IDENTIIFIER a }
b { indent level 2, output an INDENT token, then IDENTIFIER b }
c { indent level 4, output 2 INDENT tokens, then IDENTIFIER c }
2 { indent level 0; output 4 DEDENT tokens, then NUMBER 2 }
Be careful of the two exceptions:
If there are no non-space/tab characters on the line, don't output an INDENT or DEDENT and
don't change the stored indentation level.
If we are in the middle of a multi-line comment, indentation is not considered.
Note that end of file, you must output DEDENTs to get back to level 0.
Requirements:
Your exception must be called "SyntaxErrorException" and be in its own file. Unterminated strings
or characters are invalid and should throw this exception, along with any invalid symbols.
Below is Rubric:
begin{tabular}{|l|l|l|l|l|} hline Statemachine(s)-numbershandled & Nonexistentornevercorrect(0) &
Some cases handled(7) & Most cases handled(13) & All cases handled(20) hline End of Line
output & None (0) & & & Outputs (5) hline end{tabular}

More Related Content

Similar to Java Programming Introduction Lexer 1 In this project we.pdf

Regular expressions
Regular expressionsRegular expressions
Regular expressions
Raghu nath
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docx
fredharris32
 
import java.util.ArrayList;import java.util.Arrays;import ja.docx
import java.util.ArrayList;import java.util.Arrays;import ja.docximport java.util.ArrayList;import java.util.Arrays;import ja.docx
import java.util.ArrayList;import java.util.Arrays;import ja.docx
wilcockiris
 
The Java Script Programming Language
The  Java Script  Programming  LanguageThe  Java Script  Programming  Language
The Java Script Programming Language
zone
 
Javascript by Yahoo
Javascript by YahooJavascript by Yahoo
Javascript by Yahoo
birbal
 

Similar to Java Programming Introduction Lexer 1 In this project we.pdf (20)

Lexical Analysis
Lexical AnalysisLexical Analysis
Lexical Analysis
 
Pcd question bank
Pcd question bank Pcd question bank
Pcd question bank
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Using Static Analysis in Program Development
Using Static Analysis in Program DevelopmentUsing Static Analysis in Program Development
Using Static Analysis in Program Development
 
tokens patterns and lexemes
tokens patterns and lexemestokens patterns and lexemes
tokens patterns and lexemes
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docx
 
module 4.pptx
module 4.pptxmodule 4.pptx
module 4.pptx
 
import java.util.ArrayList;import java.util.Arrays;import ja.docx
import java.util.ArrayList;import java.util.Arrays;import ja.docximport java.util.ArrayList;import java.util.Arrays;import ja.docx
import java.util.ArrayList;import java.util.Arrays;import ja.docx
 
Php
PhpPhp
Php
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 
Parser
ParserParser
Parser
 
Javascript
JavascriptJavascript
Javascript
 
It’s sometimes useful to make a little language for a simple problem.pdf
It’s sometimes useful to make a little language for a simple problem.pdfIt’s sometimes useful to make a little language for a simple problem.pdf
It’s sometimes useful to make a little language for a simple problem.pdf
 
1.Role lexical Analyzer
1.Role lexical Analyzer1.Role lexical Analyzer
1.Role lexical Analyzer
 
The Java Script Programming Language
The  Java Script  Programming  LanguageThe  Java Script  Programming  Language
The Java Script Programming Language
 
Les origines de Javascript
Les origines de JavascriptLes origines de Javascript
Les origines de Javascript
 
Javascript by Yahoo
Javascript by YahooJavascript by Yahoo
Javascript by Yahoo
 
The JavaScript Programming Language
The JavaScript Programming LanguageThe JavaScript Programming Language
The JavaScript Programming Language
 
Javascript
JavascriptJavascript
Javascript
 

More from adinathassociates

JAVA Please help me on this method The requirement of the m.pdf
JAVA Please help me on this method The requirement of the m.pdfJAVA Please help me on this method The requirement of the m.pdf
JAVA Please help me on this method The requirement of the m.pdf
adinathassociates
 

More from adinathassociates (20)

It is somewhat curious that this documentation does not wind.pdf
It is somewhat curious that this documentation does not wind.pdfIt is somewhat curious that this documentation does not wind.pdf
It is somewhat curious that this documentation does not wind.pdf
 
izgi ynteminin ve nceden alanm plakann sonularna gre k.pdf
izgi ynteminin ve nceden alanm plakann sonularna gre k.pdfizgi ynteminin ve nceden alanm plakann sonularna gre k.pdf
izgi ynteminin ve nceden alanm plakann sonularna gre k.pdf
 
Jamie needs a new roof on her house The cash cost is 4100.pdf
Jamie needs a new roof on her house The cash cost is 4100.pdfJamie needs a new roof on her house The cash cost is 4100.pdf
Jamie needs a new roof on her house The cash cost is 4100.pdf
 
It is necessary for Marketers to know how the customers feel.pdf
It is necessary for Marketers to know how the customers feel.pdfIt is necessary for Marketers to know how the customers feel.pdf
It is necessary for Marketers to know how the customers feel.pdf
 
James tiene la enfermedad celaca Cul de los siguientes a.pdf
James tiene la enfermedad celaca Cul de los siguientes a.pdfJames tiene la enfermedad celaca Cul de los siguientes a.pdf
James tiene la enfermedad celaca Cul de los siguientes a.pdf
 
It is difficult to quantify a value for certain biological a.pdf
It is difficult to quantify a value for certain biological a.pdfIt is difficult to quantify a value for certain biological a.pdf
It is difficult to quantify a value for certain biological a.pdf
 
It was leaked that Bergdorf Goodman treated Kayla a Hispanic.pdf
It was leaked that Bergdorf Goodman treated Kayla a Hispanic.pdfIt was leaked that Bergdorf Goodman treated Kayla a Hispanic.pdf
It was leaked that Bergdorf Goodman treated Kayla a Hispanic.pdf
 
It is possible to create dynamic GUI applications based on c.pdf
It is possible to create dynamic GUI applications based on c.pdfIt is possible to create dynamic GUI applications based on c.pdf
It is possible to create dynamic GUI applications based on c.pdf
 
Joe makes annual income of 5000 for five years Joe withdr.pdf
Joe makes annual income of 5000 for five years Joe withdr.pdfJoe makes annual income of 5000 for five years Joe withdr.pdf
Joe makes annual income of 5000 for five years Joe withdr.pdf
 
Joan Robinson ktisat okumann amac iktisat sorularna bir d.pdf
Joan Robinson ktisat okumann amac iktisat sorularna bir d.pdfJoan Robinson ktisat okumann amac iktisat sorularna bir d.pdf
Joan Robinson ktisat okumann amac iktisat sorularna bir d.pdf
 
Jobyde ie alma ilerleme terfi ve tevik etme grevleri gen.pdf
Jobyde ie alma ilerleme terfi ve tevik etme grevleri gen.pdfJobyde ie alma ilerleme terfi ve tevik etme grevleri gen.pdf
Jobyde ie alma ilerleme terfi ve tevik etme grevleri gen.pdf
 
It is not option 3 please help What would be the outcome of.pdf
It is not option 3 please help What would be the outcome of.pdfIt is not option 3 please help What would be the outcome of.pdf
It is not option 3 please help What would be the outcome of.pdf
 
Joanna and Chip Gaines and Michael Dubin got together recent.pdf
Joanna and Chip Gaines and Michael Dubin got together recent.pdfJoanna and Chip Gaines and Michael Dubin got together recent.pdf
Joanna and Chip Gaines and Michael Dubin got together recent.pdf
 
JKL Co issues zero coupon bonds on the market at a price of.pdf
JKL Co issues zero coupon bonds on the market at a price of.pdfJKL Co issues zero coupon bonds on the market at a price of.pdf
JKL Co issues zero coupon bonds on the market at a price of.pdf
 
Jill is offered a choice between receiving 50 with certaint.pdf
Jill is offered a choice between receiving 50 with certaint.pdfJill is offered a choice between receiving 50 with certaint.pdf
Jill is offered a choice between receiving 50 with certaint.pdf
 
Jennys Froyo INC Balance Sheet For Year Ended December 31.pdf
Jennys Froyo INC Balance Sheet For Year Ended December 31.pdfJennys Froyo INC Balance Sheet For Year Ended December 31.pdf
Jennys Froyo INC Balance Sheet For Year Ended December 31.pdf
 
Jim y Sue se iban a casar y estaban muy enamorados Antes de.pdf
Jim y Sue se iban a casar y estaban muy enamorados Antes de.pdfJim y Sue se iban a casar y estaban muy enamorados Antes de.pdf
Jim y Sue se iban a casar y estaban muy enamorados Antes de.pdf
 
Jhania Bive the muk hroutwes ite p+4i if peras bt in +45 te.pdf
Jhania Bive the muk hroutwes ite p+4i if peras bt in +45 te.pdfJhania Bive the muk hroutwes ite p+4i if peras bt in +45 te.pdf
Jhania Bive the muk hroutwes ite p+4i if peras bt in +45 te.pdf
 
JAVA Please help me on this method The requirement of the m.pdf
JAVA Please help me on this method The requirement of the m.pdfJAVA Please help me on this method The requirement of the m.pdf
JAVA Please help me on this method The requirement of the m.pdf
 
Jennifer invested the profit of his business in an investmen.pdf
Jennifer invested the profit of his business in an investmen.pdfJennifer invested the profit of his business in an investmen.pdf
Jennifer invested the profit of his business in an investmen.pdf
 

Recently uploaded

Orientation Canvas Course Presentation.pdf
Orientation Canvas Course Presentation.pdfOrientation Canvas Course Presentation.pdf
Orientation Canvas Course Presentation.pdf
Elizabeth Walsh
 
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MysoreMuleSoftMeetup
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
 

Recently uploaded (20)

diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Orientation Canvas Course Presentation.pdf
Orientation Canvas Course Presentation.pdfOrientation Canvas Course Presentation.pdf
Orientation Canvas Course Presentation.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
MuleSoft Integration with AWS Textract | Calling AWS Textract API |AWS - Clou...
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
 
VAMOS CUIDAR DO NOSSO PLANETA! .
VAMOS CUIDAR DO NOSSO PLANETA!                    .VAMOS CUIDAR DO NOSSO PLANETA!                    .
VAMOS CUIDAR DO NOSSO PLANETA! .
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Diuretic, Hypoglycemic and Limit test of Heavy metals and Arsenic.-1.pdf
Diuretic, Hypoglycemic and Limit test of Heavy metals and Arsenic.-1.pdfDiuretic, Hypoglycemic and Limit test of Heavy metals and Arsenic.-1.pdf
Diuretic, Hypoglycemic and Limit test of Heavy metals and Arsenic.-1.pdf
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdf
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 

Java Programming Introduction Lexer 1 In this project we.pdf

  • 1. Java Programming: Introduction: Lexer 1 In this project, we will begin our lexer. Our lexer will start by reading the strings of the .shank file that the user wants to run. It will break the Shank code up into "words" or tokens and build a collection of these tokens. We can consider the lexer complete when it can take any Shank file and output a list of the tokens generated. We will not be using the Scanner class that you may be familiar with for reading from a file; we are, instead, using Files.readAllLines. This is a much simpler way of dealing with files. Example of readAllLines: Path myPath = Paths.get("someFile.shank"); List <String> lines = Files.readAllLines(myPath, StandardCharsets.UTF_8); A second concept you may not be familiar with is "enum". Enum, short for enumeration. This is a Java language construct that lets us create a variable that may be any one of a list of things. We will use this to define the types of tokens - a tokenType. enum colorType { RED, GREEN, BLUE } colorType myFavorite = colorType.BLUE; System.out.println(myFavorite); // prints BLUE Details To start with, we will build a lexer that accepts words and numbers and generates a collection of tokens. There are three types of tokens in this assignment - WORD, NUMBER and ENDOFLINE. A word is defined as a letter (upper or lower case) and then any number of letters or numbers. In regular expression terms: [A-Za-z][A-Za-z0-9]* - anything not a letter or number ends the word. A number is defined as integer or floating point; in regular expressions: [0-9]*[.]?[0-9]+ - anything else ends the number. Any character that is not word or number is not a token (for now) but does show the end of a token. For example: word1 123.456 2 would lex to 4 tokens: WORD (word1) NUMBER (123.456) NUMBER (2) ENDOFLINE The lexer class will hold a collection of tokens; it will start out empty and calls to lex() will add to it. You must use a state machine to keep track of what type of token you are in the middle of. Any character that is not a letter or number will reset the state machine and output the current token. The end of the line will also cause the current token to be output. Output an ENDOFLINE token at the end of each line. This assignment must have three different source code files. One file must be called Shank.java. Shank.java must contain main. Your main must ensure that there is one and only one argument (args). If there are none or more than 1, it must print an appropriate error message and exit. That
  • 2. one argument will be considered as a filename. Your main must then use Files.ReadAllLines to read all of the lines from the file denoted by filename. Your main must instantiate one instance of your Lexer class (to be defined below). You must parse all lines using the lex method of the Lexer class (calling it repeatedly). If lex throws an exception, you must catch the exception, print that there was an exception. You must then print each token out (this is a temporary step to show that it works) once the lexing is complete. One file must be called Token.java. This file must contain a Token class. The token class is made up of an instance of an enum (tokenType) and a value string. There must be a public accessor for both the enum and the value string; the underlying variables must be private. You may create whatever constructors you choose. The enum must be defined as containing values appropriate to what we will be processing. The definition of the enum should be public, but the instance inside Token must be private. We will add to this enum in the next several assignments. You will find it helpful to create an appropriate "ToString" overload. The enum should be inside the Token class so you will reference it as: token.tokenType.WORD The final file must be called Lexer.java. The Lexer class must contain a lex method that accepts a single string and returns nothing. The lex method must use one or more state machine(s) to iterate over the input string and create appropriate Tokens. Any character not allowed by your state machine(s) should throw an exception. The lexer needs to accumulate characters for some types (consider 123 - we need to accumulate 1, then 2, then 3, then the state machine can tell that the number is complete because the next character is not a number). You may not use regular expressions to do your lexical analysis - you must build your own state machine(s). Below is Rubric: Introduction: Lexer 2 In this assignment, we will work toward completing our lexer. We will start by switching from just "WORD" to actual keywords. We will add the punctuation that we need. We will process string and character literals ("hello" 'a'), deal with/ignore comments and finally, we will deal with indentation levels. There are a lot of words in Shank that we must deal with. When we read (in English), we recognize words from the words that we have learned in the past. You might say that we associate a word with a concept or an idea. This association is a powerful concept and Java has support for it with HashMap. A HashMap is a data structure that maps (in the mathematical sense) or associates two values. It can also be called a "key-value store". In the Hash Map, you can provide a key and get a value. This is perfect for looking up words. We want to, lookup a string ("while" for example) and get a token type (tokenType.WHILE). HashMap<String, tokenType> knownWords = new HashMap<String, tokenType>(); knownWords.put("while", tokenType.WHILE); boolean doWeHaveWhile = knownWords.containsKey("while"); tokenType whileType = knownWords.get("while"); Details Look through the Language Description and build a list of keywords. Add a HashMap to the Lexer
  • 3. class and initialize all the keywords. Change the lexer so that it checks each string before making the WORD token and creates a token of the appropriate type if the work is a key word. When the exact type of a token is known (like "WHILE"), you should NOT fill in the value string, the type is enough. For tokens with no exact type (like "hello"), we still need to fill in the token's string. Finally, rename "WORD" to "IDENTIFIER". Similarly, look through the Language Description for the list of punctuation. A hash map is not necessary or helpful for these - they need to be added to the state machine. Be particularly careful about the multi-character operators like := or >=. These require a little more complexity in your state machine. See the comment state machine example for an idea on how to implement this. Strings and characters will require some additions to your state machine. Create "STRINGLITERAL" and "CHARACTERLITERAL" token types. These cannot cross line boundaries. Note that we aren't going to build in escaping like Java does ( " This is a double quote" that is inside a string" or '''). Comments, too, require a bit more complexity in your state machine. When a comment starts, you need to accept and ignore everything until the closing comment character. Assume that comments cannot be nested - {{this is invalid} and will be a syntax error later}. Remember, though, that comments can span lines, unlike numbers or words or symbols; no token should be output for comments. Your lexer should throw an exception if it encounters a character that it doesn't expect outside of a comment, string literal or character literal. Create a new exception type that includes a good error message and the token that failed. Ensure that the ToString method prints nicely. An example of this might be: ThisIsAnIdentifier 123 ! { <- that exclamation is unexpected } Add "line number" to your Token class. Keep track of the current line number in your lexer and populate each Token's line number; this is straightforward because each call to lex() will be one line greater than the last one. The line number should be added to the exception, too, so that users can fix the exceptions. Finally, indentation. This is not as bad as it seems. For each line, count from the beginning the number of spaces and tabs until you reach a non-space/tab. Each tab OR four spaces is an indentation level. If the indentation level is greater than the last line (keep track of this in the lexer), output one or more INDENT tokens. If the indentation level is less than the last line, output one or more DEDENT tokens (obviously you will need to make new token types). For example: 1 { indent level 0, output NUMBER 1 } a { indent level 1, output an INDENT token, then IDENTIIFIER a } b { indent level 2, output an INDENT token, then IDENTIFIER b } c { indent level 4, output 2 INDENT tokens, then IDENTIFIER c } 2 { indent level 0; output 4 DEDENT tokens, then NUMBER 2 } Be careful of the two exceptions: If there are no non-space/tab characters on the line, don't output an INDENT or DEDENT and don't change the stored indentation level. If we are in the middle of a multi-line comment, indentation is not considered. Note that end of file, you must output DEDENTs to get back to level 0.
  • 4. Requirements: Your exception must be called "SyntaxErrorException" and be in its own file. Unterminated strings or characters are invalid and should throw this exception, along with any invalid symbols. Below is Rubric: begin{tabular}{|l|l|l|l|l|} hline Statemachine(s)-numbershandled & Nonexistentornevercorrect(0) & Some cases handled(7) & Most cases handled(13) & All cases handled(20) hline End of Line output & None (0) & & & Outputs (5) hline end{tabular}