Lexical Analyzers and Parsers


Published on

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lexical Analyzers and Parsers

  1. 1. Heshan T. Suriyaarachchi
  2. 3. <ul><ul><li>lexical analysis is the process of converting a sequence of characters into a sequence of tokens. </li></ul></ul><ul><ul><li>Programs performing lexical analysis are called lexical analyzers or lexers . </li></ul></ul><ul><ul><li>A lexer consists of a scanner and a tokenizer . </li></ul></ul>
  3. 5. <ul><ul><li>A lexical analyzer breaks an input stream of characters into tokens. </li></ul></ul><ul><ul><li>Writing lexical analyzers by hand can be a tedious process, so software tools have been developed to ease this task. </li></ul></ul><ul><ul><li>Perhaps the best known such utility is Lex. Lex is a lexical analyzer generator for the UNIX operating system, targeted to the C programming language </li></ul></ul>
  4. 6. <ul><ul><li>Lex takes a specially-formatted specification file containing the details of a lexical analyzer. This tool then creates a C source file for the associated table-driven lexer. </li></ul></ul><ul><ul><li>The JLex utility is based upon the Lex lexical analyzer generator model. </li></ul></ul><ul><ul><li>JLex takes a specification file similar to that accepted by Lex, then creates a Java source file for the corresponding lexical analyzer. </li></ul></ul>
  5. 7. <ul><ul><li>A JLex input file is organized into three sections, separated by double-percent directives (``%%''). </li></ul></ul><ul><ul><li>A proper JLex specification has the following format. </li></ul></ul>
  6. 8. <ul><li>user code %% JLex directives %% regular expression rules </li></ul><ul><li>The ``%%'' directives distinguish sections of the input file and must be placed at the beginning of their line. </li></ul>
  7. 9. <ul><ul><li>The user code section - the first section of the specification file - is copied directly into the resulting output file. This area of the specification provides space for the implementation of utility classes or return types. </li></ul></ul><ul><ul><li>The JLex directives section is the second part of the input file. Here, macros definitions are given and state names are declared. </li></ul></ul><ul><ul><li>The third section contains the rules of lexical analysis, each of which consists of three parts: an optional state list, a regular expression, and an action. </li></ul></ul>
  8. 10. <ul><ul><li>This code is copied verbatim into the lexical analyzer source file that JLex outputs, at the top of the file. </li></ul></ul><ul><ul><li>Therefore, if the lexer source file needs to begin with a package declaration or with the importation of an external class, the user code section should begin with the corresponding declaration. </li></ul></ul><ul><ul><li>This declaration will then be copied onto the top of the generated source file. </li></ul></ul>
  9. 11. <ul><ul><li>The JLex directive section begins after the first ``%%'' and continues until the second ``%%'' delimiter. Each JLex directive should be contained on a single line and should begin that line. </li></ul></ul>
  10. 12. <ul><ul><li>The third part of the JLex specification consists of a series of rules for breaking the input stream into tokens. </li></ul></ul><ul><ul><li>These rules specify regular expressions, then associate these expressions with actions consisting of Java source code. </li></ul></ul><ul><ul><li>The rules have three distinct parts: the optional state list, the regular expression, and the associated action. This format is represented as follows. </li></ul></ul><ul><ul><li>[<states>] <expression> { <action> } </li></ul></ul>
  11. 13. <ul><ul><li>If more than one rule matches strings from its input, the generated lexer resolves conflicts between rules by greedily choosing the rule that matches the longest string. </li></ul></ul><ul><ul><li>Rules appearing earlier in the specification are given a higher priority by the generated lexer. </li></ul></ul><ul><ul><li>If the generated lexical analyzer receives input that does not match any of its rules, an error will be raised. </li></ul></ul>
  12. 14. <ul><li>Therefore, all input should be matched by at least one rule. This can be guaranteed by placing the following rule at the bottom of a JLex specification: </li></ul><ul><ul><li>. { java.lang.System.out.println(&quot;Unmatched input: &quot; + yytext()); } </li></ul></ul><ul><li>The dot (.) , will match any input except for the newline </li></ul>
  13. 15. <ul><ul><li>JLex will take a properly-formed specification and transform it into a Java source file for the corresponding lexical analyzer. </li></ul></ul><ul><ul><li>A benchmark experiment was conducted, comparing the performance of a lexical analyzer generated by JLex to that of a hand-written lexical analyzer. </li></ul></ul>
  14. 16. <ul><ul><li>The comparison was made for lexical analyzers of a simple ''toy'' programming language. The hand-written lexical analyzer was written in Java. </li></ul></ul><ul><ul><li>The experiment consists of running each lexical analyzer on two source files written in the toy language, then measuring the time required to process these files. </li></ul></ul>
  15. 17. <ul><li>The generated lexical analyzer proved to be quite quick, as the following results show. </li></ul><ul><ul><ul><li>Source File JLex-Generated Hand-Written </li></ul></ul></ul><ul><ul><ul><li>Lexical Analyzer Lexical Analyzer </li></ul></ul></ul><ul><ul><ul><li>177 lines 0.42 seconds 0.53 seconds </li></ul></ul></ul><ul><ul><ul><li>897 lines 0.98 seconds 1.28 seconds </li></ul></ul></ul><ul><li>The JLex lexical analyzer soundly outperformed the hand-written lexer. </li></ul>
  16. 18. <ul><li>One of the biggest complaints about table-driven lexical analyzers generated by programs like JLex is that these lexical analyzers do not perform as well as hand-written ones. </li></ul><ul><li>Therefore, this experiment is particularly important in demonstrating the relative speed of JLex lexical analyzers. </li></ul>
  17. 19. <ul><li>The following is a (possibly incomplete) list of unimplemented features of JLex. </li></ul><ul><li>1)The regular expression lookahead operator is unimplemented, and not included in the list of special regular expression metacharacters. </li></ul><ul><li>2)The start-of-line operator (^) assumes the following nonstandard behavior. A match on a regular expression that uses this operator will cause the newline that precedes the match to be discarded. </li></ul>
  18. 20. <ul><ul><li>Javac Main.java </li></ul></ul><ul><ul><li>Java JLex.Main Sample.lex </li></ul></ul><ul><ul><li>Javac sample.lex.java </li></ul></ul><ul><ul><li>Java Sample </li></ul></ul>
  19. 21. <ul><ul><li>Java CUP is a parser generator for Java </li></ul></ul><ul><ul><li>Java CUP compatibility is turned off by default, but can be activated with the following JLex directive. </li></ul></ul><ul><ul><li>%cup </li></ul></ul>
  20. 22. <ul><li>When given, this directive makes the generated scanner conform to the java_cup.runtime.Scanner interface. It has the same effect as the following three directives: </li></ul><ul><li>%implements java_cup.runtime.Scanner %function next_token %type java_cup.runtime.Symbol </li></ul>
  21. 26. <ul><li>Parsing ( syntactic analysis) is the process of analyzing a sequence of tokens to determine their grammatical structure with respect to a given (more or less) formal grammar. </li></ul>
  22. 27. <ul><li>Document Object Model </li></ul><ul><li>Platform- and language-independent standard object model for representing HTML or XML and related formats. </li></ul><ul><li>Tree Structure based API:     The Dom parser implements the dom api and it creates a DOM tree in memory for a XML document </li></ul>
  23. 28. <ul><li>supports navigation in any direction (e.g., parent and previous sibling) and allows for arbitrary modifications </li></ul><ul><li>an implementation must at least buffer the document that has been read so far (or some parsed form of it). </li></ul><ul><li>best suited for applications where the document must be accessed repeatedly or out of sequence order </li></ul>
  24. 29. <ul><li>DOM parsers must have the entire tree in memory before any processing can begin, so the amount of memory used by a DOM parser depends entirely on the size of the input data. </li></ul>
  25. 30. <ul><li>When to use DOM parser </li></ul><ul><li>Manipulate the document </li></ul><ul><li>Traverse the document back and forth </li></ul><ul><li>Small XML files </li></ul><ul><li>Drawbacks of DOM parser </li></ul><ul><li>Consumes a lot of memory </li></ul>
  26. 31. <ul><li>Is a serial access parser API for XML. </li></ul><ul><li>Provides a mechanism for reading data from an XML document. </li></ul><ul><li>Popular alternative to the DOM. </li></ul><ul><li>The quantity of memory that a SAX parser must use in order to function is typically much smaller than that of a DOM parser. </li></ul>
  27. 32. <ul><li>Because of the event-driven nature of SAX, processing documents can often be faster than DOM-style parsers </li></ul>
  28. 33. <ul><li>When to use SAX parser </li></ul><ul><li>No structural modification </li></ul><ul><li>Huge XML files </li></ul><ul><li>Drawbacks of SAX Parser </li></ul><ul><li>Certain kinds of XML validation require access to the document in full </li></ul>
  29. 34. <ul><li>OM stands for Object Model (also known as AXIOM - AXis Object Model) </li></ul><ul><li>Refers to the XML infoset model that is initially developed for Apache Axis2. </li></ul><ul><li>For an object oriented language the obvious choice is a model made up of objects. DOM and JDOM are two such XML models. </li></ul>
  30. 35. <ul><li>OM is conceptually similar to such an XML model by its external behavior but deep down it is very much different. </li></ul><ul><li>OM is based on Pull Parsing instead of Push Parsing. </li></ul>
  31. 36. <ul><li>Pull parsing is a recent trend in XML processing. </li></ul><ul><li>The previously popular XML processing frameworks such as SAX and DOM were &quot;push-based“, which means the control of the parsing was in the hands of the parser itself. </li></ul>
  32. 37. <ul><li>Push-based approach is fine and easy to use, but it was not efficient in handling large XML documents since a complete memory model will be generated in the memory. </li></ul><ul><li>Pull parsing inverts the control and hence the parser only proceeds at the users command. </li></ul><ul><li>The user can decide to store or discard events generated from the parser. </li></ul>
  33. 38. <ul><li>Credits goes out to </li></ul><ul><ul><li>Mr Elliot Joel Berk who wrote JLex. </li></ul></ul><ul><ul><li>To the Department of Computer Science, Princeton University for maintaining JLex. </li></ul></ul><ul><ul><li>All the others who contributed towards these projects. </li></ul></ul><ul><li>A special thanks goes out to </li></ul><ul><ul><li>Dr. Damith Karunaratne for giving me this opportunity. </li></ul></ul>