About Tokens and Lexemes

10,334
-1

Published on

A Parser is an integral part when building a Domain Specific Language or file format parser, such as our example usage case: the Ical format. This session will cover the general concept about tokenizing and parsing into a datastructure, as well as going into depth about how to keep the memory footprint and runtime low with the help of a stream-tokenizer.

Published in: Technology
1 Comment
8 Likes
Statistics
Notes
  • thanks for this power,was beautiful
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
10,334
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
169
Comments
1
Likes
8
Embeds 0
No embeds

No notes for slide

About Tokens and Lexemes

  1. 1. About tokens and lexemes Ben Scholzen Game Developer Gameforge Productions GmbH
  2. 2. What we'll cover <ul><li>Definition of a compiler, tokenizer and parser
  3. 3. Basic structure of a tokenizer and a parser
  4. 4. Where to optimize things for PHP </li></ul>
  5. 5. What about parser generators?
  6. 6. They are evil! <ul><li>PHP_LexerGenerator, PHP_ParserGenerator, lemon-PHP
  7. 7. Create lots of function calls like lemon parsers in C
  8. 8. Are not working very performance-wise
  9. 9. Will eat up all your memory </li></ul>
  10. 10. Conclusion <ul>Don't use them! </ul>
  11. 11. Let's get started
  12. 12. What a compiler is and how it works <ul><li>Acts as frontend for the application
  13. 13. Converts human-readable data into machine-readable data
  14. 14. Consists of a two components: </li><ul><li>The lexer: </li><ul><li>Is a finite-state-machine
  15. 15. Reads the input stream
  16. 16. Clears up the input data
  17. 17. Creates a list of tokens </li></ul><li>The parser: </li><ul><li>Gets tokens from the tokenizer
  18. 18. Converts them into a data structure </li></ul></ul></ul>
  19. 19. What a compiler is and how it works Lexer Parser Tokens Document Stream Structure
  20. 20. Sounds great, but where do I need it? <ul><li>Formatting languages </li><ul><li>BB-Code
  21. 21. Wiki-Codes </li></ul><li>Description languages </li><ul><li>iCalendar / vCalendar
  22. 22. XML </li></ul><li>Even programming languages </li><ul><li>JavaScript
  23. 23. PHP </li></ul><li>Anything else you want your program to understand </li></ul>
  24. 24. The lexer (or tokenizer)
  25. 25. What are tokens? <ul><li>Categorized block of text </li><ul><li>Token type
  26. 26. Corresponding block of text (lexeme) </li></ul><li>List of tokens represents an entire document
  27. 27. Example in PHP: $value = 5 * 7 ; </li></ul>
  28. 28. How the tokenizer works <ul><li>Define possible states of the lexer
  29. 29. Tokenize the input in a loop </li><ul><li>Scan with preg_match() </li><ul><li>Strtok() is mostly too simple
  30. 30. Reading char-by-char is too slow
  31. 31. Use the offset parameter
  32. 32. Use the G assertion (^ won't work) </li></ul><li>Always store the current position
  33. 33. Use either a switch-statement or a structured array </li></ul><li>Return the tokens </li></ul>
  34. 34. What we can optimize <ul><li>Use little memory </li><ul><li>Always just read a partial part of the document into memory </li><ul><li>Via fopen() and fgets()
  35. 35. Requires previous knowledge about when tokens end </li></ul><li>Offer a method for the parser to get a partial bunch of tokens </li></ul><li>Speed up execution-time </li><ul><li>Do no internal function-calls if applicable </li></ul></ul>
  36. 36. Going into practice
  37. 37. The beginning <ul><li>Use little memory </li><ul><li>Via fopen() and fread() </li><ul><li>Requires previous knowledge about when tokens end
  38. 38. Offer a method for the parser to get a partial bunch of tokens </li></ul><li>Speed up execution-time </li></ul><li>Do no internal function-calls if applicable </li></ul>
  39. 39. Throwing in a file
  40. 40. Preparing stuff
  41. 41. Base state
  42. 42. Operator state
  43. 43. Value state
  44. 44. Rounding it up
  45. 45. Some actual testing
  46. 46. And what we get <ul><li>array(6) {
  47. 47. [0]=>
  48. 48. array(2) {
  49. 49. [0]=>
  50. 50. string(8) &quot;variable&quot;
  51. 51. [1]=>
  52. 52. string(6) &quot;$value&quot;
  53. 53. }
  54. 54. [1]=>
  55. 55. array(2) {
  56. 56. [0]=>
  57. 57. string(8) &quot;operator&quot;
  58. 58. [1]=>
  59. 59. string(1) &quot;=&quot;
  60. 60. }
  61. 61. [2]=>
  62. 62. array(2) {
  63. 63. [0]=>
  64. 64. string(6) &quot;number&quot;
  65. 65. [1]=>
  66. 66. string(1) &quot;5&quot;
  67. 67. } </li></ul><ul><li>[3]=>
  68. 68. array(2) {
  69. 69. [0]=>
  70. 70. string(8) &quot;operator&quot;
  71. 71. [1]=>
  72. 72. string(1) &quot;*&quot;
  73. 73. }
  74. 74. [4]=>
  75. 75. array(2) {
  76. 76. [0]=>
  77. 77. string(6) &quot;number&quot;
  78. 78. [1]=>
  79. 79. string(1) &quot;7&quot;
  80. 80. }
  81. 81. [5]=>
  82. 82. array(2) {
  83. 83. [0]=>
  84. 84. string(8) &quot;operator&quot;
  85. 85. [1]=>
  86. 86. string(1) &quot;;&quot;
  87. 87. }
  88. 88. } </li></ul>
  89. 89. The parser
  90. 90. So we have a bunch of tokens, what now? <ul><li>Loop through the tokens and analyze them
  91. 91. Create an object-oriented tree-structure or interpret
  92. 92. Avoid non-tail recursion </li><ul><li>Use tail-recursion (trampoline) instead
  93. 93. Saves you from hitting the stack limit </li></ul><li>That's it! </li></ul>
  94. 94. Summary — Questions?
  95. 95. Where to go from here <ul><li>Wikipedia: http://en.wikipedia.org/wiki/Compiler http://en.wikipedia.org/wiki/Parsing
  96. 96. About tail-recursion in PHP: http://www.alternateinterior.com/2006/09/tail-recursion-in-php.html
  97. 97. My blog: http://www.dasprids.de
  98. 98. Rate this talk: http://joind.in/635
  99. 99. Follow me on twitter:
  100. 100. http://www.twitter.com/dasprid </li></ul>
  101. 101. Thank you!

×