About Tokens and Lexemes
Upcoming SlideShare
Loading in...5
×
 

About Tokens and Lexemes

on

  • 10,073 views

A Parser is an integral part when building a Domain Specific Language or file format parser, such as our example usage case: the Ical format. This session will cover the general concept about ...

A Parser is an integral part when building a Domain Specific Language or file format parser, such as our example usage case: the Ical format. This session will cover the general concept about tokenizing and parsing into a datastructure, as well as going into depth about how to keep the memory footprint and runtime low with the help of a stream-tokenizer.

Statistics

Views

Total Views
10,073
Views on SlideShare
10,044
Embed Views
29

Actions

Likes
8
Downloads
110
Comments
0

4 Embeds 29

http://www.slideshare.net 22
http://www.scoop.it 4
http://www.linkedin.com 2
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

About Tokens and Lexemes About Tokens and Lexemes Presentation Transcript

  • About tokens and lexemes Ben Scholzen Game Developer Gameforge Productions GmbH
  • What we'll cover
    • Definition of a compiler, tokenizer and parser
    • Basic structure of a tokenizer and a parser
    • Where to optimize things for PHP
  • What about parser generators?
  • They are evil!
    • PHP_LexerGenerator, PHP_ParserGenerator, lemon-PHP
    • Create lots of function calls like lemon parsers in C
    • Are not working very performance-wise
    • Will eat up all your memory
  • Conclusion
      Don't use them!
  • Let's get started
  • What a compiler is and how it works
    • Acts as frontend for the application
    • Converts human-readable data into machine-readable data
    • Consists of a two components:
      • The lexer:
        • Is a finite-state-machine
        • Reads the input stream
        • Clears up the input data
        • Creates a list of tokens
      • The parser:
        • Gets tokens from the tokenizer
        • Converts them into a data structure
  • What a compiler is and how it works Lexer Parser Tokens Document Stream Structure
  • Sounds great, but where do I need it?
    • Formatting languages
      • BB-Code
      • Wiki-Codes
    • Description languages
      • iCalendar / vCalendar
      • XML
    • Even programming languages
      • JavaScript
      • PHP
    • Anything else you want your program to understand
  • The lexer (or tokenizer)
  • What are tokens?
    • Categorized block of text
      • Token type
      • Corresponding block of text (lexeme)
    • List of tokens represents an entire document
    • Example in PHP: $value = 5 * 7 ;
  • How the tokenizer works
    • Define possible states of the lexer
    • Tokenize the input in a loop
      • Scan with preg_match()
        • Strtok() is mostly too simple
        • Reading char-by-char is too slow
        • Use the offset parameter
        • Use the G assertion (^ won't work)
      • Always store the current position
      • Use either a switch-statement or a structured array
    • Return the tokens
  • What we can optimize
    • Use little memory
      • Always just read a partial part of the document into memory
        • Via fopen() and fgets()
        • Requires previous knowledge about when tokens end
      • Offer a method for the parser to get a partial bunch of tokens
    • Speed up execution-time
      • Do no internal function-calls if applicable
  • Going into practice
  • The beginning
    • Use little memory
      • Via fopen() and fread()
        • Requires previous knowledge about when tokens end
        • Offer a method for the parser to get a partial bunch of tokens
      • Speed up execution-time
    • Do no internal function-calls if applicable
  • Throwing in a file
  • Preparing stuff
  • Base state
  • Operator state
  • Value state
  • Rounding it up
  • Some actual testing
  • And what we get
    • array(6) {
    • [0]=>
    • array(2) {
    • [0]=>
    • string(8) "variable"
    • [1]=>
    • string(6) "$value"
    • }
    • [1]=>
    • array(2) {
    • [0]=>
    • string(8) "operator"
    • [1]=>
    • string(1) "="
    • }
    • [2]=>
    • array(2) {
    • [0]=>
    • string(6) "number"
    • [1]=>
    • string(1) "5"
    • }
    • [3]=>
    • array(2) {
    • [0]=>
    • string(8) "operator"
    • [1]=>
    • string(1) "*"
    • }
    • [4]=>
    • array(2) {
    • [0]=>
    • string(6) "number"
    • [1]=>
    • string(1) "7"
    • }
    • [5]=>
    • array(2) {
    • [0]=>
    • string(8) "operator"
    • [1]=>
    • string(1) ";"
    • }
    • }
  • The parser
  • So we have a bunch of tokens, what now?
    • Loop through the tokens and analyze them
    • Create an object-oriented tree-structure or interpret
    • Avoid non-tail recursion
      • Use tail-recursion (trampoline) instead
      • Saves you from hitting the stack limit
    • That's it!
  • Summary — Questions?
  • Where to go from here
    • Wikipedia: http://en.wikipedia.org/wiki/Compiler http://en.wikipedia.org/wiki/Parsing
    • About tail-recursion in PHP: http://www.alternateinterior.com/2006/09/tail-recursion-in-php.html
    • My blog: http://www.dasprids.de
    • Rate this talk: http://joind.in/635
    • Follow me on twitter:
    • http://www.twitter.com/dasprid
  • Thank you!