NLP: CFG Parsing with Earley Algorithm

2,126 views
2,039 views

Published on

CFG Parsing with Earley Algorithm

Published in: Science, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,126
On SlideShare
0
From Embeds
0
Number of Embeds
193
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

NLP: CFG Parsing with Earley Algorithm

  1. 1. Natural Language Processing CFG Parsing with Early Algorithm Vladimir Kulyukin
  2. 2. Outline  Background  Early Algorithm  Parser States (aka Dotted Rules)  Prediction, Completion, Scanning  Parse Tree Retrieval  Part-of-Speech Tagging with Apache Open NL
  3. 3. Background
  4. 4. Syntax & Parsing  Syntax in the NLP context refers to the study of sentence or text structure  Parsing is the process of assigning a parse tree to a string  A grammar is required to generate parse trees  Grammars for natural languages consists of syntactic categories and parts of speech
  5. 5. Part of Speech vs. Syntactic Category  In NLP, every wordform has a part of speech (e.g., NOUN, PREPOSITION, VERB, DET, etc.)  Syntactic categories are higher level in that they are composed of parts of speech (e.g. NP  DET NOMINAL)  In context-free grammars for formal languages a part of speech is any variable V that has a production V  t, where t is a terminal
  6. 6. NL Example S ::= NP VP S ::= AUX NP VP S ::= VP NP ::= DET NOMINAL NOMINAL ::= NOUN NOMINAL ::= NOUN NOMINAL NP ::= ProperNoun VP ::= VERB VP ::= VERB NP DET ::= this | that | a NOUN ::= book | flight VERB ::= book | include AUX ::= does PREP ::= from | to | on ProperNoun ::= Houston Suppose a CFG G has the following productions: S, NP, VP, NOMINAL are syntactic categories; DET, NOUN, VERB, AUX, PREP, ProperNoun
  7. 7. Formal Language Example E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has the following productions: E is the only syntactic category; MINUS, TIMES, EQUALS, LESS, LP, RP, A, B, C are parts of speech
  8. 8. Early Algorithm
  9. 9. General Remarks  The Early Algorithm (EA) uses a dynamic programming approach to find all possible parse trees of a given input  EA makes a single left-to-right pass that gradually fills an array called a chart (this class of algorithms is known as chart parsers)  Each found parse tree can be retrieved from the chart, which makes this algorithm well suited for retrieving partial inputs  EA can be used as a prediction management framework
  10. 10. Helper Data Structures & Functions Production { LHS; // left-hand side variable RHS; // sequence of right-hand side variables and terminals } // Suppose that A is a syntactic or part-of-speech category and G is a CFG. FindProductions(A, G) returns a set of grammar productions {(A  λ) | where A is the LHS and λ is the RHS of a production in G} PartOfSpeech(wordform) returns a set of part-of-speech symbols for wordform (e.g., PartOfSpeech(“make”) returns { NOUN, VERB }
  11. 11. NL Example S ::= NP VP S ::= AUX NP VP S ::= VP NP ::= DET NOMINAL NOMINAL ::= NOUN NOMINAL ::= NOUN NOMINAL NP ::= ProperNoun VP ::= VERB VP ::= VERB NP DET ::= this | that | a NOUN ::= book | flight VERB ::= book | include AUX ::= does PREP ::= from | to | on ProperNoun ::= Houston Suppose a CFG G has the following productions: FindProductions(S, G) returns { S ::= NP VP, S ::= AUX NP VP, S ::= VP }; PartOfSpeech(book) returns { NOUN, VERB}
  12. 12. Formal Language Example E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has the following productions: FindProductions(E, G) returns { E ::= E MINUS E, E ::= E TIMES E, E ::= E EQUALS E, E ::= E LESS E, E ::= LP E RP, E ::= A, E ::= B, E ::= C}; PartOfSpeech(a) returns { A }; PartOfSpeech(<) returns { LESS }
  13. 13. Parser States (aka Dotted Rules)
  14. 14. Input & Input Positions “MAKE” “A” “LEFT” 0 1 2 3 INPUT: “MAKE A LEFT”
  15. 15. Relationship between Input & Parser States 0 i j N (B  E F * G, i, j) (A  B C D *, 0, N) … … …
  16. 16. Input & Parser State: Examples “MAKE” “A” “LEFT” 0 1 2 3 (S  * VP, 0, 0) (NP  DET * NOMINAL, 1, 2) (VP  Verb NP *, 0, 3)
  17. 17. Predictions, Scans, & Completions
  18. 18. Prediction  Prediction is the creation of new parser states (aka dotted rules) that predict what is expected to be seen in the input  If a parser state’s production’s right-hand side has the dot to the left of a variable (i.e., non-terminal) that is not a part-of-speech category, prediction is applicable Prediction generates new parser states that start and end at the same input position as the parser state from which they were predicted
  19. 19. Prediction Example  Suppose a CFG grammar G contains the following productions: S  VP, VP  Verb, VP  Verb NP  Suppose that there is a parser state (S  * VP, 0, 0)  Since VP is not a part-of-speech category, prediction applies  Prediction generates two new parser states from the old parser state (S  * VP, 0, 0): 1) (VP  * Verb, 0, 0) 2) (VP  * Verb NP, 0, 0)
  20. 20. Prediction Pseudocode // G is a grammar, chart[] is an array of parser states Predict((A  α *X β, start, end), G, chart[]) { Productions = FindProductions(X, G); find each production (X  λ) in Productions { // Insert a new parser state ((X  *λ, end, end) // into chart[end] AddToChart((X  *λ, end, end), chart[end]); } }
  21. 21. Example “a” “*” “b”0 1 2 3 INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions:
  22. 22. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions: 1. ((λ  *E), 0, 0) // dummy prediction to start everything going
  23. 23. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions: 1.((λ  *E), 0, 0) // dummy prediction to start everything going 2. ((E  *E MINUS E), 0, 0) // prediction from E ::= E MINUS E
  24. 24. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions: 1.((λ  *E), 0, 0) // dummy prediction to start everything going 2.((E  *E MINUS E), 0, 0) // prediction from E ::= E MINUS E 3.((E  *E TIMES E), 0, 0) // prediction from E ::= E TIMES E
  25. 25. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions: 1.((λ  *E), 0, 0) // dummy prediction to start everything going 2.((E  *E MINUS E), 0, 0) // prediction from E ::= E MINUS E 3.((E  *E TIMES E), 0, 0) // prediction from E ::= E TIMES E 4.((E  *E EQUALS E), 0, 0) // prediction from E ::= E EQUALS E
  26. 26. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions: 1.((λ  *E), 0, 0) // dummy prediction to start everything going 2. ((E  *E MINUS E), 0, 0) // prediction from E ::= E MINUS E 3. ((E  *E TIMES E), 0, 0) // prediction from E ::= E TIMES E 4. ((E  *E EQUALS E), 0, 0) // prediction from E ::= E EQUALS E 5. ((E  *E LESS E), 0, 0) // prediction from E ::= LESS E
  27. 27. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions: 1.((λ  *E), 0, 0) // dummy prediction to start everything going 2. ((E  *E MINUS E), 0, 0) // prediction from E ::= E MINUS E 3. ((E  *E TIMES E), 0, 0) // prediction from E ::= E TIMES E 4. ((E  *E EQUALS E), 0, 0) // prediction from E ::= E EQUALS E 5. ((E  *E LESS E), 0, 0) // prediction from E ::= LESS E 6. ((E  *LP E RP), 0, 0) // prediction from E ::= LP E RP
  28. 28. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions: 1.((λ  *E), 0, 0) // dummy prediction to start everything going 2. ((E  *E MINUS E), 0, 0) // prediction from E ::= E MINUS E 3. ((E  *E TIMES E), 0, 0) // prediction from E ::= E TIMES E 4. ((E  *E EQUALS E), 0, 0) // prediction from E ::= E EQUALS E 5. ((E  *E LESS E), 0, 0) // prediction from E ::= LESS E 6. ((E  *LP E RP), 0, 0) // prediction from E ::= LP E RP 7. ((E  *A), 0, 0) // prediction from E ::= A
  29. 29. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions: 1.((λ  *E), 0, 0) // dummy prediction to start everything going 2. ((E  *E MINUS E), 0, 0) // prediction from E ::= E MINUS E 3. ((E  *E TIMES E), 0, 0) // prediction from E ::= E TIMES E 4. ((E  *E EQUALS E), 0, 0) // prediction from E ::= E EQUALS E 5. ((E  *E LESS E), 0, 0) // prediction from E ::= LESS E 6. ((E  *LP E RP), 0, 0) // prediction from E ::= LP E RP 7. ((E  *A), 0, 0) // prediction from E ::= A 8. ((E  *B), 0, 0) // prediction from E ::= B
  30. 30. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c Suppose a CFG G has these productions: 1.((λ  *E), 0, 0) // dummy prediction to start everything going 2. ((E  *E MINUS E), 0, 0) // prediction from E ::= E MINUS E 3. ((E  *E TIMES E), 0, 0) // prediction from E ::= E TIMES E 4. ((E  *E EQUALS E), 0, 0) // prediction from E ::= E EQUALS E 5. ((E  *E LESS E), 0, 0) // prediction from E ::= LESS E 6. ((E  *LP E RP), 0, 0) // prediction from E ::= LP E RP 7. ((E  *A), 0, 0) // prediction from E ::= A 8. ((E  *B), 0, 0) // prediction from E ::= B 9. ((E  *C), 0, 0) // prediction from E ::= C
  31. 31. Scanning  Scanning advances the input pointer one wordform to the right  Scanning applies to any parser state whose production has a dot (any other symbol can be used) to the left of a part-of-speech category  Scanning creates a new parser state (NPS) from an old parser state (OPS) by 1) moving the dot in the OPR’s production’s RHS one symbol right; 2) incrementing the OPS’s input end position by 1, and 3) placing the NPS to the chart entry that follows the OPS’s chart entry
  32. 32. Scanning Example  Suppose the current wordform at position 0 in the input is MAKE  Suppose an old parser state in chart[0] is (VP  * VERB NP, 0, 0)  Since VERB is a part-of-speech category and MAKE is a VERB, scanning applies and generates a new parser state (VP  VERB * NP, 0, 1) that is placed in chart[1]
  33. 33. Scanning Pseudocode // α, β are possibly empty sequences of variables and // terminals Scan((A  α *X β, start, end), input[], G, chart[]) { POSList = PartsOfSpeech(input[end]); if ( X is in POSList ) { AddToChart((X  input[end]*, end, end+1), chart[end+1]); }
  34. 34. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c CFG G 1.((λ  *E), 0, 0) // dummy prediction to start everything going 2. ((E  *E MINUS E), 0, 0) // prediction from E ::= E MINUS E 3. ((E  *E TIMES E), 0, 0) // prediction from E ::= E TIMES E 4. ((E  *E EQUALS E), 0, 0) // prediction from E ::= E EQUALS E 5. ((E  *E LESS E), 0, 0) // prediction from E ::= LESS E 6. ((E  *LP E RP), 0, 0) // prediction from E ::= LP E RP 7. ((E  *A), 0, 0) // prediction from E ::= A 8. ((E  *B), 0, 0) // prediction from E ::= B 9. ((E  *C), 0, 0) // prediction from E ::= C 10.((A  a*), 0, 1) // scanning
  35. 35. Completion  Completion is the process that indicates the completion of a production’s RHS  Completion is applied to any parser state whose dot has reached the end of its production’s RHS (to the right of the rightmost symbol)  In other words, completion signals that the parser has completed a production (LHS  RHS) over a specific portion of the input  Completion triggers the process of finding all previously created parser states that wait for a specific category (LHS) at the completed parser state’s start position
  36. 36. Completion Example  Suppose that there are two parser states: 1) (NP  DET NOMINAL *, 1, 3) 2) (VP  VERB * NP, 0, 1)  Since the dot has reached the end of the 1st production’s RHS, completion applies (this parser state is called completed parser state)  Completion finds all states that end at position 1 and look for NP; in this case, (VP  VERB * NP, 0, 1) is found  For each found old parser state (OPS – old parser state): a new parser state (NPS) is created by advancing the OPS’s dot one symbol to the right and setting the NPS’s end position to the completed parser state’s end position; in this case, (VP  VERB NP *, 0, 3)  NPS is placed in chart[completed parser state’s end position]
  37. 37. Completion Pseudocode // α, β, λ are sequences of variables and // terminals, G is a grammar Complete((A  λ *, compStart, compEnd), input[], G, chart[]) { for each old parser state ((X  α *A β), oldStart, compStart) { AddToChart((X  α A * β, oldStart, compEnd), chart[compEnd]); }
  38. 38. EA Pseudocode EarlyParse(input[], G, chart[]) { AddToChart((λ  *S, 0, 0), char[0]); for i from 0 to length(input[]) { for each parser state PS in chart[i] { if ( isIncomplete(PS) && !isPartOfSpeech(nextCat(PS)) ) { Predict(PS, G, chart[]); } else if ( isIncomplete(PS) && isPartOfSpeech(nextCat(PS) ) { Scan(PS, input[], G, chart[]); } else { Complete(PS, input[], G, chart[]); } }}}
  39. 39. Example INPUT: “a * b” E ::= E MINUS E E ::= E TIMES E E ::= E EQUALS E E ::= E LESS E E ::= LP E RP E ::= A E ::= B E ::= C MINUS ::= - TIMES ::= * EQUALS ::= = LESS ::= < LP ::= ( RP ::= ) A ::= a B ::= b C ::= c CFG G 1) ((λ  *E), 0, 0) // dummy prediction to start everything 2) ((E  *E MINUS E), 0, 0) // prediction 3) ((E  *E TIMES E), 0, 0) // prediction 4) ((E  *E EQUALS E), 0, 0) // prediction 5) ((E  *E LESS E), 0, 0) // prediction 6) ((LP  *LP E RP), 0, 0) // prediction 7) ((E  *A), 0, 0) // prediction 8) ((E  *B), 0, 0) // prediction 9) ((E  *C), 0, 0) // prediction 10) ((A  a *, 0, 1) // scanning 11) ((E  A*, 0, 1) // completion of parser state 7 12) ((λ  E *, 0, 1) // completion of parser state 1 13) ((E  E * MINUS E), 0, 1) // completion of parser state 2 14) ((E  E *TIMES E), 0, 1) // completion of parser state 3 15) ((E  E *EQUALS E), 0, 1) // completion of parser state 4 16) ((E  E *LESS E , 0, 1) // completion of parser state 5
  40. 40. Implementation Notes
  41. 41. CFGSymbol.java public class CFGSymbol { String mSymbolName; public CFGSymbol() { mSymbolName = ""; } public CFGSymbol(String n) { mSymbolName = new String(n); } public CFGSymbol(CFGSymbol s) { mSymbolName = new String(s.mSymbolName); } public String toString() { return mSymbolName; } public boolean isEqual(CFGSymbol s) { if ( s == null ) return false; else return this.mSymbolName.equals(s.mSymbolName); } } source code is here
  42. 42. CFGrammar.java public class CFGrammar { protected ArrayList<String> mVariables; // sorted array of variables protected ArrayList<String> mTerminals; // sorted array of terminals protected TreeMap<String, ArrayList<CFProduction>> mProductions; // variable to // productions map where variable is lhs protected CFGSymbol mStartSymbol; protected ArrayList<CFProduction> mIdToProductionMap; // given an id, find a rule. protected ArrayList<String> mPosVars; // sorted lists of parts of speech. protected TreeMap<String, ArrayList<CFGSymbol>> mTerminalsToPosVarsMap; … } source code is here
  43. 43. CFGrammar.java: Compiling CFG from File public CFGrammar() { mVariables = new ArrayList<String>(); mTerminals = new ArrayList<String>(); mProductions = new TreeMap<String, ArrayList<CFProduction>>(); mIdToProductionMap = new ArrayList<CFProduction>(); mPosVars = new ArrayList<String>(); mTerminalsToPosVarsMap = new TreeMap<String, ArrayList<CFGSymbol>>(); mStartSymbol = null; } public CFGrammar(String grammarFilePath) { this(); compileCFGGrammar(grammarFilePath); } source code is here
  44. 44. CFProduction.java public class CFProduction { int mID; CFGSymbol mLHS; // left-hand side ArrayList<CFGSymbol> mRHS; // right-hand side // production ID defaults to 0 public CFProduction(CFGSymbol lhs, ArrayList<CFGSymbol> rhs) { this.mID = 0; this.mLHS = new CFGSymbol(lhs); this.mRHS = new ArrayList<CFGSymbol>(); Iterator<CFGSymbol> iter = rhs.iterator(); while ( iter.hasNext() ) { this.mRHS.add(new CFGSymbol(iter.next())); } } public CFProduction(int id, CFGSymbol lhs, ArrayList<CFGSymbol> rhs) { this(lhs, rhs); this.mID = id; } } source code is here
  45. 45. RecognizerState.java public class RecognizerState { int mRuleNum; // number of a cfg rule that this state tracks. int mDotPos; // dot position in the rhs of the rule. int mInputFromPos; // where in the input the rule starts. int mUptoPos; // where in the input the rule ends. CFProduction mTrackedRule; // the actual CFG rule being tracked; the rule's number == ruleNum. public CFGSymbol nextCat() { if ( mDotPos < mTrackedRule.mRHS.size() ) return mTrackedRule.mRHS.get(mDotPos); else return null; } public boolean isComplete() { return mDotPos == mTrackedRule.mRHS.size(); } public int getDotPos() { return mDotPos; } public int getFromPos() { return mInputFromPos; } public int getUptoPos() { return mUptoPos; } public CFProduction getCFGRule() { return mTrackedRule; } public int getRuleNum() { return mRuleNum; } … } source code is here
  46. 46. ParserState.java public class ParserState extends RecognizerState { static int mCount = 0; int mID = 0; // this is the unique id of this ParserState ArrayList<ParserState> mParentParserStates = null; // IDs of parent parser states String mOrigin = "N"; // it can be N (None), S (Scanner), P (Predictor), C (Completer) public ParserState(int ruleNum, int dotPos, int fromPos, int uptoPos, CFProduction r) { super(ruleNum, dotPos, fromPos, uptoPos, r); mID = mCount; mCount = mCount + 1; mParentParserStates = new ArrayList<ParserState>(); } String getID() { return "PS" + mID; } void addPreviousState(ParserState ps) { mParentParserStates.add(ps); } // add all previous states of ps into the this state. void addPreviousStatesOf(ParserState ps) { if ( ps.mParentParserStates.isEmpty() ) return; Iterator<ParserState> iter = ps.mParentParserStates.iterator(); while ( iter.hasNext() ) { addPreviousState(iter.next()); } } void setOrigin(String origin) { mOrigin = origin; } source code is here
  47. 47. Retrieving Parse Trees
  48. 48. Retrieval of Parse Trees  Each state is augmented with an addition member variable to its parents, i.e., the completed parser states that generated it  The parse tree retrieval algorithm begins with each parser state that spans the entire input and completes a production whose LHS is S  The algorithm recursively retrieves each sub-tree by for the parents and then linking them to the tree of the current parser state
  49. 49. Retrieval of Parse Trees private ArrayList<ParseTree> getAllParseTrees(ArrayList<CFGSymbol> words) { parseCFGSymbols(words); if ( !isInputAccepted(words.size()) ) return null; ArrayList<ParserState> pstates = getFinalStates(words.size()); ArrayList<ParseTree> ptrees = new ArrayList<ParseTree>(); Iterator<ParserState> iter = pstates.iterator(); while ( iter.hasNext() ) { ptrees.add(toParseTree(iter.next())); } return ptrees; } public ArrayList<ParseTree> parse(String input) { return getAllParseTrees((splitStringIntoSymbols(input))); } public ArrayList<ParseTree> parse(ArrayList<String> input) { ArrayList<CFGSymbol> symbols = new ArrayList<CFGSymbol>(); if ( input == null ) { return getAllParseTrees(symbols); } else { Iterator<String> iter = input.iterator(); while ( iter.hasNext() ) { symbols.add(new CFGSymbol(iter.next())); } return getAllParseTrees(symbols); } }
  50. 50. Part-of-Speech Tagging With Apache Open NL
  51. 51. Installing Apache Open NL  Download & unzip the JAR files from http://opennlp.apache.org/cgi-bin/download.cgi  Create a Java project and add the JARs to your project  I included the following JARs:  jwnl-1.3.3.jar  opennlp-maxent-3.0.3.jar  opennlp-tools-1.5.3.jar  opennlp-uima-1.5.3.jar  Download the language model files from http://opennlp.sourceforge.net/models-1.5/  Place the model files into a directory
  52. 52. Tokenization of Input  Apache Open NL provides a set of tools that can be used to: - Split text into tokens or sentences - Do part-of-speech tagging and use the output as the input to a parser - The categories used in Open NL POS tagging are at http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html Java example is here
  53. 53. References & Reading Suggestions  Early, J. (1968). An Efficient Context-Free Parsing Algorithm. Ph.D. thesis, CMU, Pittsburgh, PA  Jurafsky & Martin. Speech & Language Processing. Prentice Hall  Hopcroft and Ullman. Introduction to Automata Theory, Languages, and Computation, Narosa Publishing House  Moll, Arbib, and Kfoury. An Introduction to Formal Language Theory

×