<ul><li>GATE, Human Language and  Machine Learning </li></ul><ul><li>http://gate.ac.uk/   http://nlp.shef.ac.uk/   </li></...
<ul><li>                                                                                                                  ...
Information Extraction (1):  from text to structured data <ul><li>Two trends in the early 1990s: </li></ul><ul><li>NLU: to...
Information Extraction (2) <ul><li>MUC-7 tasks </li></ul><ul><li>NE: Named Entity recognition and typing  </li></ul><ul><l...
Human Language Formal Knowledge (ontologies and instance bases) (A)IE CLIE (M)NLG Controlled Language OIE Semantic Web;  S...
Populating Ontologies with IE
Protégé and Ontology Management
IE: the bad news… Domain specificity vs. task complexity: complexity specificity “ acceptable” accuracy domain specific si...
<ul><li>                                                                                                                  ...
<ul><li>                                                                                                                  ...
<ul><li>                                                                                                                  ...
<ul><li>                                                                                                                  ...
Visual Resources
<ul><li>  Performance Evaluation </li></ul><ul><li>At document level – annotation diff </li></ul><ul><li>At corpus level –...
Regression Test – Corpus Benchmark Tool
Information Retrieval Based on the Lucene IR engine
Editing Multilingual Data <ul><li>                       </li></ul><ul><li>GATE Unicode Kit (GUK)   </li></ul><ul><ul><li>...
Processing Multilingual Data All the visualisation and editing tools for ML LRs use enhanced Java facilities:
A bit of a nuisance (users) <ul><li>GATE team projects: </li></ul><ul><li>Conceptual indexing:  MUMIS : automatic semantic...
3. Machine Learning in GATE <ul><li>Uses  classification . </li></ul><ul><ul><li>[Attr 1 , Attr 2 , Attr 3 , … Attr n ]  ...
Attributes <ul><li>Attributes can be: </li></ul><ul><ul><li>Boolean </li></ul></ul><ul><ul><ul><li>The [lack of] presence ...
Implementation <ul><li>Machine Learning PR in GATE. </li></ul><ul><li>Has two functioning modes: </li></ul><ul><ul><li>tra...
<DATASET> <ul><li><DATASET> </li></ul><ul><li><INSTANCE-TYPE>Token</INSTANCE-TYPE> </li></ul><ul><li><ATTRIBUTE> </li></ul...
<ENGINE> <ul><li><ENGINE> </li></ul><ul><li><WRAPPER> gate.creole.ml.weka.Wrapper </WRAPPER> </li></ul><ul><li><OPTIONS> <...
Attributes Position Instances type:  Token
Machine Learning PR <ul><li>Can save a learnt model to an external file for later use. </li></ul><ul><ul><li>Saves the act...
Standard Use Scenario <ul><li>Training </li></ul><ul><li>Prepare training data by enriching the documents with annotation ...
An Example <ul><li>Learn POS category from POS context. </li></ul>
Using Other ML Libraries <ul><li>The  MLEngine  Interface </li></ul><ul><li>Method Summary </li></ul><ul><li>void  addTrai...
<ul><li>                                                                                                                  ...
Upcoming SlideShare
Loading in …5
×

GATE, HLT and Machine Learning, Sheffield, July 2003

1,120 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,120
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

GATE, HLT and Machine Learning, Sheffield, July 2003

  1. 1. <ul><li>GATE, Human Language and Machine Learning </li></ul><ul><li>http://gate.ac.uk/ http://nlp.shef.ac.uk/ </li></ul><ul><li>Hamish Cunningham, Valentin Tablan, </li></ul><ul><li>Kalina Bontcheva, Diana Maynard </li></ul><ul><li>9 th July/2003 </li></ul><ul><ul><li>The Knowledge Economy and Human Language Technology </li></ul></ul><ul><ul><li>GATE: a General Architecture for Text Engineering </li></ul></ul><ul><ul><li>GATE, Information Extraction and Machine Learning </li></ul></ul>
  2. 2. <ul><li>                                                                                                                              </li></ul><ul><li>1. The Knowledge Economy and Human Language </li></ul><ul><li>Gartner, December 2002: </li></ul><ul><li>taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications </li></ul><ul><li>through 2012 more than 95% of human-to-computer information input will involve textual language </li></ul><ul><li>A contradiction: formal knowledge in semantics-based systems vs. ambiguous informal natural language </li></ul><ul><li>The challenge: to reconcile these two opposing tendencies </li></ul>
  3. 3. Information Extraction (1): from text to structured data <ul><li>Two trends in the early 1990s: </li></ul><ul><li>NLU: too difficult! Restrict the task and increase the performance </li></ul><ul><li>Quantitative measurement (MUC – Message Understanding Conference , ACE – Advanced Content Extraction, TREC – Text Retrieval Conference...) means good estimation of accuracy </li></ul><ul><li>Types of extraction: </li></ul><ul><li>Identify named entities (domain independent) </li></ul><ul><ul><li>Persons </li></ul></ul><ul><ul><li>Dates </li></ul></ul><ul><ul><li>Numbers </li></ul></ul><ul><ul><li>Organizations </li></ul></ul><ul><li>Identify domain-specific events and terms; e.g., if we’re processing football: </li></ul><ul><ul><li>Relations: which team a player plays for </li></ul></ul><ul><ul><li>Events: goal, foul, etc </li></ul></ul>
  4. 4. Information Extraction (2) <ul><li>MUC-7 tasks </li></ul><ul><li>NE: Named Entity recognition and typing </li></ul><ul><li>CO: co-reference resolution </li></ul><ul><li>TE: Template Elements </li></ul><ul><li>TR: Template Relations </li></ul><ul><li>ST: Scenario Templates </li></ul><ul><li>Example: </li></ul><ul><li>The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc. </li></ul><ul><li>NE: entities are &quot;rocket&quot;, &quot;Tuesday&quot;, &quot;Dr. Head&quot; and &quot;We Build Rockets&quot; </li></ul><ul><li>CO: &quot;it&quot; refers to the rocket; &quot;Dr. Head&quot; and &quot;Dr. Big Head“ are the same </li></ul><ul><li>TE: the rocket is &quot;shiny red&quot; and Head's &quot;brainchild&quot;. </li></ul><ul><li>TR: Dr. Head works for We Build Rockets Inc. </li></ul><ul><li>ST: a rocket launching event occurred with the various participants. </li></ul>
  5. 5. Human Language Formal Knowledge (ontologies and instance bases) (A)IE CLIE (M)NLG Controlled Language OIE Semantic Web; Semantic Grid; Semantic Web Services KEY MNLG : Multilingual Natural Language Generation OIE : Ontology-aware Information Extraction AIE : Adaptive IE CLIE : Controlled Language IE IE and Knowledge: Closing the Language Loop
  6. 6. Populating Ontologies with IE
  7. 7. Protégé and Ontology Management
  8. 8. IE: the bad news… Domain specificity vs. task complexity: complexity specificity “ acceptable” accuracy domain specific simple entities events and relations very general
  9. 9. <ul><li>                                                                                                                              </li></ul><ul><li>2. GATE: Software Architecure for HLT </li></ul><ul><li>Software lifecycle in collaborative research </li></ul><ul><ul><li>Project Proposal : We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to. </li></ul></ul><ul><ul><li>Analysis and Design : Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg. </li></ul></ul><ul><ul><li>Implementation : Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator. </li></ul></ul><ul><ul><li>Integration and Testing : The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input (&quot;well, you know, it's still a prototype...&quot;). </li></ul></ul><ul><ul><li>Evaluation : Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry). </li></ul></ul>
  10. 10. <ul><li>                                                                                                                              </li></ul><ul><ul><li>GATE, a General Architecture for Text Engineering </li></ul></ul><ul><ul><li>An architecture A macro-level organisational picture for LE software systems. </li></ul></ul><ul><ul><li>A framework For programmers, GATE is an object-oriented class library that implements the architecture. </li></ul></ul><ul><ul><li>A development environment For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. </li></ul></ul><ul><ul><li>Some free components... ...and wrappers for other people's components </li></ul></ul><ul><ul><li>Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. </li></ul></ul><ul><ul><li>Free software (LGPL). Download at http:// gate.ac.uk /download/ </li></ul></ul>
  11. 11. <ul><li>                                                                                                                              </li></ul><ul><ul><li>Architectural principles </li></ul></ul><ul><ul><li>Non-prescriptive, theory neutral (strength and weakness) </li></ul></ul><ul><ul><li>Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) </li></ul></ul><ul><ul><li>(Almost) everything is a component, and component sets are user-extendable </li></ul></ul><ul><ul><li>Component-based development </li></ul></ul><ul><ul><li>An OO way of chunking software: Java Beans </li></ul></ul><ul><ul><li>GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) </li></ul></ul><ul><ul><li>The minimal component = 10 lines of Java, 10 lines of XML, 1 URL. </li></ul></ul>
  12. 12. <ul><li>                                                                                                                              </li></ul><ul><li>GATE Language Resources </li></ul><ul><ul><li>GATE LRs are documents, ontologies, corpora, lexicons, …… </li></ul></ul><ul><ul><li>Documents / corpora: </li></ul></ul><ul><ul><li>GATE documents loaded from local files or the web... </li></ul></ul><ul><ul><li>Diverse document formats: text, html, XML, email, RTF, SGML. </li></ul></ul><ul><li>Processing Resources </li></ul><ul><ul><li>Algorithmic components knows as PRs – beans with execute methods. </li></ul></ul><ul><ul><li>All PRs can handle Unicode data by default. </li></ul></ul><ul><ul><li>Clear distinction between code and data (simple repurposing). </li></ul></ul><ul><ul><li>20-30 freebies with GATE </li></ul></ul><ul><ul><li>e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene </li></ul></ul>
  13. 13. Visual Resources
  14. 14. <ul><li>  Performance Evaluation </li></ul><ul><li>At document level – annotation diff </li></ul><ul><li>At corpus level – corpus benchmark tool – tracking system’s performance over time </li></ul>
  15. 15. Regression Test – Corpus Benchmark Tool
  16. 16. Information Retrieval Based on the Lucene IR engine
  17. 17. Editing Multilingual Data <ul><li>                       </li></ul><ul><li>GATE Unicode Kit (GUK) </li></ul><ul><ul><li>Java provides no special support for text input (this may change) </li></ul></ul><ul><li>Support for defining additional Input Methods (IMs) </li></ul><ul><li>currently 30 IMs for 17 languages </li></ul><ul><li>Pluggable in other applications </li></ul>
  18. 18. Processing Multilingual Data All the visualisation and editing tools for ML LRs use enhanced Java facilities:
  19. 19. A bit of a nuisance (users) <ul><li>GATE team projects: </li></ul><ul><li>Conceptual indexing: MUMIS : automatic semantic indices for sports video </li></ul><ul><li>MUSE , cross-genre entitiy finder </li></ul><ul><li>HSL , Health-and-safety IE </li></ul><ul><li>ETCSL : collaboration with IOAS Oxford on Sumerian </li></ul><ul><li>Old Bailey : collaboration with HRI on 17th century court reports </li></ul><ul><li>Multiflora : plant taxonomy text analysis for biodiversity research e-science </li></ul><ul><li>Advanced Knowledge Technologies : €12m UK five site collaborative project </li></ul><ul><li>H-TechSight : knowledge portal for Chemicals Engineers </li></ul><ul><li>Framework 6 : SEKT, PrestoSpace, KnowledgeWeb </li></ul><ul><li>A representative fraction of GATE users : </li></ul><ul><li>IBM TJ Watson , US </li></ul><ul><li>the American National Corpus project, US </li></ul><ul><li>the Perseus Digital Library project, Tufts University, US </li></ul><ul><li>Longman Pearson publishing, UK </li></ul><ul><li>Merck KgAa , Germany </li></ul><ul><li>Canon Europe , UK </li></ul><ul><li>Knight Ridder (the second biggest US news publisher) </li></ul><ul><li>BBN (leading HLT research lab), US </li></ul><ul><li>SMEs in Sirma AI Ltd., Bulgaria </li></ul><ul><li>Imperial College, London, the University of Manchester, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities </li></ul><ul><li>UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, MUSE, Poesia... </li></ul>
  20. 20. 3. Machine Learning in GATE <ul><li>Uses classification . </li></ul><ul><ul><li>[Attr 1 , Attr 2 , Attr 3 , … Attr n ]  Class </li></ul></ul><ul><li>Classifies annotations . </li></ul><ul><ul><li>(Documents can be classified as well using a simple trick.) </li></ul></ul><ul><li>Annotations of a particular type are selected as instances. </li></ul><ul><li>Attributes refer to instance annotations. </li></ul><ul><li>Attributes have a position relative to the instance annotation they refer to. </li></ul>
  21. 21. Attributes <ul><li>Attributes can be: </li></ul><ul><ul><li>Boolean </li></ul></ul><ul><ul><ul><li>The [lack of] presence of an annotation of a particular type [partially] overlapping the referred instance annotation. </li></ul></ul></ul><ul><ul><li>Nominal </li></ul></ul><ul><ul><ul><li>The value of a particular feature of the referred instance annotation. The complete set of acceptable values must be specified a-priori. </li></ul></ul></ul><ul><ul><li>Numeric </li></ul></ul><ul><ul><ul><li>The numeric value (converted from String) of a particular feature of the referred instance annotation. </li></ul></ul></ul>
  22. 22. Implementation <ul><li>Machine Learning PR in GATE. </li></ul><ul><li>Has two functioning modes: </li></ul><ul><ul><li>training </li></ul></ul><ul><ul><li>application </li></ul></ul><ul><li>Uses an XML file for configuration: </li></ul><ul><li><?xml version=&quot;1.0&quot; encoding=&quot;windows-1252&quot;?> </li></ul><ul><ul><li><ML-CONFIG> </li></ul></ul><ul><ul><ul><li><DATASET> … </DATASET> </li></ul></ul></ul><ul><ul><ul><li><ENGINE>…</ENGINE> </li></ul></ul></ul><ul><ul><li><ML-CONFIG> </li></ul></ul>
  23. 23. <DATASET> <ul><li><DATASET> </li></ul><ul><li><INSTANCE-TYPE>Token</INSTANCE-TYPE> </li></ul><ul><li><ATTRIBUTE> </li></ul><ul><li><NAME>POS_category(0)</NAME> </li></ul><ul><li><TYPE>Token</TYPE> </li></ul><ul><li><FEATURE>category</FEATURE> </li></ul><ul><li><POSITION>0</POSITION> </li></ul><ul><li><VALUES> </li></ul><ul><li><VALUE>NN</VALUE> </li></ul><ul><li><VALUE>NNP</VALUE> </li></ul><ul><li><VALUE>NNPS</VALUE> </li></ul><ul><li>… </li></ul><ul><li></VALUES> </li></ul><ul><li>[<CLASS/>] </li></ul><ul><li></ATTRIBUTE> </li></ul><ul><li>… </li></ul><ul><li></DATASET> </li></ul>
  24. 24. <ENGINE> <ul><li><ENGINE> </li></ul><ul><li><WRAPPER> gate.creole.ml.weka.Wrapper </WRAPPER> </li></ul><ul><li><OPTIONS> </li></ul><ul><li><CLASSIFIER>weka.classifiers.j48.J48</CLASSIFIER> </li></ul><ul><li><CLASSIFIER-OPTIONS>-K 3</CLASSIFIER-OPTIONS> </li></ul><ul><li><CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-THRESHOLD> </li></ul><ul><li></OPTIONS> </li></ul><ul><li></ENGINE> </li></ul>
  25. 25. Attributes Position Instances type: Token
  26. 26. Machine Learning PR <ul><li>Can save a learnt model to an external file for later use. </li></ul><ul><ul><li>Saves the actual model and the collected dataset. </li></ul></ul><ul><li>Can export the collected dataset in .arff format. </li></ul>
  27. 27. Standard Use Scenario <ul><li>Training </li></ul><ul><li>Prepare training data by enriching the documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc). </li></ul><ul><li>Run the ML PR in training mode. </li></ul><ul><li>Export the dataset as .arff and perform experiments using the WEKA interface in order to find the best attribute set / algorithm / algorithm options. </li></ul><ul><li>Update the configuration file accordingly. </li></ul><ul><li>Run the ML PR again to collect the actual data. </li></ul><ul><li>[ Save the learnt model. ] </li></ul><ul><li>Application </li></ul><ul><li>Prepare data by enriching the documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc). </li></ul><ul><li>[ Load the previously saved model. ] </li></ul><ul><li>Run the ML PR in application mode. </li></ul><ul><li>[ Save the learnt model. ] </li></ul>
  28. 28. An Example <ul><li>Learn POS category from POS context. </li></ul>
  29. 29. Using Other ML Libraries <ul><li>The MLEngine Interface </li></ul><ul><li>Method Summary </li></ul><ul><li>void addTrainingInstance ( List  attributes) Adds a new training instance to the dataset.  </li></ul><ul><li>Object classifyInstance ( List  attributes) Classifies a new instance.  </li></ul><ul><li>void init () This method will be called after an engine is created and has its dataset and options set.  </li></ul><ul><li>void setDatasetDefinition ( DatasetDefintion  definition) Sets the definition for the dataset used.  </li></ul><ul><li>void setOptions (org.jdom.Element options) Sets the options from an XML JDom element. </li></ul><ul><li>void setOwnerPR ( ProcessingResource  pr) Registers the PR using the engine with the engine.  </li></ul>
  30. 30. <ul><li>                                                                                                                              </li></ul><ul><li>Conclusion </li></ul><ul><ul><li>GATE is: </li></ul></ul><ul><ul><li>Addressing the need for scalable, reusable, and portable HLT solutions </li></ul></ul><ul><ul><li>Supporting large data, in multiple media, languages, formats, and locations </li></ul></ul><ul><ul><li>Lowering the cost of creation of new language processing components </li></ul></ul><ul><ul><li>Promoting quantitative evaluation metrics via tools and a level playing field </li></ul></ul><ul><ul><li>Promoting experimental repeatability by developing and supporting free software </li></ul></ul><ul><ul><li>Perhaps it may become: </li></ul></ul><ul><ul><li>A vehicle for the spread of collaborative experiments in ML and HLT? </li></ul></ul>

×