Apache UIMA Introduction


Published on

Slides for GiW 2010/2011 Course

Published in: Technology

Apache UIMA Introduction

  1. 1. Apache UIMA IntroductionGestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org
  2. 2. UIM ?Unstructured Information ManagementA wide topic: text, audio, video Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources)Apache UIMA
  3. 3. Apache Software Foundation No profit corporation “...provides organizational, legal, and financial support for a broad range of open source software projects...” “...collaborative and meritocratic development process...” “...pragmatic Apache License...”
  4. 4. Apache UIMAArchitectural framework to manageunstructured data (Java, C++, ...)Former IBM research project donated to ASFOASIS Standard for unstructuredinformation management
  5. 5. Apache UIMA - Goals“Our goal is to support a thriving communityof users and developers of UIMAframeworks, tools, and annotators, facilitatingthe analysis of unstructured content such astext, audio and video”
  6. 6. Apache UIMA - bridging worlds
  7. 7. Apache UIMA - Overview UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies
  8. 8. Apache UIMA - Multimodal AnalysisMultimodal Analysis means the ability ofprocessing some resource from various“points of view”Sample: a video stream for which we want toextract subtitles and also automaticallyrecognize the actors involvedWe are though mainly interested in text...
  9. 9. Sample scenarioContent Management System containing freetext articles about moviesWe want such articles to be automaticallyenriched with metadata contained inside thetext (movies, directors, actors/actresses,distribution) and linked to “similar” articles(i.e.: dealing with same movies or actors)So that we can search for “similar” articles
  10. 10. Sample scenario - articles about movies
  11. 11. Sample scenarioUIMA can help on enriching articles withmetadataThink of filling an Article.java instancevariables with proper valuesThen persisting it to a database to queryarticles dealing with the same actors
  12. 12. Filling Article with metadata
  13. 13. Sample scenario - metadata
  14. 14. UIMA - Annotations
  15. 15. Apache UIMA - AnnotationThe association of a metadata, such as a label,with a region of text (or other type of artifact).For example, the label “Person” associated with aregion of text “Fred Center” constitutes anannotation. We say “Person” annotates the spanof text from X to Y containing exactly “FredCenter”
  16. 16. Apache UIMA - Basic Steps Domain model definition Analysis pipeline definition Arrange components: Define components draining data from sources Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc... Define components outputting information on target storages Analysis pipeline(s) execution
  17. 17. Defining domain model within UIMA using Type Systems Type System is the place where we describe which metadata we would like to extract Low representational gap Like almost everything in UIMA: described (and generated!) using XML Possible to define multiple Type Systems for different purposes
  18. 18. How do UIMA extract metadata?
  19. 19. Apache UIMA - Analysis Engines Basic UIMA building blocks Analyze a document Infer and record descriptive attributes (about documents/regions) Generating analysis results
  20. 20. Apache UIMA - AEsAnalysis Engines are described by a descriptor(XML)Can be Primitive (a single AE) or Aggregated (apipeline of AEs)Analysis algorithms can be switched changingdescriptor instead of codeContain TypeSystems definitionsDefine Capabilites
  21. 21. Apache UIMA -AnalysisComponent API initialize : Performs (once) any startup tasks required by this component process : Process the resource to analyze generating analysis results (metadata) destroy : Frees all resources held, called only once when it is finished using this component
  22. 22. Apache UIMA - AnnotatorsAnalysis Engine algorithm Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video) Annotators implement AnalysisComponent interface
  23. 23. Apache UIMA - RolesAnalysisEngine : High level block responsiblefor analysis - contains at least oneAnalysisComponentAnalysisComponent : interface for anycomponent responsible for analyzing artifactsAnnotator : implementation ofAnalysisComponent responsible for creatingAnnotations
  24. 24. Apache UIMA - AEs
  25. 25. Analysis Engines in a Pipeline
  26. 26. Apache UIMA - Analysis Results Where do analysis results end up? How annotators represent and share their results? CAS - Common Analysis Structure Maintain typed indexes of extracted results
  27. 27. Common Analysis Structure
  28. 28. Which algorithms lay under AEs?
  29. 29. Apache UIMA & NLPNLP (Natural Language Processing) is atheoretically motivated range ofcomputational techniques for analyzing andrepresenting naturally occurring texts at oneor more levels of linguistic analysis for thepurpose of achieving human-like languageprocessing for a range of tasks orapplicationsIt’s an AI discipline
  30. 30. Apache UIMA & NLP“accomplish human-like language processing” Paraphrase an input text Translate the text into another language Answer questions about the contents of the text Draw inferences from the text
  31. 31. Apache UIMA & NLP“an NLP-based IR system has the goal ofproviding more precise, complete informationin response to a user’s real informationneed”various levels of processing
  32. 32. Apache UIMA - ApproachesSimplest : Write RegEx and Dictionaries andmix them togetherNLP-like : Tokenize -> Sentence identification-> PoS Tagging -> Anaphora resolution ->Named Entities Recognition -> CoreferenceIdentification ...
  33. 33. Analysis Engines in a Pipeline
  34. 34. NLP - Language Identifying NLP takes advantage of language specific syntax, forms, rules and meanings Not easy to write language independent extraction algorithms Often this is the first block of NLP pipelines Techniques: Stopwords dictionaries, statistical models, etc.
  35. 35. NLP - Tokens and Sentences Humans learn words’ meaning in order to understand whole context semantics Split the target text in words to be able to analyze their meaning and role Discover sentences to later assign roles to each token Easiest for English, Italian & co. but what about Chinese?
  36. 36. NLP - PoS TaggingAssign a “Part of Speech” (noun, adjective,verb, etc.) to each token generated in theprevious stepMany language/domain specific patterns canbe discovered and exploited just with pos-tagged-tokens and sentences
  37. 37. NLP - Chunking & Parsing Parse sentences into a meaningful set or tree of relationships Chunks are the sentence building blocks (i.e. verbal forms) Parse tree highlights the structure of a sentence Can leverage logic analysis chunking parsing
  38. 38. NLP - Named Entities RecognitionAnswer thequestions: where?when? who? howoften? how much?Identify key entitiesin the textCommon techniques:dictionaries, rules,statistcal models
  39. 39. Debugging NER in UIMA
  40. 40. Using UIMADefine TypeSystemDefine AnalysisEngine descriptor(s)Implement Annotator(s)Execute the UIMA pipeline
  41. 41. Sample scenario - extract actorsTokenize article textIdentify sentencesTag PoSIdentify Persons using regular expressions and PoSUse Person annotations, Tokens’ PoS and Sentencesto extract relations between terms to identifyPersons who are also Actors
  42. 42. Sample scenario - extract personsI have a dictionary of names (simple to find and/or build)I use a dictionary based Annotator to extract annotations offirst names (NameAnnotation)I don’t have a dictionary of surnamesEverytime a matching name (a NameAnnotation) is found welook for one or more (considering persons with double name orsurname) subsequent tokens whose PoS is “undefined” or anoun (but not a verb) and starts with Uppercase letterIf found then the name + token(s) sequence annotates aPerson (i.e. “Michael J. Fox”)
  43. 43. from Persons to Actors Getting actors can be simple if we know that Persons who are also actors do some well known actions or there exist widely used patterns i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor then we could build an ActorAnnotator
  44. 44. 1. Define TypeSystemDefine at least a Type inside Type System for eachobject inside the domain modelUseful to define more fine grained Types (for values oftype properties, called Features)If we want to extract information about articles wecreate an Article type inside the Type SystemAlso we’ll need to create annotations/entites for movies,actors, directors, etc...
  45. 45. 2. Define AnalysisEngine descriptor Define which type system it’s going to use Define which capabilities the analysis engine has: which annotations need to work and which annotations it’ll (eventually) generate Define configuration paramaters for the underlying algorithm Define resources needed by the analysis engine
  46. 46. 3. Implement Annotator create a new class extending JCasAnnotator_ImplBase implement the process() method that actually does the job the algorithm implementation is (called) in the process() method you can use configuration parameters/resources defined in the descriptor eventually override initialize() and destroy() methods
  47. 47. DummyPersonAnnotator
  48. 48. 4. Execute the UIMA pipeline Instantiate the AnalysisEngine with its descriptor as a parameter Create a CAS which will contain the text to be analyzed and the annotations extracted Run the AnalysisEngine on the given CAS Browse results
  49. 49. Execute a UIMA pipeline
  50. 50. What’s nextUIMA Use casesUsing UIMA in search enginesHands on code (assignment)
  51. 51. Referenceshttp://www.apache.orghttp://uima.apache.orghttp://www.oasis-open.orghttp://uima.apache.org/d/uimaj-2.3.1/index.htmlhttp://uima.apache.org/d/uimaj-2.3.1/overview_and_setup.html#ugr.ovv.eclipse_setuphttp://www.manning.com/ingersoll/https://github.com/tteofili/samplett/tree/master/giw1011