Apache UIMA and Metadata Generation


Published on

Slides about an overview about Apache UIMA and how it can be used for Metadata Generation in the context of the "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache UIMA and Metadata Generation

  1. 1. Apache UIMA and Metadata Generation Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org mercoledì 14 aprile 2010
  2. 2. Agenda Unstructured information management The ASF Apache UIMA Goals Overview Components Usage mercoledì 14 aprile 2010
  3. 3. UIM ? Unstructured Information Management A wide topic: text, audio, video Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources) Apache UIMA mercoledì 14 aprile 2010
  4. 4. Apache Software Foundation No profit corporation “...provides organizational, legal, and financial support for a broad range of open source software projects...” “...collaborative and meritocratic development process...” “...pragmatic Apache License...” mercoledì 14 aprile 2010
  5. 5. Apache UIMA Architectural framework to manage unstructured data (Java, C++) Just graduated as Apache Top Level Project Former IBM research project donated to ASF OASIS Standard mercoledì 14 aprile 2010
  6. 6. Apache UIMA - Goals “Our goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video” mercoledì 14 aprile 2010
  7. 7. Apache UIMA - bridging worlds mercoledì 14 aprile 2010
  8. 8. Apache UIMA - Overview UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies mercoledì 14 aprile 2010
  9. 9. Apache UIMA - Multimodal Analysis Multimodal Analysis means the ability of processing some resource from various “points of view” Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved We are though mainly interested in text... mercoledì 14 aprile 2010
  10. 10. Sample scenario Content Management System containing free text articles about movies We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to “similar” articles (i.e.: dealing with same movies or actors) So that we can search for “similar” articles mercoledì 14 aprile 2010
  11. 11. Sample scenario - articles about movies mercoledì 14 aprile 2010
  12. 12. Sample scenario UIMA can help on enriching articles with metadata Think of filling an Article.java instance variables with proper values Then persisting it to a database to query articles dealing with the same actors mercoledì 14 aprile 2010
  13. 13. Filling Article with metadata mercoledì 14 aprile 2010
  14. 14. Sample scenario - metadata mercoledì 14 aprile 2010
  15. 15. UIMA - Annotations and Entities mercoledì 14 aprile 2010
  16. 16. Apache UIMA - Annotation The association of a metadata, such as a label, with a region of text (or other type of artifact). For example, the label “Person” associated with a region of text “Fred Center” constitutes an annotation. We say “Person” annotates the span of text from X to Y containing exactly “Fred Center” mercoledì 14 aprile 2010
  17. 17. Apache UIMA - Basic Steps Domain model definition Analysis pipeline definition Arrange components: Define components draining data from sources Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc... Define components outputting information on target storages Analysis pipeline(s) execution mercoledì 14 aprile 2010
  18. 18. Defining domain model within UIMA using Type Systems Type System is the place where we describe which metadata we would like to extract Low representational gap Like almost everything in UIMA: described (and generated!) using XML Possible to define multiple Type Systems for different purposes mercoledì 14 aprile 2010
  19. 19. Defining domain model within UIMA using Type Systems Define at least a Type inside Type System for each object inside the domain model Useful to define more fine grained Types (for values of type properties, called Features) If we want to extract information about articles we create an Article type inside the Type System Also we’ll need to create annotations/entites for movies, actors, directors, etc... Types usually extends Annotation or TOP mercoledì 14 aprile 2010
  20. 20. Type System for Articles mercoledì 14 aprile 2010
  21. 21. How do UIMA extract metadata? mercoledì 14 aprile 2010
  22. 22. Apache UIMA - Analysis Engines Basic UIMA building blocks Analyze a document Infer and record descriptive attributes (about documents/regions) Generating analysis results mercoledì 14 aprile 2010
  23. 23. Apache UIMA - AEs Analysis Engines are described by a descriptor (XML) Can be Primitive (a single AE) or Aggregated (a pipeline of AEs) Analysis algorithms can be switched changing descriptor instead of code Contain TypeSystems definitions Define Capabilites mercoledì 14 aprile 2010
  24. 24. Apache UIMA - AnalysisComponent API initialize : Performs (once) any startup tasks required by this component process : Process the resource to analyze generating analysis results (metadata) destroy : Frees all resources held, called only once when it is finished using this component mercoledì 14 aprile 2010
  25. 25. Apache UIMA - Annotators Analysis Engine algorithm Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video) Annotators implement AnalysisComponent interface mercoledì 14 aprile 2010
  26. 26. Apache UIMA - Roles AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent AnalysisComponent : interface for any component responsible for analyzing artifacts Annotator : implementation of AnalysisComponent responsible for creating Annotations mercoledì 14 aprile 2010
  27. 27. Apache UIMA - AEs mercoledì 14 aprile 2010
  28. 28. Analysis Engines in a Pipeline mercoledì 14 aprile 2010
  29. 29. Apache UIMA - Analysis Results Where do analysis results end up? How annotators represent and share their results? CAS - Common Analysis Structure Maintain typed indexes of extracted results mercoledì 14 aprile 2010
  30. 30. Common Analysis Structure mercoledì 14 aprile 2010
  31. 31. Which algorithms lay under AEs? mercoledì 14 aprile 2010
  32. 32. Apache UIMA & NLP NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications It’s an AI discipline mercoledì 14 aprile 2010
  33. 33. Apache UIMA & NLP “accomplish human-like language processing” Paraphrase an input text Translate the text into another language Answer questions about the contents of the text Draw inferences from the text <-- mercoledì 14 aprile 2010
  34. 34. Apache UIMA & NLP “an NLP-based IR system has the goal of providing more precise, complete information in response to a user’s real information need” various levels of processing that’s where we are! mercoledì 14 aprile 2010
  35. 35. Apache UIMA - First Approaches Simplest : Write RegEx and Dictionaries and mix them together NLP-like : Tokenize -> Sentence identification -> PoS Tagging -> Custom (Domain specific) structures mercoledì 14 aprile 2010
  36. 36. Analysis Engines in a Pipeline mercoledì 14 aprile 2010
  37. 37. Sample scenario - extract actors Tokenize article text Identify sentences Tag PoS Identify Persons using regular expressions and PoS Use Person annotations, Tokens’ PoS and Sentences to extract relations between terms to identify Persons who are also Actors mercoledì 14 aprile 2010
  38. 38. Sample scenario - PersonAnnotator I have a dictionary of names (simple to find and/or build) I use a DictionaryAnnotator to extract NameAnnotations I don’t have a dictionary of surnames Everytime a matching name (a NameAnnotation) is found we look for one ore more (considering persons with double name or surname) subsequent tokens whose PoS is “undefined” or a noun (but not a verb) and starts with Uppercase letter If found then the name + token(s) sequence annotates a Person (i.e. “Michael J. Fox”) mercoledì 14 aprile 2010
  39. 39. PersonAnnotator sample mercoledì 14 aprile 2010
  40. 40. Sample scenario - articles about movies mercoledì 14 aprile 2010
  41. 41. Sample scenario Getting actors can be simple if we know that Persons who are also actors do some well known actions i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor then we can build an ActorAnnotator mercoledì 14 aprile 2010
  42. 42. Sample scenario mercoledì 14 aprile 2010
  43. 43. Apache UIMA experience Under SVN at http://svn.apache.org/repos/asf/uima/uimaj/trunk/ uimaj-examples/ there are some examples and also the getting started guides are very useful to start to get in touch with UIMA http://uima.apache.org/ documentation.html#getting_started Subscribe to users@ and dev@uima.apache.org MLs mercoledì 14 aprile 2010
  44. 44. Apache UIMA - Components Type Systems CAS Consumers Analysis Engines Asynchronous Scaleout CAS Sandbox Collection Components Processing Manager/Engine Eclipse Plugins Flow Controllers Tools mercoledì 14 aprile 2010
  45. 45. Apache UIMA - Flow Controllers A component which implements the interfaces needed to specify a custom flow within an Aggregate Analysis Engine Enabling conditional pipelines mercoledì 14 aprile 2010
  46. 46. Apache UIMA - CAS Consumers Components responsible for taking the results from the CAS and storing them into a database, or other storage device mercoledì 14 aprile 2010
  47. 47. Apache UIMA - Collection Processing and a bigger picture mercoledì 14 aprile 2010
  48. 48. Apache UIMA - Asynchronous Scaleout add-on to the base Java framework, supporting a very flexible scaleout capability based on JMS (Java Messaging Services) and Apache ActiveMQ (a messaging an integration patterns provider) a powerful clustering solution very useful when source documents size is huge mercoledì 14 aprile 2010
  49. 49. Apache UIMA - Sandbox Basics Tokenizer HMM Tagger Dictionaries (DictionaryAnnotator, ConceptMapper) Snowball ConfigurableFeatureExtractor mercoledì 14 aprile 2010
  50. 50. Apache UIMA - External Services External IE engines exposing webservices integrated easily inside UIMA: AlchemyAPI Annotator OpenCalais Annotator mercoledì 14 aprile 2010
  51. 51. Apache UIMA - Tika Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. The TikaAnnotator uses Tika to generate annotations representing the original markup of a document, extract its text and metadata mercoledì 14 aprile 2010
  52. 52. Apache UIMA - Lucas Very useful to build search engines! stores CAS data on Lucene indexes transforms annotation objects of a CAS into Lucene token streams which are stored in a Lucene document mercoledì 14 aprile 2010
  53. 53. Apache UIMA - Tools JCasGen PEAR Installer, Merger, Packager Component Descriptor Editor CPE Configurator Java Annotation Viewer CAS Visual Debugger Document Analyzer mercoledì 14 aprile 2010
  54. 54. Apache UIMA We can aggregate existing components or write and deploy our new ones There are lots of repositories for UIMA containing open source analysis engines, type systems, etc... We though have to know better enough our domain Please mind the “false positives” issue mercoledì 14 aprile 2010
  55. 55. References http://www.apache.org http://uima.apache.org http://www.oasis-open.org http://www.cnlp.org/publications/03NLP.LIS.Encyclopedia.pdf http://nlp.stanford.edu/ http://www.opencalais.com/gnosis/ http://www.dsi.unive.it/~marin/docs/hmm-it.pdf http://en.wikipedia.org/wiki/Hidden_Markov_model mercoledì 14 aprile 2010