Apache UIMA IntroductionGestione delle Informazioni su Web - 2010/2011 Tommaso Teoﬁli tommaso [at] apache [dot] org
UIM ?Unstructured Information ManagementA wide topic: text, audio, video Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources)Apache UIMA
Apache Software Foundation No proﬁt corporation “...provides organizational, legal, and ﬁnancial support for a broad range of open source software projects...” “...collaborative and meritocratic development process...” “...pragmatic Apache License...”
Apache UIMAArchitectural framework to manageunstructured data (Java, C++, ...)Former IBM research project donated to ASFOASIS Standard for unstructuredinformation management
Apache UIMA - Goals“Our goal is to support a thriving communityof users and developers of UIMAframeworks, tools, and annotators, facilitatingthe analysis of unstructured content such astext, audio and video”
Apache UIMA - Overview UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies
Apache UIMA - Multimodal AnalysisMultimodal Analysis means the ability ofprocessing some resource from various“points of view”Sample: a video stream for which we want toextract subtitles and also automaticallyrecognize the actors involvedWe are though mainly interested in text...
Sample scenarioContent Management System containing freetext articles about moviesWe want such articles to be automaticallyenriched with metadata contained inside thetext (movies, directors, actors/actresses,distribution) and linked to “similar” articles(i.e.: dealing with same movies or actors)So that we can search for “similar” articles
Sample scenarioUIMA can help on enriching articles withmetadataThink of ﬁlling an Article.java instancevariables with proper valuesThen persisting it to a database to queryarticles dealing with the same actors
Apache UIMA - AnnotationThe association of a metadata, such as a label,with a region of text (or other type of artifact).For example, the label “Person” associated with aregion of text “Fred Center” constitutes anannotation. We say “Person” annotates the spanof text from X to Y containing exactly “FredCenter”
Apache UIMA - Basic Steps Domain model deﬁnition Analysis pipeline deﬁnition Arrange components: Deﬁne components draining data from sources Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc... Deﬁne components outputting information on target storages Analysis pipeline(s) execution
Deﬁning domain model within UIMA using Type Systems Type System is the place where we describe which metadata we would like to extract Low representational gap Like almost everything in UIMA: described (and generated!) using XML Possible to deﬁne multiple Type Systems for different purposes
Apache UIMA - Analysis Engines Basic UIMA building blocks Analyze a document Infer and record descriptive attributes (about documents/regions) Generating analysis results
Apache UIMA - AEsAnalysis Engines are described by a descriptor(XML)Can be Primitive (a single AE) or Aggregated (apipeline of AEs)Analysis algorithms can be switched changingdescriptor instead of codeContain TypeSystems deﬁnitionsDeﬁne Capabilites
Apache UIMA -AnalysisComponent API initialize : Performs (once) any startup tasks required by this component process : Process the resource to analyze generating analysis results (metadata) destroy : Frees all resources held, called only once when it is ﬁnished using this component
Apache UIMA - AnnotatorsAnalysis Engine algorithm Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video) Annotators implement AnalysisComponent interface
Apache UIMA - RolesAnalysisEngine : High level block responsiblefor analysis - contains at least oneAnalysisComponentAnalysisComponent : interface for anycomponent responsible for analyzing artifactsAnnotator : implementation ofAnalysisComponent responsible for creatingAnnotations
Apache UIMA & NLPNLP (Natural Language Processing) is atheoretically motivated range ofcomputational techniques for analyzing andrepresenting naturally occurring texts at oneor more levels of linguistic analysis for thepurpose of achieving human-like languageprocessing for a range of tasks orapplicationsIt’s an AI discipline
Apache UIMA & NLP“accomplish human-like language processing” Paraphrase an input text Translate the text into another language Answer questions about the contents of the text Draw inferences from the text
Apache UIMA & NLP“an NLP-based IR system has the goal ofproviding more precise, complete informationin response to a user’s real informationneed”various levels of processing
NLP - Language Identifying NLP takes advantage of language speciﬁc syntax, forms, rules and meanings Not easy to write language independent extraction algorithms Often this is the ﬁrst block of NLP pipelines Techniques: Stopwords dictionaries, statistical models, etc.
NLP - Tokens and Sentences Humans learn words’ meaning in order to understand whole context semantics Split the target text in words to be able to analyze their meaning and role Discover sentences to later assign roles to each token Easiest for English, Italian & co. but what about Chinese?
NLP - PoS TaggingAssign a “Part of Speech” (noun, adjective,verb, etc.) to each token generated in theprevious stepMany language/domain speciﬁc patterns canbe discovered and exploited just with pos-tagged-tokens and sentences
NLP - Chunking & Parsing Parse sentences into a meaningful set or tree of relationships Chunks are the sentence building blocks (i.e. verbal forms) Parse tree highlights the structure of a sentence Can leverage logic analysis chunking parsing
NLP - Named Entities RecognitionAnswer thequestions: where?when? who? howoften? how much?Identify key entitiesin the textCommon techniques:dictionaries, rules,statistcal models
Using UIMADeﬁne TypeSystemDeﬁne AnalysisEngine descriptor(s)Implement Annotator(s)Execute the UIMA pipeline
Sample scenario - extract actorsTokenize article textIdentify sentencesTag PoSIdentify Persons using regular expressions and PoSUse Person annotations, Tokens’ PoS and Sentencesto extract relations between terms to identifyPersons who are also Actors
Sample scenario - extract personsI have a dictionary of names (simple to ﬁnd and/or build)I use a dictionary based Annotator to extract annotations ofﬁrst names (NameAnnotation)I don’t have a dictionary of surnamesEverytime a matching name (a NameAnnotation) is found welook for one or more (considering persons with double name orsurname) subsequent tokens whose PoS is “undeﬁned” or anoun (but not a verb) and starts with Uppercase letterIf found then the name + token(s) sequence annotates aPerson (i.e. “Michael J. Fox”)
from Persons to Actors Getting actors can be simple if we know that Persons who are also actors do some well known actions or there exist widely used patterns i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor then we could build an ActorAnnotator
1. Deﬁne TypeSystemDeﬁne at least a Type inside Type System for eachobject inside the domain modelUseful to deﬁne more ﬁne grained Types (for values oftype properties, called Features)If we want to extract information about articles wecreate an Article type inside the Type SystemAlso we’ll need to create annotations/entites for movies,actors, directors, etc...
2. Deﬁne AnalysisEngine descriptor Deﬁne which type system it’s going to use Deﬁne which capabilities the analysis engine has: which annotations need to work and which annotations it’ll (eventually) generate Deﬁne conﬁguration paramaters for the underlying algorithm Deﬁne resources needed by the analysis engine
3. Implement Annotator create a new class extending JCasAnnotator_ImplBase implement the process() method that actually does the job the algorithm implementation is (called) in the process() method you can use conﬁguration parameters/resources deﬁned in the descriptor eventually override initialize() and destroy() methods
4. Execute the UIMA pipeline Instantiate the AnalysisEngine with its descriptor as a parameter Create a CAS which will contain the text to be analyzed and the annotations extracted Run the AnalysisEngine on the given CAS Browse results