Apache UIMA
   Introduction
Gestione delle Informazioni su Web - 2010/2011
                Tommaso Teofili
        tommaso [at] apache [dot] org
UIM ?

Unstructured Information Management

A wide topic: text, audio, video

  Different (possibly mixed) approaches
  (NLP, Machine Learning, IR, Ontologies,
  Automated reasoning, Knowledge Sources)

Apache UIMA
Apache Software Foundation

  No profit corporation

  “...provides organizational, legal, and financial
  support for a broad range of open source
  software projects...”

  “...collaborative and meritocratic development
  process...”

  “...pragmatic Apache License...”
Apache UIMA


Architectural framework to manage
unstructured data (Java, C++, ...)

Former IBM research project donated to ASF

OASIS Standard for unstructured
information management
Apache UIMA - Goals


“Our goal is to support a thriving community
of users and developers of UIMA
frameworks, tools, and annotators, facilitating
the analysis of unstructured content such as
text, audio and video”
Apache UIMA - bridging worlds
Apache UIMA - Overview


 UIMA supports the development, discovery,
 composition and deployment of multi-modal
 analytics for the analysis of unstructured
 information and its integration with search
 technologies
Apache UIMA -
 Multimodal Analysis
Multimodal Analysis means the ability of
processing some resource from various
“points of view”

Sample: a video stream for which we want to
extract subtitles and also automatically
recognize the actors involved

We are though mainly interested in text...
Sample scenario
Content Management System containing free
text articles about movies

We want such articles to be automatically
enriched with metadata contained inside the
text (movies, directors, actors/actresses,
distribution) and linked to “similar” articles
(i.e.: dealing with same movies or actors)

So that we can search for “similar” articles
Sample scenario - articles
      about movies
Sample scenario

UIMA can help on enriching articles with
metadata

Think of filling an Article.java instance
variables with proper values

Then persisting it to a database to query
articles dealing with the same actors
Filling Article with metadata
Sample scenario - metadata
UIMA - Annotations
Apache UIMA -
       Annotation

The association of a metadata, such as a label,
with a region of text (or other type of artifact).

For example, the label “Person” associated with a
region of text “Fred Center” constitutes an
annotation. We say “Person” annotates the span
of text from X to Y containing exactly “Fred
Center”
Apache UIMA - Basic Steps

  Domain model definition

  Analysis pipeline definition

  Arrange components:

      Define components draining data from sources

      Add and customize analysis components: Patterns,
      Dictionaries, RegEx, External services, NLP, etc...

      Define components outputting information on target
      storages

  Analysis pipeline(s) execution
Defining domain model within
 UIMA using Type Systems

 Type System is the place where we describe which
 metadata we would like to extract

 Low representational gap

 Like almost everything in UIMA: described (and
 generated!) using XML

 Possible to define multiple Type Systems for different
 purposes
How do UIMA extract
     metadata?
Apache UIMA - Analysis
       Engines

 Basic UIMA building blocks

 Analyze a document

   Infer and record descriptive attributes
   (about documents/regions)

 Generating analysis results
Apache UIMA - AEs
Analysis Engines are described by a descriptor
(XML)

Can be Primitive (a single AE) or Aggregated (a
pipeline of AEs)

Analysis algorithms can be switched changing
descriptor instead of code

Contain TypeSystems definitions

Define Capabilites
Apache UIMA -
AnalysisComponent API

 initialize : Performs (once) any startup tasks
 required by this component

 process : Process the resource to analyze
 generating analysis results (metadata)

 destroy : Frees all resources held, called only once
 when it is finished using this component
Apache UIMA -
       Annotators
Analysis Engine algorithm

  Annotator : A software component
  implemented to produce and record
  annotations over regions of an artifact
  (e.g., text document, audio, and video)

  Annotators implement AnalysisComponent
  interface
Apache UIMA - Roles
AnalysisEngine : High level block responsible
for analysis - contains at least one
AnalysisComponent

AnalysisComponent : interface for any
component responsible for analyzing artifacts

Annotator : implementation of
AnalysisComponent responsible for creating
Annotations
Apache UIMA - AEs
Analysis Engines in a
      Pipeline
Apache UIMA - Analysis Results


  Where do analysis results end up?

  How annotators represent and share their
  results?

  CAS - Common Analysis Structure

  Maintain typed indexes of extracted results
Common Analysis Structure
Which algorithms lay
    under AEs?
Apache UIMA & NLP
NLP (Natural Language Processing) is a
theoretically motivated range of
computational techniques for analyzing and
representing naturally occurring texts at one
or more levels of linguistic analysis for the
purpose of achieving human-like language
processing for a range of tasks or
applications

It’s an AI discipline
Apache UIMA & NLP

“accomplish human-like language processing”

  Paraphrase an input text

  Translate the text into another language

  Answer questions about the contents of
  the text

  Draw inferences from the text
Apache UIMA & NLP


“an NLP-based IR system has the goal of
providing more precise, complete information
in response to a user’s real information
need”

various levels of processing
Apache UIMA -
       Approaches

Simplest : Write RegEx and Dictionaries and
mix them together

NLP-like : Tokenize -> Sentence identification
-> PoS Tagging -> Anaphora resolution ->
Named Entities Recognition -> Coreference
Identification ...
Analysis Engines in a
      Pipeline
NLP - Language Identifying

  NLP takes advantage of language specific
  syntax, forms, rules and meanings

  Not easy to write language independent
  extraction algorithms

  Often this is the first block of NLP pipelines

  Techniques: Stopwords dictionaries, statistical
  models, etc.
NLP - Tokens and Sentences

  Humans learn words’ meaning in order to
  understand whole context semantics

  Split the target text in words to be able to
  analyze their meaning and role

  Discover sentences to later assign roles to
  each token

  Easiest for English, Italian & co. but what
  about Chinese?
NLP - PoS Tagging
Assign a “Part of Speech” (noun, adjective,
verb, etc.) to each token generated in the
previous step

Many language/domain specific patterns can
be discovered and exploited just with pos-
tagged-tokens and sentences
NLP - Chunking & Parsing
 Parse sentences into a meaningful set or
 tree of relationships

 Chunks are the sentence building blocks (i.e.
 verbal forms)

 Parse tree highlights the structure of a
 sentence

 Can leverage logic analysis



    chunking                          parsing
NLP - Named Entities
       Recognition
Answer the
questions: where?
when? who? how
often? how much?

Identify key entities
in the text

Common techniques:
dictionaries, rules,
statistcal models
Debugging NER in UIMA
Using UIMA


Define TypeSystem

Define AnalysisEngine descriptor(s)

Implement Annotator(s)

Execute the UIMA pipeline
Sample scenario -
    extract actors
Tokenize article text

Identify sentences

Tag PoS

Identify Persons using regular expressions and PoS

Use Person annotations, Tokens’ PoS and Sentences
to extract relations between terms to identify
Persons who are also Actors
Sample scenario -
     extract persons
I have a dictionary of names (simple to find and/or build)

I use a dictionary based Annotator to extract annotations of
first names (NameAnnotation)

I don’t have a dictionary of surnames

Everytime a matching name (a NameAnnotation) is found we
look for one or more (considering persons with double name or
surname) subsequent tokens whose PoS is “undefined” or a
noun (but not a verb) and starts with Uppercase letter

If found then the name + token(s) sequence annotates a
Person (i.e. “Michael J. Fox”)
from Persons to Actors
 Getting actors can be simple if we know that
 Persons who are also actors do some well known
 actions or there exist widely used patterns

 i.e.: a Person “stars as” CharacterInTheMovie (that
 will be eventually tagged as Person too) when is
 also an Actor

 i.e.: if the snippet “CharacterInTheMovie (Person)”
 exists, then Person is usually an Actor

 then we could build an ActorAnnotator
1. Define TypeSystem

Define at least a Type inside Type System for each
object inside the domain model

Useful to define more fine grained Types (for values of
type properties, called Features)

If we want to extract information about articles we
create an Article type inside the Type System

Also we’ll need to create annotations/entites for movies,
actors, directors, etc...
2. Define AnalysisEngine descriptor

  Define which type system it’s going to use

  Define which capabilities the analysis engine
  has: which annotations need to work and
  which annotations it’ll (eventually) generate

  Define configuration paramaters for the
  underlying algorithm

  Define resources needed by the analysis
  engine
3. Implement Annotator
 create a new class extending JCasAnnotator_ImplBase

 implement the process() method that actually does the
 job

    the algorithm implementation is (called) in the
    process() method

 you can use configuration parameters/resources defined
 in the descriptor

 eventually override initialize() and destroy() methods
DummyPersonAnnotator
4. Execute the UIMA pipeline

  Instantiate the AnalysisEngine with its
  descriptor as a parameter

  Create a CAS which will contain the text to
  be analyzed and the annotations extracted

  Run the AnalysisEngine on the given CAS

  Browse results
Execute a UIMA pipeline
What’s next


UIMA Use cases

Using UIMA in search engines

Hands on code (assignment)
References
http://www.apache.org

http://uima.apache.org

http://www.oasis-open.org

http://uima.apache.org/d/uimaj-2.3.1/index.html

http://uima.apache.org/d/uimaj-2.3.1/
overview_and_setup.html#ugr.ovv.eclipse_setup

http://www.manning.com/ingersoll/

https://github.com/tteofili/samplett/tree/master/giw1011

Apache UIMA Introduction

  • 1.
    Apache UIMA Introduction Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org
  • 2.
    UIM ? Unstructured InformationManagement A wide topic: text, audio, video Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources) Apache UIMA
  • 3.
    Apache Software Foundation No profit corporation “...provides organizational, legal, and financial support for a broad range of open source software projects...” “...collaborative and meritocratic development process...” “...pragmatic Apache License...”
  • 4.
    Apache UIMA Architectural frameworkto manage unstructured data (Java, C++, ...) Former IBM research project donated to ASF OASIS Standard for unstructured information management
  • 5.
    Apache UIMA -Goals “Our goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video”
  • 6.
    Apache UIMA -bridging worlds
  • 7.
    Apache UIMA -Overview UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies
  • 8.
    Apache UIMA - Multimodal Analysis Multimodal Analysis means the ability of processing some resource from various “points of view” Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved We are though mainly interested in text...
  • 9.
    Sample scenario Content ManagementSystem containing free text articles about movies We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to “similar” articles (i.e.: dealing with same movies or actors) So that we can search for “similar” articles
  • 10.
    Sample scenario -articles about movies
  • 11.
    Sample scenario UIMA canhelp on enriching articles with metadata Think of filling an Article.java instance variables with proper values Then persisting it to a database to query articles dealing with the same actors
  • 12.
  • 13.
  • 14.
  • 15.
    Apache UIMA - Annotation The association of a metadata, such as a label, with a region of text (or other type of artifact). For example, the label “Person” associated with a region of text “Fred Center” constitutes an annotation. We say “Person” annotates the span of text from X to Y containing exactly “Fred Center”
  • 16.
    Apache UIMA -Basic Steps Domain model definition Analysis pipeline definition Arrange components: Define components draining data from sources Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc... Define components outputting information on target storages Analysis pipeline(s) execution
  • 17.
    Defining domain modelwithin UIMA using Type Systems Type System is the place where we describe which metadata we would like to extract Low representational gap Like almost everything in UIMA: described (and generated!) using XML Possible to define multiple Type Systems for different purposes
  • 18.
    How do UIMAextract metadata?
  • 19.
    Apache UIMA -Analysis Engines Basic UIMA building blocks Analyze a document Infer and record descriptive attributes (about documents/regions) Generating analysis results
  • 20.
    Apache UIMA -AEs Analysis Engines are described by a descriptor (XML) Can be Primitive (a single AE) or Aggregated (a pipeline of AEs) Analysis algorithms can be switched changing descriptor instead of code Contain TypeSystems definitions Define Capabilites
  • 21.
    Apache UIMA - AnalysisComponentAPI initialize : Performs (once) any startup tasks required by this component process : Process the resource to analyze generating analysis results (metadata) destroy : Frees all resources held, called only once when it is finished using this component
  • 22.
    Apache UIMA - Annotators Analysis Engine algorithm Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video) Annotators implement AnalysisComponent interface
  • 23.
    Apache UIMA -Roles AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent AnalysisComponent : interface for any component responsible for analyzing artifacts Annotator : implementation of AnalysisComponent responsible for creating Annotations
  • 24.
  • 25.
  • 26.
    Apache UIMA -Analysis Results Where do analysis results end up? How annotators represent and share their results? CAS - Common Analysis Structure Maintain typed indexes of extracted results
  • 27.
  • 28.
  • 29.
    Apache UIMA &NLP NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications It’s an AI discipline
  • 30.
    Apache UIMA &NLP “accomplish human-like language processing” Paraphrase an input text Translate the text into another language Answer questions about the contents of the text Draw inferences from the text
  • 31.
    Apache UIMA &NLP “an NLP-based IR system has the goal of providing more precise, complete information in response to a user’s real information need” various levels of processing
  • 32.
    Apache UIMA - Approaches Simplest : Write RegEx and Dictionaries and mix them together NLP-like : Tokenize -> Sentence identification -> PoS Tagging -> Anaphora resolution -> Named Entities Recognition -> Coreference Identification ...
  • 33.
  • 34.
    NLP - LanguageIdentifying NLP takes advantage of language specific syntax, forms, rules and meanings Not easy to write language independent extraction algorithms Often this is the first block of NLP pipelines Techniques: Stopwords dictionaries, statistical models, etc.
  • 35.
    NLP - Tokensand Sentences Humans learn words’ meaning in order to understand whole context semantics Split the target text in words to be able to analyze their meaning and role Discover sentences to later assign roles to each token Easiest for English, Italian & co. but what about Chinese?
  • 36.
    NLP - PoSTagging Assign a “Part of Speech” (noun, adjective, verb, etc.) to each token generated in the previous step Many language/domain specific patterns can be discovered and exploited just with pos- tagged-tokens and sentences
  • 37.
    NLP - Chunking& Parsing Parse sentences into a meaningful set or tree of relationships Chunks are the sentence building blocks (i.e. verbal forms) Parse tree highlights the structure of a sentence Can leverage logic analysis chunking parsing
  • 38.
    NLP - NamedEntities Recognition Answer the questions: where? when? who? how often? how much? Identify key entities in the text Common techniques: dictionaries, rules, statistcal models
  • 39.
  • 40.
    Using UIMA Define TypeSystem DefineAnalysisEngine descriptor(s) Implement Annotator(s) Execute the UIMA pipeline
  • 41.
    Sample scenario - extract actors Tokenize article text Identify sentences Tag PoS Identify Persons using regular expressions and PoS Use Person annotations, Tokens’ PoS and Sentences to extract relations between terms to identify Persons who are also Actors
  • 42.
    Sample scenario - extract persons I have a dictionary of names (simple to find and/or build) I use a dictionary based Annotator to extract annotations of first names (NameAnnotation) I don’t have a dictionary of surnames Everytime a matching name (a NameAnnotation) is found we look for one or more (considering persons with double name or surname) subsequent tokens whose PoS is “undefined” or a noun (but not a verb) and starts with Uppercase letter If found then the name + token(s) sequence annotates a Person (i.e. “Michael J. Fox”)
  • 43.
    from Persons toActors Getting actors can be simple if we know that Persons who are also actors do some well known actions or there exist widely used patterns i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor then we could build an ActorAnnotator
  • 44.
    1. Define TypeSystem Defineat least a Type inside Type System for each object inside the domain model Useful to define more fine grained Types (for values of type properties, called Features) If we want to extract information about articles we create an Article type inside the Type System Also we’ll need to create annotations/entites for movies, actors, directors, etc...
  • 45.
    2. Define AnalysisEnginedescriptor Define which type system it’s going to use Define which capabilities the analysis engine has: which annotations need to work and which annotations it’ll (eventually) generate Define configuration paramaters for the underlying algorithm Define resources needed by the analysis engine
  • 46.
    3. Implement Annotator create a new class extending JCasAnnotator_ImplBase implement the process() method that actually does the job the algorithm implementation is (called) in the process() method you can use configuration parameters/resources defined in the descriptor eventually override initialize() and destroy() methods
  • 47.
  • 48.
    4. Execute theUIMA pipeline Instantiate the AnalysisEngine with its descriptor as a parameter Create a CAS which will contain the text to be analyzed and the annotations extracted Run the AnalysisEngine on the given CAS Browse results
  • 49.
  • 50.
    What’s next UIMA Usecases Using UIMA in search engines Hands on code (assignment)
  • 51.