Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā
Apache UIMA Introduction
1. Apache UIMA
Introduction
Gestione delle Informazioni su Web - 2010/2011
Tommaso Teoļ¬li
tommaso [at] apache [dot] org
2. UIM ?
Unstructured Information Management
A wide topic: text, audio, video
Different (possibly mixed) approaches
(NLP, Machine Learning, IR, Ontologies,
Automated reasoning, Knowledge Sources)
Apache UIMA
3. Apache Software Foundation
No proļ¬t corporation
ā...provides organizational, legal, and ļ¬nancial
support for a broad range of open source
software projects...ā
ā...collaborative and meritocratic development
process...ā
ā...pragmatic Apache License...ā
4. Apache UIMA
Architectural framework to manage
unstructured data (Java, C++, ...)
Former IBM research project donated to ASF
OASIS Standard for unstructured
information management
5. Apache UIMA - Goals
āOur goal is to support a thriving community
of users and developers of UIMA
frameworks, tools, and annotators, facilitating
the analysis of unstructured content such as
text, audio and videoā
7. Apache UIMA - Overview
UIMA supports the development, discovery,
composition and deployment of multi-modal
analytics for the analysis of unstructured
information and its integration with search
technologies
8. Apache UIMA -
Multimodal Analysis
Multimodal Analysis means the ability of
processing some resource from various
āpoints of viewā
Sample: a video stream for which we want to
extract subtitles and also automatically
recognize the actors involved
We are though mainly interested in text...
9. Sample scenario
Content Management System containing free
text articles about movies
We want such articles to be automatically
enriched with metadata contained inside the
text (movies, directors, actors/actresses,
distribution) and linked to āsimilarā articles
(i.e.: dealing with same movies or actors)
So that we can search for āsimilarā articles
11. Sample scenario
UIMA can help on enriching articles with
metadata
Think of ļ¬lling an Article.java instance
variables with proper values
Then persisting it to a database to query
articles dealing with the same actors
15. Apache UIMA -
Annotation
The association of a metadata, such as a label,
with a region of text (or other type of artifact).
For example, the label āPersonā associated with a
region of text āFred Centerā constitutes an
annotation. We say āPersonā annotates the span
of text from X to Y containing exactly āFred
Centerā
16. Apache UIMA - Basic Steps
Domain model deļ¬nition
Analysis pipeline deļ¬nition
Arrange components:
Deļ¬ne components draining data from sources
Add and customize analysis components: Patterns,
Dictionaries, RegEx, External services, NLP, etc...
Deļ¬ne components outputting information on target
storages
Analysis pipeline(s) execution
17. Deļ¬ning domain model within
UIMA using Type Systems
Type System is the place where we describe which
metadata we would like to extract
Low representational gap
Like almost everything in UIMA: described (and
generated!) using XML
Possible to deļ¬ne multiple Type Systems for different
purposes
19. Apache UIMA - Analysis
Engines
Basic UIMA building blocks
Analyze a document
Infer and record descriptive attributes
(about documents/regions)
Generating analysis results
20. Apache UIMA - AEs
Analysis Engines are described by a descriptor
(XML)
Can be Primitive (a single AE) or Aggregated (a
pipeline of AEs)
Analysis algorithms can be switched changing
descriptor instead of code
Contain TypeSystems deļ¬nitions
Deļ¬ne Capabilites
21. Apache UIMA -
AnalysisComponent API
initialize : Performs (once) any startup tasks
required by this component
process : Process the resource to analyze
generating analysis results (metadata)
destroy : Frees all resources held, called only once
when it is ļ¬nished using this component
22. Apache UIMA -
Annotators
Analysis Engine algorithm
Annotator : A software component
implemented to produce and record
annotations over regions of an artifact
(e.g., text document, audio, and video)
Annotators implement AnalysisComponent
interface
23. Apache UIMA - Roles
AnalysisEngine : High level block responsible
for analysis - contains at least one
AnalysisComponent
AnalysisComponent : interface for any
component responsible for analyzing artifacts
Annotator : implementation of
AnalysisComponent responsible for creating
Annotations
26. Apache UIMA - Analysis Results
Where do analysis results end up?
How annotators represent and share their
results?
CAS - Common Analysis Structure
Maintain typed indexes of extracted results
29. Apache UIMA & NLP
NLP (Natural Language Processing) is a
theoretically motivated range of
computational techniques for analyzing and
representing naturally occurring texts at one
or more levels of linguistic analysis for the
purpose of achieving human-like language
processing for a range of tasks or
applications
Itās an AI discipline
30. Apache UIMA & NLP
āaccomplish human-like language processingā
Paraphrase an input text
Translate the text into another language
Answer questions about the contents of
the text
Draw inferences from the text
31. Apache UIMA & NLP
āan NLP-based IR system has the goal of
providing more precise, complete information
in response to a userās real information
needā
various levels of processing
32. Apache UIMA -
Approaches
Simplest : Write RegEx and Dictionaries and
mix them together
NLP-like : Tokenize -> Sentence identiļ¬cation
-> PoS Tagging -> Anaphora resolution ->
Named Entities Recognition -> Coreference
Identiļ¬cation ...
34. NLP - Language Identifying
NLP takes advantage of language speciļ¬c
syntax, forms, rules and meanings
Not easy to write language independent
extraction algorithms
Often this is the ļ¬rst block of NLP pipelines
Techniques: Stopwords dictionaries, statistical
models, etc.
35. NLP - Tokens and Sentences
Humans learn wordsā meaning in order to
understand whole context semantics
Split the target text in words to be able to
analyze their meaning and role
Discover sentences to later assign roles to
each token
Easiest for English, Italian & co. but what
about Chinese?
36. NLP - PoS Tagging
Assign a āPart of Speechā (noun, adjective,
verb, etc.) to each token generated in the
previous step
Many language/domain speciļ¬c patterns can
be discovered and exploited just with pos-
tagged-tokens and sentences
37. NLP - Chunking & Parsing
Parse sentences into a meaningful set or
tree of relationships
Chunks are the sentence building blocks (i.e.
verbal forms)
Parse tree highlights the structure of a
sentence
Can leverage logic analysis
chunking parsing
38. NLP - Named Entities
Recognition
Answer the
questions: where?
when? who? how
often? how much?
Identify key entities
in the text
Common techniques:
dictionaries, rules,
statistcal models
41. Sample scenario -
extract actors
Tokenize article text
Identify sentences
Tag PoS
Identify Persons using regular expressions and PoS
Use Person annotations, Tokensā PoS and Sentences
to extract relations between terms to identify
Persons who are also Actors
42. Sample scenario -
extract persons
I have a dictionary of names (simple to ļ¬nd and/or build)
I use a dictionary based Annotator to extract annotations of
ļ¬rst names (NameAnnotation)
I donāt have a dictionary of surnames
Everytime a matching name (a NameAnnotation) is found we
look for one or more (considering persons with double name or
surname) subsequent tokens whose PoS is āundeļ¬nedā or a
noun (but not a verb) and starts with Uppercase letter
If found then the name + token(s) sequence annotates a
Person (i.e. āMichael J. Foxā)
43. from Persons to Actors
Getting actors can be simple if we know that
Persons who are also actors do some well known
actions or there exist widely used patterns
i.e.: a Person āstars asā CharacterInTheMovie (that
will be eventually tagged as Person too) when is
also an Actor
i.e.: if the snippet āCharacterInTheMovie (Person)ā
exists, then Person is usually an Actor
then we could build an ActorAnnotator
44. 1. Deļ¬ne TypeSystem
Deļ¬ne at least a Type inside Type System for each
object inside the domain model
Useful to deļ¬ne more ļ¬ne grained Types (for values of
type properties, called Features)
If we want to extract information about articles we
create an Article type inside the Type System
Also weāll need to create annotations/entites for movies,
actors, directors, etc...
45. 2. Deļ¬ne AnalysisEngine descriptor
Deļ¬ne which type system itās going to use
Deļ¬ne which capabilities the analysis engine
has: which annotations need to work and
which annotations itāll (eventually) generate
Deļ¬ne conļ¬guration paramaters for the
underlying algorithm
Deļ¬ne resources needed by the analysis
engine
46. 3. Implement Annotator
create a new class extending JCasAnnotator_ImplBase
implement the process() method that actually does the
job
the algorithm implementation is (called) in the
process() method
you can use conļ¬guration parameters/resources deļ¬ned
in the descriptor
eventually override initialize() and destroy() methods
48. 4. Execute the UIMA pipeline
Instantiate the AnalysisEngine with its
descriptor as a parameter
Create a CAS which will contain the text to
be analyzed and the annotations extracted
Run the AnalysisEngine on the given CAS
Browse results