SlideShare a Scribd company logo
1 of 51
Download to read offline
Apache UIMA
   Introduction
Gestione delle Informazioni su Web - 2010/2011
                Tommaso Teoļ¬li
        tommaso [at] apache [dot] org
UIM ?

Unstructured Information Management

A wide topic: text, audio, video

  Different (possibly mixed) approaches
  (NLP, Machine Learning, IR, Ontologies,
  Automated reasoning, Knowledge Sources)

Apache UIMA
Apache Software Foundation

  No proļ¬t corporation

  ā€œ...provides organizational, legal, and ļ¬nancial
  support for a broad range of open source
  software projects...ā€

  ā€œ...collaborative and meritocratic development
  process...ā€

  ā€œ...pragmatic Apache License...ā€
Apache UIMA


Architectural framework to manage
unstructured data (Java, C++, ...)

Former IBM research project donated to ASF

OASIS Standard for unstructured
information management
Apache UIMA - Goals


ā€œOur goal is to support a thriving community
of users and developers of UIMA
frameworks, tools, and annotators, facilitating
the analysis of unstructured content such as
text, audio and videoā€
Apache UIMA - bridging worlds
Apache UIMA - Overview


 UIMA supports the development, discovery,
 composition and deployment of multi-modal
 analytics for the analysis of unstructured
 information and its integration with search
 technologies
Apache UIMA -
 Multimodal Analysis
Multimodal Analysis means the ability of
processing some resource from various
ā€œpoints of viewā€

Sample: a video stream for which we want to
extract subtitles and also automatically
recognize the actors involved

We are though mainly interested in text...
Sample scenario
Content Management System containing free
text articles about movies

We want such articles to be automatically
enriched with metadata contained inside the
text (movies, directors, actors/actresses,
distribution) and linked to ā€œsimilarā€ articles
(i.e.: dealing with same movies or actors)

So that we can search for ā€œsimilarā€ articles
Sample scenario - articles
      about movies
Sample scenario

UIMA can help on enriching articles with
metadata

Think of ļ¬lling an Article.java instance
variables with proper values

Then persisting it to a database to query
articles dealing with the same actors
Filling Article with metadata
Sample scenario - metadata
UIMA - Annotations
Apache UIMA -
       Annotation

The association of a metadata, such as a label,
with a region of text (or other type of artifact).

For example, the label ā€œPersonā€ associated with a
region of text ā€œFred Centerā€ constitutes an
annotation. We say ā€œPersonā€ annotates the span
of text from X to Y containing exactly ā€œFred
Centerā€
Apache UIMA - Basic Steps

  Domain model deļ¬nition

  Analysis pipeline deļ¬nition

  Arrange components:

      Deļ¬ne components draining data from sources

      Add and customize analysis components: Patterns,
      Dictionaries, RegEx, External services, NLP, etc...

      Deļ¬ne components outputting information on target
      storages

  Analysis pipeline(s) execution
Deļ¬ning domain model within
 UIMA using Type Systems

 Type System is the place where we describe which
 metadata we would like to extract

 Low representational gap

 Like almost everything in UIMA: described (and
 generated!) using XML

 Possible to deļ¬ne multiple Type Systems for different
 purposes
How do UIMA extract
     metadata?
Apache UIMA - Analysis
       Engines

 Basic UIMA building blocks

 Analyze a document

   Infer and record descriptive attributes
   (about documents/regions)

 Generating analysis results
Apache UIMA - AEs
Analysis Engines are described by a descriptor
(XML)

Can be Primitive (a single AE) or Aggregated (a
pipeline of AEs)

Analysis algorithms can be switched changing
descriptor instead of code

Contain TypeSystems deļ¬nitions

Deļ¬ne Capabilites
Apache UIMA -
AnalysisComponent API

 initialize : Performs (once) any startup tasks
 required by this component

 process : Process the resource to analyze
 generating analysis results (metadata)

 destroy : Frees all resources held, called only once
 when it is ļ¬nished using this component
Apache UIMA -
       Annotators
Analysis Engine algorithm

  Annotator : A software component
  implemented to produce and record
  annotations over regions of an artifact
  (e.g., text document, audio, and video)

  Annotators implement AnalysisComponent
  interface
Apache UIMA - Roles
AnalysisEngine : High level block responsible
for analysis - contains at least one
AnalysisComponent

AnalysisComponent : interface for any
component responsible for analyzing artifacts

Annotator : implementation of
AnalysisComponent responsible for creating
Annotations
Apache UIMA - AEs
Analysis Engines in a
      Pipeline
Apache UIMA - Analysis Results


  Where do analysis results end up?

  How annotators represent and share their
  results?

  CAS - Common Analysis Structure

  Maintain typed indexes of extracted results
Common Analysis Structure
Which algorithms lay
    under AEs?
Apache UIMA & NLP
NLP (Natural Language Processing) is a
theoretically motivated range of
computational techniques for analyzing and
representing naturally occurring texts at one
or more levels of linguistic analysis for the
purpose of achieving human-like language
processing for a range of tasks or
applications

Itā€™s an AI discipline
Apache UIMA & NLP

ā€œaccomplish human-like language processingā€

  Paraphrase an input text

  Translate the text into another language

  Answer questions about the contents of
  the text

  Draw inferences from the text
Apache UIMA & NLP


ā€œan NLP-based IR system has the goal of
providing more precise, complete information
in response to a userā€™s real information
needā€

various levels of processing
Apache UIMA -
       Approaches

Simplest : Write RegEx and Dictionaries and
mix them together

NLP-like : Tokenize -> Sentence identiļ¬cation
-> PoS Tagging -> Anaphora resolution ->
Named Entities Recognition -> Coreference
Identiļ¬cation ...
Analysis Engines in a
      Pipeline
NLP - Language Identifying

  NLP takes advantage of language speciļ¬c
  syntax, forms, rules and meanings

  Not easy to write language independent
  extraction algorithms

  Often this is the ļ¬rst block of NLP pipelines

  Techniques: Stopwords dictionaries, statistical
  models, etc.
NLP - Tokens and Sentences

  Humans learn wordsā€™ meaning in order to
  understand whole context semantics

  Split the target text in words to be able to
  analyze their meaning and role

  Discover sentences to later assign roles to
  each token

  Easiest for English, Italian & co. but what
  about Chinese?
NLP - PoS Tagging
Assign a ā€œPart of Speechā€ (noun, adjective,
verb, etc.) to each token generated in the
previous step

Many language/domain speciļ¬c patterns can
be discovered and exploited just with pos-
tagged-tokens and sentences
NLP - Chunking & Parsing
 Parse sentences into a meaningful set or
 tree of relationships

 Chunks are the sentence building blocks (i.e.
 verbal forms)

 Parse tree highlights the structure of a
 sentence

 Can leverage logic analysis



    chunking                          parsing
NLP - Named Entities
       Recognition
Answer the
questions: where?
when? who? how
often? how much?

Identify key entities
in the text

Common techniques:
dictionaries, rules,
statistcal models
Debugging NER in UIMA
Using UIMA


Deļ¬ne TypeSystem

Deļ¬ne AnalysisEngine descriptor(s)

Implement Annotator(s)

Execute the UIMA pipeline
Sample scenario -
    extract actors
Tokenize article text

Identify sentences

Tag PoS

Identify Persons using regular expressions and PoS

Use Person annotations, Tokensā€™ PoS and Sentences
to extract relations between terms to identify
Persons who are also Actors
Sample scenario -
     extract persons
I have a dictionary of names (simple to ļ¬nd and/or build)

I use a dictionary based Annotator to extract annotations of
ļ¬rst names (NameAnnotation)

I donā€™t have a dictionary of surnames

Everytime a matching name (a NameAnnotation) is found we
look for one or more (considering persons with double name or
surname) subsequent tokens whose PoS is ā€œundeļ¬nedā€ or a
noun (but not a verb) and starts with Uppercase letter

If found then the name + token(s) sequence annotates a
Person (i.e. ā€œMichael J. Foxā€)
from Persons to Actors
 Getting actors can be simple if we know that
 Persons who are also actors do some well known
 actions or there exist widely used patterns

 i.e.: a Person ā€œstars asā€ CharacterInTheMovie (that
 will be eventually tagged as Person too) when is
 also an Actor

 i.e.: if the snippet ā€œCharacterInTheMovie (Person)ā€
 exists, then Person is usually an Actor

 then we could build an ActorAnnotator
1. Deļ¬ne TypeSystem

Deļ¬ne at least a Type inside Type System for each
object inside the domain model

Useful to deļ¬ne more ļ¬ne grained Types (for values of
type properties, called Features)

If we want to extract information about articles we
create an Article type inside the Type System

Also weā€™ll need to create annotations/entites for movies,
actors, directors, etc...
2. Deļ¬ne AnalysisEngine descriptor

  Deļ¬ne which type system itā€™s going to use

  Deļ¬ne which capabilities the analysis engine
  has: which annotations need to work and
  which annotations itā€™ll (eventually) generate

  Deļ¬ne conļ¬guration paramaters for the
  underlying algorithm

  Deļ¬ne resources needed by the analysis
  engine
3. Implement Annotator
 create a new class extending JCasAnnotator_ImplBase

 implement the process() method that actually does the
 job

    the algorithm implementation is (called) in the
    process() method

 you can use conļ¬guration parameters/resources deļ¬ned
 in the descriptor

 eventually override initialize() and destroy() methods
DummyPersonAnnotator
4. Execute the UIMA pipeline

  Instantiate the AnalysisEngine with its
  descriptor as a parameter

  Create a CAS which will contain the text to
  be analyzed and the annotations extracted

  Run the AnalysisEngine on the given CAS

  Browse results
Execute a UIMA pipeline
Whatā€™s next


UIMA Use cases

Using UIMA in search engines

Hands on code (assignment)
References
http://www.apache.org

http://uima.apache.org

http://www.oasis-open.org

http://uima.apache.org/d/uimaj-2.3.1/index.html

http://uima.apache.org/d/uimaj-2.3.1/
overview_and_setup.html#ugr.ovv.eclipse_setup

http://www.manning.com/ingersoll/

https://github.com/tteoļ¬li/samplett/tree/master/giw1011

More Related Content

What's hot

[Product Camp 2020] - Dados e mƩtricas para lƭderes de produto - Pedro Galopp...
[Product Camp 2020] - Dados e mƩtricas para lƭderes de produto - Pedro Galopp...[Product Camp 2020] - Dados e mƩtricas para lƭderes de produto - Pedro Galopp...
[Product Camp 2020] - Dados e mƩtricas para lƭderes de produto - Pedro Galopp...Product Camp Brasil
Ā 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchNETWAYS
Ā 
Digital Marketing Portfolio
Digital Marketing PortfolioDigital Marketing Portfolio
Digital Marketing PortfolioBhagyashreekate
Ā 
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystems
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystemsTechnical Webinar: By the (Play) Book: The Agile Practice at OutSystems
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystemsOutSystems
Ā 
Brand24
Brand24Brand24
Brand24Brand24
Ā 
Digital Marketing Service Provider
Digital Marketing Service ProviderDigital Marketing Service Provider
Digital Marketing Service ProviderFomaxtechnology
Ā 
Jobs-To-Be-Done by C6 Bank Lead Product Designer
Jobs-To-Be-Done by C6 Bank Lead Product DesignerJobs-To-Be-Done by C6 Bank Lead Product Designer
Jobs-To-Be-Done by C6 Bank Lead Product DesignerProduct School
Ā 
Building a Culture of Experimentation at HP
Building a Culture of Experimentation at HPBuilding a Culture of Experimentation at HP
Building a Culture of Experimentation at HPOptimizely
Ā 
How to Utilize TikTok in Your Content Marketing Strategy
How to Utilize TikTok in Your Content Marketing StrategyHow to Utilize TikTok in Your Content Marketing Strategy
How to Utilize TikTok in Your Content Marketing Strategyintrotodigital
Ā 
Social Media Consulting Services Contract
Social Media Consulting Services ContractSocial Media Consulting Services Contract
Social Media Consulting Services ContractPaul Bain
Ā 
Data Driven Design Research Personas
Data Driven Design Research PersonasData Driven Design Research Personas
Data Driven Design Research PersonasTodd Zaki Warfel
Ā 
Why everything is an A/B Test at Pinterest
Why everything is an A/B Test at PinterestWhy everything is an A/B Test at Pinterest
Why everything is an A/B Test at PinterestKrishna Gade
Ā 
Complete Digital Marketing Proposal Format (1).pdf
Complete Digital Marketing Proposal Format (1).pdfComplete Digital Marketing Proposal Format (1).pdf
Complete Digital Marketing Proposal Format (1).pdfKen Khan
Ā 
2.5.3 interfaz videojuego
2.5.3 interfaz videojuego2.5.3 interfaz videojuego
2.5.3 interfaz videojuegoDiana Hernandez
Ā 
Project Manhattan Columbia
Project Manhattan ColumbiaProject Manhattan Columbia
Project Manhattan ColumbiaStanford University
Ā 
Webloft agency credentials
Webloft agency credentialsWebloft agency credentials
Webloft agency credentialsWebloft Concepts
Ā 
Branding Design Proposal Template PowerPoint Presentation Slides
Branding Design Proposal Template PowerPoint Presentation SlidesBranding Design Proposal Template PowerPoint Presentation Slides
Branding Design Proposal Template PowerPoint Presentation SlidesSlideTeam
Ā 
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...Rosenfeld Media
Ā 

What's hot (20)

[Product Camp 2020] - Dados e mƩtricas para lƭderes de produto - Pedro Galopp...
[Product Camp 2020] - Dados e mƩtricas para lƭderes de produto - Pedro Galopp...[Product Camp 2020] - Dados e mƩtricas para lƭderes de produto - Pedro Galopp...
[Product Camp 2020] - Dados e mƩtricas para lƭderes de produto - Pedro Galopp...
Ā 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearch
Ā 
Digital Marketing Portfolio
Digital Marketing PortfolioDigital Marketing Portfolio
Digital Marketing Portfolio
Ā 
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystems
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystemsTechnical Webinar: By the (Play) Book: The Agile Practice at OutSystems
Technical Webinar: By the (Play) Book: The Agile Practice at OutSystems
Ā 
Marketo ppt (1)
Marketo   ppt (1)Marketo   ppt (1)
Marketo ppt (1)
Ā 
Brand24
Brand24Brand24
Brand24
Ā 
Digital Marketing Service Provider
Digital Marketing Service ProviderDigital Marketing Service Provider
Digital Marketing Service Provider
Ā 
Jobs-To-Be-Done by C6 Bank Lead Product Designer
Jobs-To-Be-Done by C6 Bank Lead Product DesignerJobs-To-Be-Done by C6 Bank Lead Product Designer
Jobs-To-Be-Done by C6 Bank Lead Product Designer
Ā 
Building a Culture of Experimentation at HP
Building a Culture of Experimentation at HPBuilding a Culture of Experimentation at HP
Building a Culture of Experimentation at HP
Ā 
How to Utilize TikTok in Your Content Marketing Strategy
How to Utilize TikTok in Your Content Marketing StrategyHow to Utilize TikTok in Your Content Marketing Strategy
How to Utilize TikTok in Your Content Marketing Strategy
Ā 
Social Media Consulting Services Contract
Social Media Consulting Services ContractSocial Media Consulting Services Contract
Social Media Consulting Services Contract
Ā 
Data Driven Design Research Personas
Data Driven Design Research PersonasData Driven Design Research Personas
Data Driven Design Research Personas
Ā 
UX 101: Personas
UX 101: PersonasUX 101: Personas
UX 101: Personas
Ā 
Why everything is an A/B Test at Pinterest
Why everything is an A/B Test at PinterestWhy everything is an A/B Test at Pinterest
Why everything is an A/B Test at Pinterest
Ā 
Complete Digital Marketing Proposal Format (1).pdf
Complete Digital Marketing Proposal Format (1).pdfComplete Digital Marketing Proposal Format (1).pdf
Complete Digital Marketing Proposal Format (1).pdf
Ā 
2.5.3 interfaz videojuego
2.5.3 interfaz videojuego2.5.3 interfaz videojuego
2.5.3 interfaz videojuego
Ā 
Project Manhattan Columbia
Project Manhattan ColumbiaProject Manhattan Columbia
Project Manhattan Columbia
Ā 
Webloft agency credentials
Webloft agency credentialsWebloft agency credentials
Webloft agency credentials
Ā 
Branding Design Proposal Template PowerPoint Presentation Slides
Branding Design Proposal Template PowerPoint Presentation SlidesBranding Design Proposal Template PowerPoint Presentation Slides
Branding Design Proposal Template PowerPoint Presentation Slides
Ā 
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
Communicating and Establishing DesignOps as a New Function (Brennan Hartich a...
Ā 

Viewers also liked

UIMA
UIMAUIMA
UIMAotisg
Ā 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
Ā 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPlucenerevolution
Ā 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
Ā 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
Ā 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
Ā 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
Ā 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
Ā 
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyOptimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyLucidworks
Ā 
Pablo Duboue
Pablo DubouePablo Duboue
Pablo DuboueClusterCba
Ā 
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 University of Torino
Ā 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
Ā 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaDiana Maynard
Ā 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit
Ā 
Rule engine
Rule engineRule engine
Rule engineVimal Kumar
Ā 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
Ā 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
Ā 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
Ā 

Viewers also liked (20)

UIMA
UIMAUIMA
UIMA
Ā 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
Ā 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
Ā 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
Ā 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Ā 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
Ā 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
Ā 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Ā 
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyOptimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Ā 
Pablo Duboue
Pablo DubouePablo Duboue
Pablo Duboue
Ā 
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
Ā 
Pycon16 draft
Pycon16 draftPycon16 draft
Pycon16 draft
Ā 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
Ā 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
Ā 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
Ā 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Ā 
Rule engine
Rule engineRule engine
Rule engine
Ā 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
Ā 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
Ā 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
Ā 

Similar to Apache UIMA Introduction

Analysis Mechanical system using Artificial intelligence
Analysis Mechanical system using Artificial intelligenceAnalysis Mechanical system using Artificial intelligence
Analysis Mechanical system using Artificial intelligenceanishahmadgrd222
Ā 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
Ā 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
Ā 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to pythonMohammed Rafi
Ā 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
Ā 
4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...EditorJST
Ā 
Evaluation of Research Tools
Evaluation of Research ToolsEvaluation of Research Tools
Evaluation of Research ToolsHATS
Ā 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inKumari Naveen
Ā 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.pptHaHa501620
Ā 
Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfInexture Solutions
Ā 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaSrikanth Vanama
Ā 
Deep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWSDeep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWSKristana Kane
Ā 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Soham Mondal
Ā 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
Ā 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysisSudhaa Ravi
Ā 
Osonto documentatie
Osonto documentatieOsonto documentatie
Osonto documentatiewondernet
Ā 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppratnapatil14
Ā 

Similar to Apache UIMA Introduction (20)

Analysis Mechanical system using Artificial intelligence
Analysis Mechanical system using Artificial intelligenceAnalysis Mechanical system using Artificial intelligence
Analysis Mechanical system using Artificial intelligence
Ā 
AI & ML
AI & MLAI & ML
AI & ML
Ā 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
Ā 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
Ā 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
Ā 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Ā 
4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...
Ā 
Evaluation of Research Tools
Evaluation of Research ToolsEvaluation of Research Tools
Evaluation of Research Tools
Ā 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful in
Ā 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
Ā 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
Ā 
Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
Ā 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_Vanama
Ā 
Parser
ParserParser
Parser
Ā 
Deep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWSDeep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWS
Ā 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Ā 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
Ā 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysis
Ā 
Osonto documentatie
Osonto documentatieOsonto documentatie
Osonto documentatie
Ā 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt pppppppppppppppppppppppppp
Ā 

More from Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
Ā 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
Ā 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
Ā 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
Ā 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
Ā 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
Ā 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
Ā 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
Ā 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integrationTommaso Teofili
Ā 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
Ā 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
Ā 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
Ā 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
Ā 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
Ā 

More from Tommaso Teofili (14)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
Ā 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
Ā 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
Ā 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
Ā 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
Ā 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Ā 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
Ā 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
Ā 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
Ā 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
Ā 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Ā 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
Ā 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
Ā 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
Ā 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
Ā 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
Ā 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
Ā 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
Ā 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
Ā 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
Ā 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
Ā 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
Ā 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
Ā 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
Ā 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vƔzquez
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
Ā 
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot ModelDeepika Singh
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
Ā 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
Ā 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Ā 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
Ā 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
Ā 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ā 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Ā 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Ā 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Ā 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Ā 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Ā 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Ā 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Ā 
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Ā 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 

Apache UIMA Introduction

  • 1. Apache UIMA Introduction Gestione delle Informazioni su Web - 2010/2011 Tommaso Teoļ¬li tommaso [at] apache [dot] org
  • 2. UIM ? Unstructured Information Management A wide topic: text, audio, video Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources) Apache UIMA
  • 3. Apache Software Foundation No proļ¬t corporation ā€œ...provides organizational, legal, and ļ¬nancial support for a broad range of open source software projects...ā€ ā€œ...collaborative and meritocratic development process...ā€ ā€œ...pragmatic Apache License...ā€
  • 4. Apache UIMA Architectural framework to manage unstructured data (Java, C++, ...) Former IBM research project donated to ASF OASIS Standard for unstructured information management
  • 5. Apache UIMA - Goals ā€œOur goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and videoā€
  • 6. Apache UIMA - bridging worlds
  • 7. Apache UIMA - Overview UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies
  • 8. Apache UIMA - Multimodal Analysis Multimodal Analysis means the ability of processing some resource from various ā€œpoints of viewā€ Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved We are though mainly interested in text...
  • 9. Sample scenario Content Management System containing free text articles about movies We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to ā€œsimilarā€ articles (i.e.: dealing with same movies or actors) So that we can search for ā€œsimilarā€ articles
  • 10. Sample scenario - articles about movies
  • 11. Sample scenario UIMA can help on enriching articles with metadata Think of ļ¬lling an Article.java instance variables with proper values Then persisting it to a database to query articles dealing with the same actors
  • 13. Sample scenario - metadata
  • 15. Apache UIMA - Annotation The association of a metadata, such as a label, with a region of text (or other type of artifact). For example, the label ā€œPersonā€ associated with a region of text ā€œFred Centerā€ constitutes an annotation. We say ā€œPersonā€ annotates the span of text from X to Y containing exactly ā€œFred Centerā€
  • 16. Apache UIMA - Basic Steps Domain model deļ¬nition Analysis pipeline deļ¬nition Arrange components: Deļ¬ne components draining data from sources Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc... Deļ¬ne components outputting information on target storages Analysis pipeline(s) execution
  • 17. Deļ¬ning domain model within UIMA using Type Systems Type System is the place where we describe which metadata we would like to extract Low representational gap Like almost everything in UIMA: described (and generated!) using XML Possible to deļ¬ne multiple Type Systems for different purposes
  • 18. How do UIMA extract metadata?
  • 19. Apache UIMA - Analysis Engines Basic UIMA building blocks Analyze a document Infer and record descriptive attributes (about documents/regions) Generating analysis results
  • 20. Apache UIMA - AEs Analysis Engines are described by a descriptor (XML) Can be Primitive (a single AE) or Aggregated (a pipeline of AEs) Analysis algorithms can be switched changing descriptor instead of code Contain TypeSystems deļ¬nitions Deļ¬ne Capabilites
  • 21. Apache UIMA - AnalysisComponent API initialize : Performs (once) any startup tasks required by this component process : Process the resource to analyze generating analysis results (metadata) destroy : Frees all resources held, called only once when it is ļ¬nished using this component
  • 22. Apache UIMA - Annotators Analysis Engine algorithm Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video) Annotators implement AnalysisComponent interface
  • 23. Apache UIMA - Roles AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent AnalysisComponent : interface for any component responsible for analyzing artifacts Annotator : implementation of AnalysisComponent responsible for creating Annotations
  • 25. Analysis Engines in a Pipeline
  • 26. Apache UIMA - Analysis Results Where do analysis results end up? How annotators represent and share their results? CAS - Common Analysis Structure Maintain typed indexes of extracted results
  • 28. Which algorithms lay under AEs?
  • 29. Apache UIMA & NLP NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications Itā€™s an AI discipline
  • 30. Apache UIMA & NLP ā€œaccomplish human-like language processingā€ Paraphrase an input text Translate the text into another language Answer questions about the contents of the text Draw inferences from the text
  • 31. Apache UIMA & NLP ā€œan NLP-based IR system has the goal of providing more precise, complete information in response to a userā€™s real information needā€ various levels of processing
  • 32. Apache UIMA - Approaches Simplest : Write RegEx and Dictionaries and mix them together NLP-like : Tokenize -> Sentence identiļ¬cation -> PoS Tagging -> Anaphora resolution -> Named Entities Recognition -> Coreference Identiļ¬cation ...
  • 33. Analysis Engines in a Pipeline
  • 34. NLP - Language Identifying NLP takes advantage of language speciļ¬c syntax, forms, rules and meanings Not easy to write language independent extraction algorithms Often this is the ļ¬rst block of NLP pipelines Techniques: Stopwords dictionaries, statistical models, etc.
  • 35. NLP - Tokens and Sentences Humans learn wordsā€™ meaning in order to understand whole context semantics Split the target text in words to be able to analyze their meaning and role Discover sentences to later assign roles to each token Easiest for English, Italian & co. but what about Chinese?
  • 36. NLP - PoS Tagging Assign a ā€œPart of Speechā€ (noun, adjective, verb, etc.) to each token generated in the previous step Many language/domain speciļ¬c patterns can be discovered and exploited just with pos- tagged-tokens and sentences
  • 37. NLP - Chunking & Parsing Parse sentences into a meaningful set or tree of relationships Chunks are the sentence building blocks (i.e. verbal forms) Parse tree highlights the structure of a sentence Can leverage logic analysis chunking parsing
  • 38. NLP - Named Entities Recognition Answer the questions: where? when? who? how often? how much? Identify key entities in the text Common techniques: dictionaries, rules, statistcal models
  • 40. Using UIMA Deļ¬ne TypeSystem Deļ¬ne AnalysisEngine descriptor(s) Implement Annotator(s) Execute the UIMA pipeline
  • 41. Sample scenario - extract actors Tokenize article text Identify sentences Tag PoS Identify Persons using regular expressions and PoS Use Person annotations, Tokensā€™ PoS and Sentences to extract relations between terms to identify Persons who are also Actors
  • 42. Sample scenario - extract persons I have a dictionary of names (simple to ļ¬nd and/or build) I use a dictionary based Annotator to extract annotations of ļ¬rst names (NameAnnotation) I donā€™t have a dictionary of surnames Everytime a matching name (a NameAnnotation) is found we look for one or more (considering persons with double name or surname) subsequent tokens whose PoS is ā€œundeļ¬nedā€ or a noun (but not a verb) and starts with Uppercase letter If found then the name + token(s) sequence annotates a Person (i.e. ā€œMichael J. Foxā€)
  • 43. from Persons to Actors Getting actors can be simple if we know that Persons who are also actors do some well known actions or there exist widely used patterns i.e.: a Person ā€œstars asā€ CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor i.e.: if the snippet ā€œCharacterInTheMovie (Person)ā€ exists, then Person is usually an Actor then we could build an ActorAnnotator
  • 44. 1. Deļ¬ne TypeSystem Deļ¬ne at least a Type inside Type System for each object inside the domain model Useful to deļ¬ne more ļ¬ne grained Types (for values of type properties, called Features) If we want to extract information about articles we create an Article type inside the Type System Also weā€™ll need to create annotations/entites for movies, actors, directors, etc...
  • 45. 2. Deļ¬ne AnalysisEngine descriptor Deļ¬ne which type system itā€™s going to use Deļ¬ne which capabilities the analysis engine has: which annotations need to work and which annotations itā€™ll (eventually) generate Deļ¬ne conļ¬guration paramaters for the underlying algorithm Deļ¬ne resources needed by the analysis engine
  • 46. 3. Implement Annotator create a new class extending JCasAnnotator_ImplBase implement the process() method that actually does the job the algorithm implementation is (called) in the process() method you can use conļ¬guration parameters/resources deļ¬ned in the descriptor eventually override initialize() and destroy() methods
  • 48. 4. Execute the UIMA pipeline Instantiate the AnalysisEngine with its descriptor as a parameter Create a CAS which will contain the text to be analyzed and the annotations extracted Run the AnalysisEngine on the given CAS Browse results
  • 49. Execute a UIMA pipeline
  • 50. Whatā€™s next UIMA Use cases Using UIMA in search engines Hands on code (assignment)