Invited talk at Processing ROmanian in Multilingual, Interoperational and Scalable Environments (PROMISE 2010) on how to port the QALL-ME framework to a new language
1. Porting the QALL-ME framework to Romanian
Constantin Or˘san
a
Research Group in Computational Linguistics
Research Institute in Information and Language Processing
University of Wolverhampton
http://www.wlv.ac.uk/~in6093/
29th March 2010
2. 1 Introduction
2 The QALL-ME project
3 Multilingual information access in QALL-ME
4 Conclusions
3. Structure of the presentation
1 Introduction
2 The QALL-ME project
3 Multilingual information access in QALL-ME
4 Conclusions
4. Need to access information
• as a result of the Internet development more and more
information becomes available
• this information is in many languages
• fields from computational linguistics such as automatic
summarisation, question answering, text mining, etc. can help
people deal with information
5. Need to access information
• as a result of the Internet development more and more
information becomes available
• this information is in many languages
• fields from computational linguistics such as automatic
summarisation, question answering, text mining, etc. can help
people deal with information
6. Question answering (QA)
• Question answering aims at identifying the answer to a
question in a large collection of documents
• the information provided by QA is more focused than
information retrieval
• the output can be the exact answer or a text snippet which
contains the answer
• the domain took off as a result of the introduction of QA
track in TREC, whilst cross-lingual QA as a result of CLEF
7. Types of QA systems
• open-domain QA systems: can answer any question from any
collection
+ can potentially answer any question
- very low accuracy (especially in cross-lingual settings)
8. Types of QA systems
• open-domain QA systems: can answer any question from any
collection
+ can potentially answer any question
- very low accuracy (especially in cross-lingual settings)
• canned QA systems: rely on a very large repository of
questions for which the answer is known
+ very little processing necessary
- limited to the answers in the database
9. Types of QA systems
• open-domain QA systems: can answer any question from any
collection
+ can potentially answer any question
- very low accuracy (especially in cross-lingual settings)
• canned QA systems: rely on a very large repository of
questions for which the answer is known
+ very little processing necessary
- limited to the answers in the database
• closed-domain QA systems: are built for very specific domains
and exploit expert knowledge in them
+ very high accuracy
- can require extensive language processing and limited to one
domain
10. Purpose of the presentation
• briefly present the QALL-ME project
11. Purpose of the presentation
• briefly present the QALL-ME project
• show how it was adapted to answer questions in Romanian
about movies
12. Structure of the presentation
1 Introduction
2 The QALL-ME project
3 Multilingual information access in QALL-ME
4 Conclusions
13. The QALL-ME project
• QALL-ME = Question Answering Learning technologies in a
multiLingual and Multimodal Environment
• EU-funded project part of FP6
• 7 partners:
• FBK-irst, Italy
• University of Wolverhampton, UK
• University of Alicante, Spain
• DFKI, Germany
• Comdata, Italy
• UbiEST, Italy
• WayCom, Italy
• Web page: http://qallme.fbk.eu
14. The QALL-ME project
• aimed at establishing a shared infrastructure for multilingual
and multimodal QA in the domain of tourism
• In the QALL-ME system
• users ask natural language questions in several languages (both
in textual and speech modality) using a variety of input devices
(e.g. mobile phones), and
• returns a list of specific answers formatted in the most
appropriate modality, ranging from small texts, maps, videos,
and pictures.
15. Local Information Semantic
Sources representation
Service Provider
English Answer German Answer
Extractor Extractor
QALLME central
QA planner
Spanish Answer Italian Answer
Extractor Extractor
Question Type Answer Type Speech Dialog Models
ontology ontology Recognizers
16. Main outputs of the project
• an ontology for the domain of tourism
• entailment based QA framework
• the QALL-ME benchmark
• an entailment framework
(all accessible from the project’s web page:
http://qallme.fbk.eu)
17. The ontology
• A domain-specific ontology for the tourism domain was
developed and shared among all the partners.
• The ontology was used to serve as:
• bridge between different languages
• communication language between different components of the
system
• The ontology was linked to domain independent ontologies
such as MultiWordNet and Sumo
• For more information see (Ou et al., 2008)
18. Design of the ontology
• Analysis of data from content providers
• Analysis of users requirements
• Inspired by similar ontologies:
• Harmonise and eTourism: focus on static information (e.g.
accommodation and events/activities)
• Similar to eTourism as is written in OWL rather RDFs
• but wider coverage
• Introspection
19. The ontology
• Main classes: Country, Destination, Site (i.e.
Accommodation, Attraction, Gastro, and Infrastructure),
Transportation, EventContent and Event
• Element classes: Facility, Room, PersonOrganization,
Language, and Currency
• Attribute classes: Contact, Location, Period and Price.
• Element and attribute classes cannot exist independently and
have to be attached to other main or element classes
20. Price Site
GPSCoordinate
priceType
hasGPSCoordinate
subClassOf subClassOf
PostalAddress
priceValue Event hasPostalAddress
TicketPrice Cinema
DirectionLocation
hasCurrency subClassOf DirectionLocation
Currency isInSite
hasPrice
hasContact
name description
Contact
hasSiteFacility
MovieShow hasRoom
CinemaRoom
SiteFacility
Period EventContent
hasRoomFacility
endTime startTime hasPeriod
hasEventContent RoomFacility
subClassOf subClassOf
TimePeriod
Director
hasTimePeriod
hasDirector
DateTimePeriod Movie hasProducer Producer
hasDatePeriod hasStar
DatePeriod hasWriter Star
name certificate
endDate startDate synposis genre Writer
21. The ontology
• Encoded using OWL DL, since it has more expressive power
than OWL Lite and has more efficient reasoning support than
OWL Full
• Used Protege-OWL as the editor and RacerPro7 as the
reasoner
• The ontology contains
• 122 classes (concepts),
• 55 datatype properties and
• 52 object properties which indicate the relationships among
the 122 classes.
• 15 top-level classes.
• The class hierarchy has a maximum depth of 4.
22. The QALL-ME framework
• is an architecture skeleton for multilingual QA systems for
closed domains
• designed in such a way that it allows fast development of
closed domain QA systems
• freely available from http://qallme.sourceforge.net/
• is based on a Service Oriented Architecture (SOA) which is
realised using web services
• relies on textual entailment recognisers
23. Web services
1 Context providers: are used to anchor questions in space
and time
2 Annotators: Currently three types of annotators are
available:
• named entity annotators which identify names of cinemas,
movies, persons, etc.
• term annotators which identify hotel facilities, movie genres
and other domain-specific terminology
• temporal annotators that are used to recognise and normalise
temporal expressions in user questions
3 Entailment engine: determines whether a user question
entails a retrieval procedure
4 Query generator: which relies on an entailment engine to
generate a query to extract the answer.
5 Answer pool: retrieves the answers from a database.
24. Context providers
• are used to anchor a question in space and time
• return the current position and time
• used by the presentation module when maps are displayed
• used by temporal process to normalise temporal entities
• determines which services are used in a cross-lingual scenario
• can be static or determined from a mobile phone
25. Named entity and term annotators
• named entity recogniser = identifies names of hotels, movies,
persons, etc.
• term annotator = identifies domain specific terms such as
hotel facilities, movie genres, etc.
• the entities and terms are known, so the task is reduced to a
database look up
• Gazetteers are the main source for determining the entities
• The annotation module needs to determine the canonical form
of a entity
• greedy algorithm that uses character based similarity, a
modified TF*IDF and a greedy algorithm
• does not allow overlapping and there are few ambiguities
26. Named entity and term annotators
• Annotates both standard and non-standard entities: cinema,
movie, location, genre, certificate
• Needs to deal with nosy input:
• misspelt words/input from ASR engines/SMS input e.g.
becaming Jane, becoming Jade
• free word order (Will Smith / Smith, Will)
• equivalent strings (saw III / three / 3; Smith, Will / Smith,
W.)
• Needs to deal with questions in mixed languages
• Needs to deal with ambiguous entities
27. Temporal annotator
• questions from the domain of tourism contain a large number
of temporal expressions
• we use a simplified version of the tagger implemented by
Pu¸ca¸u (2004)
s s
• the simplification was done to reduce the processing time
(Varga, Pu¸ca¸u, and Or˘san, 2009)
s s a
• identifies both self-contained temporal expressions (TEs) and
indexical/under-specified TEs
• uses TIMEX2 standard
• the output is used by TIMEX2SPARQL service to restrict the
extracted answers
28. Entailment engine
• often closed-domain QA systems transform a question to a
Prolog fact or SQL query
• often this solution works only partially due to language
variability
• in QALL-ME this problem is solved using textual entailment
• the entailment engine determines whether two questions entail
the same meaning so they share the same retrieval procedure:
• T the input question
• H is textual pattern stored in a repository
• textual patterns have SPARQL retrieval procedures
• we calculate the similarity between two sentences to determine
whether between them there is an entailment relation
29. Query generation service
• produces a SPARQL query that can be used to answer the
question
• has a list of question templates with their associated SPARQL
queries
• relies on the entailment engine to determine which of the
question patterns entail the same meaning as the user
question
• fills in the slots of the question patterns
30. Example
User question (T): What movie can I see tonight in
Wolverhampton?
List of patterns (H):
• Who is the director of [MOVIE]?
• Where can I see [MOVIE] [TIMEX]?
• What movies are on in [DESTINATION] [TIMEX]?
• What is the address of [CINEMA]?
• ...
31. Example
User question (T): What movie can I see tonight in
Wolverhampton? → What movie can I see [TIMEX] in
[DESTINATION]?
List of patterns (H):
• Who is the director of [MOVIE]?
• Where can I see [MOVIE] [TIMEX]?
• What movies are on in [DESTINATION] [TIMEX]?
• What is the address of [CINEMA]?
• ...
Select the retrieval pattern associated with the question
What movies are on in Wolverhampton tonight
32. Answer Pool service
• takes the SPARQL query generated by the query generator
and extracts the answer
• SPARQL is a query language for accessing RDF graphs by the
W3C RDF Data Access Working Group
• SPARQL provides interoperability between languages
33. Structure of the presentation
1 Introduction
2 The QALL-ME project
3 Multilingual information access in QALL-ME
4 Conclusions
34. Cross-lingual QA
• QALL-ME tourism prototype is design to allow both
monolingual and cross-lingual QA
• relevant web services are activated depending on the source
and target language
• user scenario: Romanian tourist in UK who wants to find out
more about the movies in Wolverhampton
36. Prototype for Romanian
• we wanted to find out how long it takes to develop a demo for
Romanian
• components had to be adapted:
• named entity and term annotators had to be trained on a
different list of entities
• a simple temporal annotator was implemented on the basis of
the English one
• the language independent similarity entailment engine was used
• the question patterns were translated to Romanian
• answer pool did not required any change
• the whole process took under one week
38. Structure of the presentation
1 Introduction
2 The QALL-ME project
3 Multilingual information access in QALL-ME
4 Conclusions
39. Conclusions
• multilinguality is a very important issue for the QALL-ME
project
• the ontology constitute the bridge between languages
• the QALL-ME framework can be used to quickly develop
prototypes for other languages
42. Ou, Shiyan, Viktor Pekar, Constantin Or˘san, Christian Spurk, and Matteo Negri.
a
2008. Development and alignment of a domain-specific ontology for question
answering. In European Language Resources Association (ELRA), editor, Proceedings
of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech,
Morocco, May 28 – 30.
Pu¸ca¸u, Georgiana. 2004. A framework for temporal resolution. In Proceedings of
s s
the 4th Conference on Language Resources and Evaluation (LREC 2004), Lisbon,
Portugal, May, 26-28.
Varga, Andrea, Georgiana Pu¸ca¸u, and Constantin Or˘san. 2009. Identification of
s s a
temporal expressions in the domain of tourism. In Knowledge Engineering: Principles
and Techniques, volume 1, pages 29 – 32, Cluj-Napoca, Romania, July 2 – 4.