SlideShare a Scribd company logo
1 of 30
Language Resources, Language
Technology, Text Mining, the Semantic
Web: How interoperability of machines
can help humans in the multilingual web
Felix Sasaki
DFKI / University of Appl. Sciences Potsdam
W3C German-Austrian Office
felix.sasaki@dfki.de
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 1
Purpose of this talk (1)
• Show gaps
– Between machines
– Between machines and humans
• … which we need to fill to bridge gaps
between humans
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 2
Purpose of this talk (2)
• Identify groups / communities
– To fill gaps
– To come together in new alliances
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 3
Basics:
What are machines doing
(not only on the Web)?
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 4
Language Technology
• Summarization
LT
“These texts are
about ... “
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 5
Language Technology
• Machine Translation
LT このワークショップ
は…で開催される
“The workshop
takes place in …“
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 6
Language Technology
• Spell and grammar checking
LT
“The workshop
takes place in …“
“The worksop
take place in …“
• And many more applications
• Coreference resolution, discourse analysis,
named entity recognition, natural language
generation, question answering, …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 7
Text mining
• Finding out things you did not know
Text
mining
•“Text A and text B
are similar”
•“The text collection
has clusters of
topics: …”
Visualization
of results
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 8
Basics:
What are machines doing
(not only on the Web)?
How are they doing it?
They are using resources
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 9
Resources in language technology
• Sample resources for summarization
LT
“These texts are
about ... “
NLG output
text mining
output
stop word
list
…
10
Language Technology
• Sample resources in Machine Translation
LT このワークショップ
は…で開催される
“The workshop
takes place in …“
Lexicon Grammar
(Training)
corpora
…
Generation 11
Language Technology
• Sample resources for spell and grammar
checking
LT
“The workshop
takes place in …“
“The worksop
take place in …“
Lexicon Grammar …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 12
Text mining
• Sample resources for text mining
Text
mining
•“Text A and text B
are similar”
•“The text collection
has clusters of
topics: …”
Lexicon
Stop word
list
…
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 13
In general: you need three types of
data: input, resources, workflow
Input
Work-
flow
Output
Resources Resources …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 14
What gaps need to be filled for truly
“multilingual content processing”?
• Gap 1: machines don’t use metadata available
in the input
• Gap 2: machines don’t know about the
workflow (input) data goes through
• Gap 3: machines don’t make explicit
– “Who” they are
– What resources they are using
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 15
Gap 1: machines don’t use metadata
available in the input
• Input from www.postbank.de
„Ob Postbank direkt, Online-Banking,
Online-Brokerage oder myBHW. Die
häufigsten Fragen zu unseren
Transaktionssystemen finden Sie an
dieser Stelle.“
• Output via Google translate
“Whether Postbank direct, online
banking, online brokerage or myBHW.
Frequently asked questions about our
transaction systems can be found at
this location.”
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 16
Gap 1: machines don’t use metadata
available in the input
• Input from www.postbank.de
„Ob Postbank direkt, Online-Banking,
Online-Brokerage oder myBHW. Die
häufigsten Fragen zu unseren
Transaktionssystemen finden Sie an
dieser Stelle.“
• Output via Google translate
“Whether Postbank direct, online
banking, online brokerage or myBHW.
Frequently asked questions about our
transaction systems can be found at
this location.”
Fixed terminology
should not have
been translated.
But – the MT tool
had no chance to
“know” that –
why?
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 17
Gap 2: machines don’t know about
processes data goes through
• Input from the data base – the
“hidden web”:
„Ob <term>Postbank direkt</term>,
<term>Online-Banking</term>,
<term>Online-Brokerage</term> …“
• Output on the Web:
„Ob <em>Postbank direkt</em>,
<em>Online-Banking</em>,
<em>Online-Brokerage</em> …“
fixed terminology
(= metadata) …
… is lost
on the Web 
publication
process
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 18
Gap 3: no common identification …
• Of metadata and processes chains (previous
slides)
• Of resources – e.g. what is a lexicon
– In machine translation?
– In localization?
– For a human reader?
– Ability to combine tools depends on knowing
about them (capabilities, resources) in detail
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 19
Who can fill these gaps – people
dealing with multilingual content
• Content producers
– Allow for terminology identification in source formats
/ CMS
• Localizers
– Make localization workflows aware of (process /
source content) metadata
• “Machine” experts
– Make their tools sensible to source content metadata
and expose their capabilities (what resources /
workflows) in a clear defined way
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 20
Who can fill these gaps – people
dealing with multilingual content
• Users
– Add metadata to source content
– Use (machine translation) tools without knowing the
details – e.g. in the browser!
• Browser vendors
– Create APIs which make use of automatic tools /
resource and workflow descriptions / source code
metadata
• …
 The people in this room!
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 21
How can they fill the gaps?
• All these groups need to agree upon one
machine readable information space for filling
the gaps
• It’s actually already here – the Semantic Web!
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 22
What is the Semantic Web
• The Web as humans see it: Identification of
“meaning” e.g. via (typographic or other)
conventions
„Ob Postbank direkt …“
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 23
What is the Semantic Web
• The Web as machines see it: Identification of
meaning via RDF-based mechanisms (here via
RDFa)
„Ob <span property=”its:term”>Postbank direkt</span>
…“
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 24
What is the Semantic Web –
RDF in 30 seconds
• A framework for making statements about
resources, using URIs
• RDF can help to fill our gaps
1. Metadata in the input
2. Metadata for workflows
3. Identify 1., 2. and language technology resources
uniquely
• In one information space – the machine
readable Web
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 25
Instead of a summary – call for project
(participating in ) proposals
• Who needs to come together
– Content producers, localizers, “machine” experts, browser
vendors, users
• What should their work be based upon
– Semantic Web technologies
– Clear interfaces to the human (e.g. browser) Web, like RDFa
• What we do not need
– Web-centred standardization of formats for language resources
themselves – that is already done elsewhere (see this session)
• Where the place is to do that work?
– W3C, since it needs to be part of core Web technologies
• For making it happen, we need a strong alliance of Web
technologies, other fields and machine technologies
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 26
META-NET
• EU-funded project, closely related to
“Multilingual Web”
• Main aim: build an alliance for improving
language technologies in Europe
• Laaarge: soon 40+ participating organizations
in 30+ countries
• Very important: bring users of language
technology in
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 27
META-NET
• Users and language technology companies =
in Europe not only large companies, but more
and more small SMEs
• Target of META-NET are these small and fast
units – including you 
• EU has started special funding programs for
SMEs – see http://tinyurl.com/eu-lt-sme
(“objective 4.1”)
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 28
META-NET
• Event: META-NET Forum
• Brussels, November 17th/18th
• Aim: Bring users / language technology
developers / policy makers together
• Discuss a road map for the next 10 years of
language technology road map and its
applications
• Details and registration at
http://www.meta-net.eu/events
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 29
Language Resources, Language
Technology, Text Mining, the Semantic
Web: How interoperability of machines
can help humans in the multilingual web
Felix Sasaki
DFKI / University of Appl. Sciences Potsdam
W3C German-Austrian Office
felix.sasaki@dfki.de
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 30

More Related Content

Viewers also liked

Sasaki practical-linked-data
Sasaki practical-linked-dataSasaki practical-linked-data
Sasaki practical-linked-dataFelix Sasaki
 
Sasaki webtechcon2010
Sasaki webtechcon2010Sasaki webtechcon2010
Sasaki webtechcon2010Felix Sasaki
 
Freme at feisgiltt 2015 freme & linked data & localisers
Freme at feisgiltt 2015   freme & linked data & localisersFreme at feisgiltt 2015   freme & linked data & localisers
Freme at feisgiltt 2015 freme & linked data & localisersFelix Sasaki
 
HTML5 - presentation at W3C-Tag 2009
HTML5 - presentation at W3C-Tag 2009HTML5 - presentation at W3C-Tag 2009
HTML5 - presentation at W3C-Tag 2009Felix Sasaki
 
Sasaki ins-netz-gegangen-20111117
Sasaki ins-netz-gegangen-20111117Sasaki ins-netz-gegangen-20111117
Sasaki ins-netz-gegangen-20111117Felix Sasaki
 
Sasaki datathon-madrid-2015
Sasaki datathon-madrid-2015Sasaki datathon-madrid-2015
Sasaki datathon-madrid-2015Felix Sasaki
 
Terminologie als Baustein der CMS-Einführung
Terminologie als Baustein der CMS-EinführungTerminologie als Baustein der CMS-Einführung
Terminologie als Baustein der CMS-EinführungHans Pich
 
1114 sasaki-metadata
1114 sasaki-metadata1114 sasaki-metadata
1114 sasaki-metadataFelix Sasaki
 
Sasaki Presentation at EVA 2016
Sasaki Presentation at EVA 2016Sasaki Presentation at EVA 2016
Sasaki Presentation at EVA 2016Felix Sasaki
 
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 PräsentationFelix Sasaki
 
Tdahtdahok 111120161211-phpapp01 (1)
Tdahtdahok 111120161211-phpapp01 (1)Tdahtdahok 111120161211-phpapp01 (1)
Tdahtdahok 111120161211-phpapp01 (1)Cristina Orientacion
 
Prof Klaus: Terminology Management
Prof Klaus: Terminology ManagementProf Klaus: Terminology Management
Prof Klaus: Terminology Managementakashjd
 
tekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentation
tekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentationtekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentation
tekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer DokumentationGeorg Eck
 
Its2 ontology-localization
Its2 ontology-localizationIts2 ontology-localization
Its2 ontology-localizationFelix Sasaki
 
Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung
Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung
Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung SDL Language Technologies
 
Sasaki markupforum2011
Sasaki markupforum2011Sasaki markupforum2011
Sasaki markupforum2011Felix Sasaki
 

Viewers also liked (17)

Sasaki practical-linked-data
Sasaki practical-linked-dataSasaki practical-linked-data
Sasaki practical-linked-data
 
Sasaki webtechcon2010
Sasaki webtechcon2010Sasaki webtechcon2010
Sasaki webtechcon2010
 
Freme at feisgiltt 2015 freme & linked data & localisers
Freme at feisgiltt 2015   freme & linked data & localisersFreme at feisgiltt 2015   freme & linked data & localisers
Freme at feisgiltt 2015 freme & linked data & localisers
 
HTML5 - presentation at W3C-Tag 2009
HTML5 - presentation at W3C-Tag 2009HTML5 - presentation at W3C-Tag 2009
HTML5 - presentation at W3C-Tag 2009
 
Sasaki ins-netz-gegangen-20111117
Sasaki ins-netz-gegangen-20111117Sasaki ins-netz-gegangen-20111117
Sasaki ins-netz-gegangen-20111117
 
XML Seminar
XML SeminarXML Seminar
XML Seminar
 
Sasaki datathon-madrid-2015
Sasaki datathon-madrid-2015Sasaki datathon-madrid-2015
Sasaki datathon-madrid-2015
 
Terminologie als Baustein der CMS-Einführung
Terminologie als Baustein der CMS-EinführungTerminologie als Baustein der CMS-Einführung
Terminologie als Baustein der CMS-Einführung
 
1114 sasaki-metadata
1114 sasaki-metadata1114 sasaki-metadata
1114 sasaki-metadata
 
Sasaki Presentation at EVA 2016
Sasaki Presentation at EVA 2016Sasaki Presentation at EVA 2016
Sasaki Presentation at EVA 2016
 
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
 
Tdahtdahok 111120161211-phpapp01 (1)
Tdahtdahok 111120161211-phpapp01 (1)Tdahtdahok 111120161211-phpapp01 (1)
Tdahtdahok 111120161211-phpapp01 (1)
 
Prof Klaus: Terminology Management
Prof Klaus: Terminology ManagementProf Klaus: Terminology Management
Prof Klaus: Terminology Management
 
tekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentation
tekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentationtekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentation
tekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentation
 
Its2 ontology-localization
Its2 ontology-localizationIts2 ontology-localization
Its2 ontology-localization
 
Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung
Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung
Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung
 
Sasaki markupforum2011
Sasaki markupforum2011Sasaki markupforum2011
Sasaki markupforum2011
 

Similar to Bridging Gaps Between Humans With Interoperable Machines

From Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperabilityFrom Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperabilityAndy Powell
 
Realizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 TutorialRealizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 TutorialEmanuele Della Valle
 
Overview AG AKSW
Overview AG AKSWOverview AG AKSW
Overview AG AKSWSören Auer
 
The Europeana Strategy and Linked Open Data
The Europeana Strategy and Linked Open DataThe Europeana Strategy and Linked Open Data
The Europeana Strategy and Linked Open DataDavid Haskiya
 
Busy Architects Guide to Modern Web Architecture in 2014
Busy Architects Guide to  Modern Web Architecture in 2014Busy Architects Guide to  Modern Web Architecture in 2014
Busy Architects Guide to Modern Web Architecture in 2014Particular Software
 
Metaverse for Dataverse
Metaverse for DataverseMetaverse for Dataverse
Metaverse for Dataversevty
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell
 
Semantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in UseSemantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in UseJesse Wang
 
Semantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivitySemantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivityIoannis Stavrakantonakis
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMVladimir Alexiev, PhD, PMP
 
Semantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in ActionSemantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in ActionJesse Wang
 
Power to the Users (and Librarians)
Power to the Users (and Librarians)Power to the Users (and Librarians)
Power to the Users (and Librarians)Guus van den Brekel
 
Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...
Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...
Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...Europeana
 
Building an ecosystem of networked references
Building an ecosystem of networked referencesBuilding an ecosystem of networked references
Building an ecosystem of networked referencesHugo Manguinhas
 
Amersfoort 2016 koch_wg_v02
Amersfoort 2016 koch_wg_v02Amersfoort 2016 koch_wg_v02
Amersfoort 2016 koch_wg_v02walter koch
 
Instructional Design for the Semantic Web
Instructional Design for the Semantic WebInstructional Design for the Semantic Web
Instructional Design for the Semantic Webguest649a93
 
Presentation of context: Web Annotations (& Pundit) during the StoM Project (...
Presentation of context: Web Annotations (& Pundit) during the StoM Project (...Presentation of context: Web Annotations (& Pundit) during the StoM Project (...
Presentation of context: Web Annotations (& Pundit) during the StoM Project (...Net7
 

Similar to Bridging Gaps Between Humans With Interoperable Machines (20)

From Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperabilityFrom Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperability
 
Digital Libraries of the Future
Digital Libraries of the Future
Digital Libraries of the Future
Digital Libraries of the Future
 
Realizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 TutorialRealizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 Tutorial
 
Overview AG AKSW
Overview AG AKSWOverview AG AKSW
Overview AG AKSW
 
The Europeana Strategy and Linked Open Data
The Europeana Strategy and Linked Open DataThe Europeana Strategy and Linked Open Data
The Europeana Strategy and Linked Open Data
 
Busy Architects Guide to Modern Web Architecture in 2014
Busy Architects Guide to  Modern Web Architecture in 2014Busy Architects Guide to  Modern Web Architecture in 2014
Busy Architects Guide to Modern Web Architecture in 2014
 
Metaverse for Dataverse
Metaverse for DataverseMetaverse for Dataverse
Metaverse for Dataverse
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 
Semantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in UseSemantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in Use
 
Semantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivitySemantic Web in the Plateau of Productivity
Semantic Web in the Plateau of Productivity
 
Irish Digital Libraries Summit
Irish Digital Libraries SummitIrish Digital Libraries Summit
Irish Digital Libraries Summit
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRM
 
Semantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in ActionSemantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in Action
 
Power to the Users (and Librarians)
Power to the Users (and Librarians)Power to the Users (and Librarians)
Power to the Users (and Librarians)
 
Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...
Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...
Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...
 
Building an ecosystem of networked references
Building an ecosystem of networked referencesBuilding an ecosystem of networked references
Building an ecosystem of networked references
 
Amersfoort 2016 koch_wg_v02
Amersfoort 2016 koch_wg_v02Amersfoort 2016 koch_wg_v02
Amersfoort 2016 koch_wg_v02
 
Silicon Valley Semantic Web Meet Up
Silicon Valley Semantic Web Meet UpSilicon Valley Semantic Web Meet Up
Silicon Valley Semantic Web Meet Up
 
Instructional Design for the Semantic Web
Instructional Design for the Semantic WebInstructional Design for the Semantic Web
Instructional Design for the Semantic Web
 
Presentation of context: Web Annotations (& Pundit) during the StoM Project (...
Presentation of context: Web Annotations (& Pundit) during the StoM Project (...Presentation of context: Web Annotations (& Pundit) during the StoM Project (...
Presentation of context: Web Annotations (& Pundit) during the StoM Project (...
 

Recently uploaded

MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 

Recently uploaded (20)

MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 

Bridging Gaps Between Humans With Interoperable Machines

  • 1. Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web Felix Sasaki DFKI / University of Appl. Sciences Potsdam W3C German-Austrian Office felix.sasaki@dfki.de W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 1
  • 2. Purpose of this talk (1) • Show gaps – Between machines – Between machines and humans • … which we need to fill to bridge gaps between humans W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 2
  • 3. Purpose of this talk (2) • Identify groups / communities – To fill gaps – To come together in new alliances W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 3
  • 4. Basics: What are machines doing (not only on the Web)? W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 4
  • 5. Language Technology • Summarization LT “These texts are about ... “ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 5
  • 6. Language Technology • Machine Translation LT このワークショップ は…で開催される “The workshop takes place in …“ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 6
  • 7. Language Technology • Spell and grammar checking LT “The workshop takes place in …“ “The worksop take place in …“ • And many more applications • Coreference resolution, discourse analysis, named entity recognition, natural language generation, question answering, … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 7
  • 8. Text mining • Finding out things you did not know Text mining •“Text A and text B are similar” •“The text collection has clusters of topics: …” Visualization of results W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 8
  • 9. Basics: What are machines doing (not only on the Web)? How are they doing it? They are using resources W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 9
  • 10. Resources in language technology • Sample resources for summarization LT “These texts are about ... “ NLG output text mining output stop word list … 10
  • 11. Language Technology • Sample resources in Machine Translation LT このワークショップ は…で開催される “The workshop takes place in …“ Lexicon Grammar (Training) corpora … Generation 11
  • 12. Language Technology • Sample resources for spell and grammar checking LT “The workshop takes place in …“ “The worksop take place in …“ Lexicon Grammar … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 12
  • 13. Text mining • Sample resources for text mining Text mining •“Text A and text B are similar” •“The text collection has clusters of topics: …” Lexicon Stop word list … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 13
  • 14. In general: you need three types of data: input, resources, workflow Input Work- flow Output Resources Resources … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 14
  • 15. What gaps need to be filled for truly “multilingual content processing”? • Gap 1: machines don’t use metadata available in the input • Gap 2: machines don’t know about the workflow (input) data goes through • Gap 3: machines don’t make explicit – “Who” they are – What resources they are using W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 15
  • 16. Gap 1: machines don’t use metadata available in the input • Input from www.postbank.de „Ob Postbank direkt, Online-Banking, Online-Brokerage oder myBHW. Die häufigsten Fragen zu unseren Transaktionssystemen finden Sie an dieser Stelle.“ • Output via Google translate “Whether Postbank direct, online banking, online brokerage or myBHW. Frequently asked questions about our transaction systems can be found at this location.” W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 16
  • 17. Gap 1: machines don’t use metadata available in the input • Input from www.postbank.de „Ob Postbank direkt, Online-Banking, Online-Brokerage oder myBHW. Die häufigsten Fragen zu unseren Transaktionssystemen finden Sie an dieser Stelle.“ • Output via Google translate “Whether Postbank direct, online banking, online brokerage or myBHW. Frequently asked questions about our transaction systems can be found at this location.” Fixed terminology should not have been translated. But – the MT tool had no chance to “know” that – why? W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 17
  • 18. Gap 2: machines don’t know about processes data goes through • Input from the data base – the “hidden web”: „Ob <term>Postbank direkt</term>, <term>Online-Banking</term>, <term>Online-Brokerage</term> …“ • Output on the Web: „Ob <em>Postbank direkt</em>, <em>Online-Banking</em>, <em>Online-Brokerage</em> …“ fixed terminology (= metadata) … … is lost on the Web  publication process W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 18
  • 19. Gap 3: no common identification … • Of metadata and processes chains (previous slides) • Of resources – e.g. what is a lexicon – In machine translation? – In localization? – For a human reader? – Ability to combine tools depends on knowing about them (capabilities, resources) in detail W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 19
  • 20. Who can fill these gaps – people dealing with multilingual content • Content producers – Allow for terminology identification in source formats / CMS • Localizers – Make localization workflows aware of (process / source content) metadata • “Machine” experts – Make their tools sensible to source content metadata and expose their capabilities (what resources / workflows) in a clear defined way W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 20
  • 21. Who can fill these gaps – people dealing with multilingual content • Users – Add metadata to source content – Use (machine translation) tools without knowing the details – e.g. in the browser! • Browser vendors – Create APIs which make use of automatic tools / resource and workflow descriptions / source code metadata • …  The people in this room! W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 21
  • 22. How can they fill the gaps? • All these groups need to agree upon one machine readable information space for filling the gaps • It’s actually already here – the Semantic Web! W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 22
  • 23. What is the Semantic Web • The Web as humans see it: Identification of “meaning” e.g. via (typographic or other) conventions „Ob Postbank direkt …“ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 23
  • 24. What is the Semantic Web • The Web as machines see it: Identification of meaning via RDF-based mechanisms (here via RDFa) „Ob <span property=”its:term”>Postbank direkt</span> …“ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 24
  • 25. What is the Semantic Web – RDF in 30 seconds • A framework for making statements about resources, using URIs • RDF can help to fill our gaps 1. Metadata in the input 2. Metadata for workflows 3. Identify 1., 2. and language technology resources uniquely • In one information space – the machine readable Web W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 25
  • 26. Instead of a summary – call for project (participating in ) proposals • Who needs to come together – Content producers, localizers, “machine” experts, browser vendors, users • What should their work be based upon – Semantic Web technologies – Clear interfaces to the human (e.g. browser) Web, like RDFa • What we do not need – Web-centred standardization of formats for language resources themselves – that is already done elsewhere (see this session) • Where the place is to do that work? – W3C, since it needs to be part of core Web technologies • For making it happen, we need a strong alliance of Web technologies, other fields and machine technologies W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 26
  • 27. META-NET • EU-funded project, closely related to “Multilingual Web” • Main aim: build an alliance for improving language technologies in Europe • Laaarge: soon 40+ participating organizations in 30+ countries • Very important: bring users of language technology in W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 27
  • 28. META-NET • Users and language technology companies = in Europe not only large companies, but more and more small SMEs • Target of META-NET are these small and fast units – including you  • EU has started special funding programs for SMEs – see http://tinyurl.com/eu-lt-sme (“objective 4.1”) W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 28
  • 29. META-NET • Event: META-NET Forum • Brussels, November 17th/18th • Aim: Bring users / language technology developers / policy makers together • Discuss a road map for the next 10 years of language technology road map and its applications • Details and registration at http://www.meta-net.eu/events W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 29
  • 30. Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web Felix Sasaki DFKI / University of Appl. Sciences Potsdam W3C German-Austrian Office felix.sasaki@dfki.de W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 30

Editor's Notes

  1. The purpose of this talk is as shown on this and on the following slide.
  2. Let us start with some basics: what are machines doing, not only on the Web? We focus on a few technology fields like “language technology” and “text mining”, which are sufficient to identify the gaps we want to point out.
  3. We start with the field of language technology and some of its applications. Language technology can be used e.g. for text summarization. A language technology component takes one or several documents as an input and produces as an output a summary of the document(s).
  4. Another application of language technology is machine translation. A language technology component takes a text as an input and produces an output in a different language.
  5. Another application is checking of spelling and grammar. On the left side we see the input sentence from the last slide with some mistakes. A language technology component takes this as an input and produces suggestions for corrections, as shown on the left side of the slide. These are only a few sample application of language technology, to give you a rough idea what this technology is used for. Some others are shown on this slide, but we will not go into details here.
  6. The purpose of text mining is to find out new information about large amount of unstructured text. For example, You may find out from a text mining process that two texts A and B are similar, or that there is a cluster around certain topics in a text collection. Results of text mining are often visualized, which is only indicated on this slide.
  7. Now let us get closer to the point we want to make: what is common about what machines doing? All the technologies we introduced rely on resources. Some of the resources will be described now.
  8. Summarization depends on the output of other processes. There are various approaches to summarization, making use of different kinds of resources or outputs. Sample outputs are from natural language generation, text mining. Text mining itself relies on a stop word list, since you don’t want words like “and, because, or, …” as part of the mining process.
  9. For machine translation it depends very much on your approach what you need. Again we will not go into details here, but list some prototypical resources. You may need a lexicon for your source and your target language, or grammar(s) again for both languages. You may use corpora for several reasons. In corpora you can “train” your translation component: it can help you to generate a lexicon or a grammar from examples of real use. It can be used to calculate probabilities for translations, given examples of aligned source and target language sentences. And it can be used to test and enhance the quality of a translation, given (hand written) examples.
  10. For spell or grammar checking, again you need a lexicon and a grammar, and potentially other resources.
  11. For text mining, you need a lexicon since you want to find e.g. similarities between texts about “multilingualism” and “multilingual”. The lexicon will help you to bring these two expressions together. You will also need something like a stop word list. It contains words like “and”, “when”, “because” which you do not want to take into account for your mining process.
  12. In the previous slides we described resources like a lexicon, grammars or corpora. Generalizing the picture, you may call the resources a kind of “data”. There are other types of data you need: you have input data which you want to summarize, translate, spell check or mine for new information. And there is a workflow in the examples we had: you describe for a language technology or mining process what is happing: a lexicon is used for translation, a corpus to find previously translated examples, a grammar to check correctness in the target language, and so on.
  13. The first gap we want to point out is: machines don’t use metadata which is available in the input. Here we see an example of a translation generated automatically, via Google translate. The result looks OK, but let’s take a closer look.
  14. The machine translation process should have “known” that there is fixed terminology not to be translated. But it could never know that – why?
  15. The reason is that the information about the fixed terminology is not available in the data. By “in the data”, we mean the input data for the machine translation tool. Actually, the information about fixed terminology was available in the “hidden web”, that is in a data base. But the data based does not appear on the web. Here we come to gap 2: Machines need to know about the process the data went through, before it appeared on the “surface” of the Web. This is of course closely related to the first gap. Filling the second gap is somehow the prerequisite for filling the first gap.
  16. The third gap is related to the two others: we want to be able to uniquely identify metadata (for source content) and processing chain descriptions. Also, we want to identify characteristics of tools using that kind of metadata, too, also in a clear manner.
  17. This slide shows what groups can help to fill these gaps. It is actually the people in this room! Note that the examples are – just examples, useful for our “machine translation and terminology” problem. There is no time during this presentation to show other problems which can be solved in the same manner, but believe me that they exist.
  18. It is necessary that the people in this room come together and fill the gaps mentioned. It is also helpful if they do it in one machine readable information space – the Semantic Web.
  19. The Semantic Web is the Web made process able for machines. On this slide you see how humans see the Web. “meaning” is conveyed by the content itself and e.g. typographic conventions.
  20. In the Semantic Web, meaning is conveyed with specific means: via RDF-based mechanisms. The mechanism to add machine readable meaning to Web pages is called RDFa and is exemplified here. The “property “ attribute expresses that the content of the “span” element is to be interpreted as a term.
  21. Instead of a summary, let‘s have a call for project proposals!
  22. These slides provide some background about META-NET.