Mlw sasaki-20101027

638 views
572 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
638
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The purpose of this talk is as shown on this and on the following slide.
  • Let us start with some basics: what are machines doing, not only on the Web? We focus on a few technology fields like “language technology” and “text mining”, which are sufficient to identify the gaps we want to point out.
  • We start with the field of language technology and some of its applications. Language technology can be used e.g. for text summarization. A language technology component takes one or several documents as an input and produces as an output a summary of the document(s).
  • Another application of language technology is machine translation. A language technology component takes a text as an input and produces an output in a different language.
  • Another application is checking of spelling and grammar. On the left side we see the input sentence from the last slide with some mistakes. A language technology component takes this as an input and produces suggestions for corrections, as shown on the left side of the slide.
    These are only a few sample application of language technology, to give you a rough idea what this technology is used for. Some others are shown on this slide, but we will not go into details here.
  • The purpose of text mining is to find out new information about large amount of unstructured text. For example, You may find out from a text mining process that two texts A and B are similar, or that there is a cluster around certain topics in a text collection. Results of text mining are often visualized, which is only indicated on this slide.
  • Now let us get closer to the point we want to make: what is common about what machines doing? All the technologies we introduced rely on resources. Some of the resources will be described now.
  • Summarization depends on the output of other processes. There are various approaches to summarization, making use of different kinds of resources or outputs. Sample outputs are from natural language generation, text mining. Text mining itself relies on a stop word list, since you don’t want words like “and, because, or, …” as part of the mining process.
  • For machine translation it depends very much on your approach what you need. Again we will not go into details here, but list some prototypical resources. You may need a lexicon for your source and your target language, or grammar(s) again for both languages. You may use corpora for several reasons. In corpora you can “train” your translation component: it can help you to generate a lexicon or a grammar from examples of real use. It can be used to calculate probabilities for translations, given examples of aligned source and target language sentences. And it can be used to test and enhance the quality of a translation, given (hand written) examples.
  • For spell or grammar checking, again you need a lexicon and a grammar, and potentially other resources.
  • For text mining, you need a lexicon since you want to find e.g. similarities between texts about “multilingualism” and “multilingual”. The lexicon will help you to bring these two expressions together. You will also need something like a stop word list. It contains words like “and”, “when”, “because” which you do not want to take into account for your mining process.
  • In the previous slides we described resources like a lexicon, grammars or corpora. Generalizing the picture, you may call the resources a kind of “data”. There are other types of data you need: you have input data which you want to summarize, translate, spell check or mine for new information. And there is a workflow in the examples we had: you describe for a language technology or mining process what is happing: a lexicon is used for translation, a corpus to find previously translated examples, a grammar to check correctness in the target language, and so on.
  • The first gap we want to point out is: machines don’t use metadata which is available in the input. Here we see an example of a translation generated automatically, via Google translate. The result looks OK, but let’s take a closer look.
  • The machine translation process should have “known” that there is fixed terminology not to be translated. But it could never know that – why?
  • The reason is that the information about the fixed terminology is not available in the data. By “in the data”, we mean the input data for the machine translation tool. Actually, the information about fixed terminology was available in the “hidden web”, that is in a data base. But the data based does not appear on the web. Here we come to gap 2: Machines need to know about the process the data went through, before it appeared on the “surface” of the Web. This is of course closely related to the first gap. Filling the second gap is somehow the prerequisite for filling the first gap.
  • The third gap is related to the two others: we want to be able to uniquely identify metadata (for source content) and processing chain descriptions. Also, we want to identify characteristics of tools using that kind of metadata, too, also in a clear manner.
  • This slide shows what groups can help to fill these gaps. It is actually the people in this room! Note that the examples are – just examples, useful for our “machine translation and terminology” problem. There is no time during this presentation to show other problems which can be solved in the same manner, but believe me that they exist.
  • It is necessary that the people in this room come together and fill the gaps mentioned. It is also helpful if they do it in one machine readable information space – the Semantic Web.
  • The Semantic Web is the Web made process able for machines. On this slide you see how humans see the Web. “meaning” is conveyed by the content itself and e.g. typographic conventions.
  • In the Semantic Web, meaning is conveyed with specific means: via RDF-based mechanisms. The mechanism to add machine readable meaning to Web pages is called RDFa and is exemplified here. The “property “ attribute expresses that the content of the “span” element is to be interpreted as a term.
  • Instead of a summary, let‘s have a call for project proposals!
  • These slides provide some background about META-NET.
  • Mlw sasaki-20101027

    1. 1. Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web Felix Sasaki DFKI / University of Appl. Sciences Potsdam W3C German-Austrian Office felix.sasaki@dfki.de W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 1
    2. 2. Purpose of this talk (1) • Show gaps – Between machines – Between machines and humans • … which we need to fill to bridge gaps between humans W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 2
    3. 3. Purpose of this talk (2) • Identify groups / communities – To fill gaps – To come together in new alliances W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 3
    4. 4. Basics: What are machines doing (not only on the Web)? W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 4
    5. 5. Language Technology • Summarization LT “These texts are about ... “ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 5
    6. 6. Language Technology • Machine Translation LT このワークショップ は…で開催される “The workshop takes place in …“ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 6
    7. 7. Language Technology • Spell and grammar checking LT “The workshop takes place in …“ “The worksop take place in …“ • And many more applications • Coreference resolution, discourse analysis, named entity recognition, natural language generation, question answering, … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 7
    8. 8. Text mining • Finding out things you did not know Text mining •“Text A and text B are similar” •“The text collection has clusters of topics: …” Visualization of results W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 8
    9. 9. Basics: What are machines doing (not only on the Web)? How are they doing it? They are using resources W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 9
    10. 10. Resources in language technology • Sample resources for summarization LT “These texts are about ... “ NLG output text mining output stop word list … 10
    11. 11. Language Technology • Sample resources in Machine Translation LT このワークショップ は…で開催される “The workshop takes place in …“ Lexicon Grammar (Training) corpora … Generation 11
    12. 12. Language Technology • Sample resources for spell and grammar checking LT “The workshop takes place in …“ “The worksop take place in …“ Lexicon Grammar … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 12
    13. 13. Text mining • Sample resources for text mining Text mining •“Text A and text B are similar” •“The text collection has clusters of topics: …” Lexicon Stop word list … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 13
    14. 14. In general: you need three types of data: input, resources, workflow Input Work- flow Output Resources Resources … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 14
    15. 15. What gaps need to be filled for truly “multilingual content processing”? • Gap 1: machines don’t use metadata available in the input • Gap 2: machines don’t know about the workflow (input) data goes through • Gap 3: machines don’t make explicit – “Who” they are – What resources they are using W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 15
    16. 16. Gap 1: machines don’t use metadata available in the input • Input from www.postbank.de „Ob Postbank direkt, Online-Banking, Online-Brokerage oder myBHW. Die häufigsten Fragen zu unseren Transaktionssystemen finden Sie an dieser Stelle.“ • Output via Google translate “Whether Postbank direct, online banking, online brokerage or myBHW. Frequently asked questions about our transaction systems can be found at this location.” W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 16
    17. 17. Gap 1: machines don’t use metadata available in the input • Input from www.postbank.de „Ob Postbank direkt, Online-Banking, Online-Brokerage oder myBHW. Die häufigsten Fragen zu unseren Transaktionssystemen finden Sie an dieser Stelle.“ • Output via Google translate “Whether Postbank direct, online banking, online brokerage or myBHW. Frequently asked questions about our transaction systems can be found at this location.” Fixed terminology should not have been translated. But – the MT tool had no chance to “know” that – why? W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 17
    18. 18. Gap 2: machines don’t know about processes data goes through • Input from the data base – the “hidden web”: „Ob <term>Postbank direkt</term>, <term>Online-Banking</term>, <term>Online-Brokerage</term> …“ • Output on the Web: „Ob <em>Postbank direkt</em>, <em>Online-Banking</em>, <em>Online-Brokerage</em> …“ fixed terminology (= metadata) … … is lost on the Web  publication process W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 18
    19. 19. Gap 3: no common identification … • Of metadata and processes chains (previous slides) • Of resources – e.g. what is a lexicon – In machine translation? – In localization? – For a human reader? – Ability to combine tools depends on knowing about them (capabilities, resources) in detail W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 19
    20. 20. Who can fill these gaps – people dealing with multilingual content • Content producers – Allow for terminology identification in source formats / CMS • Localizers – Make localization workflows aware of (process / source content) metadata • “Machine” experts – Make their tools sensible to source content metadata and expose their capabilities (what resources / workflows) in a clear defined way W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 20
    21. 21. Who can fill these gaps – people dealing with multilingual content • Users – Add metadata to source content – Use (machine translation) tools without knowing the details – e.g. in the browser! • Browser vendors – Create APIs which make use of automatic tools / resource and workflow descriptions / source code metadata • …  The people in this room! W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 21
    22. 22. How can they fill the gaps? • All these groups need to agree upon one machine readable information space for filling the gaps • It’s actually already here – the Semantic Web! W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 22
    23. 23. What is the Semantic Web • The Web as humans see it: Identification of “meaning” e.g. via (typographic or other) conventions „Ob Postbank direkt …“ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 23
    24. 24. What is the Semantic Web • The Web as machines see it: Identification of meaning via RDF-based mechanisms (here via RDFa) „Ob <span property=”its:term”>Postbank direkt</span> …“ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 24
    25. 25. What is the Semantic Web – RDF in 30 seconds • A framework for making statements about resources, using URIs • RDF can help to fill our gaps 1. Metadata in the input 2. Metadata for workflows 3. Identify 1., 2. and language technology resources uniquely • In one information space – the machine readable Web W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 25
    26. 26. Instead of a summary – call for project (participating in ) proposals • Who needs to come together – Content producers, localizers, “machine” experts, browser vendors, users • What should their work be based upon – Semantic Web technologies – Clear interfaces to the human (e.g. browser) Web, like RDFa • What we do not need – Web-centred standardization of formats for language resources themselves – that is already done elsewhere (see this session) • Where the place is to do that work? – W3C, since it needs to be part of core Web technologies • For making it happen, we need a strong alliance of Web technologies, other fields and machine technologies W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 26
    27. 27. META-NET • EU-funded project, closely related to “Multilingual Web” • Main aim: build an alliance for improving language technologies in Europe • Laaarge: soon 40+ participating organizations in 30+ countries • Very important: bring users of language technology in W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 27
    28. 28. META-NET • Users and language technology companies = in Europe not only large companies, but more and more small SMEs • Target of META-NET are these small and fast units – including you  • EU has started special funding programs for SMEs – see http://tinyurl.com/eu-lt-sme (“objective 4.1”) W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 28
    29. 29. META-NET • Event: META-NET Forum • Brussels, November 17th/18th • Aim: Bring users / language technology developers / policy makers together • Discuss a road map for the next 10 years of language technology road map and its applications • Details and registration at http://www.meta-net.eu/events W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 29
    30. 30. Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web Felix Sasaki DFKI / University of Appl. Sciences Potsdam W3C German-Austrian Office felix.sasaki@dfki.de W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 30

    ×