Apache Stanbol 
and the Web of Data - ApacheCon 2011


Published on

Presentation on Apache Stanbol (incubating) and related projects given by Olivier Grisel durin ApacheCon 2011.

More information:
- http://incubator.apache.org/stanbol/
- http://www.iks-project.eu

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Stanbol 
and the Web of Data - ApacheCon 2011

  1. 1. Apache Stanbol (Incubating) and the Web of Data Olivier Grisel, Nuxeo ogrisel@apache.org, 2011-11-11 11/7/11
  2. 2. My Background 11/7/11 Olivier Grisel - R&D Engineer nuxeo Open Source ECM    European project: IKS Stuff I do: Machine Learning Natural Language Processing  All things data
  3. 3. Agenda 11/7/11 The Web of Data: what, why, how? CMS integration demo Semantic Components in Stanbol Building models for Stanbol
  4. 4. The Web of Data What, Why, How?
  5. 6. 11/7/11 “ To a computer, then, the web is a  flat ,  boring  world devoid of  meaning ” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  6. 7. 11/7/11 “ This is a pity, as in fact  documents  on the web describe  real objects  and imaginary  concepts , and give particular  relationships  between them” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  7. 8. 11/7/11 “ The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning , better enabling computers and people to work in cooperation.” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  8. 9. 11/7/11 “ Adding semantics to the web involves two things: allowing  documents  which have information in  machine-readable  forms, and allowing  links  to be created with  relationship values .” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  9. 10. The Web of Data – What? 11/7/11 <ul><ul><li>Shared description of the real world </li></ul></ul><ul><ul><ul><li>Structured with vocabularies </li></ul></ul></ul><ul><ul><ul><li>Decentralized </li></ul></ul></ul><ul><ul><ul><li>Scoped by namespaces </li></ul></ul></ul><ul><ul><ul><li>Linked </li></ul></ul></ul>
  10. 11. The Web of Data – Why? 11/7/11 <ul><ul><li>Strings are ambiguous </li></ul></ul><ul><ul><ul><li>New York / The Big Apple / NYC </li></ul></ul></ul><ul><ul><ul><li>Washington (Person, State, City, Sports Team...) </li></ul></ul></ul><ul><ul><li>Structured context helps humans  </li></ul></ul><ul><ul><ul><li>Who is this guy? </li></ul></ul></ul><ul><ul><ul><li>Where is this city? </li></ul></ul></ul><ul><ul><li>Conceptual frame helps machines </li></ul></ul><ul><ul><ul><li>Explicit user intent decoding </li></ul></ul></ul><ul><ul><ul><li>Smarter indexing / search? </li></ul></ul></ul>
  11. 12. Decoding User Intents 11/7/11
  12. 13. Decoding User Intents 11/7/11 Next Generation User Interfaces Siri - conversational interface IBM DeepQA: Watson for Heath Care Tell Google about your stuff Publish structured prediction of your products &quot;3 bedrooms flat near Montmartre&quot; Useful for non-public data as well Intranet query: &quot;ApacheCon slides&quot; Intranet query: &quot;Xerox invoices&quot; Intranet query: &quot;Xerox salesperson email&quot;
  13. 14. The Web of Data - How? 11/7/11 RDF / TripeStores / Sparql Graph stores with dynamic schemas Strong interoperability JSON-LD Upgrade your JSON with scoped vocabularies Web / Mobile / JS developer friendly RDFa + schema.org & rNews Publish annotation in structured markup Vocabulary understood by Search Engines
  14. 15. HTML example 11/7/11 <p>   My name is Manu Sporny and you can give me a ring via   1-800-555-0155.     <img src=&quot;http://manu.sporny.org/images/manu.png&quot; />      I have a <a href=&quot;http://manu.sporny.org/&quot;>blog</a>. </p>
  15. 16. RDFa example 11/7/11 <p vocab=&quot;http://schema.org/&quot;     prefix=&quot;foaf: http://xmlns.com/foaf/0.1/&quot;    about=&quot;#manu&quot; typeof=&quot;Person&quot; >   My name is <span property=&quot;name&quot; >Manu Sporny</span>   and you can give me a ring via   <span property=&quot;telephone&quot; >1-800-555-0155</span>.     <img rel=&quot;image&quot;     src=&quot;http://manu.sporny.org/images/manu.png&quot; />      I have a <a rel=&quot;foaf:weblog&quot;     href=&quot;http://manu.sporny.org/&quot;>blog</a>. </p>
  16. 17. JSON-LD example 11/7/11
  17. 18. 11/7/11 2007 2008 2009 2010
  18. 19. 2011
  19. 20. Bridging the Web of Data and my CMS
  20. 22. Apache Stanbol 11/7/11 Enhancer Text analysis with Apache OpenNLP  / Tika EntityHub / ContentHub Linked Data Indexing with Apache Solr Graph Storage with Apache Clerezza / Jena Reasoner / Rules Inference with Apache Jena & OWLApi  Components / HTTP Services OSGi with Apache Felix / JAX-RS with Jersey
  21. 27. RESTful is Beautiful
  22. 28. Minimalist HTTP Client 11/7/11 curl -X POST -H &quot;Accept: text/turtle&quot; -H &quot;Content-type: text/plain&quot; --data &quot;John Smith was born in London.&quot; http://stanbol.demo.nuxeo.com/engines
  23. 31. Local IT infrastructure (LAN) Nuxeo DM addon 1 1 Apache Stanbol 1 2 1 Engine 1 Engine 2 Engine 3 3 DBpedia Freebase Geonames LDAP
  24. 32. Stanbol Enhancer 11/7/11 Chain of Enhancement Engines Language Detection (Tika) Named Entity Detection (OpenNLP) Linked Data dereferencing (Solr) Refactoring / Translation (Jena)
  25. 33. Stanbol EntityHub 11/7/11 Referenced Sites DBpedia Geonames (NY Times, MusicBrainz, ProductDB, UnitProt...) Fast local offline indices (Solr) Batch indexing utilities for RDF dumps Multilingual fulltext search in labels & descriptions Vocabulary mapping / merging
  26. 34. Stanbol Reasoner 11/7/11 RDFS / OWL-lite / OWL2 Consistency checks Cardinality checks: each person has 1 birth date Range constraints: birth dates are valid dates Materializing types / properties Types from subclass: Musician > Artist > Person Symmetric property: A worked with B Transitive property: A is a located in B Query-time expansion / inference?
  27. 35. Stanbol Rules 11/7/11 Simple Prolog-like language uncleRule[ has(<http://example.org/family.owl#hasParent>, ?x, ?z) . has(<http://example.org/family.owl#hasSibling>, ?z, ?y) -> has(<http://example.org/family.owl#hasUncle>, ?x, ?y) ] Sparql Construct or SWRL PREFIX family: <http://example.org/family.owl#> CONSTRUCT { ?x family:hasUncle} ?y } WHERE { ?x family:hasParent ?z . ?z family:hasSibling ?y}
  28. 36. Online Demos 11/7/11 Simple analyzer with small index https://stanbol.demo.nuxeo.com All services deployed http://dev.iks-project.eu:8081
  29. 37. Building Stanbol Enhancer models from Wikipedia with the Apache data tools
  30. 38. Universal Topic Classification 11/7/11 Use Apache Lucene / Solr MoreLikeThis to perform a truncated nearest neighbors query in the TF-IDF vector space of Wikipedia
  31. 39. Universal Topic Classification 11/7/11 Index text of all articles grouped by topic Solr MoreLikeThis query on new document DBpedia dumps provide: Text summaries for each article “ subject” relationships between articles and topics “ broader” / “narrower” SKOS hieararchy between topics
  32. 40. About the Data 11/7/11 500k purely technical categories “ People_with_missing_birth_place”, “Rivers_in_Romania” 70k “semantically grounded” categories Paths to roots require both “ technical” and “grounded” categories Scale: 1.2M topic / topic links 30M topic / article links
  33. 41. Some results (Wikinews) 11/7/11 US children who celebrate Independence Day more likely to become Republicans, says Harvard study Fireworks Voting theory Republican Party (United States) Statistics Electoral systems
  34. 42. Some results (Wikinews) 11/7/11 U.S. space agency NASA sues ex-astronaut American astronauts Aviation halls of fame Edwards Air Force Base Apollo program Exploration of the Moon
  35. 43. Some results (Wikinews) 11/7/11 Hundreds of thousands of British public sector workers strike over planned pension changes Retirement in the United Kingdom United Kingdom pensions and benefits Pensions in the United Kingdom Labor disputes by country Labor disputes
  36. 44. Some results (PLoS One) 11/7/11 Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways Renal physiology Kidney Nephrology Hypertension Membrane biology
  37. 45. Wrap Up 11/7/11 Web of Data brings Sructured Context Frame to decode  User Intention NLP + Entities & Topics indices to automate Content Enrichment to provide Disambiguationn
  38. 46. Resources 11/7/11 Documentation, svn, mailing list:   http://incubator.apache.org/stanbol IKS project blog:   http://blog.iks-project.eu Blog posts about Semantic ECM:   http://blogs.nuxeo.com/dev/semantic/
  39. 47. Thank you for your attention! 11/7/11 Olivier Grisel [email_address] https://twitter.com/ogrisel
  40. 48. Training models for NER from Wikipedia Extract sentences with link positions in Wikipedia articles DBPedia to the find type of the target entity (Person, Location, Organization) Apache Pig scripts to compute the join + format the result as training files for OpenNLP Apache OpenNLP to build and evaluate the models Apache Hadoop / Apache Whirr for distributed processing