Semantic Technology: The Basics


Published on

data, database,"linked data","open data", semantic, software, open source, oracle, franz, linked open data, triple store, graph, technology, software, semantic web, web 3.0, semtech

Published in: Technology, Business

Semantic Technology: The Basics

  1. Semantic Technology: the basics Jans Aasman, Ph.D. CEO Franz Inc [email_address]
  2. Contents <ul><li>The basics about triples and meta data </li></ul><ul><li>Storing triples and the difference between a RDB and a triple store? </li></ul><ul><li>Linked Open Data and we learn how to SPARQL </li></ul><ul><li>Where do little triples come from and an example of entity extraction </li></ul><ul><li>Tim Berners-Lee’s dream come true: RDFa, ontologies, another example.. </li></ul><ul><li>What do companies use triple stores for? </li></ul><ul><li>And why do companies work with triple stores? </li></ul><ul><li>Can’t I do that with a Nosql database like Hadoop, Bigtable, HBase, Cassandra…? </li></ul><ul><li>General requirements for RDB, Triple Stores and Big Data </li></ul><ul><li>What about rules and reasoning? </li></ul><ul><li>What about geospatial, temporal and social network analysis </li></ul>
  3. Caveat <ul><li>I will talk mostly about the AllegroGraph tool suite </li></ul><ul><li>But 90 % of what I discuss will work with </li></ul><ul><ul><li>Virtuoso </li></ul></ul><ul><ul><li>Oracle </li></ul></ul><ul><ul><li>Owlim </li></ul></ul><ul><ul><li>Jena </li></ul></ul><ul><ul><li>Sesame </li></ul></ul>
  4. Trends in main stream IT <ul><li>Gartner group‘s 2008 list of the Top 10 Disruptive Technologies that will effect IT in the next five years </li></ul><ul><ul><li>Multi core and hybrid processors </li></ul></ul><ul><ul><li>Virtualisation and fabric computing </li></ul></ul><ul><ul><li>Meta Data and Semantic technologies </li></ul></ul><ul><ul><li>Social networks and social software </li></ul></ul><ul><ul><li>Cloud computing and cloud platforms </li></ul></ul><ul><ul><li>Web mashups </li></ul></ul><ul><ul><li>User Interface </li></ul></ul><ul><ul><li>Ubiquitous computing </li></ul></ul><ul><ul><li>Contextual computing </li></ul></ul><ul><ul><li>Augmented reality </li></ul></ul>
  6. The whole Semantic Web rests on a little thingy: <ul><li>The triple </li></ul>
  7. Subject predicate object graph Quads
  8. ntriple format Resource or Blank Node Only Resource Resource Literal Blank Node (?:(?<=<)[^>]+(?=>))|(?:&quot;(?:(?:[^&quot;]|(?:amp;quot;))*)&quot;[^. ]*)|(?:_:[-_a-zA-Z0-9]+)
  9. As RDF/XML
  10. As Trix As Trix
  11. As N3
  12. But don’t worry <ul><li>Most triple stores will read </li></ul><ul><ul><li>Rdf/xml </li></ul></ul><ul><ul><li>Ntriples </li></ul></ul><ul><ul><li>N3 </li></ul></ul><ul><li>And otherwise you just use rapper… </li></ul>
  15. How can I read up on triples and meta data?
  16. A triple store <ul><li>How do you store triples and what is the difference between a relational database and a triple store </li></ul>
  17. An artist’s impression of a db for persons January, 2008
  18. g
  20. Comparing RDB, Hbase, Triple Stores RDB Hbase Modern Triple Store Transactions + - + ACID + - + Concurrent & Dynamic + + + Random Access + + + Flexibility - - + High Availability + + + Joins / complex graph search - - + Structured + Unstructured - - + Scalability + + +
  21. Can you give us some examples <ul><li>Linked Open Data and Pharmaceutical demo </li></ul><ul><li>Entity Extraction and a Triple Store </li></ul><ul><li>Tim Berners-Lee’s dream comes true: </li></ul><ul><ul><li>RDFa – Yahoo/Bing/Google/Overstock </li></ul></ul>
  22. Pharmaceutical
  25. sider
  26. Federated 11 linked data sets <ul><li>We took 5 public databases: Drugbank, Dailymeds, Clinical trials, Diseasome, and Sider. Entities are mostly linked together through same-as relationships. </li></ul><ul><li>And using some entity extraction created some more links (and) triples </li></ul><ul><ul><li>CT-discusses-drug, CT-discusses-side-effect </li></ul></ul><ul><ul><li>CT-discusses-target, CT-discusses-disease </li></ul></ul><ul><li>Alitora did some extensive NLP and entity extraction on Rheumatoid Arthritis </li></ul><ul><ul><li>CT-mentions-genes </li></ul></ul><ul><li>And to facilitate search through schema space: Schema-connections </li></ul>
  29. DrugBank A repository of almost 5000 FDA-approved small molecule and biotech drugs. Contains detailed information about drugs including chemical, pharmacological and pharmaceutical data; along with comprehensive drug target data such as sequence, structure, and pathway information.
  30. LinkedCT: Clinical Trials Up-to-date information for locating federally and privately supported clinical trials for a wide range of diseases and conditions. It contains 81,571 trials sponsored by the National Institutes of Health, other federal agencies, and private industry receives over 40 million page views per month 50,000 visitors daily.
  31. Diseasome Publishes a network of 4,300 disorders and disease genes linked by known disorder-gene associations for exploring all known phenotype and disease gene associations, indicating the common genetic origin of many diseases. The list of disorders, disease genes, and associations between them was obtained from the Online Mendelian Inheritance in Man (OMIM), a compilation of human disease genes and phenotypes.
  32. DailyMed Published by the National Library of Medicine, and provides high quality information about marketed drugs. DailyMed provides much information including general background on the chemical structure of the compound and its therapeutic purpose, details on the compound's clinical pharmacology, indication and usage, contraindications, warnings, precautions, adverse reactions, overdosage, and patient counseling.
  33. Sider Contains information on marketed drugs and their adverse effects. The information is extracted from public documents and package inserts.
  34. Finding entities in Clinical Trials <ul><li>CT has too much text </li></ul><ul><li>We searched for drugs, diseases, targets and side effects in Clinical trials and created new triples </li></ul><ul><ul><li>CT100385 discusses-side-effect headache </li></ul></ul><ul><ul><li>CT100385 discusses-drug aspirin </li></ul></ul><ul><ul><li>CT100385 discusses-disease alcohol-addiction </li></ul></ul><ul><ul><li>CT100385 discusses-target some-protein </li></ul></ul>
  35. Combined with advanced entity extraction <ul><li>CT-mentions-drugs </li></ul><ul><li>CT-mentions-genes </li></ul><ul><li>CT-mentions-side-effects </li></ul><ul><li>(currently only for </li></ul>
  36. Interesting queries (Sparql) <ul><li>Sparql </li></ul><ul><ul><li>Give me the title of all clinical trials that discuss the drug Lipitor and the side-effect “Diabetes type 2” </li></ul></ul><ul><ul><li>Give me clinical trials that discuss Rheumatoid Arthritis and give me the genes and drugs discussed </li></ul></ul><ul><li>Prolog </li></ul><ul><ul><li>Find all clinical trials that resemble clinical trial NCT00130091 given diseases, drugs, targets, and side-effects </li></ul></ul>
  37. Where do triples come from <ul><li>By user programming </li></ul><ul><ul><li>(loop </li></ul></ul><ul><ul><li>repeat 1000 </li></ul></ul><ul><ul><li>do(add-triple (ran 100)(ran 100)(ran 100))) </li></ul></ul><ul><li>From Relational Databases </li></ul><ul><li>From entity extractors </li></ul>
  43. Guided Interaction Advisor Visualization of the solution Triple Store with business concepts Events from many source systems are transformed into a set of related business concepts Chronology of events Interactions Bills Orders Payments Collections Charge dispute Individual Customer Pay instructions Device Activated Device heartbeat Subscriptions Device changes Many events
  44. Guided Interaction Advisor Visualization of the solution Charge dispute Probabilistic assessment is performed Corporate policy rules determine actions Plan Overage Bill greater than last month Prorates Roaming charges Third Party Charges Abnormal fee Rate increases Charge dispute First bill Past due amount Pay bill Customer Cancellation Reactivation Device activation Device education Device exchange Device Lost Device not working Device resume Service data not working Service text not working Service voice not working Subscription cancellation Wrong plan Chronology of events Device Activation Third Party Charges Abnormal fee
  45. Entity Extraction: from <ul><li>People (has-people) </li></ul><ul><ul><li>And their roles </li></ul></ul><ul><li>Places (has-places) </li></ul><ul><ul><li>And the county, state, country they are in </li></ul></ul><ul><li>Organizations (has-organizations) </li></ul><ul><ul><li>Government departments, company names, etc. </li></ul></ul><ul><li>Main Categories (has-domains) </li></ul><ul><ul><li>Politics, sports, ministries, energy, finance, economics, ecology, oil, mining industry, etc.. </li></ul></ul><ul><li>Main Concepts (has-main-groups) </li></ul><ul><ul><li>Other important nouns and phrases in a text </li></ul></ul>
  46. How would you do this with your standard search engine? <ul><li>Give me a newspaper text with a republican and a democrat that serve on two subcommittees that have the same parent committee. </li></ul><ul><li>Which [democrat|republican] is most vocal in the oil spill disaster </li></ul><ul><li>Given this text, find all the other texts that have the same people and the same main topics but not democrats in the text. </li></ul><ul><li>Which [democrate|republican|senator|representative] get most of the attention in the last week. </li></ul><ul><li>Give me the distribution of the most important topics yesterday </li></ul>
  47. The process <ul><li>We spider daily > 300 on-line newspapers and thousands of blogs </li></ul><ul><li>And search specifically for all the member of the senate and house of representatives and the executive branch </li></ul><ul><li>Apply Cogito to the text and extract main concepts </li></ul><ul><ul><li>About 150 triples per text… </li></ul></ul><ul><li>Hook up these concepts with a detailed database of each politician. </li></ul>
  50. Demo <ul><li>Looking at the main properties of a text </li></ul><ul><li>Full text indexing </li></ul><ul><li>Simple SPARQL and prolog queries </li></ul><ul><li>Finding related articles </li></ul><ul><li>Finding connections between two texts </li></ul>
  51. How Tim Berners-Lee’s dream came true
  52. Vocabularies, Thesauri, Taxonomy, Ontologies <ul><li>Vocabularies : the heart of linking </li></ul><ul><ul><li>bc:Citi rdf:type bc:VocabularyEntry </li></ul></ul><ul><li>Thesauri: linking variants to Vocabulary </li></ul><ul><ul><li>bc:Citi bc:hasAlternativeName ‘Citi Group’ </li></ul></ul><ul><li>Taxonomy: finding the hierarchy in your data </li></ul><ul><ul><li>bc:Banamex bc:part of bc:Citi </li></ul></ul><ul><li>Ontology: types, subtypes, constraints </li></ul><ul><ul><li>bc:Citi rdf:type bc:Bank </li></ul></ul><ul><ul><li>bc:Bank rdf:type owl:Class </li></ul></ul><ul><ul><li>bc:Bank rdfs:subClassOf bc:Company </li></ul></ul><ul><ul><li>bc:Company rdfs:subClassOf bc:Organization </li></ul></ul>
  53. Schema Spaces <ul><li>Create Schema Connection Spaces </li></ul><ul><ul><li>Take original RDB schemas and syntactically transform to RDF and RDFS </li></ul></ul><ul><ul><li>bc:customer1 rdf:type bc:table </li></ul></ul><ul><ul><li>bc:customerID1 rdf:type bc:columnName </li></ul></ul><ul><ul><li>bc:customerID1 bc:dataType bc:long </li></ul></ul><ul><ul><li>Annotate with origin </li></ul></ul><ul><ul><li>bc:customer1 bc:fromDB bc:ERP1 </li></ul></ul><ul><ul><li>Annotate with connections to other schema </li></ul></ul><ul><ul><li>bc:customer1 bc:relatesTo bc:customer2 </li></ul></ul>
  63. What do companies use triple stores for?
  65. Bill Guin, CTO, Amdocs Semantic Real Time Intelligent Decision Automation . Thursday, 12:45, Grand A Vijay Bulusu. Knowledge Sharing Using Semantic Technologies Thursday, 3:30, Franciscan A
  66. And why do these companies work with triple stores? <ul><li>When you need ultimate flexibility </li></ul><ul><li>Modeling knowledge and assets </li></ul><ul><li>Hundreds to thousands of classes with different features </li></ul><ul><li>Everyday new classes and new features </li></ul><ul><li>You work with rules and reasoning </li></ul><ul><li>When you need ultimate ‘linkability’ </li></ul><ul><li>For (ad hoc) integration of databases </li></ul><ul><li>When you need pattern recognition and network analysis </li></ul><ul><li>Complex networks of people, companies, products, etc </li></ul><ul><li>When you need event processing using geospatial, temporal reasoning and social network analysis combined with flexible metadata </li></ul>
  67. Can’t I do that with a Nosql database like Hadoop, HBase, Cassandra… <ul><li>HBase is an open-source, distributed database modeled after Google‘s BigTable and written in Java. It is developed as part of Apache Software Foundation's Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop. ... </li></ul><ul><li>It sacrifices ACID-ness and complex Joins for web scale scalability. </li></ul>
  69. Classes of applications and their natural database. Regular Enterprise Applications. Web Scale Shallow Objects Complex Meta Data Applications. Relational Databases Hadoop, Hbase, Cassandra, etc Triple Stores
  70. Classes of applications and their natural database. Regular Enterprise Applications. Web Scale Shallow Objects Complex Meta Data Applications. Relational Databases Hadoop, Hbase, Cassandra, etc Triple Stores
  71. Geospatial, temporal and social network analysis with a triple store…
  72. Social Network Analysis Answers 4 questions <ul><li>How far is P1 from P2 (and how strong is the relation?) </li></ul><ul><li>To what groups does this person belong (ego groups, cliques?) </li></ul><ul><li>How important is this person in the group? </li></ul><ul><li>Does this group have a leader, how cohesive are they? </li></ul>
  73. GeoSpatial <ul><li>Make the following super efficient </li></ul><ul><ul><li>Where did something happen? </li></ul></ul><ul><ul><li>How far was event1 from event2? </li></ul></ul><ul><ul><li>Find all the events that occurred in a bounding box or radius of M miles? </li></ul></ul><ul><ul><li>Do these two shapes overlap? </li></ul></ul><ul><ul><li>Find all the objects in the intersection of two shapes </li></ul></ul><ul><li>On a very large scale </li></ul><ul><ul><li>when things don ’ t fit in memory </li></ul></ul><ul><ul><li>millions of events and polygons </li></ul></ul>
  74. Temporal Reasoning <ul><li>Adhere to our convention to encode StartTimes and EndTimes and enjoy efficient temporal primitives </li></ul><ul><li>Implementation of Allen’s interval logic primitives </li></ul>
  75. A Simple Event Ontology <ul><li>A type </li></ul><ul><ul><li>Meetings, communications event, financial transactions, visit, attack/truce, an insurance claim, a purchase order </li></ul></ul><ul><ul><li>RDFS++ reasoning </li></ul></ul><ul><li>A list of actors </li></ul><ul><ul><li>Social Network Analysis </li></ul></ul><ul><li>A place </li></ul><ul><ul><li>GeoSpatial Reasoning </li></ul></ul><ul><li>A Start-time and possible an end-time </li></ul><ul><ul><li>Temporal Reasoning </li></ul></ul><ul><li>Anything else that describes the event </li></ul><ul><ul><li>Goods that changed hands </li></ul></ul>
  76. Activity Recognition <ul><li>Our customers use AllegroGraph as an event database with social network analysis and geospatial and temporal reasoning </li></ul><ul><ul><li>Find all meetings that happened in November within 5 miles of Berkeley that was attended by the most important person in Jans’ friends and friends of friends. </li></ul></ul><ul><li>(select (?x) </li></ul><ul><li>(ego-group person:jans knows ?group 2) SNA </li></ul><ul><li>(actor-centrality-members ?group knows ?x ?num) SNA </li></ul><ul><li>(q ?event fr:actor ?x) DB Lookup </li></ul><ul><li>(qs ?event rdf:type fr:Meeting) RDFS </li></ul><ul><li>(interval-during ?event “2008-11-01” “2008-11-06”) Temporal </li></ul><ul><li>(geo-box-around geoname:Berkeley ?event 5 miles) Spatial </li></ul><ul><li>!) </li></ul>
  77. Comparing RDB, Hbase, Triple Stores RDB Hbase Modern Triple Store Transactions + - + ACID + - + Concurrent & Dynamic + + + Random Access + + + Flexibility - - + High Availability + + + Joins / complex graph search - - + Structured + Unstructured - - + Scalability + + +
  78. On the road map <ul><li>We do now routinely 20 Billion triples on a big blade machine </li></ul><ul><li>Expect a trillion triples in December </li></ul><ul><li>Problems we are solving </li></ul><ul><ul><li>Keep it ACID </li></ul></ul><ul><ul><li>Smart partitioning </li></ul></ul><ul><ul><li>Smart (Re-)balancing </li></ul></ul><ul><ul><li>Smart indexing </li></ul></ul><ul><ul><li>Query pipelines </li></ul></ul><ul><ul><ul><li>Do (a part of) a query where the data is. </li></ul></ul></ul><ul><ul><ul><li>Parallel query execution </li></ul></ul></ul><ul><ul><ul><li>Query pipe lines </li></ul></ul></ul>
  79. Reasoning