Semantic Technology: the basics Jans Aasman, Ph.D. CEO Franz Inc [email_address]
Contents The basics about triples and meta data Storing triples and the difference between a RDB and a triple store? Linked Open Data and we learn how to SPARQL Where do little triples come from and an example of entity extraction Tim Berners-Lee’s dream come true: RDFa, ontologies, another example.. What do companies use triple stores for? And why do companies work with triple stores? Can’t I do that with a Nosql database like Hadoop,  Bigtable, HBase, Cassandra…? General requirements for RDB, Triple Stores and Big Data What about rules and reasoning? What about geospatial, temporal and social network analysis
Caveat I will talk mostly about the AllegroGraph tool suite But 90 % of what I discuss will work with Virtuoso Oracle Owlim Jena Sesame
Trends in main stream IT Gartner group‘s 2008 list of the Top 10 Disruptive Technologies that will effect IT in the next five years Multi core and hybrid processors Virtualisation and fabric computing Meta Data and Semantic technologies Social networks and social software Cloud computing and cloud platforms Web mashups User Interface Ubiquitous computing Contextual computing Augmented reality
 
The whole Semantic Web rests on a little thingy: The triple
Subject   predicate  object   graph Quads
ntriple format Resource or Blank Node Only Resource Resource Literal  Blank Node (?:(?<=<)[^>]+(?=>))|(?:\&quot;(?:(?:[^\&quot;]|(?:\\\&quot;))*)\&quot;[^.  ]*)|(?:_:[-_a-zA-Z0-9]+)
As RDF/XML
As Trix As Trix
As N3
But don’t worry Most triple stores will read Rdf/xml Ntriples N3 And otherwise you just use rapper…
 
 
How can I read up on  triples and meta data?
A triple store How do you store triples and what is the difference between a relational database and a triple store
An artist’s impression of a db for persons January, 2008
g
 
Comparing RDB, Hbase, Triple Stores RDB Hbase Modern Triple Store Transactions + - + ACID + - + Concurrent & Dynamic + + + Random Access + + + Flexibility - - + High Availability + + + Joins / complex graph search - - + Structured + Unstructured - - + Scalability + + +
Can you give us some examples Linked Open Data and Pharmaceutical demo Entity Extraction and a Triple Store Tim Berners-Lee’s dream comes true:  RDFa – Yahoo/Bing/Google/Overstock
Pharmaceutical
 
 
sider
Federated 11 linked data sets We took 5 public databases: Drugbank, Dailymeds, Clinical trials, Diseasome, and Sider. Entities are mostly linked together through same-as relationships. And using some entity extraction created some more links (and) triples CT-discusses-drug, CT-discusses-side-effect CT-discusses-target, CT-discusses-disease Alitora did some extensive NLP and entity extraction on Rheumatoid Arthritis CT-mentions-genes And to facilitate search through schema space: Schema-connections
 
 
DrugBank A repository of almost 5000 FDA-approved small molecule and biotech drugs.  Contains detailed information about drugs including chemical, pharmacological and pharmaceutical data; along with comprehensive drug target data such as sequence, structure, and pathway information.
LinkedCT: Clinical Trials Up-to-date information for locating federally and privately supported clinical trials for a wide range of diseases and conditions. It contains 81,571 trials sponsored by the National Institutes of Health, other federal agencies, and private industry ClinicalTrials.gov receives over 40 million page views per month 50,000 visitors daily.
Diseasome Publishes a network of 4,300 disorders and disease genes linked by known disorder-gene associations for exploring all known phenotype and disease gene associations, indicating the common genetic origin of many diseases. The list of disorders, disease genes, and associations between them was obtained from the Online Mendelian Inheritance in Man (OMIM), a compilation of human disease genes and phenotypes.
DailyMed Published by the National Library of Medicine, and provides high quality information about marketed drugs. DailyMed provides much information including general background on the chemical structure of the compound and its therapeutic purpose, details on the compound's clinical pharmacology, indication and usage, contraindications, warnings, precautions, adverse reactions, overdosage, and patient counseling.
Sider Contains information on marketed drugs and their adverse effects. The information is extracted from public documents and package inserts.
Finding entities in Clinical Trials CT has too much text We searched for drugs, diseases, targets and side effects in Clinical trials and created new triples CT100385  discusses-side-effect  headache CT100385  discusses-drug  aspirin CT100385  discusses-disease  alcohol-addiction CT100385  discusses-target  some-protein
Combined with advanced entity extraction CT-mentions-drugs CT-mentions-genes CT-mentions-side-effects (currently only for
Interesting queries (Sparql) Sparql Give me the title of all clinical trials that discuss the drug Lipitor and the side-effect “Diabetes type 2” Give me clinical trials that discuss  Rheumatoid Arthritis and give me the genes and drugs discussed Prolog Find all clinical trials that resemble clinical trial NCT00130091 given diseases, drugs, targets, and side-effects
Where do triples come from By user programming (loop repeat 1000  do(add-triple (ran 100)(ran 100)(ran 100))) From Relational Databases From entity extractors
 
 
 
 
 
Guided Interaction Advisor Visualization of the solution Triple Store with business concepts Events from many source systems are transformed into a set of related business concepts Chronology of events Interactions Bills Orders Payments Collections Charge dispute Individual Customer Pay instructions Device Activated Device heartbeat Subscriptions Device changes Many events
Guided Interaction Advisor Visualization of the solution Charge dispute Probabilistic assessment is performed  Corporate policy rules determine actions Plan Overage Bill greater than last month Prorates Roaming charges Third Party Charges Abnormal fee Rate increases Charge dispute First bill Past due amount Pay bill Customer Cancellation Reactivation Device activation Device education Device exchange Device Lost Device not working Device resume Service data not working Service text not working Service voice not working Subscription cancellation Wrong plan Chronology of events Device Activation Third Party Charges Abnormal fee
Entity Extraction: from People  (has-people) And their roles Places  (has-places) And the county, state, country they are in Organizations  (has-organizations) Government departments, company names, etc. Main Categories  (has-domains) Politics, sports, ministries, energy, finance, economics, ecology, oil, mining industry, etc.. Main Concepts  (has-main-groups) Other important nouns and phrases in a text
How would you do this with your standard search engine? Give me a newspaper text with a republican and a democrat that serve on two subcommittees that have the same parent committee. Which [democrat|republican] is most vocal in the oil spill disaster Given this text, find all the other texts that have the same people and the same main topics but not democrats in the text. Which [democrate|republican|senator|representative] get most of the attention in the last week. Give me the distribution of the most important topics yesterday
The process We spider daily >  300 on-line newspapers and thousands of blogs And search specifically for all the member of the  senate  and  house of representatives  and the  executive  branch Apply Cogito to the text and extract main concepts  About 150 triples per text… Hook up these concepts with a detailed database of  each politician.
 
 
Demo Looking at the main properties of a text Full text indexing Simple SPARQL and prolog queries Finding related articles Finding connections between two texts
How Tim Berners-Lee’s dream came true
Vocabularies, Thesauri,  Taxonomy, Ontologies Vocabularies : the heart of linking bc:Citi rdf:type bc:VocabularyEntry Thesauri: linking variants to Vocabulary bc:Citi bc:hasAlternativeName ‘Citi Group’ Taxonomy:  finding the hierarchy in your data bc:Banamex bc:part of bc:Citi Ontology:  types, subtypes, constraints bc:Citi rdf:type bc:Bank bc:Bank rdf:type owl:Class bc:Bank rdfs:subClassOf bc:Company bc:Company rdfs:subClassOf bc:Organization
Schema Spaces Create Schema Connection Spaces Take original RDB schemas and  syntactically transform to RDF and RDFS bc:customer1 rdf:type bc:table bc:customerID1 rdf:type bc:columnName bc:customerID1  bc:dataType bc:long Annotate with origin bc:customer1 bc:fromDB bc:ERP1 Annotate with connections to other schema bc:customer1 bc:relatesTo bc:customer2
 
 
 
 
 
 
 
 
 
What do companies use triple stores for?
 
Bill Guin, CTO, Amdocs Semantic Real Time Intelligent Decision Automation .  Thursday, 12:45, Grand A Vijay Bulusu. Knowledge Sharing Using Semantic Technologies   Thursday, 3:30,  Franciscan A
And why do these companies work with triple stores? When you need ultimate flexibility Modeling knowledge and assets Hundreds to thousands of classes with different features Everyday new  classes and new features You work with rules and reasoning When you need ultimate ‘linkability’ For (ad hoc) integration of databases When you need pattern recognition and network analysis Complex  networks of people, companies, products, etc When you need event processing using geospatial, temporal reasoning and social network analysis combined with flexible metadata
Can’t I do that with a Nosql database like Hadoop, HBase, Cassandra… HBase is an open-source, distributed database modeled after Google‘s BigTable and written in Java. It is developed as part of Apache Software Foundation's Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop. ...  en.wikipedia.org/wiki/Hbase It sacrifices ACID-ness and complex Joins for web scale scalability.
 
Classes of applications and their natural database. Regular Enterprise Applications. Web Scale  Shallow Objects Complex Meta Data Applications. Relational Databases Hadoop, Hbase, Cassandra, etc Triple Stores
Classes of applications and their natural database. Regular Enterprise Applications. Web Scale  Shallow Objects Complex Meta Data Applications. Relational Databases Hadoop, Hbase, Cassandra, etc Triple Stores
Geospatial, temporal and social network analysis with a triple store…
Social Network Analysis  Answers 4 questions How far is P1 from P2  (and how strong is the  relation?) To what groups does  this person belong  (ego groups, cliques?) How important is this  person in the group? Does this group have  a leader, how cohesive  are they?
GeoSpatial Make the following super efficient Where did something happen? How far was event1 from event2? Find all the events that occurred  in a bounding box or radius of  M miles? Do these two shapes overlap? Find all the objects in the intersection of two shapes On a very large scale when things don ’ t fit in memory millions of events and polygons
Temporal Reasoning Adhere to our convention  to encode StartTimes and  EndTimes and enjoy  efficient temporal  primitives Implementation of Allen’s interval logic primitives
A Simple Event Ontology A type   Meetings, communications event, financial transactions, visit, attack/truce, an insurance claim, a purchase order RDFS++ reasoning A list of actors Social Network Analysis A place GeoSpatial Reasoning   A Start-time and possible an end-time Temporal Reasoning Anything else that describes the event Goods that changed hands
Activity Recognition Our customers use AllegroGraph as an event database with social network analysis and geospatial and temporal reasoning   Find all meetings that happened in November within 5 miles of Berkeley that was attended by the most important person in Jans’ friends and friends of friends. (select (?x) (ego-group person:jans knows ?group 2)  SNA (actor-centrality-members ?group knows ?x ?num)  SNA (q ?event fr:actor ?x)  DB Lookup (qs ?event rdf:type fr:Meeting)  RDFS   (interval-during ?event “2008-11-01” “2008-11-06”)  Temporal   (geo-box-around geoname:Berkeley ?event 5 miles)  Spatial   !)
Comparing RDB, Hbase, Triple Stores RDB Hbase Modern Triple Store Transactions + - + ACID + - + Concurrent & Dynamic + + + Random Access + + + Flexibility - - + High Availability + + + Joins / complex graph search - - + Structured + Unstructured - - + Scalability + + +
On the road map We do now routinely 20 Billion triples on a big blade machine Expect a trillion triples in December Problems we are solving Keep it ACID Smart partitioning Smart (Re-)balancing Smart indexing Query pipelines Do (a part of) a query where the data is. Parallel query execution Query pipe lines
Reasoning

Semantic Technology: The Basics

  • 1.
    Semantic Technology: thebasics Jans Aasman, Ph.D. CEO Franz Inc [email_address]
  • 2.
    Contents The basicsabout triples and meta data Storing triples and the difference between a RDB and a triple store? Linked Open Data and we learn how to SPARQL Where do little triples come from and an example of entity extraction Tim Berners-Lee’s dream come true: RDFa, ontologies, another example.. What do companies use triple stores for? And why do companies work with triple stores? Can’t I do that with a Nosql database like Hadoop, Bigtable, HBase, Cassandra…? General requirements for RDB, Triple Stores and Big Data What about rules and reasoning? What about geospatial, temporal and social network analysis
  • 3.
    Caveat I willtalk mostly about the AllegroGraph tool suite But 90 % of what I discuss will work with Virtuoso Oracle Owlim Jena Sesame
  • 4.
    Trends in mainstream IT Gartner group‘s 2008 list of the Top 10 Disruptive Technologies that will effect IT in the next five years Multi core and hybrid processors Virtualisation and fabric computing Meta Data and Semantic technologies Social networks and social software Cloud computing and cloud platforms Web mashups User Interface Ubiquitous computing Contextual computing Augmented reality
  • 5.
  • 6.
    The whole SemanticWeb rests on a little thingy: The triple
  • 7.
    Subject predicate object graph Quads
  • 8.
    ntriple format Resourceor Blank Node Only Resource Resource Literal Blank Node (?:(?<=<)[^>]+(?=>))|(?:\&quot;(?:(?:[^\&quot;]|(?:\\\&quot;))*)\&quot;[^. ]*)|(?:_:[-_a-zA-Z0-9]+)
  • 9.
  • 10.
  • 11.
  • 12.
    But don’t worryMost triple stores will read Rdf/xml Ntriples N3 And otherwise you just use rapper…
  • 13.
  • 14.
  • 15.
    How can Iread up on triples and meta data?
  • 16.
    A triple storeHow do you store triples and what is the difference between a relational database and a triple store
  • 17.
    An artist’s impressionof a db for persons January, 2008
  • 18.
  • 19.
  • 20.
    Comparing RDB, Hbase,Triple Stores RDB Hbase Modern Triple Store Transactions + - + ACID + - + Concurrent & Dynamic + + + Random Access + + + Flexibility - - + High Availability + + + Joins / complex graph search - - + Structured + Unstructured - - + Scalability + + +
  • 21.
    Can you giveus some examples Linked Open Data and Pharmaceutical demo Entity Extraction and a Triple Store Tim Berners-Lee’s dream comes true: RDFa – Yahoo/Bing/Google/Overstock
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    Federated 11 linkeddata sets We took 5 public databases: Drugbank, Dailymeds, Clinical trials, Diseasome, and Sider. Entities are mostly linked together through same-as relationships. And using some entity extraction created some more links (and) triples CT-discusses-drug, CT-discusses-side-effect CT-discusses-target, CT-discusses-disease Alitora did some extensive NLP and entity extraction on Rheumatoid Arthritis CT-mentions-genes And to facilitate search through schema space: Schema-connections
  • 27.
  • 28.
  • 29.
    DrugBank A repositoryof almost 5000 FDA-approved small molecule and biotech drugs. Contains detailed information about drugs including chemical, pharmacological and pharmaceutical data; along with comprehensive drug target data such as sequence, structure, and pathway information.
  • 30.
    LinkedCT: Clinical TrialsUp-to-date information for locating federally and privately supported clinical trials for a wide range of diseases and conditions. It contains 81,571 trials sponsored by the National Institutes of Health, other federal agencies, and private industry ClinicalTrials.gov receives over 40 million page views per month 50,000 visitors daily.
  • 31.
    Diseasome Publishes anetwork of 4,300 disorders and disease genes linked by known disorder-gene associations for exploring all known phenotype and disease gene associations, indicating the common genetic origin of many diseases. The list of disorders, disease genes, and associations between them was obtained from the Online Mendelian Inheritance in Man (OMIM), a compilation of human disease genes and phenotypes.
  • 32.
    DailyMed Published bythe National Library of Medicine, and provides high quality information about marketed drugs. DailyMed provides much information including general background on the chemical structure of the compound and its therapeutic purpose, details on the compound's clinical pharmacology, indication and usage, contraindications, warnings, precautions, adverse reactions, overdosage, and patient counseling.
  • 33.
    Sider Contains informationon marketed drugs and their adverse effects. The information is extracted from public documents and package inserts.
  • 34.
    Finding entities inClinical Trials CT has too much text We searched for drugs, diseases, targets and side effects in Clinical trials and created new triples CT100385 discusses-side-effect headache CT100385 discusses-drug aspirin CT100385 discusses-disease alcohol-addiction CT100385 discusses-target some-protein
  • 35.
    Combined with advancedentity extraction CT-mentions-drugs CT-mentions-genes CT-mentions-side-effects (currently only for
  • 36.
    Interesting queries (Sparql)Sparql Give me the title of all clinical trials that discuss the drug Lipitor and the side-effect “Diabetes type 2” Give me clinical trials that discuss Rheumatoid Arthritis and give me the genes and drugs discussed Prolog Find all clinical trials that resemble clinical trial NCT00130091 given diseases, drugs, targets, and side-effects
  • 37.
    Where do triplescome from By user programming (loop repeat 1000 do(add-triple (ran 100)(ran 100)(ran 100))) From Relational Databases From entity extractors
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    Guided Interaction AdvisorVisualization of the solution Triple Store with business concepts Events from many source systems are transformed into a set of related business concepts Chronology of events Interactions Bills Orders Payments Collections Charge dispute Individual Customer Pay instructions Device Activated Device heartbeat Subscriptions Device changes Many events
  • 44.
    Guided Interaction AdvisorVisualization of the solution Charge dispute Probabilistic assessment is performed Corporate policy rules determine actions Plan Overage Bill greater than last month Prorates Roaming charges Third Party Charges Abnormal fee Rate increases Charge dispute First bill Past due amount Pay bill Customer Cancellation Reactivation Device activation Device education Device exchange Device Lost Device not working Device resume Service data not working Service text not working Service voice not working Subscription cancellation Wrong plan Chronology of events Device Activation Third Party Charges Abnormal fee
  • 45.
    Entity Extraction: fromPeople (has-people) And their roles Places (has-places) And the county, state, country they are in Organizations (has-organizations) Government departments, company names, etc. Main Categories (has-domains) Politics, sports, ministries, energy, finance, economics, ecology, oil, mining industry, etc.. Main Concepts (has-main-groups) Other important nouns and phrases in a text
  • 46.
    How would youdo this with your standard search engine? Give me a newspaper text with a republican and a democrat that serve on two subcommittees that have the same parent committee. Which [democrat|republican] is most vocal in the oil spill disaster Given this text, find all the other texts that have the same people and the same main topics but not democrats in the text. Which [democrate|republican|senator|representative] get most of the attention in the last week. Give me the distribution of the most important topics yesterday
  • 47.
    The process Wespider daily > 300 on-line newspapers and thousands of blogs And search specifically for all the member of the senate and house of representatives and the executive branch Apply Cogito to the text and extract main concepts About 150 triples per text… Hook up these concepts with a detailed database of each politician.
  • 48.
  • 49.
  • 50.
    Demo Looking atthe main properties of a text Full text indexing Simple SPARQL and prolog queries Finding related articles Finding connections between two texts
  • 51.
    How Tim Berners-Lee’sdream came true
  • 52.
    Vocabularies, Thesauri, Taxonomy, Ontologies Vocabularies : the heart of linking bc:Citi rdf:type bc:VocabularyEntry Thesauri: linking variants to Vocabulary bc:Citi bc:hasAlternativeName ‘Citi Group’ Taxonomy: finding the hierarchy in your data bc:Banamex bc:part of bc:Citi Ontology: types, subtypes, constraints bc:Citi rdf:type bc:Bank bc:Bank rdf:type owl:Class bc:Bank rdfs:subClassOf bc:Company bc:Company rdfs:subClassOf bc:Organization
  • 53.
    Schema Spaces CreateSchema Connection Spaces Take original RDB schemas and syntactically transform to RDF and RDFS bc:customer1 rdf:type bc:table bc:customerID1 rdf:type bc:columnName bc:customerID1 bc:dataType bc:long Annotate with origin bc:customer1 bc:fromDB bc:ERP1 Annotate with connections to other schema bc:customer1 bc:relatesTo bc:customer2
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
    What do companiesuse triple stores for?
  • 64.
  • 65.
    Bill Guin, CTO,Amdocs Semantic Real Time Intelligent Decision Automation . Thursday, 12:45, Grand A Vijay Bulusu. Knowledge Sharing Using Semantic Technologies Thursday, 3:30, Franciscan A
  • 66.
    And why dothese companies work with triple stores? When you need ultimate flexibility Modeling knowledge and assets Hundreds to thousands of classes with different features Everyday new classes and new features You work with rules and reasoning When you need ultimate ‘linkability’ For (ad hoc) integration of databases When you need pattern recognition and network analysis Complex networks of people, companies, products, etc When you need event processing using geospatial, temporal reasoning and social network analysis combined with flexible metadata
  • 67.
    Can’t I dothat with a Nosql database like Hadoop, HBase, Cassandra… HBase is an open-source, distributed database modeled after Google‘s BigTable and written in Java. It is developed as part of Apache Software Foundation's Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop. ... en.wikipedia.org/wiki/Hbase It sacrifices ACID-ness and complex Joins for web scale scalability.
  • 68.
  • 69.
    Classes of applicationsand their natural database. Regular Enterprise Applications. Web Scale Shallow Objects Complex Meta Data Applications. Relational Databases Hadoop, Hbase, Cassandra, etc Triple Stores
  • 70.
    Classes of applicationsand their natural database. Regular Enterprise Applications. Web Scale Shallow Objects Complex Meta Data Applications. Relational Databases Hadoop, Hbase, Cassandra, etc Triple Stores
  • 71.
    Geospatial, temporal andsocial network analysis with a triple store…
  • 72.
    Social Network Analysis Answers 4 questions How far is P1 from P2 (and how strong is the relation?) To what groups does this person belong (ego groups, cliques?) How important is this person in the group? Does this group have a leader, how cohesive are they?
  • 73.
    GeoSpatial Make thefollowing super efficient Where did something happen? How far was event1 from event2? Find all the events that occurred in a bounding box or radius of M miles? Do these two shapes overlap? Find all the objects in the intersection of two shapes On a very large scale when things don ’ t fit in memory millions of events and polygons
  • 74.
    Temporal Reasoning Adhereto our convention to encode StartTimes and EndTimes and enjoy efficient temporal primitives Implementation of Allen’s interval logic primitives
  • 75.
    A Simple EventOntology A type Meetings, communications event, financial transactions, visit, attack/truce, an insurance claim, a purchase order RDFS++ reasoning A list of actors Social Network Analysis A place GeoSpatial Reasoning A Start-time and possible an end-time Temporal Reasoning Anything else that describes the event Goods that changed hands
  • 76.
    Activity Recognition Ourcustomers use AllegroGraph as an event database with social network analysis and geospatial and temporal reasoning Find all meetings that happened in November within 5 miles of Berkeley that was attended by the most important person in Jans’ friends and friends of friends. (select (?x) (ego-group person:jans knows ?group 2) SNA (actor-centrality-members ?group knows ?x ?num) SNA (q ?event fr:actor ?x) DB Lookup (qs ?event rdf:type fr:Meeting) RDFS (interval-during ?event “2008-11-01” “2008-11-06”) Temporal (geo-box-around geoname:Berkeley ?event 5 miles) Spatial !)
  • 77.
    Comparing RDB, Hbase,Triple Stores RDB Hbase Modern Triple Store Transactions + - + ACID + - + Concurrent & Dynamic + + + Random Access + + + Flexibility - - + High Availability + + + Joins / complex graph search - - + Structured + Unstructured - - + Scalability + + +
  • 78.
    On the roadmap We do now routinely 20 Billion triples on a big blade machine Expect a trillion triples in December Problems we are solving Keep it ACID Smart partitioning Smart (Re-)balancing Smart indexing Query pipelines Do (a part of) a query where the data is. Parallel query execution Query pipe lines
  • 79.