Semantic Technology: The Basics

Semantic Technology: the basics Jans Aasman, Ph.D. CEO Franz Inc [email_address]

Contents The basics about triples and meta data Storing triples and the difference between a RDB and a triple store? Linked Open Data and we learn how to SPARQL Where do little triples come from and an example of entity extraction Tim Berners-Lee’s dream come true: RDFa, ontologies, another example.. What do companies use triple stores for? And why do companies work with triple stores? Can’t I do that with a Nosql database like Hadoop, Bigtable, HBase, Cassandra…? General requirements for RDB, Triple Stores and Big Data What about rules and reasoning? What about geospatial, temporal and social network analysis

Caveat I will talk mostly about the AllegroGraph tool suite But 90 % of what I discuss will work with Virtuoso Oracle Owlim Jena Sesame

Trends in main stream IT Gartner group‘s 2008 list of the Top 10 Disruptive Technologies that will effect IT in the next five years Multi core and hybrid processors Virtualisation and fabric computing Meta Data and Semantic technologies Social networks and social software Cloud computing and cloud platforms Web mashups User Interface Ubiquitous computing Contextual computing Augmented reality

The whole Semantic Web rests on a little thingy: The triple

Subject predicate object graph Quads

ntriple format Resource or Blank Node Only Resource Resource Literal Blank Node (?:(?<=<)[^>]+(?=>))|(?:\"(?:(?:[^\"]|(?:\\\"))*)\"[^. ]*)|(?:_:[-_a-zA-Z0-9]+)

But don’t worry Most triple stores will read Rdf/xml Ntriples N3 And otherwise you just use rapper…

How can I read up on triples and meta data?

A triple store How do you store triples and what is the difference between a relational database and a triple store

An artist’s impression of a db for persons January, 2008

Comparing RDB, Hbase, Triple Stores RDB Hbase Modern Triple Store Transactions + - + ACID + - + Concurrent & Dynamic + + + Random Access + + + Flexibility - - + High Availability + + + Joins / complex graph search - - + Structured + Unstructured - - + Scalability + + +

Can you give us some examples Linked Open Data and Pharmaceutical demo Entity Extraction and a Triple Store Tim Berners-Lee’s dream comes true: RDFa – Yahoo/Bing/Google/Overstock

Federated 11 linked data sets We took 5 public databases: Drugbank, Dailymeds, Clinical trials, Diseasome, and Sider. Entities are mostly linked together through same-as relationships. And using some entity extraction created some more links (and) triples CT-discusses-drug, CT-discusses-side-effect CT-discusses-target, CT-discusses-disease Alitora did some extensive NLP and entity extraction on Rheumatoid Arthritis CT-mentions-genes And to facilitate search through schema space: Schema-connections

DrugBank A repository of almost 5000 FDA-approved small molecule and biotech drugs. Contains detailed information about drugs including chemical, pharmacological and pharmaceutical data; along with comprehensive drug target data such as sequence, structure, and pathway information.

LinkedCT: Clinical Trials Up-to-date information for locating federally and privately supported clinical trials for a wide range of diseases and conditions. It contains 81,571 trials sponsored by the National Institutes of Health, other federal agencies, and private industry ClinicalTrials.gov receives over 40 million page views per month 50,000 visitors daily.

Diseasome Publishes a network of 4,300 disorders and disease genes linked by known disorder-gene associations for exploring all known phenotype and disease gene associations, indicating the common genetic origin of many diseases. The list of disorders, disease genes, and associations between them was obtained from the Online Mendelian Inheritance in Man (OMIM), a compilation of human disease genes and phenotypes.

DailyMed Published by the National Library of Medicine, and provides high quality information about marketed drugs. DailyMed provides much information including general background on the chemical structure of the compound and its therapeutic purpose, details on the compound's clinical pharmacology, indication and usage, contraindications, warnings, precautions, adverse reactions, overdosage, and patient counseling.

Sider Contains information on marketed drugs and their adverse effects. The information is extracted from public documents and package inserts.

Finding entities in Clinical Trials CT has too much text We searched for drugs, diseases, targets and side effects in Clinical trials and created new triples CT100385 discusses-side-effect headache CT100385 discusses-drug aspirin CT100385 discusses-disease alcohol-addiction CT100385 discusses-target some-protein

Combined with advanced entity extraction CT-mentions-drugs CT-mentions-genes CT-mentions-side-effects (currently only for

Interesting queries (Sparql) Sparql Give me the title of all clinical trials that discuss the drug Lipitor and the side-effect “Diabetes type 2” Give me clinical trials that discuss Rheumatoid Arthritis and give me the genes and drugs discussed Prolog Find all clinical trials that resemble clinical trial NCT00130091 given diseases, drugs, targets, and side-effects

Where do triples come from By user programming (loop repeat 1000 do(add-triple (ran 100)(ran 100)(ran 100))) From Relational Databases From entity extractors

Guided Interaction Advisor Visualization of the solution Triple Store with business concepts Events from many source systems are transformed into a set of related business concepts Chronology of events Interactions Bills Orders Payments Collections Charge dispute Individual Customer Pay instructions Device Activated Device heartbeat Subscriptions Device changes Many events

Guided Interaction Advisor Visualization of the solution Charge dispute Probabilistic assessment is performed Corporate policy rules determine actions Plan Overage Bill greater than last month Prorates Roaming charges Third Party Charges Abnormal fee Rate increases Charge dispute First bill Past due amount Pay bill Customer Cancellation Reactivation Device activation Device education Device exchange Device Lost Device not working Device resume Service data not working Service text not working Service voice not working Subscription cancellation Wrong plan Chronology of events Device Activation Third Party Charges Abnormal fee

Entity Extraction: from People (has-people) And their roles Places (has-places) And the county, state, country they are in Organizations (has-organizations) Government departments, company names, etc. Main Categories (has-domains) Politics, sports, ministries, energy, finance, economics, ecology, oil, mining industry, etc.. Main Concepts (has-main-groups) Other important nouns and phrases in a text

How would you do this with your standard search engine? Give me a newspaper text with a republican and a democrat that serve on two subcommittees that have the same parent committee. Which [democrat|republican] is most vocal in the oil spill disaster Given this text, find all the other texts that have the same people and the same main topics but not democrats in the text. Which [democrate|republican|senator|representative] get most of the attention in the last week. Give me the distribution of the most important topics yesterday

The process We spider daily > 300 on-line newspapers and thousands of blogs And search specifically for all the member of the senate and house of representatives and the executive branch Apply Cogito to the text and extract main concepts About 150 triples per text… Hook up these concepts with a detailed database of each politician.

Demo Looking at the main properties of a text Full text indexing Simple SPARQL and prolog queries Finding related articles Finding connections between two texts

How Tim Berners-Lee’s dream came true

Vocabularies, Thesauri, Taxonomy, Ontologies Vocabularies : the heart of linking bc:Citi rdf:type bc:VocabularyEntry Thesauri: linking variants to Vocabulary bc:Citi bc:hasAlternativeName ‘Citi Group’ Taxonomy: finding the hierarchy in your data bc:Banamex bc:part of bc:Citi Ontology: types, subtypes, constraints bc:Citi rdf:type bc:Bank bc:Bank rdf:type owl:Class bc:Bank rdfs:subClassOf bc:Company bc:Company rdfs:subClassOf bc:Organization

Schema Spaces Create Schema Connection Spaces Take original RDB schemas and syntactically transform to RDF and RDFS bc:customer1 rdf:type bc:table bc:customerID1 rdf:type bc:columnName bc:customerID1 bc:dataType bc:long Annotate with origin bc:customer1 bc:fromDB bc:ERP1 Annotate with connections to other schema bc:customer1 bc:relatesTo bc:customer2

What do companies use triple stores for?

Bill Guin, CTO, Amdocs Semantic Real Time Intelligent Decision Automation . Thursday, 12:45, Grand A Vijay Bulusu. Knowledge Sharing Using Semantic Technologies Thursday, 3:30, Franciscan A

And why do these companies work with triple stores? When you need ultimate flexibility Modeling knowledge and assets Hundreds to thousands of classes with different features Everyday new classes and new features You work with rules and reasoning When you need ultimate ‘linkability’ For (ad hoc) integration of databases When you need pattern recognition and network analysis Complex networks of people, companies, products, etc When you need event processing using geospatial, temporal reasoning and social network analysis combined with flexible metadata

Can’t I do that with a Nosql database like Hadoop, HBase, Cassandra… HBase is an open-source, distributed database modeled after Google‘s BigTable and written in Java. It is developed as part of Apache Software Foundation's Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like capabilities for Hadoop. ... en.wikipedia.org/wiki/Hbase It sacrifices ACID-ness and complex Joins for web scale scalability.

Classes of applications and their natural database. Regular Enterprise Applications. Web Scale Shallow Objects Complex Meta Data Applications. Relational Databases Hadoop, Hbase, Cassandra, etc Triple Stores

Geospatial, temporal and social network analysis with a triple store…

Social Network Analysis Answers 4 questions How far is P1 from P2 (and how strong is the relation?) To what groups does this person belong (ego groups, cliques?) How important is this person in the group? Does this group have a leader, how cohesive are they?

GeoSpatial Make the following super efficient Where did something happen? How far was event1 from event2? Find all the events that occurred in a bounding box or radius of M miles? Do these two shapes overlap? Find all the objects in the intersection of two shapes On a very large scale when things don ’ t fit in memory millions of events and polygons

Temporal Reasoning Adhere to our convention to encode StartTimes and EndTimes and enjoy efficient temporal primitives Implementation of Allen’s interval logic primitives

A Simple Event Ontology A type Meetings, communications event, financial transactions, visit, attack/truce, an insurance claim, a purchase order RDFS++ reasoning A list of actors Social Network Analysis A place GeoSpatial Reasoning A Start-time and possible an end-time Temporal Reasoning Anything else that describes the event Goods that changed hands

Activity Recognition Our customers use AllegroGraph as an event database with social network analysis and geospatial and temporal reasoning Find all meetings that happened in November within 5 miles of Berkeley that was attended by the most important person in Jans’ friends and friends of friends. (select (?x) (ego-group person:jans knows ?group 2) SNA (actor-centrality-members ?group knows ?x ?num) SNA (q ?event fr:actor ?x) DB Lookup (qs ?event rdf:type fr:Meeting) RDFS (interval-during ?event “2008-11-01” “2008-11-06”) Temporal (geo-box-around geoname:Berkeley ?event 5 miles) Spatial !)

On the road map We do now routinely 20 Billion triples on a big blade machine Expect a trillion triples in December Problems we are solving Keep it ACID Smart partitioning Smart (Re-)balancing Smart indexing Query pipelines Do (a part of) a query where the data is. Parallel query execution Query pipe lines

Semantic Technology: The Basics

More Related Content

What's hot

Viewers also liked

Similar to Semantic Technology: The Basics

Recently uploaded

Semantic Technology: The Basics