Applied semantic technology and linked data
Upcoming SlideShare
Loading in...5
×
 

Applied semantic technology and linked data

on

  • 398 views

Mapping a human brain generates petabytes of gene listings and the corresponding locations of these genes throughout the human brain.  Due to the large dataset a prototype Semantic Web application ...

Mapping a human brain generates petabytes of gene listings and the corresponding locations of these genes throughout the human brain.  Due to the large dataset a prototype Semantic Web application was created with the unique ability to link new datasets from similar fields of research, and present these new models to an online community.  The resulting application presents a large set of  gene to location mappings and provides new information about diseases, drugs, and side effects in relation to the genes and areas of the human brain.

In this presentation we will discuss the normalization processes and tools for adding new datasets, the user experience throughout the publishing process, the underlying technologies behind the application, and demonstrate the preliminary use cases of the project.

Statistics

Views

Total Views
398
Views on SlideShare
395
Embed Views
3

Actions

Likes
0
Downloads
6
Comments
0

2 Embeds 3

http://www.slideee.com 2
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Hello, My name is william smith and today we will be talking about a project near and dear to my heart.I served as project manager for a prototype application, worked closely with 2 German teams, and we were the first customer for several of the tools used to assemble this application. I was also the chief integration point into Vulcan so am well aware of the technologies, code bases, and data sets that went into assembling this project…
  • So what are we discussing today?First and foremost this was a project for an internal organization at Vulcan involved in mapping the human brain. This, of course, generates petabytes of data and millions of triples worth of gene mappings – but we took a smaller slice of a couple hundred thousand genes for the initial prototype. There were also several parallel research programs generating data in a format we could use, and a conference was held of industry professionals to find the interlinking pieces of these datasets. Finally, I’m going to walk through the data pipeline, the application itself, and a set of our original use cases.
  • Why?Well a core problem that has been in neurobiology, and most sciences for that matter, is the inability to share and author sets of data across projects by industry professionals. This leaves an odd gap where people with computer science degrees are linking data they don’t fully understand, and the people that understand the data don’t have the ability to add the interlinks for greater vision into the data.With this problem known our original prototype soon expanded into how do we get these tools into the hands of the research community, and that in itself created 3 core questions. Ownership, Authorship, and publishing provenance of the newly linked data.
  • The organization that chartered this project, and provided the original data sets is the Allen Institute for Brain Science – or AIBS. When you hear me say AIBS on accident I’m referring to this organization. It was launched in 2003 by Paul G. Allen and has the explicit focus of mapping the human brain to accelerate our understand of the brain and neurological systems. Furthermore, the institute is a 501c(3) nonprofit medical research organization employing hundreds of neuroscientists, molecular biologists, informaticists, and engineers within the seattle area.----- Meeting Notes (1/28/14 12:15) -----So who requested this?accelerate our understanding
  • And this is the Institutes core product… or several screen shots of the core product. Here we have gene heat maps… some location data… where it all is location wise in the human brain. As odd as those screen caps are they are accessed by thousands of researchers daily and this is considered a major success.It’s open, the public right now can go to this site and browse the catalog. There are currently 3 human brains fully mapped with a 4th in progress. Each of these donors have generated genomic analysis of brain structure and have created a thorough catalog of genes with respect to location. While the captions are small they are part of a much larger suite of atlas navigation tools with several components – ie. Heat map – pinpointing genes expressed down to the cellular level.And most importantly, for our purposes, they generate terabytes of data with industry wide IDS we can link to other sources!
  • And here’s our prototype in screenshots. No page is hand type, no graph is hand entered, 4 static templates pulling data from our normalized mine creating all these pretty pictures and full pages of text. There are over 30 thousand of these pages.We will be discussing the first two points in depth – RDF and the LDIF pipeline. Charting tools use SPARQL which we will not be discussing in depth – however I have a hidden slide of the details should somebody be really malicious and want to ask about SPARQL queries. Finally, our navigation closely resembles the common MediaWiki installation which everybody who has been on the internet in the last 10 years is familiar with… editing on the other hand is very different and currently only bots create and maintain the pages.
  • Which brings us to these parallel tracks of research data I keep mentioning.To choose these sets we had a conference of industry researchers and data professionals go through the hundreds of biology mines looking for useful projects that closely relate to genes found in the human brain. The 4 prototype sets chosen were
  • Our original cross section of data found these connections. Not the full dump, but with roughly 15 thousand gene connections plenty of pages produced relevant connections and filled pages with interesting data points. And to the right we have our simplified ontology. Looks incredible right… hey they can’t all be winners and don’t blame me – blame protégé.This was generated with basic 1-1 relations, domain-range logic, where applicable. The simplification was created in part because nobody that does anything in neuroscience agrees with another person that does the same thing. We could get them agree that in some gray area way these things are related on the domain-range level … so that generates that and it looks way worst if I try to spread the boxes out in any other way.
  • Which brings us to the pretty graph I hate… because it makes unifying things into that ugly protégé graph look easy.It’s not, but it does give a good overall view of what we able to convert directly to the wiki be 32 thousand 900 instances turned directly into pages with over 500 thousand properties across the set. Even more important after “same as” connections were made we had 20 thousand fully populates pages – and these are the pages with connections across the datasets. That brings up an important point, if I imported all of the gene data I would end up with a huge wiki by page count, but the better part of these pages would be nothing more than a page title and empty templates. Hence the importance of finding these connections and only tracking the useful data points – like pages with more than a title.On the right we have the simplified process which I will be going into more detail very soon ----- Meeting Notes (1/28/14 12:15) -----But it does give a good overall view of what we are able to convert directly into the wiki.
  • And those parts that just turned red - - is the process we will be discussing for a section I like to call: Linked Data Integration Framework
  • \----- Meeting Notes (1/28/14 13:28) -----Created over the last 4 yearsCreated by Free University of BerlinSame team that helped build the prototypeFirst customerStill active, last update late 20132 main components, R2R and SILK
  • And this is why I don’t like the oversimplification of that process chart. Plenty of difficult computer science problems and none of them cut and dry to solve…Assuming we can find overlapping data sources you then have to unify vocabularies – the predicate of the triple. Once this is done and you can agree on what the name of the entity is, then you will have data sets with the same entity going by a range of names and ids. Finally, once you’ve located the same entities there’s no guarantee the normalized vocabularies will be referencing the same value.Without the normalization pipeline – LDIF - this creates queries that are silo’d to a specific data set basically creating an API… and that’s good for companies like facebook and Google but terrible for independent research. The last point is less of a problem for us because we decided long ago this was a philanthropic prototype with 501-c-3 data – but it is something to be considered when working with say – national security data.
  • Lucky for us, as customer 1 of the LDIF framework, we get to test all of the steps in normalization and hope for the best or fix it ourselves!If this works right we will…
  • And here’s the LDIF architecture.All this stuff on the bottom are the 5 data sets, the arrows don’t really apply because they didn’t link up that well before LDIF, and then to the pipeline.After processing and RE-releasing the arrows apply, and then we shove that all in our own public triple store for use in the application.
  • And here’s your application.
  • ----- Meeting Notes (1/28/14 13:28) -----Pubby created 5 years agoUsed in dbpediaFree univeristy of berlinNo search, have to follow linksNot very modern viewing experienceNo expression of data via links
  • Less than helpful – FINE.
  • Well I am in this business to please the consumer, and my consumer understands common web architectures – even if they don’t know they do - so let’s try an installation of Semantic MediaWikiInvented roughly 5 years ago it’s a series of plugins, that run on mediawiki, which was created by the good folks that invented wikipedia! Millions of people see it everyday while researching homework they don’t feel like doing, when sloppily referencing college term papers, or in my opinion creating one of the most accurate and comprehensive encyclopedias humanity has to date. Even better we can display the semantic properties of our normalized data inline! of course I can.
  • I’m going to build you 4 base templates by category – Gene, Drug, Disease, and Side Effect.These templates will have the base information displaying our semantic properties -
  • ----- Meeting Notes (1/28/14 12:15) -----This created a problem - namely how do I create 30,000 pages and not get fired for entering data over the course of 2 years. So, a lot of what you see on wikipedia isn' t actually input or maintained by humans. The gene pages all have very complex info boxes tracking ids, regions, and a variety of known properties mined from other sources. The pieces of code that do this mining and page creation are called wiki-bots.We wrote a wiki-bot to create our 30,000 pages, one for each page type, and this is the creation pipeline these bots utilized.
  • ----- Meeting Notes (1/28/14 12:15) -----I'll be running through 3 core use cases we used to test the project and explaining how the pages and graphs were generated. All of the graphs related to the genes, diseases, drugs, and side effects within the next few slides are generated from the wiki.However, it's far easier to view the wiki when you have access behind the vulcan firewall... so I had to run on screen shots for this portion.
  • ----- Meeting Notes (1/28/14 12:15) -----Calcium - difficult use case- within all creatures- has lots of connections to other entities- but we don't want to create all the pages
  • ----- Meeting Notes (1/28/14 13:28) ------ 15 minutes of fame 5 years ago- Powerful seditive used in anesthesiology-- You should not use it as a sleep aid- Listed as cause of death for popular musician
  • Fix this
  • ----- Meeting Notes (1/28/14 12:15) -----Finally, we head over to drug bank and search for an obscure drug page... Bicalutamide...It's an oral steroid used in the treatment of cancer that effects the androgen receptor. Thus validating our links across the data. An example of how a not-so-simple correlation of data can give researchers deeper vision by merging sets and presenting the interlinks.
  • ----- Meeting Notes (1/28/14 12:15) -----Aura wiki - it was used to test crowd sourcing of data authoring for a proto-AI.

Applied semantic technology and linked data Applied semantic technology and linked data Presentation Transcript

  • APPLIED LINKED DATA AND SEMANTIC TECHNOLOGY Expanding a Neurobiology Dataset
  • Today we are discussing… • What is the use case and who requested it? • How do you import and normalize thousands of RDF • • • • triples worth of gene data? How do we enrich the normalized gene data with parallel research data sets? Creating instance pages without knowing exactly what will be displayed on them. Demonstration of the initial use cases Question and answer session
  • Why? • Prototype: How do we assemble the data mine and refine the authoring tools? How do we expand this to the research community? • How do we expand ownership of the data to research professionals? • How do we build systems in a way that research professionals can author and link the data? • How do we publish these new relationships to the wider research community?
  • What is the Allen Institute for Brain Science? • Launched in 2003 with seed funding from founder and philanthropist Paul G. Allen. • Serving the scientific community is at the center of our mission to accelerate progress toward understanding the brain and neurological systems. • The Allen Institute's multidisciplinary staff includes neuroscientists, molecular biologists, informaticists, and engineers. “The Allen Institute for Brain Science is an independent 501(c)(3) nonprofit medical research organization dedicated to accelerating the understanding of how the human brain works.”
  • Human Brain Map • Open, public online access • A detailed, interactive three- • • • • dimensional anatomic atlas of the "normal" human brain Data from multiple human brains Genomic analysis of every brain structure, providing a quantitative inventory of which genes are turned on where High-resolution atlases of key brain structures, pinpointing where selected genes are expressed down to the cellular level Navigation and analysis tools for accessing and mining the data
  • Biological Linked Data Map • Open, public online access • Data from multiple RDF data • • • • stores Complete import pipeline using LDIF framework Outlines of each imported instance embedding inline wiki properties and providing views of imported properties from original RDF datasets Charting tools that „pivot‟ SPARQL queries providing several views of each query Navigation and composition tools for accessing and mining the data
  • Where did we get the data? • KEGG : Kyoto Encyclopedia of Genes and Genomes • “KEGG GENES is a collection of gene catalogs for all complete genomes generated from publicly available resources, mostly NCBI RefSeq.” • Diseasome • “The Diseasome website is a disease/disorder relationships explorer and a sample of an innovative map-oriented scientific work. Built by a team of researchers and engineers, it uses the Human Disease Network data set.” • DrugBank • “The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target information.” • SIDER • “SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts.”
  • New ontology map for import • Genes • • • • Diseases • • • DrugBank : 4,772 KEGG : 2,482 SIDER : 924 Effects • • Diseasome : 4,213 KEGG : 459 Drugs • • • • DrugBank : 4,553 Diseasome : 3,919 KEGG : 9,841 SIDER : 1,737 Pathways • KEGG : 28,442 We chose to intentionally simplify the ontology due to disagreements between researchers about entity relationships and subclasses.
  • Importing and mapping the Linked Data • R2R • • • • 32,900 instances were converted to the wiki ontology. Networked Storage Local Storage Download 583,746 properties mapped Pathways were ignored for wiki ontology import, but are available within the triple store KEGG Pathway graph. SIEVE • 20,849 instances available in wiki ontology after SILK normalization • Instance merging effected drugs, genes, and diseases across datasets. • Triple Store SPARQL Update R2R Mapping Engine Maps Entities to New Ontology Import to Wiki Sieve Mapping Engine Normalizes Entities across data sources Normalize Entities Triple Store Available with SPARQL Queries
  • Importing and mapping the Linked Data • R2R • • • • 32,900 instances were converted to the wiki ontology. Networked Storage Local Storage Download 583,746 properties mapped Pathways were ignored for wiki ontology import, but are available within the triple store KEGG Pathway graph. SIEVE • 20,849 instances available in wiki ontology after SILK normalization • Instance merging effected drugs, genes, and diseases across datasets. • Triple Store SPARQL Update R2R Mapping Engine Maps Entities to New Ontology Import to Wiki Sieve Mapping Engine Normalizes Entities across data sources Normalize Entities Triple Store Available with SPARQL Queries
  • LDIF: LINKED DATA INTEGRATION FRAMEWORK Expanding a Neurobiology Dataset
  • Linked Data challenges • Data sources that overlap in content may: • Use a wide range of different RDF vocabularies • Use different identifiers for the same real-world entity • Provide conflicting values for the same properties • Implications • Queries become hand crafted for a specific RDF data set – no different than using a proprietary API. • Individual, improvised and manual merging techniques for data sets. • Integrating public datasets with internal databases poses the same problems
  • Linked Data Integration Framework • LDIF normalizes the Linked Data from multiple sources into a clean, local target representation while keeping track of data provenance. 1 Collect data: Managed download and update 2 Translate data into a single, target vocabulary 3 Resolve identifier aliases into local target URIs 4 Cleanse data and resolve conflicting values 5 Output to local file system or triple store
  • LDIF Pipeline 1 Collect data 2 Translate data 3 Supported Data Formats Resolve identities 4 Cleanse data 5 Output data • • • RDF Files (Multiple Formats SPARQL Endpoints Crawling Linked Data Component Stack
  • LDIF Pipeline 1 Collect data 2 Translate data Sources use a wide range of different RDF vocabularies dbpedia-owl:City schema:Place R2R location:City fb:location.citytown 3 Resolve identities 4 Cleanse data 5 Output data Component Stack
  • LDIF Pipeline 1 Collect data 2 Sources use different identifiers for the same entity Translate data London, England London, MA, USA London, TN, USA London, TX, USA SILK London 3 London = London, England Resolve identities 4 Cleanse data 5 Output data Component Stack
  • LDIF Pipeline 1 Collect data 2 Translate data 3 Sources provide different values for the same property London, England has a population of 8.174M people London, England has a population of 9.2M people SILK rdfs:population: 8.174M Resolve identities 4 Cleanse data 5 Output data Component Stack
  • LDIF Pipeline 1 Collect data 2 Translate data 3 Supported Output Formats • • • N-Quads N-Triples SPARQL Update Stream Resolve identities 4 Cleanse data 5 Output data Provenance tracking using Named Graphs Component Stack
  • LDIF Architecture
  • Normalized Linked Data is not always pretty.
  • Normalized Linked Data is not always pretty.
  • SELECT DISTINCT ?group1 ?item1 ?group2 ?item2 { GRAPH ?G { ?target drugbank:geneName "{{{1}}}" ; drugbank:geneName ?geneName ; . ?drug drugbank:target ?target ; drugbank:genericName ?item2 ; drugbank:affectedOrganism ?group2 ; . } GRAPH ?G1 { ?siderDrug sider:drugName ?item2 ; rdfs:label ?group1 ; sider:sideEffect ?effect; . ?effect rdfs:label ?item1 . } }
  • Semantic MediaWiki Semantic MediaWiki is a full-fledged framework, in conjunction with many spinoff extensions, that can turn a wiki into a powerful and flexible knowledge management system. All data created within SMW can easily be published via the Semantic Web, allowing other systems to use this data seamlessly.
  • Four initial templates for each instance by category 1. Custom infobox within outline template • Visible inline properties 2. Outline template providing instance information 3. Widget template displaying dynamic charts or third party services • Donut charts and AIBS gene feed 4. Broad table SPARQL queries showing instance relationships 5. Hidden inline properties for other extensions
  • Creating instance wiki pages • The Triple Store now contained tens of thousands of recognized category instances. Creating the pages require a bot. Create List of Page Names 1.0 RDF Data Download 1. Fetch the RDF dumps from an active D2R server 2. Use regex to fetch the rdf:label property that was mapped by R2R as an instance name 3. Open category specific text file of wiki markup (page of template includes) 4. Contact Neurowiki and request a new page from the list of names with the category content Sanitize Script 2.0 Create CSV Category Page Names Text of Wiki markup for page instance Read Open 3.0 Create MediaWiki Page MediaWiki Gateway rb Framework REST interface 4.0 Neurowiki Instance Page
  • Final application stack JavaScript View Layer (High Charts / Sproutcore / JQuery) Semantic MediaWiki Triple Store (Virtuoso) Relational Database (MySQL) LDIF AIBS REST API (Gene Heat Map Data) AIBS Diseasome DrugBank SIDER KEGG
  • NEUROWIKI Expanding a Neurobiology Dataset
  • How are base entities like Calcium represented? 1. The wiki page and corresponding template components are rendered. Drug Search 1.0 Wiki Page Aggregate Page of Components 2. Relations are pulled from the normalized data store of linked data. 2.0 Calcium Relations Neurobase Data Stores 3. The JavaScript components are 3.0 Selected Widget for Display populated via a data feed
  • How are base entities like Calcium represented? • Because so many organisms contain calcium the mappings to affected species were never created to conserve space in the data store. Drug and Disease Class Ratios of Calcium Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
  • What are the dangers of Propofol? 1. Propofol DrugBank relations are Drug Search Neurobase Data Stores rendered in corresponding JavaScript components. 1.0 Propofol Relations 2.0 Aggregate Components 2. The Diseasome disease relations show classes of illness Propofol affects. Propofol Disease Relations 3. An aggregate of SIDER side 3.0 Propofol Side Effects effects are rendered in relation to Propofol and disease classes.
  • What are the dangers of Propofol?
  • What are the dangers of Propofol?
  • What are the dangers of Propofol?
  • Which drugs are used in Chemotherapy? 1. 2. Disease Search DrugBank and AIBS relations to genes affected by both the disease and drug. 3. SIDER side effects related to the gene, disease, and drug. 4. DrugBank drug glossary definition specifying various forms of Cancer treatment. Neurobase Data Stores 1.0 Disease Relations Diseasome disease relations normalized by LDIF. Aggregate Components 2.0 Gene Drug Relations 3.0 Drug Side Effects 4.0 Drug Info Box
  • Which Drugs are used in Chemotherapy?
  • Which drugs are used in Chemotherapy? Drug and Disease Class Ratios of AR Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
  • Which drugs are used in Chemotherapy? Drug and Side Effect Ratios of AR Inner Circle: Drugs by Affected Species, Outer Circle: Side Effect Ratios of Drugs
  • Which drugs are used in Chemotherapy?
  • Which drugs are used in Chemotherapy? Drug and Disease Class Ratios of Nilutamide Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
  • Which drugs are used in Chemotherapy? Drug and Disease Class Ratios of Bicalutamide Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
  • Which drugs are used in Chemotherapy?
  • Expanding the Prototype • Semantic MediaWiki query construction • Could this be done in SPARQL? • Authoring SILK / R2R mappings for the LDIF Pipeline • Extremely difficult and the editors are not intuitive • How do you get data owners to fuse the sets and create the data store themselves? • Tested with Aura Wiki prototype • Expand authoring provenance • How do we ensure new data / links comes from an authoritative source?
  • Today we discussed… • The Allen Institute for Brain Science (AIBS) • Four similar research data sets to interlink with the AIBS • • • • • data set An import pipeline named Link Data Integration Framework (LDIF) The interlinking process for 5 concurrent research data sets (AIBS, DrugBank, Diseasome, KEGG, SIDER) A prototype neurobiology authoring platform. Creating instance pages to display the new connections. Demonstration of the initial use cases.
  • QUESTIONS? COMMENTS? Expanding a Neurobiology Dataset
  • THANK YOU. Expanding a Neurobiology Dataset