Producing Life Sciences Linked Data (Problems) Most Linked Open Data is created and provided without the help of the original data provider whoAlmost all Linked Open Data in Life Sciences is provided by Bio2RDF
Producing Life Sciences Linked Data (Problems)• Data Base is a life’s work for a biologist and He/she wants to publish it – but not to lose the control• An RDF dump of the DB is cheap – but supporting Queries and Data Analysis is expensive – where is the money comming from?• They are very motivated to add value to the data – but they are still lacking up to date ICT skills• Help is wanted to kill Bio2RDFAlmost all Linked Open Data in Life Sciences is provided by Bio2RDF
Consuming Linked Data• Number of Linked Data repositories will keep growing• Use of Linked Data in Life Sciences means Linking data with existing tools which are de facto standards in certain subdomains: • Pathways http://sbmm.uma.es • Proteins
Consuming Linked Data• Data Analysis Services not only queries but also Data Mining, Crawling, and Reasoning are need to engage community – BioMedical uses (Pharmaceuticals testing, drug screening)
Consuming Linked Data• Reasoning, removed to make data reuse possible, should be re-introduced in some cases over real complex ontologies with large sets of data – BioPax Level 3 (Level 4 under development) • OWL Species: DL • DL Expressivity: SHIF(D) • Consistent: Yes – BioPax Level 3 (4 officially identified databases, more DBs public data as BioPax Level 3 instances) • Reactome Database – 1.54 GB – 2 980 230 triples – BioPax Level 2 (9 officially identified databases)• Previously, data and ontologies should be cleaned up
Consuming Linked Data• Reasoning Services over real complex ontologies with large sets of data – Cost reduction in experiment design – Hypothesis demonstration/refutation – Privacy in reasoning with public + private data
Consuming Linked Data• Reasoning for classification problems – Disease classification / diagnosis – Protein identification – Pathway alignment
Consuming Linked Data• Digital Data Curation / cross-validation
Consuming Linked Data• Domain oriented (customizable) user interfaces
Scalability Issues in Life Sciences• Real scenarios with rich ontologies are starting to appear: – BioPax Level 3 4: complex OWL ontology (transitive, reflexive, inverse and functional properties, restrictions in most of the classes, 70 classes) – Big data sets in OWL format (from 20MB to 45GB of data) – Problems with the data: • undetected Abox (even Tbox problems) inconsistencies because of the lack of scalable reasoners • Lack of SPARQL endpoints to query these data
Summary: Are we losing the war?• Producing Linked Data in Life Sciences: Some risks and some needs detected: – A motivating rewarding schema for the data owner – Some specific infrastructure (action, facility, institute, foundation, private…) support could be useful • to engage data owners, • to aport tecnnical capability and • to share costs
Summary: Are we losing the war?• Consuming Linked Data in Life Sciences Opportunities – Connecting Linking data with existing tools which are de facto standards in certain LS subdomains • to multiply impact – Not only Queries Services but also Data Analysis Services (Crawling, Mining, Reasoning, etc.) should be provided to the community • but this is expensive for the average DB owner – Data must be cleaned up, curate and cross-validated • main thread – Domain is lacking specific user interfaces • this is related with the connection of LD to (de facto) standard tools – In this domain makes sense to reason • but scalability is still an issue
Linked Data and Life Sciences José F. Aldana Montes email@example.com
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.