The Progress on Sagace and
Data Integration

Maori Ito
1
Main two topics
• Sagace
–Cross Search

• RDF
–Data Integration

2
Sagace
• Search for Biomedical Data &
Resources in Japan
Features
•
•
•
•

Focus on biomedical database
Manual Semi-automated Ranking
Refining search results with facets
More info...
Mechanism of Search Engine
1. Crawling
2. Indexing
3. Query Processing
4. Scoring
Crawling

Databases

Crawling Program
6
Indexing
• Split data convenient size and store
own server
Indexing Data

Internal Server
Query Processing and Scoring
Search System
NIBIO

NBDC / DBCLS

AgriTogo

MEDALS

Collaborate by
using P2P
architecture

JCGGDB

9
What is the most Important
thing in cross search ?

! Speed and Accuracy !
Features

• Focus on biomedical database
• Semi-automated Ranking
Log Analysis and reflect
search results
• The members of top 8 databases are
almost the same.
–
–
–
–
–
–
–

Patents
KEGG ...
Comparison of databases
• Popular databases are Medical or
Pharmaceutical “literal rich”
databases.
• Top databases run aw...
Log data has been reflected in
ranking.
• Original score -> A:12,000,B:8,000
• Gather clicked data
• Eliminate duplicating...
Unpopular databases
• Sagace has started the service in
March 2012.
• Some databases have never clicked
since then.
• Elim...
Results
• Accuracy for users must have
improved.
• Reducing databases also caused
speed up.

16
Specific databases in life
science
• Some databases in life science is
lacked “literal information” .
• Cross search engin...
Metadata
• If the developers mark up data with
metadata…

18
Metadata
• Literal information can add into
search results!

Results Image
How to mark up and reflect the
results?
【HTML】

Declare scope itemtype with normal html tag

<div itemscope itemtype="http...
Win Win Win!
• Database developers can appeal rich
database information.
• Users can find valuable information
easily.
• C...
What is schema.org?
• "Schema.org is a set of extensible schemas that
enables webmasters to embed structured data on
their...
Microdata
“You use the schema.org vocabulary, along
with the microdata format, to add information to
your HTML content.”
(...
Current Situation
• Define original "property"
(entryID, isEntryOf, taxon, seeAlso, reference).
• Please refer to
– http:/...
6 DBs, 1 catalog and 1 DB
archive applied microdata!
• DoBISCUIT(Database Of BIoSynthesis clusters
CUrated and InTegrated)...
To add biological database
vocabularies into schema.org,
• “Need more people who think it is a good
idea.” (by organizers ...
Data Integration with RDF

http://www.mkbergman.com/968/a-new-best-friend-gephi-for-large-scale-networks/

http://www.cyto...
What is RDF?
• Resource Description Framework
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix drugbank:
<h...
RDF

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix drugbank: <http://bio2rdf.org/drugbank:> .
@prefix dr...
SPARQL(SPARQL Protocol and
RDF Query Language)
• “SPARQL (pronounced "sparkle", a
recursive acronym for SPARQL
Protocol an...
How to use?
RDF
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix drugbank: <http://bio2rdf.org/drugbank:> ....
SPARQL Endpoint
e.g:http://drugbank.bio2rdf.org/sparql

What is the target of “Acetaminophen”

32
Results
• You can get results from the
endpoint.

33
RDFization in life science
• Many data has been rdfized already.
• Affymetrix,Drugbank, GO, OMIM, KE
GG, PDB, UniProt, Pub...
Let’s try!
• Bio2RDF
– http://bio2rdf.org/

• EBI RDF Platform
– http://www.ebi.ac.uk/rdf/

• SPARQL endpoint
– e.g:http:/...
Pros of RDF
• Excellent with life science data
• Comparison to RDB
– Easily be expanded
– RDB  RDF

• Excellent with No S...
Cons of RDF
• A bit hard to make RDF
• A bit hard to create developing
environments
• Speed of SPARQL

37
Currant situation in NIBIO
• Toxygates
– Johan-san and Igarashi-san have been
developing .

• Orphan Drug Data

38
Toxygates
• RDFization Open TG-Gates data.
– microarray data, pathological data
(kidney, liver, grade ,... )

• Linked to ...
http://toxygates.nibio.go.jp/

40
Orphan Drug
• RDFize orphan drug information in
NIBIO.

<http://www.nibio.go.jp/orphanDrugTarget#80> drgn:designationFisca...
Let’s try and give me your
idea!
• RDF data will enlarge many kinds of
data in Life science.
• NBDC encouraged this moveme...
Future Perspective
• RDFize other databases in NIBIO
– E.g. bioresource

• Examine the benefit
• Spread RDF to many scient...
Upcoming SlideShare
Loading in …5
×

The Progress on Sagace and Data Integration

391 views

Published on

This slide explains the progress of Sagace ( cross search engine for for Biomedical Data & Resources in Japan) and data integration by using RDF.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
391
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The Progress on Sagace and Data Integration

  1. 1. The Progress on Sagace and Data Integration Maori Ito 1
  2. 2. Main two topics • Sagace –Cross Search • RDF –Data Integration 2
  3. 3. Sagace • Search for Biomedical Data & Resources in Japan
  4. 4. Features • • • • Focus on biomedical database Manual Semi-automated Ranking Refining search results with facets More informative search results with metadata
  5. 5. Mechanism of Search Engine 1. Crawling 2. Indexing 3. Query Processing 4. Scoring
  6. 6. Crawling Databases Crawling Program 6
  7. 7. Indexing • Split data convenient size and store own server Indexing Data Internal Server
  8. 8. Query Processing and Scoring
  9. 9. Search System NIBIO NBDC / DBCLS AgriTogo MEDALS Collaborate by using P2P architecture JCGGDB 9
  10. 10. What is the most Important thing in cross search ? ! Speed and Accuracy !
  11. 11. Features • Focus on biomedical database • Semi-automated Ranking
  12. 12. Log Analysis and reflect search results • The members of top 8 databases are almost the same. – – – – – – – Patents KEGG MEDICUS Medicine and pharmaceutical proceedings Drug emergency call Ingredients information of health food Merck Manual Medical Information Network Distribution Service – The Encyclopedia of Psychoactive Drugs 12
  13. 13. Comparison of databases • Popular databases are Medical or Pharmaceutical “literal rich” databases. • Top databases run away with the winnings! • More than half databases have never clicked! 13
  14. 14. Log data has been reflected in ranking. • Original score -> A:12,000,B:8,000 • Gather clicked data • Eliminate duplicating database in the same day and pick up lowest denotative rank. – If the database score is lower than 12,400, add 200. – The other databases are added 100 basically. But if the database denotative rank is lower than 10, add 200. • Patents score is fixed 8,100. • Maximum score is 30,000.
  15. 15. Unpopular databases • Sagace has started the service in March 2012. • Some databases have never clicked since then. • Eliminate these databases. • Databases – 272 DB -> 122 DB 15
  16. 16. Results • Accuracy for users must have improved. • Reducing databases also caused speed up. 16
  17. 17. Specific databases in life science • Some databases in life science is lacked “literal information” . • Cross search engine is suitable to show literal information. • Metadata will help these database. 17
  18. 18. Metadata • If the developers mark up data with metadata… 18
  19. 19. Metadata • Literal information can add into search results! Results Image
  20. 20. How to mark up and reflect the results? 【HTML】 Declare scope itemtype with normal html tag <div itemscope itemtype="http://schema.org/BiologicalDatabaseEntry"> <span >2012-10-24</span> </div> Select property 【Result】 Content
  21. 21. Win Win Win! • Database developers can appeal rich database information. • Users can find valuable information easily. • Crawler program can find these metadata properly. 21
  22. 22. What is schema.org? • "Schema.org is a set of extensible schemas that enables webmasters to embed structured data on their web pages for use by search engines and other applications.” • "Search engines including Bing, Google, Yahoo! and Yandex rely on this markup to improve the display of search results, making it easier for people to find the right web pages.” (http://schema.org/)
  23. 23. Microdata “You use the schema.org vocabulary, along with the microdata format, to add information to your HTML content.” (http://schema.org/docs/gs.html) • Finalizing the proposal of schema.org extension is a requirement to show “rich” results for major search engines.
  24. 24. Current Situation • Define original "property" (entryID, isEntryOf, taxon, seeAlso, reference). • Please refer to – http://sagace.nibio.go.jp/press/metadata/markup/
  25. 25. 6 DBs, 1 catalog and 1 DB archive applied microdata! • DoBISCUIT(Database Of BIoSynthesis clusters CUrated and InTegrated) • JCRB Cell Bank • Functional Glycomics with KO mice database • Glyco-Disease Genes Database • JCGGDB Report • MEDALS • Integbio Database Catalog • Life Science Database Archive
  26. 26. To add biological database vocabularies into schema.org, • “Need more people who think it is a good idea.” (by organizers @ schema.org) – public-vocabs@w3.org (<- ML Let’s join !) • We need more databases and web pages that are marked up with microdata. • I want your opinion on microdata. • Let's talk!
  27. 27. Data Integration with RDF http://www.mkbergman.com/968/a-new-best-friend-gephi-for-large-scale-networks/ http://www.cytoscape.org/what_is_cytoscape.html
  28. 28. What is RDF? • Resource Description Framework @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix drugbank: <http://bio2rdf.org/drugbank:> . drugbank:DB00316 rdfs:label "Acetaminophen”. RDF Object Subject rdf s:label drugbank: DB00316 Acet aminophen Predicat e 28
  29. 29. RDF @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix drugbank: <http://bio2rdf.org/drugbank:> . @prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> . @prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> . subject predicate object drugbank:DB00316 rdfs:label "Acetaminophen" ; drugbank_vocab:target drugbank_target:290 . drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2". Object Predicate Subject Drugbank: DB00316 rdfs:label Acetaminophen Object / Subject Predicate rdfs:label Drugbank_target: drugbank_vocab:target 290 Predicate Object Prostaglandin G/H synthase2 29
  30. 30. SPARQL(SPARQL Protocol and RDF Query Language) • “SPARQL (pronounced "sparkle", a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format.” (http://en.wikipedia.org/wiki/SPARQL) 30
  31. 31. How to use? RDF @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix drugbank: <http://bio2rdf.org/drugbank:> . @prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> . @prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> . drugbank:DB00316 rdfs:label "Acetaminophen" ; drugbank_vocab:target drugbank_target:290 . drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2". PREFIX drugbank:<http://bio2rdf.org/drugbank_vocabulary:> SPARQL select distinct ?v where {#distinct means exclude duplicate ?s rdfs:label "Acetaminophen” ; drugbank:target ?t . ?t rdfs:label ?v. What is the target of “Acetaminophen” } "Prostaglandin G/H synthase 2” Results! 31
  32. 32. SPARQL Endpoint e.g:http://drugbank.bio2rdf.org/sparql What is the target of “Acetaminophen” 32
  33. 33. Results • You can get results from the endpoint. 33
  34. 34. RDFization in life science • Many data has been rdfized already. • Affymetrix,Drugbank, GO, OMIM, KE GG, PDB, UniProt, PubMed... 34
  35. 35. Let’s try! • Bio2RDF – http://bio2rdf.org/ • EBI RDF Platform – http://www.ebi.ac.uk/rdf/ • SPARQL endpoint – e.g:http://drugbank.bio2rdf.org/sparql • How to learn? – Learning SPARQL 35
  36. 36. Pros of RDF • Excellent with life science data • Comparison to RDB – Easily be expanded – RDB  RDF • Excellent with No SQL too – key value 36
  37. 37. Cons of RDF • A bit hard to make RDF • A bit hard to create developing environments • Speed of SPARQL 37
  38. 38. Currant situation in NIBIO • Toxygates – Johan-san and Igarashi-san have been developing . • Orphan Drug Data 38
  39. 39. Toxygates • RDFization Open TG-Gates data. – microarray data, pathological data (kidney, liver, grade ,... ) • Linked to other database by using RDF – KEGG pathway – GO terms – CHEMBL – DrugBank 39
  40. 40. http://toxygates.nibio.go.jp/ 40
  41. 41. Orphan Drug • RDFize orphan drug information in NIBIO. <http://www.nibio.go.jp/orphanDrugTarget#80> drgn:designationFiscalYear "1996"; drgn:designationDate "1996/4/1"; drgn:number "(8yaku A) No. 81"; drgb:name "Imiglucerase"; dc:description "Improvement of symptoms of anaemia, thrombocytopenia, hepatosp drgn:designationApplicant "Genzyme Japan K.K."; drgb:pharmacology "Improvement of symptoms of anaemia, thrombocytopenia, hep drgb:manufacturer "Genzyme Japan K.K."; eob:approvalDate "1998/3/6"; drgb:product "Cerezyme injection 200U"; drgb:brand "CEREZYME_ injection"; drgn:approvedName "Imiglucerase (Genetical Recombination)"; 41 drgn:status "Approved".
  42. 42. Let’s try and give me your idea! • RDF data will enlarge many kinds of data in Life science. • NBDC encouraged this movement. 42
  43. 43. Future Perspective • RDFize other databases in NIBIO – E.g. bioresource • Examine the benefit • Spread RDF to many scientists • Make useful environments for who are not familiar with computers 43

×