Building Data

471 views
414 views

Published on

Slides I used at BioHackathon 2012 in Toyama.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
471
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Building Data

    1. 1. Building DataYasunori Yamamoto
    2. 2. NCBI Taxonomy 4,000 biomedical journals Database indexed at NLM 1994 4 DBs GenBank SWISSPROT PIR EMBL PRF DDBJ PDB dbEST GenBank dbSTS EMBL DDBJ 3442 Nucleic Acids Research, 1994, Vol. 22, No. 17 LANL Patent LANL Patent35 DBs2012 http://www.ncbi.nlm.nih.gov/sites/gquery Database Center for Life Science
    3. 3. NAR database issue 1400 1380 1330 1300 1230 1200 1170 1078 1100 2008 2009 2010 2011 2012 Source: Oxford University Press 92 databases added every year 93 dullhunk Database Center for Life Science
    4. 4. How to find a relevant database is an important topic, and, at the same time,to discuss what kind of databases are “good” is also significant. Database Center for Life Science
    5. 5. Data before applications / services NASA Goddard Photo and Video Database Center for Life Science
    6. 6. Good fishes first y ! m u m ! Y y m u m Y Database Center for Life Science
    7. 7. Aziz T. SaltikNature provides good fishes Chef mashes up good materials mrjorgen Database Center for Life Science
    8. 8. What should be considered? and how can these be assessed?Interesting, useful & reliable Reliable in terms of content and structure Peer-reviewed → Published on NAR database issue or another scientific journal.Sustainable, reusable & discoverable Appropriate licenses, bulk downloadable via the Internet, Linked Data...Fresh & stable Frequent updates with the least amount of down time. Database Center for Life Science
    9. 9. We should focus on building “good” data or developing tools to help it. Database Center for Life Science
    10. 10. AllieAbbreviation / long form pairs in life sciences Japanese translation CC 2.1 (Japan) Allie Monthly update http://allie.dbcls.jp/ SPARQL endpoint / bulk downloadable (N-triples or tab delimited plain text) Links to PubMed and DBpedia (currently, RDF data only)Web search service 7000+ unique visits / mo to the search service Database Center for Life Science
    11. 11. Allie data model absorption of lexical variantsPairCluster ShortForm LongForm SPF specific pathogen-free appearsIn PubMedIDList contains CoocurringShort cooccursWithPairList FormList Pair ShortForm LongForm SPF specific pathogen-free inResearch AreaOf ResearchArea Pair ShortForm LongForm spf specified pathogen free frequency Database Center for Life Science
    12. 12. Allie class hierarchy http://purl.org/allie/ontology/201108 Database Center for Life Science
    13. 13. Allie RDF data excerpted "特定病原体除去の"@ja allie:LongFormAbbreviation SPF "specific pathogen-free"@en rdfs:label rdf:type Long form rdfs:label specific pathogen-free http://purl.org/allie/id/longform/1528191 English allie:hasLongFormOf 特定病原体除去の Japanese http://purl.org/allie/id/pair/1547869 rdf:type allie:hasShortFormOf allie:EachPair http://purl.org/allie/id/pair/1547869 rdfs:label rdf:type "SPF"@en allie:ShortForm Database Center for Life Science
    14. 14. Useful / reliable? Database, Vol. 2011, Article ID bar013, doi:10.1093/database/bar013 ............................................................................................................................................................................................................................................................................................. Original article Allie: a database and a search service of abbreviations and long forms Yasunori Yamamoto1,*, Atsuko Yamaguchi1, Hidemasa Bono1 and Toshihisa Takagi2 1 Database Center for Life Science, Bunkyo-ku, Tokyo and 2Department of Computational Biology, University of Tokyo, Kashiwa, Chiba, Japan *Corresponding author: Tel: +81 (0)3 5841 0251; Fax: +81 (0)3 5841 8090; Email: yy@dbcls.rois.ac.jp Downloaded from http://database.oxfordjournals.org/ at University of Tokyo on Submitted 25 November 2010; Revised 25 March 2011; Accepted 28 March 2011 ............................................................................................................................................................................................................................................................................................. Many abbreviations are used in the literature especially in the life sciences, and polysemous abbreviations appear frequently, making it difficult to read and understand scientific papers that are outside of a reader’s expertise. Thus, we have developed Allie, a database and a search service of abbreviations and their long forms (a.k.a. full forms or definitions). Allie searches for abbreviations and their corresponding long forms in a database that we have generated based on all titles and abstracts in MEDLINE. When a user query matches an abbreviation, Allie returns all potential long forms of the query along with their bibliographic data (i.e. title and publication year). In addition, for each candidate, co-occurring abbreviations and a research field in which it frequently appears in the MEDLINE data are displayed. This function helps users learn about the context in which an abbreviation appears. To deal with synonymous long forms, we use a dictionary called GENA that contains domain-specific terms such as gene, protein or disease names along with their synonymic information. Conceptually identical domain-specific terms are regarded as one term, and then conceptually identical abbreviation-long form pairs are grouped taking into account their appearance in MEDLINE. To keep up with new abbre- viations that are continuously introduced, Allie has an automatic update system. In addition, the database of abbreviations and their long forms with their corresponding PubMed IDs is constructed and updated weekly. Database URL: The Allie service is available at http://allie.dbcls.jp/. ............................................................................................................................................................................................................................................................................................. Database Center for Life Science
    15. 15. Discoverable?http://thedatahub.org/dataset/allie-abbreviation-and-long-form-database-in-life-science Database Center for Life Science
    16. 16. Reliable?http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php Database Center for Life Science
    17. 17. Reliable/stable? http://stats.lod2.eu/rdfdocs Database Center for Life Science
    18. 18. Stable? http://labs.mondeca.com/sparqlEndpointsStatus/http://labs.mondeca.com/sparqlEndpointsStatus/details/allie-abbreviation-and-long-form-database-in-life-science.html Database Center for Life Science
    19. 19. consider to be on the right track. Database Center for Life Science
    20. 20. Projects in this hackathon Database Center for Life Science
    21. 21. RDFization of Life Science DictionaryLife Science Dictionary English - Japanese / Japanese - English dictionary in life sciences Thesaurus and concordance Project started in 1993. 110k English words and 120k Japanese words (as of Mar. 2011)Can be used to inter- or intra-connect life science databases Bridge English-Japanese resources in life sciencesPrefix would be http://purl.org/lsd/ Database Center for Life Science
    22. 22. http://lsd.pharm.kyoto-u.ac.jp/en/service/weblsd/index.html Database Center for Life Science
    23. 23. RDFization of ColilComments on Literature in Literature (Colil) Citation data extracted from PMC OA subset Citing comments on each cited literature (Citation context) Relevant literature based on co-citation data Similar to the MS academic search serviceCan be used to a literature recommendation service Curation/annotation assistance servicesBulk downloadable Database Center for Life Science
    24. 24. Colil Database Center for Life Science
    25. 25. Enjoy hack & Toyama! digicacy

    ×