Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Opportunities and challenges presented by Wikidata in the context of biocuration

1,895 views

Published on

Abstract—Wikidata is a world readable and writable knowledge base maintained by the Wikimedia Foundation. It offers the opportunity to collaboratively construct a fully open access knowledge graph spanning biology, medicine, and all other domains of knowledge. To meet this potential, social and technical challenges must be overcome - many of which are familiar to the biocuration community. These include community ontology building, high precision information extraction, provenance, and license management. By working together with Wikidata now, we can help shape it into a trustworthy, unencumbered central node in the Semantic Web of biomedical data.

Published in: Science
  • Be the first to comment

Opportunities and challenges presented by Wikidata in the context of biocuration

  1. 1. Opportunities and challenges presented by Wikidata in the context of biocuration Benjamin Good BioCreative, Corvalis Oregon 2016 @bgood bgood@scripps.edu http://www.slideshare.net/goodb
  2. 2. Road Map • Introduction to Wikidata • Wikidata and biocuration • Wikidata and BioCreative
  3. 3. Is to data as Wikipedia is to text “Giving more people more access to more knowledge” A free and open repository of knowledge • Initiated by WikiMedia Germany • In transition to the WikiMedia Foundation • Not a grant funded ‘project’… as stable as Wikipedia
  4. 4. It’s a knowledge base! • Anyone can edit (human or robot) • Anyone can use (CC0)
  5. 5. Elements of the kb are called ‘items’ https://www.wikidata.org/wiki/Q146
  6. 6. Items are unique concepts, used to link different language Wikipedias together Q146 Af:Kat En:cat Als:Hauskatze Ang:Catte Av:Keto
  7. 7. Items are described by “statements” that link together to form the language-independent wikidata knowledge graph Cat Domesticated Animal Animal Subclass Of Subclass Of Animalia Taxon name Kingdom Taxon rank
  8. 8. Item: Q84
  9. 9. Item: Q414043 RELN Genomic start: 103471784 GenLoc assembly: GRCh38 Stated in: Ensembl Release 83 Retrieved: 19 January 2016 Value (numeric) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement Genomic position for Reelin gene
  10. 10. Item: Q414043 RELN Encodes: Reelin (protein) Stated in: NCBI homo sapiens annotation release 107 Retrieved: 19 January 2016 Value (item) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement Linking the Reelin gene to a protein it encodes
  11. 11. Item: Q13561329 Reelin Cell component: dendrite Determination method: • ISS (Sequence or structural Similarity) • IEA (Electronic annotation) Stated in: Uniprot Retrieved: 21 March 2016 Value (item) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q13561329 Statement Gene ontology annotation for Reelin protein with evidence codes modeled as qualifiers
  12. 12. graphical view RELN Reelin encodes dendrite cellular component claim ISS IEA Determination method: qualifiers UniProt stated in retrieved 21 March 2016 References Statement
  13. 13. Inter-item links form a giant knowledge graph Everything is connected Reelin, Heart disease, Barack Obama, everything.. https://query.wikidata.org SPARQL endpoint for Wikidata
  14. 14. Sample of current biomedical content • All human, mouse genes and proteins (swissprot) • All Gene Ontology terms • All Human Disease Ontology terms • All FDA approved drugs • 109 reference microbial genomes Burgstaller-Muelbacher et al (2016) Database Mitraka et al (2015) Semantic Web Applications for the Life Sciences Putman et al (2016) Database
  15. 15. http://tinyurl.com/biowiki-sparql Sample queries that are currently possible: • “GO cellular localization annotations for Reelin with evidence code ISS” • “Diseases treated by Metformin” • “Diseases that might be treated by Metformin” http://query.wikidata.org
  16. 16. Example question: repurposing Metformin http://tinyurl.com/zem3oxz Metformin ?disease interacts with protein SLC22A3encoded by genetic association Might treat ? Solute carrier family 22 member 3 SLC22A3 prostate cancer
  17. 17. Road Map • Introduction to Wikidata • Wikidata and biocuration • Wikidata and BioCreative
  18. 18. API Flatfiles The dominant paradigm for open biocuration API Flatfiles Your Database Your Database Your Databasexrefs Your Database Pain points • API or flatfile parsing • Ambiguous or non-existent xrefs • Persistence of funding • Too much information to curate My Web Application My Database My Database Curators My Research Grants $ Biomedical knowledge
  19. 19. A new paradigm for open biocuration? My Application Our Database? Our Database Curators And our community Biomedical knowledge My Application My Application My Research Grants $ Reducing the pain • Reduces API/parser proliferation • Forces up-front integration • Facilitates coordination • Ensures that if funding is lost, data is not • Invites community input
  20. 20. A new platform for open biocuration? My Application Our Database Curators And our community Biomedical knowledge My Application My Application My Research Grants $ • SPARQL = a common API for accessing content • 1 endpoint to maintain… • Its working
  21. 21. The first application built on wikidata is Wikipedia Our Database Curators And our community Biomedical knowledge Our Applications Our Applications Su, Schriml, Pavlidis R01 Grant… $
  22. 22. Deeply integrated, (incredible SEO)
  23. 23. Application #1 Burgstaller et al (2016)
  24. 24. Impact of wikidata on Wikipedia Gene Wiki Version 1. {{GNF_Protein_box | Name = Reelin| image = | image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 | MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 | IUPHAR = | ChEMBL = | OMIM = None | ECnumber = | Homologene = 9349 | GeneAtlas_image1 = | GeneAtlas_image2 = | GeneAtlas_image3 = | Protein_domain_image = | Function = {{GNF_GO|id=GO:0005515 |text = protein binding}} {{GNF_GO|id=GO:0016787 |text = hydrolase activity}} {{GNF_GO|id=GO:0046872 |text = metal ion binding}} | Component = {{GNF_GO|id=GO:0005739 |text = mitochondrion}} | Process = {{GNF_GO|id=GO:0008152 |text = metabolic process}} | Hs_EntrezGene = 51110 | Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA = NM_016027 | Hs_RefseqProtein = NP_057111 | Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 | Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174 | Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 | Mm_Ensembl = ENSMUSG00000025937 | Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein = NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr = 1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end = 13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}} = Gene Wiki Version 2. {{Infobox gene}} • All data in Wikidata • 1 Lua script works for all genes = (1 of these for every gene)
  25. 25. Wikidata use increasing on Wikipedia • https://en.wikipedia.org/wiki/Category: Templates_using_data_from_Wikidata • 81 templates indicate that they use it
  26. 26. Application #2. Centralized Model Organism Database (CMOD) http://sulab.scripps.edu/CMOD/
  27. 27. The next application built on wikidata, yours? Our community Biomedical knowledge CMOD ????
  28. 28. Road Map • Introduction to Wikidata • Wikidata and biocuration • Wikidata and BioCreative
  29. 29. Challenges • Community ontology building • Establishing computable trust • Expanding the knowledge base “Dogs and cats living together! Mass hysteria!” (leave that for ICBO) BioCreative Challenges?
  30. 30. ‘Statements’ on Wikidata 2013 2016 100M Statements Bad Good Ugly 60M 20M https://tools.wmflabs.org/wikidata-todo/stats.php
  31. 31. Computable trust RELN Genomic start: 103471784 GenLoc assembly: GRCh38 Claim Add References 1. Add references 2. Check that references concur with the claim or not 3. Estimate ‘truthiness’ of claim 4. Provide humans with sources to follow up. • References can come from databases, articles in PubMed, etc. BadUgly Good
  32. 32. Expanding the knowledge base RELN ? ? ? New Claims • Given external knowledge source (text or database) • Create claims and references automatically with very high precision • Allow for human verification PMID: 77901 PMID: 523070
  33. 33. Unique characteristics Wikidata w/ regard to IE tasks. • 16,000+ ‘active’ editors and growing • Could be a powerful crowdsourcing resource • Must be kept involved or will block progress • Constrained data model and some limits on content type • CC0 requirement
  34. 34. One known attempt: “StrepHit” • Individual Engagement Grant (IEG) from the Wikimedia Foundation (30k, Start Jan. 2016) • Goal to: • “Generate trust and reliability over Wikidata content” • “Alleviate the burden of manual curation” (sounds familiar, right?) • Ended up working on Biographical and Soccer data…
  35. 35. StrepHit NLP pipeline: https://github.com/Wikidata/StrepHit Text corpus Claim Extractor “Primary sources” tool on wikidata
  36. 36. ‘Primary Sources’ optional userscript that wikidata users can install. Approving a suggested reference for the claim that came from German Wikipedia https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
  37. 37. The Wikidata Game(s) (= microtasks…) https://tools.wmflabs.org/wikidata-game/ https://tools.wmflabs.org/wikidata-game/distributed/ Code for making your own!
  38. 38. StrepHit, Primary Sources, Wikidata games… All works in progress https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal
  39. 39. BioCreative Challenges with Wikidata ? Our Database Curators And our community Biomedical knowledge ???? ?
  40. 40. Acknowledgements Gene Wikidata Team Andra Waagmeester (Micelio) Sebastian Burgstaller (Scripps) Tim Putman (Scripps) Elvira Mitraka (U Maryland) Julia Turner (Scripps) Justin Leong (UBC) Lynn Schriml (U Maryland) Paul Pavlidis (UBC) Andrew Su (Scripps) Ginger Tsueng (Scripps) Contact bgood@scripps.edu @bgood on twitter Adapted logo Su Laboratory at TSRI The 16,950 other active editors of Wikidata and especially the 693 that joined last month and the 809 that joined the month before that and the 721 that joined the month before that.. This work was supported by the US National Institute of Health (grants GM089820 and U54GM114833) and by the Scripps Translational Science Institute with an NIH-NCATS Clinical and Translational Science Award (CTSA; 5 UL1 TR001114).
  41. 41. Social controls • Anyone can • Add or edit labels, descriptions, statements, references etc. on existing items • Create new items • Link items to Wikipedia articles • Query using https://query.wikidata.org • Read and write small numbers of edits with https://www.wikidata.org/w/api.php • Propose a new property • Request a bot account for high-volume automated editing Here be dragons..
  42. 42. Properties (as of April 10, 2016) • 2196 active properties • 114 new properties that have been proposed but not yet approved Proposal https://www.wikidata.org/wiki/Wikidata:Property_proposal
  43. 43. After proposal, community discussion • Each property is left open for discussion by anyone until • An administrator or other person blessed with the power either creates it or decides not to create it based on the discussion • People that enjoy ontology arguments needed here! Lengthy (cut-off) discussion of proposal for ‘extinct’ property
  44. 44. https://www.wikidata.org/wiki/Wikidata:Property_proposal/ Property proposal on wikidata Proposal Community discussion
  45. 45. Bot accounts • https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot • Same basic process.
  46. 46. Proposal discussions • Can not be avoided • The discussions are long and tiring but important • Many of the people involved are quite experienced • All are trying to make something great • Persistence and patience required
  47. 47. RELN Reelin encodes dendrite cellular component claim ISS IEA Determination method: qualifiers UniProt stated in retrieved 21 March 2016 References Statement SPARQL graph.. http://tinyurl.com/biowiki-sparql

×