Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit

1,458 views

Published on

Slides for my September 23 talk on Wikidata and WikiCite – NIH Frontiers in Data Science lecture series.

Persistent URL: https://dx.doi.org/10.6084/m9.figshare.3850821

Published in: Technology
  • Be the first to comment

Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit

  1. 1. Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit Dario Taraborelli @readermeter National Institutes of Health • September 23, 2016
  2. 2. Wikimedia Research https://www.mediawiki.org/wiki/Wikimedia_Research
  3. 3. The altmetrics manifesto http://altmetrics.org/manifesto/
  4. 4. A short history of Wikipedia A website that anyone can edit The largest reference work on the internet A multi-language online encyclopedia
  5. 5. A short history of Wikipedia A website that anyone can edit The largest reference work on the internet A multi-language online encyclopedia
  6. 6. A short history of Wikipedia A website that anyone can edit The largest reference work on the internet A multi-language online encyclopedia
  7. 7. Wikipedia: unintended outcomes accelerate the dissemination of scholarship provide an infrastructure open scientific research enable distributed fact-checking and curation of scientific knowledge
  8. 8. Outline 1. Wikipedia as the front matter to all research 2. A new kind of open knowledge 3. Wikidata: Collaboratively curated linked open data 4. WikiCite: Building the sum of all human citations 5. Applications and opportunities for open science 6. Concluding remarks
  9. 9. 1. Wikipedia as the front matter to all research
  10. 10. “Wikipedia is not the bottom layer of authority, nor the top, but in fact the highest layer without formal vetting. In this unique role, it serves as an ideal bridge between the validated and unvalidated Web.” Casper Grathwohl Chronicle of Higher Education http://chronicle.com/article/article-content/125899/
  11. 11. Top sources of DOI lookups http://crosstech.crossref.org/2014/02/many-metrics-such-data-wow.html http://blog.crossref.org/2016/05/https-and-wikipedia.html wikipedia.org
  12. 12. World’s most accessed online medical resources Heilman and West (2015) doi.org/10.2196/jmir.4069
  13. 13. Most visited resource on Ebola in West Africa Heilman (2016) http://tinyurl.com/jfuyduv Most used internet site in Liberia, Sierra Leone and Guinea for Ebola during 2014 outbreak Greater than CNN, CDC and WHO
  14. 14. 2. A new kind of open knowledge
  15. 15. Schmachtenberg et al (2014) http://lod-cloud.net [CC BY SA]
  16. 16. Challenges Biases / errors Coverage Diversity and inclusiveness Verifiability
  17. 17. Machine-readable linked open data Editable by anyone Supporting human + algorithmic curation Comprehensive Transparently verifiable
  18. 18. Machine-readable linked open data Editable by anyone Supporting human + algorithmic curation Comprehensive Transparently verifiable
  19. 19. Machine-readable linked open data Editable by anyone Supporting human + algorithmic curation Comprehensive Transparently verifiable
  20. 20. 3. Wikidata: Collaboratively curated linked open data
  21. 21. Wikidata Free knowledge base that anyone can edit Launched in 2012 Integrated with Wikipedia and other sister projects Statistics (September 2016) Over 20M items Over 100M statements
  22. 22. Wikidata: Growth http://reportcard.wmflabs.org/graphs/active_editors English Wikipedia Wikidata
  23. 23. Wikidata: Growth http://reportcard.wmflabs.org/graphs/very_active_editors English Wikipedia Wikidata
  24. 24. Wikidata’s anatomy https://www.wikidata.org/wiki/Wikidata:Introduction
  25. 25. Wikidata’s anatomy Linked data, San Francisco, Jeblad https://commons.wikimedia.org/wiki/File:Linked_Data_-_San_Francisco.svg [CC BY SA]
  26. 26. SPARQL: https://t.co/cDR4Lt7V6P Birth place of people employed by MIT Wikidata: queries
  27. 27. SPARQL: http://tinyurl.com/h2lqv9y Authors with a known location and ORCID Wikidata: queries
  28. 28. Expert curation of scientific open data Benjamin Good (2016) Opportunities and challenges presented by Wikidata in the context of biocuration http://tinyurl.com/hk9qrmz
  29. 29. Sample of current biomedical content in Wikidata ● All human, mouse genes and proteins (swissprot) ● All Gene Ontology terms ● All Human Disease Ontology terms ● All FDA approved drugs ● 109 reference microbial genomes Mitraka et al (2015) Semantic Web Applications for the Life Sciences Burgstaller-Muelbacher et al (2016) Database Putman et al (2016) Database
  30. 30. Expert curation of scientific open data
  31. 31. Expert curation of scientific open data Gene Wiki: WIkidata SPARQL examples https://bitbucket.org/sulab/wikidatasparqlexamples/overview Get all known drug-drug interactions for Methadone via its CHEMBL id Get a list of all diseases known to be treated by Metformin Get a list of all diseases that might be treated by Metformin
  32. 32. 4. WikiCite: Building the sum of all human citations Randall Munroe, Wikipedian protester http://tinyurl.com/p3rodlb [CC BY]
  33. 33. https://twitter.com/egonwillighagen/status/718474906858582016
  34. 34. Benjamin Good (2016) Opportunities and challenges presented by Wikidata in the context of biocuration http://tinyurl.com/hk9qrmz
  35. 35. the disappearance of provenance http://bit.ly/SumOfAllCitations
  36. 36. the disappearance of provenance
  37. 37. a provenance-preserving answer engine The sum of all human knowledge The sum of all data and sources backing human knowledge +
  38. 38. https://tools.wmflabs.org/wikidata-todo/stats.php https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Source_MetaData#Sources_used_as_references_on _Wikidata 77% 2013 2014 2015 2016 References in Wikidata
  39. 39. The molecular origins of insulin go at least as far back as the simplest unicellular [[eukaryotes]].<ref name='LeRoith'>{{cite journal | vauthors = LeRoith D, Shiloach J, Heffron R, Rubinovitz C, Tanenbaum R, Roth J | title = Insulin-related material in microbes: similarities and differences from mammalian insulins | journal = Can. J. Biochem. Cell Biol. | volume = 63 | issue = 8 | pages = 839–49 | year = 1985 | pmid = 3933801 | doi = 10.1139/o85-106 }}</ref> Apart from animals, insulin-like proteins are also known to exist in Fungi and Protista kingdoms. References in Wikipedia
  40. 40. WikiCite: goals Build a repository of all Wikimedia citations and bibliographic metadata Design data models and technology to improve the coverage, quality, standards-compliance and machine-readability of citations and bibliographic metadata in Wikimedia projects @wikicite • meta.wikimedia.org/wiki/WikiCite
  41. 41. Vision Technology Community Scale Licensing Independence
  42. 42. https://tools.wmflabs.org/sqid/#/view?id=P2860 All biomedical OA review articles of the last 5 years
  43. 43. The Zika corpus Open citation graph layer Bibliographic metadata layer Expert annotation layer Encyclopedic layer
  44. 44. The Zika corpus Encyclopedic layer
  45. 45. The Zika corpus Expert annotation layer Encyclopedic layer Pathogen transmission process
  46. 46. The Zika corpus Bibliographic metadata layer Expert annotation layer Encyclopedic layer
  47. 47. The Zika corpus Open citation graph layer Bibliographic metadata layer Expert annotation layer Encyclopedic layer
  48. 48. 5. Applications
  49. 49. Co-author graphs for individual researchers SPARQL: http://tinyurl.com/zml3jox
  50. 50. Most cited authors in the Zika research corpus (+ filtering by journal, OA status, type of statement) SPARQL: http://tinyurl.com/jb8da68
  51. 51. Semi-automated recommendation of entities, missing statements, references for unsourced statements https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
  52. 52. Semi-automated recommendation of entities, missing statements, references for unsourced statements https://meta.wikimedia.org/wiki/Grants:Project/WikiFactMine https://twitter.com/larswillighagen/status/774614483394236416
  53. 53. Tools for crowdsourcing entity matching / disambiguation http://www.generalist.org.uk/blog/2014/wikidata-identifiers-and-the-odnb-where-next/ http://www.generalist.org.uk/blog/2014/wikidata-and-identifiers-part-2-the-matching-process/
  54. 54. read/write interfaces for biocuration
  55. 55. all statements citing a New York Times article the most popular scholarly journals used as citations for statements in any item that is a subclass of economics all statements citing the works of Joseph Stiglitz all statements citing journal articles by physicists from Oxford University all statements citing a journal article that was retracted all statements citing a source that cites a journal article that was retracted New opportunities for linked open knowledge curation and discovery https://meta.wikimedia.org/wiki/WikiCite_2016/Report/Group_5
  56. 56. More reliable data for altmetrics services https://www.altmetric.com/blog/new-source-alert-wikipedia/
  57. 57. 6. Concluding remarks
  58. 58. Dominant biocuration paradigm ● Cost of ad-hoc parsing of API responses or flatfile data ● Ambiguous or non-existent xrefs ● Persistence of funding ● Too much information to curate B. Good (2016) Opportunities and challenges presented by Wikidata in the context of biocuration http://tinyurl.com/hk9qrmz
  59. 59. A new paradigm for biocuration ● Reduce API/parser proliferation ● Force up-front integration ● Facilitate coordination ● Ensure that if funding is lost, data is not ● Leverage community input B. Good (2016) Opportunities and challenges presented by Wikidata in the context of biocuration http://tinyurl.com/hk9qrmz
  60. 60. T. Putman (2016) Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes https://doi.org/10.6084/m9.figshare.3201796.v1
  61. 61. Accelerate the discoverability, reusability, and societal impact of open access
  62. 62. Support new forms of open curation and distributed fact-checking Provide long-term, sustainable infrastructure to support open science Benefit from large-scale distribution of data in the linked data ecosystem Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
  63. 63. meta.wikimedia.org/wiki/WikiCite • @wikicite
  64. 64. Thank you Acknowledgments Daniel Mietchen, Jonathan Dugan, Lydia Pintscher, Cameron Neylon, James Hare, James Heilman, Magnus Manske, Egon Willighagen, the Gene Wiki team (especially Andra Waagmeester, Tim Putman, Benjamin Good), the ContentMine team, the University of Chicago Knowledge Lab, all WikiCite 2016 participants and Wikidata Source Metadata project contributors. Additional image credits Library, National Park Service Collection thenounproject.com/term/library/191/ [CC0] Robot, Creative Stall thenounproject.com/term/robot/132360/ [CC BY] Open Access logo commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_transparent.svg [CC0] dario@wikimedia.org • @readermeter • @Wikidata • @WikiCite • @WikiResearch
  65. 65. A short history of NIH and Wikimedia ● 2002: article National Institutes of Health started on English Wikipedia ● 2003: MEDLINE ● 2004: PubMed ● 2005: PubMed Central ○ along with Template:PMC ● 2007: WikiProject National Institutes of Health ○ along with Template:National Institutes of Health ● 2009: first Wikipedia Academy in the US took place at NIH ○ Susannah Fox: “Shared Kismet: Wikipedia and the NIH” ○ triggers Guidelines for Participating in Wikipedia from NIH ● 2012: bot imports multimedia from PMC into Wikimedia Commons ○ triggers formation of JATS for Reuse working group ● 2015: Template:NIH properties on Wikidata ● 2016: First papers using Wikidata queries appear in PMC

×