Successfully reported this slideshow.
Your SlideShare is downloading. ×

Bioschemas Workshop

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 69 Ad

Bioschemas Workshop

Download to read offline

Cape Town - Bioschemas workshop before the Bioinformatics Education Summit.
Explains schema.org, Bioschemas, TeSS Case study, and the tools and implementation techniques adopters can use

Cape Town - Bioschemas workshop before the Bioinformatics Education Summit.
Explains schema.org, Bioschemas, TeSS Case study, and the tools and implementation techniques adopters can use

Advertisement
Advertisement

More Related Content

Slideshows for you (15)

Similar to Bioschemas Workshop (20)

Advertisement

More from Niall Beard (18)

Recently uploaded (20)

Advertisement

Bioschemas Workshop

  1. 1. Bioschemas Workshop Niall Beard Bioinformatics Education Summit 13th May 2019
  2. 2. Preliminary Agenda
  3. 3. Expected Learning Outcomes • Understand what schema.org is and how it can be applied to a project • Understand what Bioschemas is, how it differs from schema.org, and what vocabularies are available • Know the benefits and limitations to using schema.org • Gain an understanding of how to apply (bio/)schema.org to your site.
  4. 4. Workshop style • Please do interrupt me if: – You have any questions – If you have difficulty reading the slides – If I’m not speaking clearly enough – Or if I am going to fast/slow
  5. 5. What is…
  6. 6. Search Engines User InformationConnect
  7. 7. Search Engines User InformationConnect Query text Demographic Location Device Type Document content Web traffic Link count Freshness ---- 21 ‘signals’
  8. 8. Search Engines User InformationConnect Query text Demographic Location Device Type Document content Web traffic Link count Freshness ---- 21 ‘signals’ Algorithms to guess matches ???????? Text Matching Named Entity Recog TF-IDF NLP
  9. 9. Take out some of the guesswork… • Search engines need to predict what a page is about… • What if instead, search engines allow the information providers to explicitly define their pages contents • Rather than relying on algorithmic guesswork!
  10. 10. Slide courtesy of Alasdair Gray
  11. 11. Schema.org • A lightweight way of structuring data online • Created by a consortium of search engines to improve experience and search efficacy •Thousands of different vocabularies to describe information online
  12. 12. Metadata model ie. Recipe type
  13. 13. <div itemscope itemtype="http://schema.org/Recipe"> <div itemprop="nutrition” itemscope itemtype="http://schema.org/NutritionInformation"> Nutrition facts: <span itemprop="calories">144 kcal</span>, </div> Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . .
  14. 14. <script type="application/ld+json"> { "@context": "http://schema.org", "@type": ”Recipe", "name": ”Potato Salad", “NutritionInformation”: { "calories”: “144 kcal”, "recipeIngredient”: “800g small new potato”, "recipeIngredient”: “3 shallot” . . .
  15. 15. Readable by search engines Content Content Content Schema.org Schema.org Schema.org
  16. 16. A training event – marked up in schema.org – as shown by Google https://search.google.com/structured-data/testing-tool
  17. 17. https://toolbox.google m/datasetsearch
  18. 18. Search engines favour websites containing schema.org in their search results
  19. 19. Readable by Registries Resource Resource Resource Schema.org Schema.org Schema.org
  20. 20. Schema.org is community made • Schema.org is made up of decentralized extensions from different industries
  21. 21. Schema.org is community made • Extensions that see good usage get ‘folded-in’ to the core schema.org vocabularies
  22. 22. Schema.org is community made • To take advantage of schema.org for Bioinformatics, we need to make our own community Bioinformatics / Life science Community
  23. 23. Part 2
  24. 24. Bioschemas See; “The FAIR Guiding Principles for scientific data management and stewardship”, Mark D Wilkinson et al, 2016
  25. 25. Schema.org is community made • … Bioschemas is a community to propose Life science specifications to schema.org Bioinformatics / Life science Community
  26. 26. Bioschemas • Bioschemas is a community project which; – Creates Types for Life science resources • Proteins, Samples, Beacons, Tools, Training, etc – Create Profiles to Refine & Enhance Types • Marginality • Cardinality • Controlled Vocabularies – Creates tools to make bioschemas easier to create, validate, and extract
  27. 27. Types • Types = New vocabularies to propose to schema.org – Some are Biological Types – Some are Generic Types that are useful to Life scientists – These new types will be hosted at bio.schema.org – Currently at: http://bio.sdo-bioschemas-227516.appspot.com
  28. 28. Biological Types
  29. 29. http://bio.sdo-bioschemas-227516.appspot.com/BioChemEntity
  30. 30. Profiles • Profiles = Refinement & Interoperability Layer - Because every industry and domain shares in these specifications… - Every domain includes its own properties - So we inherit lots of properties we don’t care about Schema.org is messy!
  31. 31. Profiles - Tidying up Schema.org • For example; – Dataset inherits from schema.org/CreativeWork – CreativeWork (and therefore Dataset) contains properties for: • Character • IsFamilyFriendly • Material (e.g. leather, wool, cotton, paper) • Genre • Bioschemas offers an indication of how relevant / recommended each property is, by grouping into • Minimum | Recommended | Optional
  32. 32. Profiles • Profiles = Refinement & Interoperability Layer - schema.orgs generality means it does not recommend which ontologies to annotate with - Lack of restrictions on cardinality make it difficult to parse the data (if you’re not a huge search engine) Schema.org is not great for interoperability!
  33. 33. Profiles - Improving interoperability • Bioschemas profiles include cardinality restrictions and controlled vocabularies tailored to our use-cases
  34. 34. Profiles and their adoption
  35. 35. Profile Development process • Determining the schema is a process of empirical surveying and expert opinion. • We do a Cross-walk to find what fields are missing and use this to gauge marginality
  36. 36. Profile Development process Should it be Minimum / Optional / Recommended Should there be one or many of them? Should values be restricted to a controlled vocab? If we already have it: Do we want to keep it? Agree on answers for each of these questions Go through each attribute (row) of the schema If we don’t have it: Do we want to include it? Column G Column G Column H Column I Is the description provided okay? Do we want to rewrite it? Column F
  37. 37. • Discussions through our public mailing list Profile Development process
  38. 38. Profile Development process We use Github to request new properties, identify and manage bug fixing, and publicly present our decision making
  39. 39. Case Study TeSS: Training materials, Events, and Courses Part 4
  40. 40. ELIXIR All Hands 2018, June 2018, Berlin, Germany
  41. 41. The ELIXIR Training Portal - TeSS https://tess.elixir-europe.org
  42. 42. TeSS • A training portal that indexes metadata from across the web. •Presents a wide selection of openly available training resources across the bioinformatics discipline. •Displays these in a navigable – easy-to-find manner; in a feature rich environment.
  43. 43. View upcoming events of interest https://tess.elixir-europe.org/events
  44. 44. Find training materials from around the Web https://tess.elixir-europe.org/materials
  45. 45. TeSS Features Search and Filter Institutional Login Events • 270+ Upcoming events • 800+Training materials • Filter with 10+ different facets Login with ELIXIRAAI using your institutional or Google credentials with 1-click sign- on, to: • Favourite resources • Add new events & materials • Create new training workflows Stay informed about upcoming events of interest • E-mail subscription • Import into calendar applications
  46. 46. TeSS Features Link with other registries Ontological Classification Events map • Training events and materials can be linked with resources from other registries. BioportalAnnotatorWeb service predicts topics of resources added toTeSS. These can be approved/rejected easily by our curation group View filtered events plotted on a map to find the most accessible & relevant events Tools & Data services from bio.tools Databases, standard, & policies from fairsharing.org
  47. 47. Content sourcing • Rely on community to register resources? • Community needs to be moderated (to avoid spammers) • Hard to get critical mass of community involvement • Rely on curators to enter content? • Curators need to be paid / incentivized • Data entry is boring • A drop in curation/moderation attention can lead to inaccurate, malevolent, or insufficient content • Instead develop a solution that • Takes metadata directly from sources • Adds any resources to TeSS as they appear • Updates any resources that have changed
  48. 48. How TeSS works Front End Automated Aggregator Custom Scraper Custom Scraper Custom Scraper Extract metadata from training material and events pages Back End Metadata Catalogue Events Materials Workflows Finds relevant resources Training Workflows Search Interface Workflow Viewer Online Training Resources User enters form data
  49. 49. •There are several techniques we can use to extract metadata from content provider websites. This depends on what’s on the site. •Interface with an API • Handy but rare, difficult for websites to implement Content aggregators must write bespoke API Client for each • Structured data already embedded in page (RSS, ICS) • Limited amount of data •HTML Scraping • Fragile technique that can break when there are changes to the website. Automatic extraction techniques
  50. 50. Trade-off between ease of adopting and usefulness to aggregators Ease to implement on a website Usefulness to aggregator
  51. 51. Content Provider extraction technique statistics Events Materials Total Schema.org / Bioschemas 9 6 15 HTML 3 5 8 XML/JSON/YAML/CS V 4 3 7 iCal 5 -- 5 JSON API -- 2 2 RSS 1 -- 1 Total 38
  52. 52. Content aggregation via Bioschemas Front End Automated Aggregator Schema.orgScrape Custom Scraper Custom Scraper Extract metadata from training material and events pages Back End Metadata Catalogue Events Materials Workflows Finds relevant resources Training Workflows Search Interface Workflow Viewer Online Training Resources
  53. 53. Tools and Techniques for Implementation Part 4
  54. 54. Technique for adding Bioschemas to a website • 1. Identify an appropriate schema(s) for your content type • 1.a If it doesn’t exist, e-mail the mailing list (W3C, or add to Github Issue tracker) Issue tracker https://github.com/BioSch emas/specifications Mailing List https://www.w3.org/co mmunity/bioschemas/
  55. 55. Technique for adding Bioschemas to a website • 2. Draw a table and write down your metadata fields on the left hand side and the schema.org properties on the right. • Map the ones that correlate
  56. 56. Technique for adding Bioschemas to a website • 3a. Use the Bioschemas generator to create a JSON-LD snippet that you can (hopefully) copy and paste into your site. (This would mean creating one for every new schema.org record you want to add) http://www.macs.hw.ac.uk/SWeL/BioschemasGenerator/
  57. 57. Technique for adding Bioschemas to a website • 3b. If you can modify your site, paste in the JSON-LD template of the schema (from 3a), and render the metadata variables as values to the keys Mapping
  58. 58. Technique for adding Bioschemas to a website • 3c. If your site is using a CMS such as Wordpress or Drupal, explore whether there is an appropriate schema.org plugins you can use (or ask on the bioschemas mailing list)
  59. 59. Tutorials • Bioschemas Training Portal – There is a step-by-step tutorial on there for adding schema.org to jekyll pages / github page sites. – Hopefully there will be more to come https://bioschemas.gitbook.io/training-portal
  60. 60. Tools • Bioschemas Generator – Form-based tool to generate valid Bioschemas JSON-LD – http://www.macs.hw.ac.uk/SWeL/BioschemasGener ator/ • Validata [under construction] – Web application for validating bioschemas markup https://bioschemas.org/software/
  61. 61. Tools • GoCrawlt – JSON-LD schema.org extractor • Buzzbang [on hold] – Search engine that crawls the web for Bioschemas JSON-LD https://bioschemas.org/software/
  62. 62. Freebies from Schema.org • Google Search Console – Shows you what schema.org data Google is picking up from your site, any errors, and advice on how to fix them – https://search.google.com/search-console
  63. 63. Freebies from Schema.org • Google Structured Data Testing Tool – Extracts the schema.org from a given web-page or from a code-snippet, validates it, and shows you what errors there are – https://search.google.com/structured-data/testing- tool
  64. 64. Freebies from Schema.org ecosystem • 3rd party plug-ins – Lots available to help add schema.org to your framework
  65. 65. Slide courtesy of Alasdair Gray

Editor's Notes

  • Collection of schemas can be used to describe online objects
  • Schema.org very lightweight
  • Going clockwise from top right – we have international organizations, communities surrounding technologies, national institutions, and other academic institutions.

    All output training events and/or materials and share via their own websites. Many, many opportunities in many, many locations.
  • 273 Upcoming events – 7540 collected previously.
  • 831 Training materials

×