Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automated Curation with Internal and External Validation

200 views

Published on

WikiPathways 2018 Summit presentation about using WikiPathways RDF and Jenkins for unit testing the (biological) content of WikiPathways, detecting common errors, like wrong or outdated identifiers, unlikely biology, etc.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Automated Curation with Internal and External Validation

  1. 1. WikiPathways Summit 2018 Automated Curation with Internal and External Validation Egon Willighagen Maastricht University
  2. 2. Computer-Assisted Curation? ● Because people repeatedly make the same mistake, and so do computers ○ e.g. homology-translated pathways ● Some things are hard to detect ○ the metabolite with CAS 2646-71-1 is a salt ○ outdated/replaced identifiers ● Typos ○ Incorrect identifier format ○ Incorrect database source ○ Incorrect DataNode (e.g. Metabolite with NCBI Gene identifier)
  3. 3. Frequent Testing
  4. 4. Tests for Genes ● Wrong species (e.g. ENSG… for mouse, etc) ○ harder: miRNAs ... ● NCBI Genes should be integers ● CHEBI identifiers for genes (e.g. for gene probes) ● Outdated Ensembl identifiers
  5. 5. Tests for Metabolites ● Secondary identifiers ○ ChEBI, HMDB ● ChemSpider identifiers are integers ● DataNodes with an identifier not marked as type Metabolite ○ PubChem, KEGG, … ● DataNodes of type Metabolite with a gene identifier ● DataNodes of type Metabolite that are also of type GeneProduct, … ● KEGG Compound identifiers with the wrong pattern ○ Start with cpd: ○ Do not start with a ‘C’
  6. 6. Tests for Interactions ● A Metabolite that gets converted to a GeneProduct ○ Exceptions, e.g. Thioredoxin, bradykinin ● Genes that get converted into genes ● Proteins that get converted into genes
  7. 7. Tests for Outdated Conventions ● Outdated Data Sources ○ ChemSpider → misspelling, should be Chemspider ○ Ensembl Human → new convention is just “Ensembl” for all species ○ PubChem → split, and now PubChem Compound or PubChem Substance ○ Uniprot/TrEMBL, UniProt/TrEMBL → upper/lower case errors
  8. 8. How does it work? SPARQL
  9. 9. Anatomy of a Test: RDF data @BeforeClass public static void loadData() throws InterruptedException { if (System.getProperty("SPARQLEP").startsWith("http")) { // ok, assume the SPARQL end point is online System.err.println("SPARQL EP: " + System.getProperty("SPARQLEP")); } else { Model data = OPSWPRDFFiles.loadData(); Assert.assertTrue(data.size() > 5000); } }
  10. 10. Anatomy of a Test: JUnit public void wrongEnsemblIDForRatSpecies() throws Exception { String sparql = ResourceHelper.resourceAsString("genes/ensemblGenesWrongSpecies_Rat.rq"); Assert.assertNotNull(sparql); System.out.println("Wrong Ensembl gene for rat for: " + System.getProperty("SUBSETPREFIX")); StringMatrix table = (System.getProperty("SPARQLEP").contains("http:")) ? SPARQLHelper.sparql(System.getProperty("SPARQLEP"), sparql) : SPARQLHelper.sparql(OPSWPRDFFiles.loadData(), sparql); Assert.assertNotNull(table); Assert.assertEquals( "Ensembl identifiers for wrong species for a rat pathway:n" + table, 0, table.getRowCount() ); }
  11. 11. Anatomy of a Test: SPARQL SELECT DISTINCT ?homepage ?label ?identifier WHERE { ?gene dc:source "Ensembl"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway ; a wp:GeneProduct . FILTER (!strStarts(?identifier, "ENSRN")) ?pathway wp:organism <http://purl.obolibrary.org/obo/NCBITaxon_10116> ; foaf:page ?homepage . }
  12. 12. Anatomy of a Test: Post-processing SELECT DISTINCT ?homepage ?label ?identifier WHERE { ?gene dc:source "Ensembl"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway ; a wp:GeneProduct . FILTER (!strStarts(?identifier, "ENSRN")) ?pathway wp:organism <http://purl.obolibrary.org/obo/NCBITaxon_10116> ; foaf:page ?homepage . }
  13. 13. Test Reports
  14. 14. Test Reports
  15. 15. Test Reports
  16. 16. Limitations / Choices ● Some things will fail too often ○ Uncorrected lines ○ DataNode’s without identifiers ○ … ● Some queries have exceptions (e.g. bradykinin, …) ● Some things give too many false positives ○ labels with IUPAC names ● Do not overload the curators
  17. 17. Conclusions / Outlook ● Prevent regressions ● Systematic searching of common mistakes ● Shape Expressions? ○ “All DataNodes must have…” ● Regressions versus All issues ● Pathway-specific Error Reports ● Chemistry ● Requests: github.com/BiGCAT-UM/WikiPathwaysCurator
  18. 18. Acknowledgments ● WikiPathways teams ○ Gladstone Institutes: Alex Pico et al. ○ Maastricht University: Chris Evelo et al. ● Infrastructure ○ Nuno Nunes → maintaining the Jenkins installation ○ Andra Waagmeester → WikiPathways RDF

×