Linking the silos. Data and predictive models integration in toxicology.

1,102 views
1,080 views

Published on

Silo storage system are typically designed to store one single type of grain. An information silo is characterized by a rigid design, not allowing easy exchange of information and integration with other systems. Life sciences software systems and those applied in toxicology in particular, are often developed independently and compatibility is rarely perceived as a primary design goal. The traditional approach to the integration challenge is to build data warehouses and custom applications managing all the information under a central authority and storing data in a single database schema. The majority of the existing chemical databases rely on a centralized design, offering data access via unique and incompatible interfaces. Loosely coupled systems, allowing end users to share and integrate data on the fly are increasingly considered more appropriate and actively developed [1], but not yet the mainstream in cheminformatics. Our experience in several past and on-going projects, dealing with data integration of chemical structures and toxicity databases, has been reflected in the architecture of the systems that we have designed (e.g. OpenTox, a set of distributed web services, with semantic representation of resources, including data [2], [3], [4]). The core concepts of the information integration are the web service oriented architecture and a data representation design that includes semantic and provenance information uncoupled from the software code and any proprietary metadata.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,102
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Linking the silos. Data and predictive models integration in toxicology.

  1. 1. NINA JELIAZKOVA IdeaConsult Ltd. Sofia, Bulgaria www.ideaconsult.net15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA
  2. 2. LAST DECADE CHANGESBiology, Bioinformatics Data, human genomeInternet Online databases Online collaboration Crowd SourcingOpen access, open source Database management systems Williams J M et al. Brief Bioinform 2010;11:598-609 Although biological databases are Machine learning and statistics only a fraction of all available Cheminformatics bioinformatics software resources, their rise is representative of the overall growth of these resources, and is concurrent with the number of base pairs released in GenBank. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 2
  3. 3. OPEN SOURCE CHEMINFORMATICS Noel M OBoyle et.al. Open Data, Open Source and Open Standards inA.Dalke EuroQSAR 2008 poster chemistry: The Blue Obelisk five years on. Journal of Cheminformatics 2011, 3:37 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 3
  4. 4. DATABASES 2001: Binding DB  2006: Chemical Structure>80 mln Lookup Service (CSLS) structures 2002: NIAID ChemDB  2007: ChemSpider HIV/AIDS Database  2008/2009: ChEMBL 2003 Ligand Depot  2009: Chemical Identifier 2004: ZINC database Resolver (CIR) 2004: PDBbind database  2003: First Nucleic Acids 2004: PubChem >30 mln Research (NAR) annual structures Database issue 2004: sc-PDB  > 1000 databases published 2004: Binding MOAD per year since 2008; 2005: DrugBank  >1300 in 2011 Marc Nicklaus, 5th Meeting on U.S. Government Chemical Databases and Open Chemistry, 2011 http://cactus.nci.nih.gov/presentations/meeting-08- 2011/meeting-2011-08-25.html 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 4
  5. 5. EASY ACCESSIncreased activityTechnology advances, more productsWhat was revolutionary few decades ago with establishing the software for chemicals registration (e.g. CAS database) is a routine nowadays. It could be a Computer Science Master project or a research project … 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 5
  6. 6. EU PROJECTS (AN INCOMPLETE LIST) EU FP6/ FP7 : 2-FUN, ACROPOLIS , ACuteTox, AQUATERRA, ARTEMIS, CADASTER, CAESAR, carcinoGENOMICS, CONTAMED, ESNATS, GENESIS, HEIMSTA, INVITROHEART, LIINTOP, MENTRANS, MODELKEY, NOMIRACLE, OSIRIS, PREDICT-IV, OpenTox, ReProTect, RISKBASE, RISKCYCLE, Sens-it-iv , VITROCELLOMICS Innovative Medicines Initiative – 30 ongoing projects , including eTox, OpenPhacts Safety Evaluation Ultimately Replacing Animal Testing (SEURAT) cluster – 6 research projects 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 6
  7. 7. SCIENTIFIC DATABASES IMPLEMENTATION Identify the data model and functionality Translate the data model into a database schema Implement the database and user interface functionality (Optionally) provide libraries or expose (some) of the functionality as web servicesAdvantages Use one’s favourite technology and jump directly into implementation Attract end-users with nice GUI relatively quickly Relatively easy to persuade funding organisations this will be a useful resourceDisadvantages Proliferation of incompatible resources, providing similar functionality, but incompatible programming interface Difficult to extract and collate data automatically This is not the only nor the best way! 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 7
  8. 8. EASY ACCESS , HENCE MANY NEW SYSTEMS SOFTWARE SYSTEMS THE SILO EFFECTLife sciences software Silo storage system and toxicology in particular  Designed to store one single type of grain.Live in their own world Mostly developed independently Information silo Compatibility is rarely perceived  Rigid design as a primary design goal.  No easy exchange of informationLack of communication  No integration with other systems and common goals 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 8
  9. 9. THE NEED OF INTEGRATION2005: “Integrated Informatics in Life and Materials Sciences: An Oxymoron?” * Calculations, Descriptors, Statistics, Models Data (chemical structures, properties, predictions)Approaches toward integration:Workflow management systemsStandalone container applications (chassis)Web applicationsWeb services, web mashups * Gilardoni, F., Curcin, V., Karunanayake, K., Norgaard, J., & Guo, Y. (2005). QSAR Combinatorial Science, 24(1), 120-130. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 9
  10. 10. WORKFLOW MANAGEMENT SYSTEMS DisadvantagesCommercial  Overhead: 30% of the tasks Accelrys Pipeline Pilot defined and run in workflows are related to data conversion,ratherOpen source: than data analysis. Kepler, KNIME, Taverna, Triana  Convenience:  Domain experts often prefer simpleWorkflow repositories user interfaces or software with predefined functionality; MyExperiment  Power users prefer scripting http://www.myexperiment.org/ languages, rather than GUIs or graphical workflow builders with their specific constraints.Advantages  Compatibility: Nodes are not Flexibility transferable between workflow Reproducibility engines 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 10
  11. 11. INTEGRATION PLATFORMSC O N TA I N E RA P P L I C AT I O N S WEB TOOLS  ChemBenchOECD Toolbox  ChemMine  Windows only  Collaborative Drug Discovery  OECD / ECHA funded (CDD)  OCHEMBioclipse  OpenTox  Multiplatform, Java based; Recently reviewed in standard OSGI interface for Jeliazkova, N. (2012). Web tools for modules predictive toxicology model building.  Open Source Expert opinion on drug metabolism & toxicology, 8(5), 1-11. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 11
  12. 12. CHEMBENCH *SCREENSHOT WORD CLOUD * Walker, T., Grulke, C. M., Pozefsky, D., & Tropsha, A. (2010). Chembench: a cheminformatics workbench. Bioinformatics (Oxford, England), 26(23), 1-2. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 12
  13. 13. CHEMMINE TOOLS *SCREENSHOT WORD CLOUD *Backman, T. W. H., Cao, Y., & Girke, T. (2011). ChemMine tools: an online service for analyzing and clustering small molecules. Nucleic Acids Research, 39(Web Server issue), W486-W491. Oxford University Press. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 13
  14. 14. COLLABORATIVE DRUG DISCOVERY (CDD) *SCREENSHOT WORD CLOUD * Hohman, M., Gregory, K., Chibale, K., Smith, P. J., Ekins, S., & Bunin, B. (2009). Novel web-based tools combining chemistry informatics, biology and social networks for drug discovery. Drug Discovery Today, 14(5-6), 261-270. Elsevier Ltd. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 14
  15. 15. OCHEM *SCREENSHOT WORD CLOUD * Sushko, I., Novotarskyi, S., Körner, et al. (2011). Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. Journal of computer-aided molecular design, 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 15
  16. 16. OPENTOX *,**SCREENSHOT WORLD CLOUD *Hardy, B., et al. (2010). Collaborative development of predictive toxicology applications. Journal of cheminformatics, 2(1), 7. **Jeliazkova, N., et. al. (2011). AMBIT RESTful web services: an implementation of the OpenTox application programming interface. Journal of cheminformatics, 3, 18. 1 5 T H I N T E R N A T I O N A L Q S A R W O R K S H O P TALLINN, ESTONIA 16
  17. 17. MORE WEB PROJECTS *Jeliazkova, N. (2012). Web tools for predictive toxicology model building. Expert opinion on drug metabolism & toxicology, 8(5), 1-11. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 17
  18. 18. OPENTOX ALGORITHMSUniform interface: (OpenTox web services API) Descriptor calculation, feature selection; Classification and regression algorithms; Rule based algorithms; Applicability domain algorithms; Visualization, similarity and substructure queries ; Composite algorithms (workflows); Structure optimization, metabolite generation, tautomer generation, etc. an algorithm i/ˈælɡərɪðəm/ is a step-by-step procedure for calculations. Algorithms are used for calculation, data processing, and automated reasoning. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 18
  19. 19. UNIFORM APPROACH TO DATA PROCESSING Read data from a web address – process – write to a web address Dataset + Algorithm = Dataset = GET GET GET POST POST POST PUT PUT PUT DELETE DELETE DELETEhttp://myhost1.com/dataset/trainingset1 http://myhost2.com/algorithm/{myalgorithm} http://myhost3.com/dataset/results Once we have a set of uniform building blocks, we can build new tools. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 19
  20. 20. UNIFORM APPROACH TO MODELS BUILDING Read data from a web address – process – write to a web address Dataset + Algorithm = Model = GET GET GET POST POST POST PUT PUT PUT DELETE DELETE DELETEhttp://myhost1.com/dataset/trainingset1 http://myhost2.com/algorithm/{myalgorithm} http://myhost.com/model/predictivemodel1 Once we have a set of uniform building blocks, we can build new tools. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 20
  21. 21. ALGORITHM & MODEL SERVICES Model management is obtained as a side effect 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 21
  22. 22. OPENTOX WEB SERVICES AS BUILDING BLOCKSW E B A P P L I C AT I O N S BIOCLIPSE-OPENTOX*• http://ToxPredict.org – aggregates remote predictions• OpenTox Wrapper for OCHEM/CADASTER web services• QMRF database• Applicability domain used by CADASTER web site *Willighagen E., Jeliazkova N., Hardy B., Grafstrom R., Spjuth O., Computational toxicology using the OpenTox application programming interface and Bioclipse, BMC Research Notes 2011 4 (1), 487 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 22
  23. 23. HOW TO CREATE/PUBLISH A MODELUpload the training set and rebuild the model; Example: Upload the Open Melting Point dataset and run the linearregression or random forest algorithmDevelop an OpenTox API compatible solutions, allowingto train and run predictive models; Example : Lazar models, OpenTox partner models (BG, DE, CH, GR, RU)Use thin wrappers for third-party models, and exposingthem through the compatible web service API. OCHEM models are exposed as OpenTox API compatible modelsusing this approach All models become potentially visible to client applications (ToxPredict, Bioclipse), subject to access rights. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 23
  24. 24. OPEN MELTING POINT DATASET Uploading a dataset makes it structure and similarity searchable; access rights could be controlled. An unique URI is assigned. This is a web service as well. 15TH INTERNATIONAL QSAR WORKSHOP 24 TALLINN, ESTONIA
  25. 25. LOAD DATA, BUILD MODEL, APPLY MODEL 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 25
  26. 26. MELTING POINT MODELS COMPARISON 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 26
  27. 27. EXAMPLE: REPDOSE DATASETS Similarity search results 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 27
  28. 28. PUBLISH/SHARE A MODELUpload the training set and rebuild the model; Use existing algorithms (descriptors, statistics), under existing license.Host on existing servers.Develop an OpenTox API compatible solutions, allowing totrain and run predictive models; Guaranteed exact reproduction of the model! Any license (closed or open source). Host anywhere.Use thin wrappers for third-party models, and exposing themthrough the compatible web service API. Guaranteed exact reproduction of the model! Any license (closed or open source). Host anywhere. Any OpenTox resource could be assigned restricted access 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 28
  29. 29. PUBLISH/SHARE A MODELThe advantage of a clean underlying API Compounds, properties, dataset, algorithms and modelsAny kind of processing Could be formalized as algorithm or a model Applicability domain is an algorithm Structural alerts are models! Models of a human expertise We don’t need a separate database to handle structural alerts Just publish the structural alerts as models and apply the models to your dataset! 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 29
  30. 30. DATASET COMPARISON USING OPENTOX ALGORITHMS AND MODELS ECHA Preregistration list PubChem100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 No 10 No 0 0 Yes Yes 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 30
  31. 31. MODELS FOUND IN THE LITERATURE :AMBIGUOUS!Cramer rules – textual description Q29. Readily Hydrolysed How to implement – abiotic / biotic ?  Context: oral toxicity  Yet there are different implementations Knowledge of the model internals is essential; compare predictions!Substructure alerts Have come a long way since the practice of publishing only textual description. Mostly SMARTS Custom languages in some of the tools No established procedure to extend SMARTS or SMILES – thus all the custom extensions. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 31
  32. 32. SMARTSMultiple implementations Not entirely compatible No real standard, except the Daylight web pageSteep learning curve (even for experts) Q: Does SMARTS query [CH2][NH2] match [CH2][NH2+]? A: YES. Any attribute which is not specified in the SMARTS is not tested. So if you do not mention formal charge for an atom, any charge is allowed. The same is true for any other query attribute.Validate published structural alerts! Publishing structural alerts should be no different than publishing QSAR models and require validation dataset http://blueobelisk.shapado.com/questions/does-smarts- query-ch2-nh2-match-ch2-nh2 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 32
  33. 33. THE EXPERT KNOWLEDGE IS AMBIGUOUSMode of actionsAdverse outcome pathways Effectopedia http://www.qsari.org/index.php/software/100-effectopedia – an attempt to gather the expert knowledge  Hopefully will not become yet another silo  Could be useful, if adopting the approach biology and bioinformatics are using to capture similar informationOntologies; semantic annotation Knowledge representation with strong ground in logic and computer science Allow automatic reasoning Many computer frameworks Huge amount of biology/bioinformatic resources Ontology: A formal, shared conceptualization of a domain 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 33
  34. 34. CHEBI Chemical entities of biological interest http://www.ebi.ac.uk/chebi/ 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 34
  35. 35. CHEBI 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 35
  36. 36. OBO FOUNDRY 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 36
  37. 37. BIOPORTAL 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 37
  38. 38. BIOPORTAL Search BioPortal for “hepatocellular necrosis” 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 38
  39. 39. WIKI PATHWAYS 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 39
  40. 40. LINKED OPEN DATA CLOUD http://www.w3.org/DesignIssues/LinkedData.html SPARQL: http://linkedlifedata.com/sparql 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 40
  41. 41. BIOINFORMATICS Proliferation of databases Compatibility is an issue Many (relatively recent) open data initiatives / standardisation efforts. MIABE: Minimum Information about a Bioactive Entity, Nature Reviews Drug Discovery (2011): “Industry and academic actors from chemistry world agree on new bioactive molecule standard” ISA-TAB: Susanna-Assunta Sansone et. al., Toward interoperable bioscience data, Nature Genetics 44, 121–126 (2012)  http://isa-tools.org 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 41
  42. 42. CHEMINFORMATICS Historically, the cheminformatics world has been driven by de facto standards, developed and proposed by different vendors. Examples: SDF, MOL, SMILES, PDB No agreed way to modify or extend the formats! A number of initiatives (relatively recent), have adopted open standardisation procedures Examples: InChI, CML, BlueObelisk initiatives, ToXML No requirements for independent interoperable implementations so far 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 42
  43. 43. DATA STANDARDS (LACK OF)ID 73 CAS 00091-66-7 NAME N,N-Diethylaniline WA 2.873 Mv 0.58 H-073 0 nCb- 1 MAXDP 0.333 nN 1 Log1/LC50 Exp 3.959 Y-Pred. 3.51 Hat 0.034OpenBabel11300911173D 26 26 0 0 0 0 0 0 0 0999 V2000 0.2812 0.8575 -0.6609 N 0 0 0 0 0 -0.3126 2.0484 -1.0695 C 0 0 0 0 0…… [output skipped]11 25 1 0 0 0 11 26 1 0 0 0M END$$$$ SDF files come in many different flavours 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 43
  44. 44. DATA STANDARDS (LACK OF) CPDBAS: QMRF Carcinogenic Potency Database NAME 0616091047 56 55 0 0 0 0 0 0 0 0999 V2000 http://www.epa.gov/ncct/dsstox/sdf_cpdbas. 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 html#SDFFields 00000 ActivityOutcome 1.4940 0.0000 0.0000 C 0 0 0 0 0 0 0 active 00000 unspecified/blank …. [output skipped] inactive 20 55 1 0 0 0 0 20 56 1 0 0 0 0ISSCAN: M ENDChemical Carcinogens Database > <Eye irritation SP pred.>http://www.iss.it/ampp/dati/cont.php?id=233&lan 4.380000114440918g=1&tipo=7 Canc > <CAS> 30399-84-9 3 = carcinogen; 2 = equivocal; > <Type> 1 = noncarcinogen Testing $$$$ No common meaning of SDF fields! Each tool provides means to map into its internal semantic. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 44
  45. 45. OPENTOX FEATURES AND ONTOLOGY LINKS Every data field is a resource with unique URI and metadata assigned (not only name!) 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 45
  46. 46. DATA CURATION“JUST structure-name validation is a long, torturous, iterative task.”Antony Williams, RSC, “ChemSpider: A Crowdsourcing Environmentfor Hosting and Validating Chemistry Resources”,5th Meeting on U.S. Government Chemical Databases and OpenChemistry, Frederick, MD, August 25-26, 2011Approaches:• Manual, crowdsourced• A “standard” workflow - these may differ across toolkits and models!• Compare with PubChem• Compare as many sources as possible “obviously , cheminformaticians must only use correct chemical structures and biological activities in their studies” 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 46
  47. 47. OPENTOX DATABASE QUALITY ASSURANCE Automatic Initial Quality Label AssignedClassification Consensus OK Probably OK for the structure that belongs to the majority Majority Probably ERROR for the structure(s) that belong(s) to the minorityAmbiguous Unknown (multiple sources)Unconfirmed Unknown (single source) 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 47
  48. 48. 10 20 30 40 50 60 70 80 90 0 100 ChemIDplus (20110503) Chemical Identifier Resolver… ChemDraw (20110505) CPDBAS DBPCAN EPAFHM FDAMDD HPVCSI HPVISD IRISTR KIERBL NCTRER NTPBSI NTPHTS ISSCAN ISSMIC ISSSTY TOXCST TXCST2 ECETOC LLNA LLNA-2nd compilation Benchmark Data Set for pKa… Benchmark Data Set for In Silico… Bursi AMES Toxicity Dataset EPI_AOP EPI_BCF OK EPI_BioHC EPI_Biowin EPI_Boil_Pt EPI_Henry EPI_KM EPI_KOA ProbablyOK EPI_Kowwin EPI_Melt_Pt EPI_PCKOC EPI_VP EPI_WaterFrag Unknown EPI_Wskowwin ECBPRS NAME2STRUCTURE (OPSIN) PubChem Structures + Assays Leadscope_carc_level_2 Leadscope_ccris_genetox Leadscope_cder_chronic Leadscope_cder_genetox Probably ERROR Leadscope_cder_repro_dev Leadscope_cfsan_acute Leadscope_cfsan_chronic Leadscope_cfsan_genetox Leadscope_cfsan_repro_dev Leadscope_fda_marketed_drugs15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA Leadscope_genetox_level_2 Leadscope_ntp_genetox Pharmatrope_AERS_hepatobiliary_s… OPENTOX DATABASE QUALITY LABELS DISTRIBUTION Entire database 48
  49. 49. QMRF DATABASE : NOT ERROR FREE! 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 49
  50. 50. CHEMICAL STRUCTURE REPRESENTATION IT DEPENDS! http://tinyurl.com/smilesquiz 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 50
  51. 51. CHEMICAL STRUCTURE REPRESENTATION IT DEPENDS! http://tinyurl.com/smilesquiz 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 51
  52. 52. PROLIFERATION OF NEW TOOLS THAT ARERARELY ABLE TO TALK TO EACH OTHER. THE SILO EFFECT• Chemistry & Biology software and databases Silo storage system may continue to live in  Designed to store one single their own worlds, unless type of grain. we want data shared and Information silo tools interoperable.  Rigid design• Interoperability /  No easy exchange of information standards may affect  No integration with other systems business models. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 52
  53. 53. THE INTERNET & INTERNET STANDARDSInternet Engineering Task Force (IETF) working groups have the responsibility for developing and reviewing specifications intended as Internet Standards. The process starts by publishing a Request for Comments (RFC) – the goal is peer review or to convey new concepts or information.IETF accepts some RFCs as Internet standards via its three step standardisation process. If an RFC is labelled as a Proposed Standard, it needs to be implemented by at least two independent and interoperable implementations, further reviewed and after correction becomes a Draft Standard.With a sufficient level of technical maturity, a Draft Standard can become an Internet Standard. Organisations such as the World Wide Web consortium and OASIS support collaborations of open standards for software interoperability,.The existence of the Internet itself, based on compatible hardware and software components and services is a demonstration of the opportunities offered by collaborative innovation, flexibility, interoperability, cost effectiveness and freedom of action. An unique example of what society can achieve by adopting common standards 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 53
  54. 54. SCIENTIFIC DATABASES IMPLEMENTATION Identify the data model and functionality Translate the data model into a database schema Implement the database and user interface functionality (Optionally) provide libraries or expose (some) of the functionality as web servicesAdvantages Use one’s favourite technology and jump directly into implementation Attract end-users with nice GUI relatively quickly Relatively easy to persuade funding organisations this will be a useful resourceDisadvantages Proliferation of incompatible resources, providing similar functionality, but incompatible programming interface Difficult to extract and collate data automatically This is not the only nor the best way! 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 54
  55. 55. DON’T LIVE IN THE PAST“Twenty to thirty years ago, most applications were writtento solve a particular problem and were bound to a singledatabase. The application was the only way data got intoand out of the database.Today, data is much more distributed and data consistency,particularly in the face of extreme scale, poses some veryinteresting challenges” http://www.zdnet.com/blog/microsoft/microsoft-big- brains-dave-campbell/1749 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 55
  56. 56. A COMMON API FIRST, MULTIPLE INDEPENDENTINTEROPERABLE IMPLEMENTATIONS LATERAdvantages: Compatibility! Facilitates collation of distributed resources! Avoid proliferation of incompatible resources (this however only makes sense if the API is adopted beyond a single implementation) Easy to develop multiple GUI applications, once the API/library functionality is in placeDisadvantages: Think first, then implement . GUI comes last Harder to persuade funding organisations (because reviewers usually look for nice GUIs) Integration means compatibility and interaction; NOT necessarily storing everything on a single place. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 56
  57. 57. DECENTRALIZED INFORMATION INTEGRATION Distributed, yet sufficiently interoperable model for information access. The future convergence between cheminformatics and bioinformatics databases poses new challenges to the management and analysis of large data sets. Evolution towards the right mix of flexibility, performance, scalability, interoperability, sets of unique features offered, friendly user interfaces, programmatic access for advanced users, platform independence, results reproducibility, curation and crowdsourcing utilities, collaborative sharing and secure access. Interoperability is a key. 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 57
  58. 58. WHY DOES INTEROPERABILITY MATTERFacilitates: Data , models and prediction results comparison A key to decision makingNo need to wait until the perfect standard emerges Annotate and link the sources The least powerful standard wins!There will always be new databases and tools Let them talk to each other 15TH INTERNATIONAL QSAR WORKSHOP TALLINN, ESTONIA 58
  59. 59. THANK YOU!Acknowledgments: OpenTox REST API http://opentox.org/dev/apis Download AMBIT Implementation of OpenTox API and launch your OpenTox service http://ambit.sourceforge.net 59

×