4/30/10 Ph.D defense


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The topic of my dissertation is service-oriented architecture for integration of bioinformatic data and applications
  • This is a summary of main contributions for this dissertation. I am going to focus on these two main topics.
  • This is the outline of my talk today. First I am going to introduce the SOA
  • SOA is an architectural style of distributed computing. Services are basic unit on SOA. Each Individual Service is an abstracted, logical view of actual programs, databases, and business process. The features of SOA include separation of the standard interface and implementation, platform-neutral, standardized format for message passing. These features facilitate the integration of applications developed by different groups. There are three roles in SOA.
  • Web services is one realization of SOA. It includes number of standards and protocols. SOAP, WSDL, and UDDI are three basic standards are used for creation of web services.
  • As the research in SOA area is gaining numerous interests from diverse research communities. semantic web, grid computing, and Peer-to-Peer technologies are applied in the realization of the SOA. We summarize the current research trends of merging these technologies using the venn dialgram representation. Adding meaningful description of the interface using semantic web technology can avoid ambiguous interpretations of information and service descriptions and increase the soundness of the results of searching results. Grid computing is a computing platform that is intend to integrate resources (data and computational resources) from different organizations, called virtual organization in shared, coordinated and collaborative way P2P computing exchanges information in completely decentralized manner. P2P architecture does not have single failure problem and information is up2date. Semantic web enhances the capability of automation service discovery and service composition process. Grid computing provides more flexibility for SOA on computational resource sharing. P2P increases the scalability and reliability in service discovery and workflow execution process.
  • Nowadays, the advanced sequencing technologies make the growth of raw sequences much faster than the speed of understanding and analysis of these sequences. The hetergenity of these independently developed data source make the analysis process extremely difficult. The computer scientists need to provide tools environment, tools to make the data analysis faster into catch the gap. One of the difficulties is that these data sources are independently developed. The distributed, heterogeneous data format brings more difficulty of data sharing and integration.
  • Several large database providers take advantage of features provided in service-oriented architecture and publish their data and analysis tools as services to increase the interoperability. Several active middleware projects are intent to provide infrastructure to compose, manage, and integration of these services. However, there are still community efforts are needed to provide more shared and reliable services as well as practical projects for demonstration of the best practices of building SOA based system using these technologies.
  • This is one of the objects for the Mother of Green project.
  • Mother of Green project is a collaborative project with Dr. Romero-Severson for studying the deep phylogeny of plastid. The project builds on the existing technologies, of web services and the Taverna workbench, to produce a service-oriented approach to design and execute the phylogenetic comparisons automatically. The project provides an environment to support scientific investigations and increase the productivity.
  • One application of better understanding of the evolution of plastid can be applied for drug design to treat the human malaria. Plastids are organelles, descended from cyanobacteria. As more cyanobacterial and plastid genomes are sequenced, the study of plastid genomics and phylogeny is possible lead to advances in the treatment of diseases such as malaria. Malaria is caused by a parasite called plasmodium falciparum. This parasite contains a plastid genome which doesn’t exist in human and other animals. Therefore, A drug that disrupts the function of this plastid (the apicoplast) might be harmless to humans However, At present, the phylogeny of the apicoplast is not clear. The possible phylogenomics approach that Examine the genes, the linear order of the genes, the proteins, and the temporal order of protein expression of related organisms can suggest possible apicoplast functions. For these experiments, it requires the extraction and analysis of genomic information from diverse sources including plant, algae and cyanobacteria. The problem is the accurate identification of relatives or even closely related plastid genes of known function. A phylogenomics approach requires the extraction and analysis of genomic information from diverse scientific disciplines: plant, algal and cyanobacterial systematics, plant biochemistry, animal parasitology, genetics and cell biology. Surprisingly, malaria parasites harbor a plastid similar to plant chloroplasts , which they acquired by engulfing (or being invaded by) a eukaryotic alga, and retaining the algal plastid as a distinctive organelle encased within four membranes (see endosymbiotic theory ). The apicomplexan plastid, or apicoplast , Animals and insects have only two Target the third genome No harm to animals
  • A typical phylogeny analysis workflow includes query complete genome sequences, find the coding gene sequence in each genome sequence, perform the preliminary sequence alignment, choose candidate genes, then feed into more sophisticated phylogenetic analysis tools. The problem is the accurate identification of relatives or even closely related plastid genes of known function.
  • The distribution and heterogeneity of the data makes the service-oriented, workflow approach a feasible solution and provides the flexibility to add new services and develop new workflows
  • Our designed and implemented system has three layers. The middle layer interacts with other data/service providers to use service provided by them and integrate them into the system. The middle layer also provides services that can be accessed through the web interface or integrated into other applications.
  • The system has a local database to store the sequence retrieved from multiple data sources. It also records the experiments performed by users.
  • The system has a table-based service/workflow registry to store some properties of services/workflows provided in the system. In the first prototype development, this registry is not intend to support service discovery and workflow creation. To answer end-users questions of their experiments results Which algorithm was used to generate the data and what is the source of the input data
  • The system captures several type of metadata information to facilitate the sequence query and experimental data tracking.
  • The execution of services and workflows that requires long running time is managed through two components.
  • Most of the services provided in the system can be accessed through this web site. With these services, users can Easy and rapid extraction of DNA and protein sequence from public databases to a local database which saves scientists months of repetitive searching, downloading, and data management. Painless automatically reformatting of the extracted data for commonly used analytical tools. Preliminary data inspection and analysis using these tools within the web-services environment which permits inspection of many conserved gene candidates, enabling the investigator to rapidly determine the suitability of the chosen gene for deep phylogenetic analysis. User-specified additions to the local database which allows upload sequences into the local database. User-specified additions to the automated queries which provides a free-text searching interface for constructing data sets with interests. The raw sequence data is collected from the remote database. Users can query local database to find a subset of these sequences, they can manipulate the set by adding or deleting sequences from the set. After the data set is created, it can be feed into these data analysis tools as needed. Users can use this job management service to monitor the job executed and query the input and output.
  • This is a screen capture of taverna workbench we used to create, test workflows.
  • This is another example workflow that we use to collect genome sequence from the remote database in fasta format and XML format. The coding sequence is extracted by parsing the XML file.
  • In this first prototype design and implementation, we worked closely with the end user. The approaches we take is sufficient to provide a reliable system for users to do their preliminary investigations. The long term goal of the MoG project is to provide services and data can be shared by other researchers with the same interests. We apply the semantic web technology to give our data, service and workflow more meaningful descriptions.
  • The semantic web is an evolving extension of the World Wide Web in which web content can be expressed not only in natural language , but also in a form that can be understood, interpreted and used by software agents , thus permitting them to find, share and integrate information more easily. The terms and vocabularies used to define the web information is defined by ontology. OWL is the standard used to publish and share sets of terms. Some elements of the semantic web are expressed as prospective future possibilities that have yet to be implemented or realized. [4] Other elements of the semantic web are expressed in formal specifications. [5] Some of these include Resource Description Framework (RDF), a variety of data interchange formats (e.g RDF/XML , N3 , Turtle , N-Triples ), and notations such as RDF Schema (RDFS) and the Web Ontology Language (OWL). All of which are intended to formally describe concepts , terms , and relationships within a given knowledge domain. RDF is a simple data model for referring to objects (" resources ") and how they are related. An RDF-based model can be represented in XML syntax. RDF Schema is a vocabulary for describing properties and classes of RDF resources, with a semantics for generalization-hierarchies of such properties and classes. OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes. SPARQL is a protocol and query language for semantic web data sources. common metadata vocabularies ( ontologies ) and maps between vocabularies that allow document creators to know how to mark up their documents so that agents can use the information in the supplied metadata (so that Author in the sense of 'the Author of the page' won't be confused with Author in the sense of a book that is the subject of a book review). RDF is used to represent information and to exchange knowledge in the Web. OWL is used to publish and share sets of terms called ontologies, supporting advanced Web search, software agents and knowledge management. DIG Reasoner OWL DL ontology can be translated into a description logic representation that are decidable fragments of First Order Logic (FOL) Subsumption reasoning, superclasses of class, consistent
  • The formal format used to represent the web information is RDF. The RDF is a graph model of statement.
  • The principle of using the ontological module for data and service annotation is to use existing ontology vocabulary whenever it is possible. Augments the existing ontological module when it is necessary. Use two common existing ontology modules in bioinformatics fields to increase the interoperability
  • This is a sample data annotation defined using the terms from these ontological modules. By following the path, it allows users to track the origination of their data results and find relative data sets may generate similar results.
  • This is a sample service annotation. A service requester can find a service that has operations which accepts nucleotide_sequence as a parameter. By following the path, the URI of the service and operation name can be returned.
  • We implement the annotation components and query component based on the sesame RDF store. The individual data can be annotated at each time the data is generated from the service and workflow. This is a example of query can be encoded into the query component.
  • Ontology is big, concept is big, it is difficult to find accurate concepts to annotate. myGrid project has thousands of services, only several handreds are annotated with semantics. Annotation of the service requires expertise to guarantee the accuracy
  • In addition to add semantic annotation for the data and service in the system, we also develope a new approach to better reuse the verified knowledge and workflow.
  • One of the main features of SOA is the capability of composing several services to form a more complex process. An ideal workflow system should be able to let users define workflows based on their understanding of each services from the high level abstract definition to the low level Composition of actually services. For example, an investigator may interested in knowing if the gene genealogies for ATP subunit alpha, beta, and gamma are different. When an user has detailed knowledge of the services provided in the system. They may form a workflow using these executable services.
  • There are several existing workflow system that provide an environment for designing, execution, and monitoring workflows.
  • We are expecting that the collected knowledge during these workflow composition process can help translate the higher level definition to lower level executable workflows.
  • Abstract workflow and concrete workflow are two terms introduced in the pegasus system. Abstract workflow: depicts the scientific analysis including the data used and generated, but does not include information about the resources needed for execution Concrete workflow: an executable workflow that includes details of the execution environment We extend the term of the abstract workflow to represent the workflow that is defined using a high level functional description and other properties. It is not necessary contains the actually executable services.
  • The identification of the connectivity is based on the semantic description of services
  • The correctness of the matching detection of connectivity need to be guaranteed with the accurate annotation. Binary classification table
  • Give the recommendation at each step to user to find the next matched service. We don’t just simple discard the knowledge we gain here. We want our system to save these information and lessons learned during this process. These knowledge can be used to help the future workflow design.
  • Load into the RDF store Apply the match rule to generate the connectivity graph Apply the shortest path algorithm to find path
  • Graph matching algorithm
  • For computational requirement in MoG for real phynogely. Current work solve the information gathering and collection. GridSAM is an open-source job submission and monitoring web service. managed by a variety of Distributed Resource Managers (DRM). The Grid computing technology need to be integrated as the MoG project has more focus on the large computational phylogenetic analysis. Open middeware infrastructure institute.
  • 4/30/10 Ph.D defense

    1. 1. Service-oriented architecture for integration of bioinformatic data and applications Xiaorong Xiang Department of Computer Science and Engineering University of Notre Dame
    2. 2. Contributions <ul><li>Survey of research issues and challenges in service-oriented computing (Chapter 2) </li></ul><ul><li>Built a SOA based system for supporting bioinformatics research (Chapter 3) </li></ul><ul><li>Explored the deep phylogeny of the plastid with the system (Chapter 4) </li></ul><ul><li>Enhanced the system with semantic web technology and a novel approach of reuse workflows (Chapters 5 & 6) </li></ul>
    3. 3. Outline <ul><li>Introduction to SOA </li></ul><ul><li>MoG project and MoGServ </li></ul><ul><li>Ontological data and service representation model </li></ul><ul><li>Knowledge and workflow reuse </li></ul>
    4. 4. SOA – an architectural style of distributed computing <ul><li>Why SOA </li></ul><ul><ul><li>Reusability </li></ul></ul><ul><ul><li>Interoperability </li></ul></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Maintenance </li></ul></ul><ul><ul><li>Save cost when integrating applications </li></ul></ul><ul><li>Adoption of SOA </li></ul><ul><ul><li>e-Business </li></ul></ul><ul><ul><li>e-Science </li></ul></ul><ul><ul><li>e-Government </li></ul></ul>Service Requester Service Broker Service Provider 2 3 5 4 1 Discovery Invoke Publish interface
    5. 5. Web services – one realization of SOA Network Transport Protocols TCP/IP, HTTP, SMTP, FTP, etc Meta Language XML Services Communication SOAP Service Publishing & Discovery UDDI Services Description WSDL Business Process Execution BPEL4WS, WFML, WSFL, BizTalk, … Additional WS* Standards … Transactions Management Security Web Service Description Language Simple Object Access Protocol Universal Description, Discovery and Integration
    6. 6. Semantic Web Service Semantic Grid Open Grid Service Architecture (OGSA) Semantic Grid Service The P2P technology plays an important role of increasing the scalability and reliability in Service discovery and workflow execution process 1 2 3 SOA research orientations Semantic Web Grid Computing Service-oriented Architecture (Web Service)
    7. 7. From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3 Bioinformatics today <ul><li>Rapidly accumulating data: DNA sequences, contigs, expression data, ontologies, annotations, etc. </li></ul><ul><li>Non-standard independently developed heterogeneous data sources </li></ul><ul><li>Data sharing, data integration, and security </li></ul>
    8. 8. SOA in Bioinformatics <ul><li>MORE </li></ul><ul><li>Community efforts needed to provide more shared and reliable services </li></ul><ul><li>More demonstration projects needed => best practices, measured utility, feedback to middleware projects, etc. </li></ul>Recent exposure of data & analysis tools as services Others Large public database Middleware projects Others Provide infrastructure To compose, manage, Execute, connect the Distributed services
    9. 9. Outline <ul><li>Introduction to SOA </li></ul><ul><li>MoG project and MoGServ </li></ul><ul><li>Ontological data and service representation model </li></ul><ul><li>Knowledge and workflow reuse </li></ul>
    10. 10. Mother of Green (MoG) project <ul><li>Biological science </li></ul><ul><ul><li>In collaboration with Prof. Jeanne Romero-Severson, Biological Sciences, University of Notre Dame. </li></ul></ul><ul><ul><li>Study the deep phylogeny of plastid </li></ul></ul><ul><li>Computer science </li></ul><ul><ul><li>Provide an environment to support scientists’ investigations </li></ul></ul><ul><ul><li>A case study of using SOA for data and application integration </li></ul></ul><ul><ul><li>A prototype for future research in service-oriented architecture domain </li></ul></ul>
    11. 11. MoG project – one motivation <ul><li>Malaria causes 1.5 - 2.7 million deaths every year </li></ul><ul><li>3,000 children under age five die of malaria every day </li></ul><ul><li>Plasmodium falciparum (P. falciparum) causes human malaria </li></ul><ul><li>Targeted drug design through phylogenomics </li></ul><ul><li>P. falciparum has three genomes: nuclear, mitochondrial, plastid (apicoplast) </li></ul><ul><ul><li>Find the ancestors of the apicoplast, better understanding of the evolution of plastid </li></ul></ul><ul><ul><li>Identify genes in the ancestors </li></ul></ul><ul><ul><li>Determine gene function </li></ul></ul>P. falciparum Apicoplast in P. falciparum
    12. 12. A typical in-silico investigation Data driven research workflow A: Query complete genome sequences given a taxon B: Query protein coding genes for each genome sequence C: Eliminate vector sequences D: Sequences alignment E: Phylogenetic analysis
    13. 13. Challenges (Time consuming manual web-based operations) <ul><li>Data collection and information gathering </li></ul><ul><ul><li>Rapid accumulation of raw sequence information </li></ul></ul><ul><ul><li>Rate of accumulation is increasing </li></ul></ul><ul><ul><li>Information accumulates faster than analyses finish </li></ul></ul><ul><ul><li>Information in forms not readily accessible </li></ul></ul><ul><li>Analysis tool usage </li></ul><ul><li>Experimental data recording </li></ul><ul><li>Repetitive experiments for scientific discovery </li></ul>
    14. 14. Web Interface Applications Application Server Data Access Services Data Analysis Services Job Manager Job Launcher Service/Workflow Registry Metadata Search Local Data Storage Workflow/Soap Engines Services NCBI DDBJ EMBL Data/Services Providers MoGServ Middle Layer Services Access Client Others MoGServ System Architecture
    15. 15. Data storage and access services <ul><li>Local database </li></ul><ul><ul><li>Integrating data from multiple data sources with scientists interests </li></ul></ul><ul><ul><li>Supporting repetitive investigations against several subsets of sequences </li></ul></ul><ul><ul><li>Avoiding network traffic and service failure when retrieving data on-the-fly from public data sources </li></ul></ul><ul><li>Accessing the data in the local database by services </li></ul>
    16. 16. Service and workflow registry <ul><li>A table-based description with necessary properties </li></ul><ul><ul><li>Text description </li></ul></ul><ul><ul><li>Service location </li></ul></ul><ul><ul><li>Input/output </li></ul></ul><ul><ul><li>Provider </li></ul></ul><ul><ul><li>Version </li></ul></ul><ul><ul><li>Algorithm </li></ul></ul><ul><ul><li>Invocation method </li></ul></ul><ul><li>Not intended for supporting service discovery or composition at current stage </li></ul><ul><li>A repository of service and workflow used for local application developers </li></ul>
    17. 17. Indexing and querying metadata <ul><li>Metadata </li></ul><ul><ul><li>Service and workflow description </li></ul></ul><ul><ul><li>Description of sequence data in order to track the origination of data </li></ul></ul><ul><ul><li>Experimental data output, input, and intermediate data </li></ul></ul><ul><li>Indexing and querying with keyword </li></ul><ul><ul><li>Lucene </li></ul></ul><ul><ul><li>Implemented as services </li></ul></ul>
    18. 18. Service and workflow enactment INPUT Parameters Task Name Timer Service/Workflow Registry Job Manager Find the service/workflow definition using the task name Form a Job Description Output Job ID Job Launcher Instances of Workflow/Service Engines Job Information
    19. 19. Implementation <ul><li>Development and deployment </li></ul><ul><ul><li>J2EE, JSP, XSLT </li></ul></ul><ul><ul><li>Tomcat 5.0.18 / Axis 1_2RC2 </li></ul></ul><ul><li>Database </li></ul><ul><ul><li>PostgresSQL 8.1 </li></ul></ul><ul><li>Index and search of metadata </li></ul><ul><ul><li>Apache Lucene library </li></ul></ul><ul><li>Service implementation </li></ul><ul><ul><li>Java2WSDL </li></ul></ul><ul><ul><li>Wrap command line applications with JLaunch library </li></ul></ul><ul><li>Workflow </li></ul><ul><ul><li>Taverna workbench, part of myGrid project </li></ul></ul><ul><ul><li>Freefluo workflow engine </li></ul></ul>
    20. 21. Taverna workbench
    21. 22. A more complex workflow
    22. 23. Issues with the first prototype <ul><li>Meta data description </li></ul><ul><ul><li>Solution </li></ul></ul><ul><ul><ul><li>Index-based (keyword syntactic search) </li></ul></ul></ul><ul><ul><ul><li>Capture most properties to support the end-users requirement </li></ul></ul></ul><ul><ul><ul><li>Support data provenance </li></ul></ul></ul><ul><ul><li>Limitation </li></ul></ul><ul><ul><ul><li>Similar to most services in the bioinformatics community </li></ul></ul></ul><ul><ul><ul><li>Lack of semantic description (goal => semantic search) </li></ul></ul></ul><ul><li>Failure tolerance and recovery </li></ul><ul><ul><li>Solution </li></ul></ul><ul><ul><ul><li>Statically encode alternative services in the workflow to prevent service failure </li></ul></ul></ul><ul><ul><ul><li>Record status of the service and workflow execution into the database for possible recovery strategy </li></ul></ul></ul><ul><ul><ul><li>Multiple workflow engines deployment to prevent the hardware or network failure </li></ul></ul></ul><ul><ul><li>Limitation </li></ul></ul><ul><ul><ul><li>No dynamic service selection (more semantic description support) during execution time </li></ul></ul></ul><ul><ul><ul><li>No fine grained resource management and monitoring </li></ul></ul></ul><ul><li>Security </li></ul>
    23. 24. Outline <ul><li>Introduction to SOA </li></ul><ul><li>MoG project and MoGServ </li></ul><ul><li>Ontological data and service representation model </li></ul><ul><li>Knowledge and workflow reuse </li></ul>
    24. 25. Semantic web <ul><li>Semantic web vision </li></ul><ul><ul><li>Giving meaning (semantics) to web-based information </li></ul></ul><ul><ul><li>Machine-understandable such that software agents can autonomously process them </li></ul></ul><ul><li>Two standards: OWL & RDF </li></ul><ul><ul><li>The Web Ontology Language (OWL) </li></ul></ul><ul><ul><ul><li>Defines common vocabularies for specifying the concepts and relationship among concepts </li></ul></ul></ul><ul><ul><li>Resource Description Framework (RDF) </li></ul></ul><ul><ul><ul><li>Formal format for encoding web content using defined vocabularies </li></ul></ul></ul><ul><li>Semantic web for Bioinformatics </li></ul><ul><ul><li>UniProt RDF project </li></ul></ul><ul><li>Semantic web for SOA </li></ul><ul><ul><li>Automated service discovery, composition </li></ul></ul>
    25. 26. Resource Description Framework (RDF) http://www.nd.edu/~mog #hasCreator #gmadey #hasFullName Gregory Madey #hasTitle #professor http://www.nd.edu/~gmadey #hasPersonalSite MoG is a … project #hasTextDescription #hasResearchTopic #bioinformatics Literal Resource # URI provided the definition of these vocabularies # hasFundedBy #foundation <ul><li>A graph model of statements, a set of triples: </li></ul><ul><ul><li>Predicate (Subject, Object) </li></ul></ul><ul><li>Representations: </li></ul><ul><ul><li>RDF/XML </li></ul></ul><ul><ul><li>N-triples </li></ul></ul><ul><ul><li>Turtle </li></ul></ul><ul><li>A standard format to connect web information </li></ul>
    26. 27. Generic Service Description Ontology (myGrid/Feta model) Data Services Workflows Service Domain Ontology (myGrid) MoGServ application Domain Ontology (MoGServ) Software components for annotation RDF Store Ontological modules used for semantic description of data, services & workflows
    27. 28. MoGServ Application Domain Ontology <ul><ul><li>To better track the data origination </li></ul></ul><ul><ul><li>To support the automation of workflow creation </li></ul></ul><ul><ul><li>To better share the data on the web in the future </li></ul></ul>Example concepts and properties defined in MoGServ XML:String Set hasSetName Service Job isInstanceOf Set Set isParentOf User Job invokedby range domain properties 17 11 26 myGrid/Feta model myGrid MoGServ Ontological modules 8 419 7 9 12 Number of properties Object Datatype Number of Concepts
    28. 29. Sample data annotation – metadata from MoG local database Displayed by Rdf-Gravity
    29. 30. Sample service/workflow annotation Question: Which service has an operation that accepts nucleotide_sequence as a parameter Answer: Uri: http://www.ebi.ac.uk …/alignment:blastn_ncbi OperationName: Run Displayed by Rdf-Gravity
    30. 31. Implementation of annotation and query components for data, services & workflows <ul><li>Sesame 1.2.6 library </li></ul><ul><ul><li>Supports files, RDBMS, SeRQL </li></ul></ul>Sesame RDF store Annotation Templates (Data) Annotation Templates (Service) Query templates Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set} using namespace rdf = <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, mg = <http://www.mygrid.org.uk/ontology#>, mog = <http://almond.cse.nd.edu:10000/mog#> Query Components Annotation components result Service : http://almond.cse.nd.edu:10000/ axis/services/ClustalW?wsdl Operation : runClustalWdf inputParameter : setid SeRQL
    31. 32. Limitations <ul><li>The MoGServ ontology is not complete </li></ul><ul><ul><li>Contains a small portion of necessary concepts used for tracking the data provenance </li></ul></ul><ul><li>Service domain ontology is not complete </li></ul><ul><ul><li>Needs more concepts as more services are published </li></ul></ul><ul><li>Challenges of using semantic web in general </li></ul><ul><ul><li>Ontology creation, never complete </li></ul></ul><ul><ul><li>Data and service annotation accuracy, efficiency </li></ul></ul><ul><ul><li>Ontology integration </li></ul></ul>
    32. 33. Outline <ul><li>Introduction to SOA </li></ul><ul><li>MoG project and MoGServ </li></ul><ul><li>Ontological data and service representation model </li></ul><ul><li>Knowledge and workflow reuse </li></ul>
    33. 34. Aligning Retrieving Workflow A defined by a less experienced user using the functional definition of services queryGene clustalW Workflow B defined by an intermediate user with executable services queryGene clustalW queryGene queryGene setIds setFilter clustalW clustalW Workflow C defined by an expert user with two extra executable services to ensure the accurate output of the biological process Three user-defined workflows from different views Question: “are gene genealogies for ATP subunit αβ γ different?”
    34. 35. Limitations of current workflow management systems <ul><li>Existing workflow management system and bioinformatics middleware </li></ul><ul><ul><li>Taverna, Kepler, Triana, Pegasus </li></ul></ul><ul><ul><li>Design, execute, monitor, re-run </li></ul></ul><ul><li>Support ad-hoc, semi-automated and automated service discovery and composition from scratch </li></ul><ul><li>Our approach: reuse the verified knowledge and workflow </li></ul><ul><ul><li>Increase the correctness over time </li></ul></ul><ul><ul><li>Provide more accurate guidelines </li></ul></ul>
    35. 36. User Service Annotator Abstract workflow DL reasoner Ontology Create abstract workflow using ontology Annotate services using ontology Semantics enabled service registry Semantics enabled service discovery Service matchmaking Workflow composer (software agent/experienced users) Find appropriate service Workflow execution engine concrete workflow Data provenance management Collect and manage information about data origination Knowledge base management Knowledge discovery Enhanced workflow system
    36. 37. Encode, convert the High level definition To low-level executable Invoke a workflow with Specific input data and Record the data Provenance and Performance of services, workflows. Abstract workflow Concrete workflow Optimal workflow Workflow instance Replace individual Services with their optimal alternatives Task A Task B Service B Service A Service D Service C Service B Service A Service D Service C’ input output Service B Service A Service D Service C’ Our hierarchical workflow structure Pegasus workflow structure
    37. 38. Reusable knowledge <ul><li>Connectivity </li></ul><ul><ul><li>Helps to convert from abstract workflow to concrete workflow </li></ul></ul><ul><li>Alternatives and quality-of-service profiles </li></ul><ul><ul><li>Helps to convert from concrete workflow to optimal workflow </li></ul></ul><ul><li>Mapping of abstract workflow and concrete workflow </li></ul><ul><ul><li>Helps to choose reusable workflows </li></ul></ul>
    38. 39. Connectivity identification (Match detection) Service : QueryLocal Operation : createSet performTask: mygrid:retrieving inputPara: Settype(String, mog:gene) Queryterm(String, null) outputPara: Setid(string, mog:geneset) useResource : MoG Service : ClustalW Operation : runClustalWdf performTask: mygrid:aligning inputPara: Setid(String, mog:set ) Sequencetype(String, mog:sequence) outputPara: filen(string, mygrid:sequence _alignment_report) useResource: EBI Service : FormatConversion Operation : convert performtask: mygrid: translating inputPara: filen(String, mygrid:sequence _alignment_report ) outputPara: Out(String, mygrid:nexus _paup_format) useResource: MoG Parameter (data type, semantic type) Matching rule: opertation ij -> operation mn if exist parameter k is output parameter of operation ij and exist parameter o is input parameter of operation mn and data type (parameter o ) = data type (parameter k ) and semantic type (parameter o ) = semantic type(parameter k )
    39. 40. Need for verified service connectivity The mismatching problem Match detection output Accurate annotation Inaccurate annotation Lack semantic annotation Inaccurate reasoning Inaccurate annotation Lack of semantic annotation Inaccurate reasoning Accurate annotation GenBankService Out:GenBank record Blastp In: protein sequence X Mediator, adaptor, shim DDBJ-XML Out: sequence data record NCBI blast In: sequence data record fasta format Self-defined format May be detected by expertise at design time or after run Can be detected automatically X Yes No Yes No FP TN Real match TN FN FP TP
    40. 41. Connectivity Graph Implementation Registration process registry Automatically Identify the connectivity Knowledge base Store the connectivity Workflow Translation / Service composition process Refine, update, decompose the workflow connect (service a , operation ai , parameter c , service b , operation bi , parameter d ) identifyConnect (Single service, rdf repository) Search at syntactic level : search path between two nodes search next available service automatic composition base on input, output Implementation : shortest path algorithm Dijkstra
    41. 42. Experiment <ul><li>Used 418 concepts from domain ontology for semantic type, defined 10 concepts for data type. </li></ul><ul><li>Randomly generate service annotation. 1 input, 1 output </li></ul><ul><li>1000 services connectivity graph (right side) </li></ul><ul><li>Intel Pentium mobile 1.5GZ </li></ul>Length 0 = 724, length 1= 587, length 2=448, length 3= 281, Length 4=114, length 5=71 Length 6 =28, length 7=16 Length 8 = 4, length 9 = 2 Conclusion: Feasible solution. 12.51 12.35 12.31 13.01 12.02 Average time of match detection per single service (milliseconds) 3325 3015 2600 2346 1547 Load RDF repository (milliseconds) 10 200 34 400 84 600 138 800 225 1000 Number of Matched pair Number of services 587 Number of arcs Less than 1 Average path search time (milliseconds) 220 Connectivity graph load time (milliseconds) 724 Number of nodes
    42. 43. Reuse of workflows <ul><li>Reuse of abstract workflows </li></ul><ul><li>Reuse of concrete workflows </li></ul><ul><li>Compare structural similarity of two workflows </li></ul><ul><li>Implementation: SUBDUE algorithm </li></ul>input output query_term hasParameter task hasInput task hasNext retrieving aligning multiple_alignment_report performTask hasOutput performTask hasParameter v 1 input v 2 output v 3 task v 4 task v 5 query_term v 6 retrieving v 7 aligning v 8 multiple_aligning_report e 3 4 hasNext e 3 1 hasInput e 4 2 hasOutput e 3 6 performTask e 4 7 performTask e 1 5 hasParameter e 2 8 hasParameter SUBDUE input format Graph view
    43. 44. Pro and Con <ul><li>Pro </li></ul><ul><ul><li>Increase the correctness of the formed workflow over time </li></ul></ul><ul><ul><ul><li>Avoid the incorrect, inaccurate semantic annotations </li></ul></ul></ul><ul><ul><ul><li>Take advantage of verified knowledge </li></ul></ul></ul><ul><ul><ul><li>Avoid the ontological reasoning process </li></ul></ul></ul><ul><ul><li>Better support for semi-automated and automated service composition over time </li></ul></ul><ul><ul><ul><li>Provide more accurate guideline to users over time </li></ul></ul></ul><ul><li>Con </li></ul><ul><ul><li>The connectivity graph can be big </li></ul></ul><ul><ul><ul><li>Number of parameters </li></ul></ul></ul><ul><ul><ul><li>Number of services </li></ul></ul></ul><ul><ul><li>Search the connectivity of a service when a service is registered in the system may take relative long time </li></ul></ul><ul><ul><ul><li>More complex matching rule </li></ul></ul></ul><ul><ul><ul><li>Number of parameters </li></ul></ul></ul><ul><ul><li>May not have high accuracy at the beginning </li></ul></ul>
    44. 45. Summary <ul><li>Described the design and implementation of MoGServ </li></ul><ul><li>Explored the ontological representation of data and services </li></ul><ul><li>Described new approach for reuse of workflows and connectivity of services </li></ul>
    45. 46. Future work <ul><li>Integrate the GridSam into the MoGServ for execution, monitoring </li></ul><ul><li>Integrate the Grid computing technology for resource allocation </li></ul><ul><li>Refine the MoGServ application domain ontology </li></ul><ul><li>Create interface for end-user workflow creation </li></ul><ul><li>Create interface for individual workspace </li></ul><ul><li>Evaluate the scalability, accuracy of connectivity graph approach and the graph matching approach with large number real workflows and services </li></ul>
    46. 47. Acknowledgements <ul><li>Dr. Madey </li></ul><ul><li>Dr. Romero-Severson </li></ul><ul><li>Dr. Flynn </li></ul><ul><li>Dr. Striegel </li></ul><ul><li>Dr. Chaudhary </li></ul><ul><li>Dr. Collins </li></ul><ul><li>Mr. Eric Morgan </li></ul><ul><li>Dr. Jean-Christophe Ducom </li></ul><ul><li>Partially supported by the Indiana Center for Insect Genomics (ICIG) with funding from the Indiana 21 st Century fund </li></ul>
    47. 48. Publications <ul><li>X. Xiang, G. Madey and J. Romero-Severson, “A Service-oriented Data Integration and Analysis Environment for In-Silico Experiments and Bioinformatics Research”, Proceedings of the 40th Annual Hawaii International Conference on System Sciences (CD-ROM), January 3-6 2007, Computer Society Press. </li></ul><ul><li>Xiaorong Xiang and Greg Madey, &quot;A Semantic Web Services Enabled Web Portal Architecture&quot;, IEEE International Conference on Web Services (ICWS 2004) , San Diego, July 2004 </li></ul><ul><li>Xiaorong Xiang and Greg Madey, “Improving the reuse of scientific workflows and their by-products. In International Conference on Web Services (ICWS2007). Under review. </li></ul><ul><li>Xiaorong Xiang and Eric Lease Morgan, Exploiting &quot;Light-weight&quot; Protocols and Open Source Tools to Implement Digital Library Collections and Services. D-Lib Magazine, October 2005, Volume 11 Number 10 </li></ul>
    48. 49. Publications planned <ul><li>One journal paper for BMC Bioinformatics </li></ul><ul><ul><li>Chapter 3, chapter 4, chapter 5 </li></ul></ul><ul><li>Future IEEE ICWS proceedings </li></ul><ul><ul><li>Chapter 6 </li></ul></ul><ul><li>Biology journal – TBD </li></ul><ul><ul><li>Results from using MoGServ </li></ul></ul>
    49. 50. Thank you
    50. 51. Summary <ul><li>A practical demonstration of building a SOA-based system </li></ul><ul><li>Applied in a bioinformatics application to study the deep phylogeny </li></ul><ul><ul><li>Easy and rapid extraction of DNA and protein sequence from public databases to a local database which saves scientists months of repetitive searching, downloading, and data management. </li></ul></ul><ul><ul><li>Painless reformatting of the extracted data for commonly used analytical tools. </li></ul></ul><ul><ul><li>Preliminary data inspection and analysis using these tools within the web-services environment which permits inspection of many conserved gene candidates, enabling the investigator to rapidly determine the suitability of the chosen gene for deep phylogenetic analysis. </li></ul></ul><ul><ul><li>User-specified additions to the local database which allows upload sequences into the local database. </li></ul></ul><ul><ul><li>User-specified additions to the automated queries which provides a free-text searching interface for constructing data sets with interests. </li></ul></ul>
    51. 52. Ontological modules <ul><li>Generic service description ontology </li></ul><ul><ul><li>Feta model from myGrid </li></ul></ul><ul><li>Service domain ontology </li></ul><ul><ul><li>myGrid bioinformatics ontology </li></ul></ul><ul><li>MoG application domain ontology </li></ul><ul><ul><li>Adding more concepts particularly used in the MoG project </li></ul></ul><ul><ul><li>Small portion of concepts and properties </li></ul></ul>
    52. 53. Service Provider Service Requester Return results in XML format Send request in XML format Internet Software Agent Implement The service Software Agent Has knowledge Of the service In terns of the Description not The implementation Service description
    53. 54. Adoption of SOA <ul><li>Why SOA </li></ul><ul><ul><li>Reusability </li></ul></ul><ul><ul><li>Interoperability </li></ul></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Maintenance </li></ul></ul><ul><ul><li>Save cost when integration of applications </li></ul></ul><ul><li>Application of SOA </li></ul><ul><ul><li>e-Business </li></ul></ul><ul><ul><li>e-Science </li></ul></ul><ul><ul><li>e-Government </li></ul></ul>
    54. 63. Data and services <ul><li>Complete genome sequences </li></ul><ul><li>ATP gene sequences </li></ul><ul><li>Sequence sets </li></ul><ul><li>Saved jobs </li></ul>Data <ul><li>Data collection from remote database </li></ul><ul><li>Query local database </li></ul><ul><li>Data analysis tools, blast, clustalw , </li></ul><ul><li>Data format conversion, readseq </li></ul><ul><li>Management data sets and jobs </li></ul><ul><li>Download and upload </li></ul>Services, Workflows
    55. 64. The information gathering problem <ul><li>Rapid accumulation of raw sequence information </li></ul><ul><ul><li>~100 sequenced chloroplast genomes </li></ul></ul><ul><ul><li>~57 sequenced cyanobacterial genomes </li></ul></ul><ul><ul><li>Rate of accumulation is increasing </li></ul></ul><ul><ul><li>Information accumulates faster than analyses finish </li></ul></ul><ul><ul><li>Information in forms not readily accessible </li></ul></ul><ul><li>Solution </li></ul><ul><ul><li>Semi-automated web-services </li></ul></ul><ul><ul><li>“ Smart” web-services </li></ul></ul><ul><ul><li>Semantic web </li></ul></ul>
    56. 65. Time consuming manual web-based operations <ul><li>Data collection </li></ul><ul><ul><li>Copy & paste! </li></ul></ul><ul><li>Analysis tool usage </li></ul><ul><ul><li>Copy & paste! </li></ul></ul><ul><li>Experiment data recording </li></ul><ul><ul><li>Copy & paste! </li></ul></ul><ul><li>Repetitive experiments for scientific discovery </li></ul><ul><ul><li>Copy & paste! </li></ul></ul>
    57. 66. MoGServ system architecture <ul><li>MoGServ interface </li></ul><ul><ul><li>Web interface </li></ul></ul><ul><ul><li>Application interface </li></ul></ul><ul><li>MoGServ middle layer </li></ul><ul><ul><li>Data access storage </li></ul></ul><ul><ul><li>Data and analysis services </li></ul></ul><ul><ul><li>Service and workflow registry </li></ul></ul><ul><ul><li>Indexing and querying metadata </li></ul></ul><ul><ul><li>Service and workflow enactment </li></ul></ul><ul><li>Acting in two roles: service requester and service provider </li></ul>
    58. 67. MoG project <ul><li>Find the ancestors of the apicoplast, better understanding of the evolution of plastid </li></ul><ul><li>Identify genes in the ancestors </li></ul><ul><li>Determine gene function </li></ul><ul><li>Look for these genes in the P. falciparum nucleus </li></ul><ul><li>Then study regulatory mechanisms in candidate genes </li></ul>
    59. 68. Improvement of the system <ul><li>Use existing domain ontology in bioformatics community to describe services, workflows, and data </li></ul><ul><li>Integrate the grid computing technologies to address the security and resource allocation issues </li></ul><ul><li>Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain </li></ul>