Session 43 :: Accessing data using a common interface: OGSA-DAI as an example


Published on

Speaker: Elias Theocharopoulos

Published in: Technology, Health & Medicine
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Session 43 :: Accessing data using a common interface: OGSA-DAI as an example

  1. 1. Sessions 43 & 44 Accessing data using a common interface: OGSA-DAI as an example Elias Theocharopoulos and Tilaye Alemu ISSGC ‘09 – Sophia Antipolis – Tuesday, 14th July 2009 web: email:
  2. 2. Overview • The problem: Sharing data in a grid • What is OGSA-DAI? • Data-centric workflows • Key OGSA-DAI terms • The OGSA-DAI client toolkit • Use cases and extensibility points • Pros and cons 2 web: email:
  3. 3. The problem: Sharing and accessing data in a grid 3 web: email:
  4. 4. Distributed data resources web: email:
  5. 5. How about a central server? FR FR query data Client web: email:
  6. 6. Central server pros and cons • Access to up-to-date data • Single point of access • Data in common format • Database can handle joins • Initial overhead in terms of time, effort and cost • Keeping data up to date • Loss of control by data providers o Assuming they even let go • Security and trust web: email:
  7. 7. How about providing direct access? UK UK ES ES IA IA query data query data query data Translate and join Client web: email:
  8. 8. Direct access pros and cons • Access to up-to-date data • Fast access • Data providers retain control • Fat clients • Heterogeneity and inconsistency o Data o Databases o Connection o Security • Security overheads for data providers o Manage firewalls and usernames/passwords for multiple clients • Hard to use in grid/web service workflows web: email:
  9. 9. How about providing a ZIP on the web? UK data ES data IA data HTTP ZIP HTTP ZIP HTTP ZIP GET GET GET UnZIP, translate and join Client web: email:
  10. 10. ZIP on the web pros and cons • Fast access • Data providers retain control • Very large downloads even if client only needs subset • Providers have to select and ZIP their data • Client has to install data into a local database • Static snapshot web: email:
  11. 11. Sharing distributed heterogeneous resources with OGSA-DAI UK UK ES ES IA query data query data query IA data Translate and join OGSA-DAI FR FR query data Client web: email:
  12. 12. Motivation • Grid is about sharing resources • Need to share structured data resources Relational Database XML Database Indexed File 12 web: email:
  13. 13. What is OGSA-DAI? • Open Grid Services Architecture Data Access Integration • A framework that executes workflows • Workflows are data-centric • Workflow components are designed for data access, integration, transformation and delivery • Can access heterogeneous data resources • Webservice interface • Intended as a toolkit for building higher-level application-specific data services 13 web: email:
  14. 14. OGSA-DAI’s vision • Sharing data resources to enable collaboration • Data access o Structured data in distributed heterogeneous data resources • Data integration o e.g. expose multiple databases to users as a single virtual database • Data transformation o e.g. expose data in schema X to users as data in schema Y • Data delivery o To where it’s needed by the most appropriate means o e.g. web service, e-mail, HTTP, FTP, GridFTP web: email:
  15. 15. OGSA-DAI and data-centric workflows web: email:
  16. 16. OGSA-DAI workflow • Executes workflows • Workflows contain activities o Well-defined functional units o Data goes in, something is done, data comes out o Equivalent to programming language methods • Workflows are submitted by clients o To an OGSA-DAI web service web: email:
  17. 17. An OGSA-DAI workflow - a simply analogy Countr Capital y UK London France Paris Pays Capital Grande-Bretagne Londres France Paris Convert query Convert from French Run SQL query data from to English English to SELECT Country, Capital FROM Countries French Pays Capita Join l SELECT Pays,Capital Grande-Bretagne Londres the FROM Pays data France Paris SELECT País, Capital FROM Países Convert l'Espagne Madrid Convert query data from l'Italie Rome from French Run SQL query Spanish to to Spanish French Pays Capital País Capital l'Espagne Madrid España Madrid l'Italie Rome Italia Roma web: email:
  18. 18. How it appears to the client OGSA-DAI workflow(SELECT Pays,Capital Pays Capital FROM Pays) Grande-Bretagne Londres France Paris Client l'Espagne Madrid l'Italie Rome web: email:
  19. 19. A query-transform-update example Countr Capital Run SQL query y Spain Madrid Italy Rome Convert data from English to Spanish País Capital España Madrid Run SQL update Italia Roma web: email:
  20. 20. A query-transform-join example FIELDS Run SQL Read file País,Capital query ENTRY=1 España,Madrid Country Capita l Convert data from ENTRY=2 UK London file to relational Italia,Roma France Paris Pays Capital Convert data from l'Espagne Madrid Convert data from English to French Spanish to French l'Italie Rome Pays Capita Pays Capital Grande-Bretagne l Londres l'Espagne Madrid France Paris Join l'Italie Rome Pays Capital Grande-Bretagne Londres France Paris l'Espagne Madrid l'Italie web: Rome email:
  21. 21. Data integration with OGSA-DAI workflows • Across OGSA-DAI services OGSA DB1 Workflow 1 DAI Data Workflow 2 OGSA DB2 DAI SQLQuery Deliver to Receive from (DB1) OGSA-DAI OGSA-DAI JOIN Deliver SQLQuery (DB2) 21 web: email:
  22. 22. Key OGSA-DAI terms: activities, resources, workflows 22 web: email:
  23. 23. OGSA-DAI: Key Term Activity • An activity is a named unit of functionality o A well defined workflow unit o Pluggable o Composable • An activity can have o 0 or more named inputs o 0 or more named outputs • Blocks of data flow from an activity’s output into another activity’s input 23 web: email:
  24. 24. OGSA-DAI: Key Term Activity (cont.) • Example activities include o Execute an SQL query o ZIP a batch of data o List the files in a directory o Execute an XSL transform on an XML document o Deliver data to an FTP server 24 web: email:
  25. 25. OGSA-DAI: Key Term Activity (cont.) • Activity Connections o All required inputs must be connected o All outputs must be connected o Optional inputs • Inputs o Literal o Streamed o Types 25 web: email:
  26. 26. Connecting activities - examples 26 web: email:
  27. 27. Data grouping: Lists • Special blocks are used to mark the beginning and the end of a list. • A list groups related data as one unit. f1,f2 [byte[]…],[ byte[]..] ReadFromFileActivity • For example ReadFromFileActivity can dynamically take any number of filenames as input. o Without a way to group the output byte arrays we would have no way to differentiate between the binary data of filenames f1 and f2. o Streaming is preserved since for each file a number of byte arrays is produced to be forwarded to coming activities. 27 web: email:
  28. 28. Passing data internally: OGSA-DAI Tuple • A special type of data passing between activities • A Tuple is a data representation similar to a row of relational data. Each element of a Tuple represent a column. • Tuples are normally grouped in lists and they are preceded by a metadata block. Athens 20 Madrid 22 Rome 25 SqlQuery SELECT city, temp FROM weather; 28 web: email:
  29. 29. An interesting activity: Tee • There are activities that operate on the level of blocks and are not concerned with the type and values of data they are handling. E.g TeeActivity: [A,B,C,D] [A,B,C,D] TeeActivity [A,B,C,D] No of outputs: 2 29 web: email:
  30. 30. OGSA-DAI: Key Term Resource • Data request execution resource • Data resources • Data sources • Data sinks • Sessions o A state container associated with a set of workflows o One workflow can lodge state o A subsequent workflow can retrieve it • Requests o One per workflow submitted to a DRER o Access request status 30 web: email:
  31. 31. OGSA-DAI: Key Term Workflow • A workflow can contain: o Activities • Resource-based: SQLQuery • Non-Resource: Transformation and Delivery o Resources • Targeted by Activities o Other Workflows • Sub workflows • Other types of workflow 31 web: email:
  32. 32. OGSA-DAI: Key Term Workflow (cont’) • OGSA-DAI can be used as a workflow processing system that is designed to stream data through a set of activities in a pipelined manner. • In the Query->Transform->Deliver workflow, if the activities are well defined all three will be processing concurrently with different portions of the data stream. 32 web: email:
  33. 33. OGSA-DAI: Key Term Workflow (cont’) • Pipeline workflow consists of a set of chained activities that will be executed in parallel with data flowing between the activities. • Sequence workflow all the sub-workflows added to this workflow will be executed in sequence. For example 1st sub-workflow in a sequence creates a table, 2nd bulk loads transformed data into this table. • Parallel workflow all the sub-workflows added to this workflow will be executed in parallel. 1 2 33 web: email:
  34. 34. Getting to the first practical: The OGSA-DAI client toolkit. 34 web: email:
  35. 35. OGSA-DAI client toolkit • OGSA-DAI client toolkit o Construct and submit requests in Java not XML • Toolkit manages interaction with web services via SOAP over HTTP; it handles SOAP request construction and response parsing. o Provides Java abstractions of • Services • OGSA-DAI resources and properties • Requests • Activities 35 web: email:
  36. 36. The client toolkit • The workflow description is sent to the OGSA-DAI server as an XML document. • Application developer does not need to worry about creating this document. • The client toolkit provides ways of assembling activity workflows programmatically. • We will see how to use the client toolkit during the hands-on session. 36 web: email:
  37. 37. Service/resource model One Data Data Resource MyDRER Data Two Request Data Request Data Execution Data Execution Resource Service Resource Three Data Data Resource Client Session Session Request Request Management MyRequest123456 Service 37 web: email:
  38. 38. Client Toolkit Activities • One client activity per server activity • Same input and output names • Plus some convenience methods For example: • Retrieve results as a JDBC ResultSet from a TupleToWebRowSet activity. • Retrieve update count as an Integer from a SQLUpdate activity 38 web: email:
  39. 39. Step by Step Guide for Writing Clients • Create activities o There’s a corresponding client toolkit activity for each server-side activity DeliverToFTP deliver = new DeliverToFTP(); ReadFromFile readFile = new ReadFromFile(); 39 web: email:
  40. 40. Connecting activities • Set inputs for each activity (e.g. parameters) • Every input parameter can either be literal input or streamed from another activity o Literal inputs, e.g. for constant parameters: deliver.addFilename("results1.txt"); deliver.addHost(“"); o Connect input to the output of another activity to stream data deliver.connectDataInput(readFile.getDataOutput()); 40 web: email:
  41. 41. Gaining access to the results • If the output of an activity can be provided in a user-friendly type, then there are methods to access the results: o Check whether there are more results to be retrieved boolean hasNext = sqlUpdate.hasNextResult(); o Get the next result in a convenient type int count = sqlUpdate.getNextResult(); 41 web: email:
  42. 42. Build and execute the Workflow Request • Create workflow and add activities to them • A data service executes the workflow and returns a response (or an error!) • The response may contain data (depending on the activities) • Each client toolkit activity provides utility methods for retrieving its response data 42 web: email:
  43. 43. First hands-on session Go to : 43 web: email:
  44. 44. Extensibility points & components 44 web: email:
  45. 45. Extending OGSA-DAI: What • OGSA-DAI o A Framework o Extensible • Out of the Box is the basics o Different applications have different needs o New Sources of Data o New Functionality 45 web: email:
  46. 46. Extending OGSA-DAI: Overview Presentation Layer OMII GT Axis New Message Frameworks UNICORE WS-DAI ? gLite Embedded OGSA-DAI Core Workflow Execution Engine Persistence and Configuration Activity Framework Sessions XPathQuery XSLTransform Request SQLQuery DeliverToURL MyOwnActivity Data Source New Functionality Data Sink Data Resources New Types of Data 46 web: email:
  47. 47. Extending OGSA-DAI: Activities • Activities do some unit of work • Specific transformation o Data Format: SWISS-PROT to format X • Delivery o Deliver to a target service • Data analysis and Integration o Combine data from different sources 47 web: email:
  48. 48. Extending OGSA-DAI: Resources • New resources – why? o New Products o New Applications o Specialised Access • Required: o DataResource o DataResourceState o ResourceAccessor 48 web: email:
  49. 49. Extending OGSA-DAI: Remote Resource • Accessing Resources on Remote OGSA-DAI • Avoid replication of resources • Security Issues o Devolved to Local OGSA-DAI o Security between OGSA-DAI Deployments 49 web: email:
  50. 50. SQL views • Define a drPatient view o SELECT id, name, age, sex, as drName FROM patient, doctor WHERE patient.DrID = doctor.ID; ID Name DN ID Name Age Sex ZIP Dr ID 123 Greene US-Chicago-G 1 Ken 42 M IL1478305 456 456 Ross US-Chicago-R 2 Josie 25 F BN1 7QP 789 789 Fairhead UK-Holby-F • Client runs SELECT * FROM drPatient; • Shorthand for complex query results • Data access control e.g. users of drPatient o Cannot access a patient’s ZIP o Are unaware of the doctor or patient tables web: email:
  51. 51. OGSA-DAI SQL views • OGSA-DAI SQL views data resource o Represents a view across a database exposed by an OGSA-DAI relational resource • SQLQuery activity o Parses query o Splices in view definition o Submits transformed query to database • Can define views for read-only databases • Schema transformation o Map a logical schema to a physical schema web: email:
  52. 52. OGSA-DAI SQL views and security • Factor in client’s security credentials • e.g. define drPatient view as o SELECT patients.* FROM patients, doctor WHERE patients.DrID = doctor.ID AND d.dn = $DN$; • Replace $DN$ by client’s DN provided by Grid security components • Doctors can only view their own patients web: email:
  53. 53. Distributed query processing • OGSA-DQP o Developed by Universities of Manchester and Newcastle o Refactored for OGSA-DAI 3.0 by EPCC as part of the NextGrid project o OGSA-DAI DQP package • Multiple tables on multiple databases are exposed to clients as multiple tables in one “virtual database” • Clients are unaware of the multiple databases • Databases can be exposed o EITHER within one OGSA-DAI server o OR via multiple remote OGSA-DAI servers web: email:
  54. 54. OGSA-DAI DQP 3b: SELECT Annotations_Ratings.ID, 3a: SELECT Annotations_Ratings.Confidence Archeo_Finds.ID, FROM Annotation_Ratings Archeo_Finds.Provenance WHERE FROM Archeo_Finds; OGSA-DAI OGSA-DAI Annotations_Ratings.Confidence > 0.99 3: Execute 4: Push results sub-queries OGSA-DAI (DQP query evaluator) 5: Combine and post- process – do the JOIN 2: Parse query and OGSA-DAI (core + DQP coordinator) form query plan 5: Results 1: SELECT Archeo_Finds.ID, Archeo_Finds.Provenance, Annotations_Ratings.Confidence FROM Annotations_Ratings, Client HGV_June WHERE Annotations_Ratings.Confidence > 0.99 AND Annotations_Ratings.ID = Archeo_Finds.ID; web: email:
  55. 55. OGSA-DAI workflows – a de-facto standard • OGSA-DAI workflows are a de-facto standard o Of use to many projects as we’ll see • For some applications workflows are too powerful o Too expressive o Infer semantics from names of activities available on server • Must interrogate the server o Problems using OGSA-DAI services in workflow engines e.g. Taverna o Not compatible with existing data analysis tools web: email:
  56. 56. Facades • Define facades on top of OGSA-DAI • Why? o Provide interfaces with more tightly-defined semantics o Comply with standards o Exploit existing data analysis tools • Continue to exploit the power of workflows under-the- hood o “Canned workflows” o Templates selected and populated, executed and parsed o Map service operations to “template” OGSA-DAI workflows web: email:
  57. 57. Grid-enabling existing data-related products OGSA-DAI OGSA-DAI mediator Data analysis tool web: email:
  58. 58. OGSA-DAI in action web: email:
  59. 59. VOTES – data with different schema distributed across multiple databases within a group of strategic partners • Virtual Organisations for Trials and Epidemiological Studies (VOTES) o o UK Medical Research Council project • Data access and integration in the clinical domain o Relational databases – Microsoft SQL Server, Access, … o Distributed database joins • Patient information • Clinical trials records o Linking key is Scotland’s CHI number web: email:
  60. 60. VOTES – cross-database join activity workflow OGSA DB1 DAI DB2 SELECT CHI, Sex, DOB FROM Patients ORDER BY CHI SQLQuery (CHI, Sex, DOB) (DB1) (CHI, Sex, DOB, Diagnosis) Ordered data Merge streams Deliver Join SQLQuery (DB2) (CHI, Diagnosis) SELECT CHI, Diagnosis FROM TrialX ORDER BY CHI • This is equivalent to running: SELECT chi, sex, DOB, diagnosis FROM patients, trialX WHERE patients.chi = trialX.chi; • patients and trialX are in two different databases web: email:
  61. 61. Public Health Grid – data with different schema distributed across multiple databases within a group of strategic partners • US Public Health Grid o US Centers for Disease Control o University of Pittsburgh o Tarrant Country Public Health Department o Dallas County Public Health Department • Real-time Outbreak and Disease Surveillance o Health query system o Look for incidences of some disease on the rise over an area o Historical and live data • Health centres maintain their own databases o Distributed databases o Different products and schemas • e.g. PatientID, Id, PatientIdentifier, PatientNumber o Security and privacy is important web: email:
  62. 62. Public Health Grid – workflows, DQP and views DB1 workflow OGSA OGSA- OGSA- DAI DB6 View DB5 DAI DAI DB2 OGSA- OGSA- DQP DB4 View DB3 DAI SELECT zip, count(*) as total FROM Cases WHERE Reason = “Flu” Cases: GROUP BY zip SELECT * FROM ORDER BY zip DB1.Cases UNION DB2.Cases UNION (15112, 3) DB4.Cases SQLQuery (15144, 1) (DB6) web: email:
  63. 63. SEE-GEO – working with private and public data • SEcurE access to GEOspatial services o o EDINA, MIMAS, NeSC, NCeSS o UK JISC project • Geographical information systems • Virtual integration of and access control to o Census data – geo-data access service o Borders data – web feature service o Data hosted by other organisations and exposed as services web: email:
  64. 64. SEE-GEO – geo-linking service portal 1: GLSQuery submited via portal e.g. “Leeds population GLS Maps distribution by Portal census output 5: Portal gets image using URL area” 4: URL of image is returned to portal – avoids costly SOAP/HTTP transfer of image MIMAS OGSA-DAI Census Get 3: Image Join Transform Deliver is placed Get on a map UK server BORDERS 2: Workflow is populated with query parameters and run Image Creation Service web: email:
  65. 65. What did OGSA-DAI give SEE-GEO? • Could implement GLS service without OGSA- DAI • But using OGSA-DAI allowed leverage of o Workflow engine o Out-of-the-box activities for • Queries • Delivery o Security o Other grid technologies, e.g. GridFTP 65 web: email:
  66. 66. What did OGSA-DAI give SEE-GEO? • A toolkit to o Develop domain-specific activities o Develop support for domain-specific data resources o Ability to execute workflows using these o Build OGC Web Processing Services (WPS) • Relatively little effort to o Choose different data resources dynamically o Merge GDAS XML into a relational data resource o Transfer data using GridFTP o Protect data using GSI o Experiment! 66 web: email:
  67. 67. Why OGSA-DAI? web: email:
  68. 68. Workflows • A workflow can represent a complex data management scenario, involving: o Data access o Transformation o Filtering o Updating o Numerous distributed, heterogeneous databases web: email:
  69. 69. Workflows and performance • OGSA-DAI is one more layer between clients and data • Therefore, OGSA-DAI is not as fast as a direct connection to a database o OGSA-DAI uses JDBC so will never be as fast as a direct JDBC connection • But this is not what OGSA-DAI is designed to do web: email:
  70. 70. Workflows and performance • Having a server execute workflows yields o Thinner clients with less memory and CPU requirements o Minimised client-server communication overheads • Activities process data on the server o Minimises data movement o As opposed to BPEL or Taverna or web service-based workflow engines which pass data to and fro via web services • Data streaming o Activities work on different parts of the data stream in parallel o Reduces memory footprint on server o Reduces execution time web: email:
  71. 71. Workflows and inter-operability • A workflow is a simple way of representing a complex set of related, ordered actions o A de-facto standard o Very expressive • How to standardise and promote inter- operability? o Use a facade and exploit workflows behind well- defined interface o Facilitate inter-operability with other data products web: email:
  72. 72. Why another layer can be good • Data providers retain control of their data • A place to hide database heterogeneities o Yields thinner clients • A place to enforce additional security o Hide the actual location of the data o Filter the data according to the rights of clients o Manage access to federations, databases, tables, documents, files, rows, lines • A place to define views on read-only databases web: email:
  73. 73. Developing applications • OGSA-DAI is highly extensible o Data resources, activities, security, presentation layers • An enabling framework o Save development time o Focus on application-specific features o Get standard functionalities out-of-the-box • Queries, updates, transformations, deliveries web: email:
  74. 74. Portability • OGSA-DAI is 100% Java o Runs under Windows, UNIX, Linux • OGSA-DAI uses web services o Clients can be written in any language and on any platform that supports web services web: email:
  75. 75. Accessibility • 100% Java open source freeware • Compliant with free open source web and grid products o Globus Toolkit 4.0.x o Apache Axis/Tomcat o OMII 3.4.0 o UNICORE – by OMII-Europe o VOMS – by OMII-Europe web: email:
  76. 76. Second and third hands-on sessions Go to : #ScenarioTwoDataIntegration 76 web: email:
  77. 77. Further information • WWW site : • Info : • Users e-mail list : web: email: