Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Policy-based Data ManagementIntegrated Rule Oriented Data Grid (iRODS) Reagan W. Moore (DICE-UNC) Arcot Rajasekar (DICE-UNC)
  2. 2. What is the Opportunity Play for iRODS At a high level …  The Management of Big Data is the #1 concern for IT • Life Cycle Management; • Useful (actionable) and searchable metadata • Integrity • Collaboration (Federation of Immutable data)  iRODS Provides Policy-Based data management: • Next Generation data management cyber-infrastructure • System that enables a flexible, adaptive, customizable data management architecture • Tool for large collections (Petabytes, hundreds of millions of files)
  3. 3. We will touch on … Properties of policy-based data management systems Management of the data life cycle (project collection, digital library, persistent archive, processing pipeline) Applications of iRODS  LifeTime™ Library (digital library for students)  Genomics data grid  Carolina Digital Repository (institution repository)  French National Library (IT automation)  DataNet Federation Consortium (data and workflow sharing for collaborative research) 1. What iRODS is and what problems it is solving today, and tomorrow. 2. Speak to different use cases (there will be many companies attending representing many departments with different opportunities/problems) a. Digitization of University Assets- Library archive b. Genomic pipeline automation c. IT service automation
  4. 4. Topics• Principles behind policy-based data management – Enable collaborative research – Enable reproducible science – Enable creation of reference collections• Integrated Rule-Oriented Data System (iRODS) – Enforce management policies – Automate administrative functions – Validate assessment criteria
  5. 5. Shared Collections – Data Grid 50 clients: web browser, Client unix shell command, … Data grid middleware Data Grid provides global name, single sign-on, policy enforcement, metadata, replication File TapeSystem Archive Multiple types of systems can be used to store data
  6. 6. Policy-based Data Management ClientiRODS-server iRODS-server Rule-engine Rule Engine Rule base Rule base Workflows Logical Workflows Collection (data grid) Storage Storage Consensus on Policies and Procedures controls the Data Collection
  7. 7. Policy-Based Data Environments Purpose  Reason a collection is assembled Properties  Attributes needed to ensure the purpose Policies  Controls for enforcing desired properties, • mapped to computer actionable rules Procedures  Functions that implement the policies • Mapped to computer actionable workflows Persistent state information  Results of applying the procedures • mapped to system metadata Property verification  Validation that state information conforms to the desired purpose • mapped to periodically executed policies 7
  8. 8. Community-based Collection Life Cycle The driving purpose changes at each stage of the data life cycle Data Project Data Processing Digital Reference FederationCollection Grid Pipeline Library Collection Private Shared Analyzed Published Preserved Sustained Local Distribution Service Description Representation Re-purposing Policy Policy Policy Policy Policy Policy Stages correspond to addition of new policies for a broader community Virtualize the stages of the collection life cycle through policy evolution
  9. 9. Applications Data Grids (data sharing)  Ocean Observatories Initiative  The iPlant Collaborative  National Optical Astronomy Observatory  Babar High Energy Physics  Broad Institute genomics data grid  WellCome Trust Sanger Institute genomics data grid Digital Libraries (data publication)  Texas Digital Library  French National Library  UNC-CH SILS LifeTime Library Repositories / Archives (data preservation)  NASA Center for Climate Simulation  Carolina Digital Repository
  10. 10. Sequencing Work – an Infrastructure View RC, RENCI, LCCC, HTSF Infrastructure Managing several hundred TBs of genomic data hardware: ITS; software: LCCC, UNC High Throughput Sequencing facility Production RENCI Infrastructure Test/Development Archive Genome Databases Genome Pipelines Databases VarDB Hadoop Pipelines Data Production UNC HTFS Distributed ad-hoc Third Party processing Vendors iRODS data-grid managed Genome processing AnnotationsNational Resources Ref Se dbSN 1000 HGMD P Genomes q Open Science Data Sharing RENCI Grid Clinical Data Systems Science Local Portal TeraGrid NIH NCGenes (TUCASI) Secure Medical UNC BASS Other Workspace … Institutions
  11. 11. Managing Data on the Research Side UNC RENCI External Genomics Lab External STORAGE STORAGE Compute: Storage Machines Partners (Tape, Drives) (Tape, Drives) Open Science Grid Genomics HPC Clemson NIH UNC HPC RENCI HPC IT Machines Genomics Clouds RENCI Hadoop Hadoop iRODS gracefully allows for introducing control: •Data movement and replicationWild West Managed •Metadata standards •Archival, deletion, and retention •Integration with workflows, hadoop, databases •Hiding complexities Data •Automation Students IT StaffProviders •…, all policy driven Researchers External •…, without breaking the in-place systems Collaborators
  12. 12. SILS LifeTime Library Student digital libraries  Enable students to build collections of  Photographs  MP3 audio files  Class documents  Video  Web site archive Resources provided by School of Information and Library Science at UNC-CH  Student collections range from 2 GBytes to 150 Gbytes  Number of files from 2000 to 12,000
  13. 13. SILS LifeTime Library Policies Library management  Replication  Checksums  Versioning  Strict access controls  Quotas  Metadata catalog replication  Installation environment archiving Ingestion  Automated synchronization of student directory with LifeTime Library  Automated loading of MP3 metadata
  14. 14. Carolina Digital Repository Policy-Driven Repository Infrastructure projectfunded by the Institute for Museum and Library Services
  15. 15. Carolina Digital RepositoryIngest Workflow
  16. 16. iRODS Data Grid More than 50 different clients have been used to interact with the data grid  Web browsers (iDrop-web, Rich Web client)  Web services (VOSpace)  Load libraries (Python, Java)  I/O libraries (C, C++, Fortran)  File systems (FUSE, WebDav, Parrot)  Synchronization interfaces (iDrop)  Unix tools / Grid tools (icommands, SAGA, SRM, Griphyn)  Workflows (Kepler, Taverna)  Digital Libraries (Fedora, DSpace)  Portals (EnginFrame)
  17. 17. Managing Information & KnowledgeConcepts Data objects Information names Knowledge relationships between names Wisdom relationships between relationshipsImplementation Data bytes Storage system Information metadata Relational database Knowledge policies / procedures Rule base / Rule engine Wisdom policy enforcement point Data Grid
  18. 18. Data Virtualization Access Interface • Map from the actions requested by the client to multiple policyPolicy Enforcement Points enforcement points. • Map from policy toStandard Micro-services standard micro-services. • Map from micro-servicesStandard I/O Operations to standard Posix I/O operations. • Map standard I/O Storage Protocol operations to the protocol supported by Storage System the storage system
  19. 19. System and User-driven Rules The data grid automatically applies rules defined in the rule base, You can define rules that are applied interactively, or that are deferred for later execution irule –F “rule-file.r”
  20. 20. Example Rule Write “Hello World” Create rule file call ruleHello.r: myTestRule { writeLine("stdout”, "Hello World"); } INPUT null OUTPUT ruleExecOutirule –F ruleHello.r
  21. 21. Production Integrity Rule Verify all input parameters for consistency. Query the iRODS metadata catalog to retrieve status information Verify the integrity of each file in a collection Update all replicas to the most recent version. Minimize the load on production services through a deadline scheduler Differentiate between the logical name for a file and the physical replica locations. Identify all missing replicas and document their lack. Create new replicas to replace missing replicas. Implement load leveling to distribute the new replicas across the storage systems Create a log file that records all repair operations performed upon the collection. Track progress of the policy execution. Initialize the rule for the first execution. Enable restart of the process from the last set of checked files in case of a system halt. Manipulate files in batches of 256 files at a time to handle arbitrarily large collections. Minimize the number of sleep periods used by the deadline scheduler. Include the checking of new files that have been added during the execution of the policy Write out statistics about the effective execution rate, and the number of files checked.
  22. 22. Workflow Management & Registration Workflow fileeCWkflow.mss Directory holding all input and output files/earthCube/eCWkflow associated with workflow file (mounted collection that is linked to the workflow file) Automatically generated run file for Executing each input file eCWkflow.mpf Input parameter file, lists parameters and input and output file names eCWkflow2.mpf /earthCube/eCWkflow/eCWkflow.runDir0 Directory holding all output files generated for invocation Outfile of, the version number is incremented /earthCube/eCWkflow/eCWkflow2.runDir 0 Output file created for Newfile eCWKflow.mpf
  23. 23. Publications Rajasekar, R., M. Wan, R. Moore, W. Schroeder, S.-Y. Chen, L. Gilbert, C.-Y. Hou, C. Lee, R. Marciano, P. Tooby, A. de Torcy, B. Zhu, “iRODS Primer: Integrated Rule-Oriented Data System”, Morgan & Claypool, 2010. Ward, R., M. Wan, W. Schroeder, A. Rajasekar, A. de Torcy, T. Russell, H. Xu, R. Moore, “The integrated Rule-Oriented Data System (iRODS 3.0) Micro-service Workbook”, DICE Foundation, November 2011, ISBN: 9781466469129,
  24. 24. iRODS - Open Source Software Reagan W. Moore http://irods.diceresearch.orgNSF OCI-0940841 “DataNet Federation Consortium”NSF OCI-1032732 “Improvement of iRODS for Multi-Disciplinary Applications”NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype”NSF SDCI-0721400 “Data Grids for Community Driven Applications”
  25. 25. iRODS Distributed Data Management
  26. 26. Initializing Workflow Parameters*Val = "0”;msiExecStrCondQuery("SELECT COUNT(META_COLL_ATTR_NAME) where COLL_NAME =*Coll and META_COLL_ATTR_NAME = TEST_DATA_ID", *GenQOut2);foreach (*GenQOut2) {msiGetValByKey(*GenQOut2, "META_COLL_ATTR_NAME", *Val); }if(int(*Val) == 0) { *Str1 = "TEST_DATA_ID=0”; msiString2KeyValPair(*Str1,*kvp); msiAssociateKeyValuePairsToObj(*kvp,*Coll,"-C"); writeLine("*Lfile","added TEST_DATA_ID attribute to collection *Coll");}# on a restart TEST_DATA_ID will be greater than 0 msiMakeGenQuery("META_COLL_ATTR_VALUE", "COLL_NAME = *Coll andMETA_COLL_ATTR_NAME = TEST_DATA_ID", *GenQInp2);msiExecGenQuery(*GenQInp2,*GenQOut2); foreach(*GenQOut2) { msiGetValByKey(*GenQOut2, "META_COLL_ATTR_VALUE", *colldataID); }
  27. 27. Workflow Operations Used Arithmetic (+, -, *, /) Boolean tests (==, !=, &&, ||, >, <, >=) Conditional statements  if / then / else Control  break / fail Loops  for / foreach / while List manipulation  initialization / list addition (cons) / extracting an element from a list (elem) / updating an element in a list (setelem) Variable manipulation  initialization / type conversion (int, double, str)
  28. 28. Micro-services Used Metadata catalog manipulation  msiGetValByKey get metadata from structure  msiExecStrCondQuery execute string conditional query  msiString2KeyValPair convert string to key-value pair  msiAssociateKeyValuePairsToObj add metadata  msiMakeGenQuery create a query  msiExecGenQuery execute a query  msiCloseGenQuery release query buffers  msiGetContInxFromGenQueryOut check for more rows  msiRemoveKeyValuePairsFromObj remove metadata  msiGetMoreRows get more rows from query
  29. 29. Micro-services Used Data and directory manipulation  msiIsColl check whether name is a collection  msiCollCreate create a collection  msiDataObjCreate create a file  msiDataObjRepl replicate a file  msiDataObjChksum checksum a file  msiDataObjUnlink delete a file System functions  msiGetSystemTime get the system time  writeLine write a line to a file or standard out  msiSleep sleep
  30. 30. Performance at renci’• Execute call to rule engine 18 msecs• Execute metadata query 714 msecs• Disk seek latency 5 msecs• Disk rotational latency 11 msecs• Production loop logic 6.3 msecs• Checksum verification 21 msecs
  31. 31. Data Analysis Use Cases• Demonstrate reproducible science. A use case could include the registration, storage, sharing, and re-execution of a workflow. The hypoxia use case from the Cross-Domain and Brokering Concept groups could be used as an example.• Automate data retrieval. A use case could demonstrate remote access to a data collection, retrieval of desired data sets, transformation, and use in an analysis workflow. An eco-hydrology example that automates access to digital elevation maps and land use coverage is being built.• Integrate community resources with collaboration environments. An example would be use of the DAB protocol to identify and cache local copies of relevant data sets for local analysis.• Integrate multiple community resources. A use case could be demonstration of invocation of multiple workflow systems within the same analysis. An example is the integration of Cyber-integrator workflow with collaboration environments to support drought prediction.
  32. 32. Eco-Hydrology Choose gauge or outlet (HIS) RHESSys workflow to develop a nested Extract drainage area watershed parameter file (worldfile) (NHDPlus) containing a nested ecogeomorphic object Digital Elevation framework, and full, initial system state. Slope Model (DEM) Aspect Nested watershedStreams (NHD) structure Soil and vegetationRoads (DOT) parameter files Strata Patch Land Use NLCD (EPA) Hillslope Basin Leaf Area Landsat TM Index Stream network Phenology MODIS Worldfile Flowtable Soil Data USDA RHESSys
  33. 33. iRODS Rule for RHESSys Modular workflow composed by chaining basic transformation Define input variables Call functions to apply each transformation step Store results in shared collectionmain { getExtentForGageReachcode(*gageReachcode, *extentInNHD_Vect_Coords); convertExtentToNHD_DEM(*extentInNHD_Vect_Coords, *extentInNHD_DEM_Coords); extractTileFromNHD_DEM(trimr(*extentInNHD_DEM_Coords, "n")); importDEMTileIntoNewGRASSLocationAsUTM(*extentInNHD_Vect_Coords, *newLocPhysPath,*newLocObjPath); delineateWatershedForNHDGage(*nhdStreamGageID, *newLocPhysPath, *newLocObjPath);}
  34. 34. extractTileFromNHD_DEM(*extentCoords) {# Split path to object into collection and name msiSplitPath(*nhdDEMObjPath, *nhdDEMObjColl, *nhdDEMObjName); writeLine("serverLog", *nhdDEMObjColl); writeLine("serverLog", *nhdDEMObjName);# Build query to discover physical path msiAddSelectFieldToGenQuery("DATA_PATH", "null", *genQInp); msiAddConditionToGenQuery("DATA_NAME", "=", *nhdDEMObjName, *genQInp); msiAddConditionToGenQuery("COLL_NAME", "=", *nhdDEMObjColl, *genQInp); msiAddConditionToGenQuery("DATA_RESC_NAME", "=", *rescName, *genQInp);# Run query msiExecGenQuery(*genQInp, *genQOut);# Extract path from query result foreach (*genQOut) {msiGetValByKey(*genQOut, "DATA_PATH", *filePath); } writeLine("serverLog", *filePath);# Determine physical path of input directory msiSplitPath(*filePath, *inFileDir, *headerFileIgnore);# Generate physical path of output file msiSplitPath(*inFileDir, *inFileParentDir, *rasterDatasetName) *tileFileName = "SUBSET-"++*rasterDatasetName++".img" *tileFilePath = *inFileParentDir++"/"++*tileFileName;# Generate iRODS path of output msiSplitPath(*nhdDEMObjColl, *nhdDEMObjCollParent, *junk) *tileObjPath = *nhdDEMObjCollParent++"/"++*tileFileName *args = "-of HFA -projwin "++*extentCoords++" "++"*inFileDir"++" "++"*tileFilePath"; writeLine("serverLog", *args); msiExecCmd("gdal_translate", *args, "", "null", "null", *cmd_out); writeLine("serverLog", *cmd_out);# Register tile file with iRODS msiPhyPathReg(*tileObjPath, *rescName, *tileFilePath, "null", *status);}
  35. 35. Summary iRODS is a power Policy-Based engine for Managing NextGen Big Data Cyber-Infrastructures Enables a Flexible, Adaptive and Customizable Data Management Architecture “Canned” scripts (policies) can be created to standardize and automated users processes  Simple menu driven interface  No CS Degree needed iRODS is the middleware for Distributed Data Management
  36. 36. Thank youQuestions?