Your SlideShare is downloading. ×
0
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

1,631

Published on

Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures …

Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion.

A key point is to handle the complexity of the 'Data Life Cycle', i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing.


''Active Data'' is new approach to automate and improve the expressiveness of data management applications. It consists of

* a 'formal model' for Data Life Cycle, based on Petri Net, that allows to describe and expose data life cycle across heterogeneous systems and infrastructures.

* a 'programming model' allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,631
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Active Data: Data Life Cycle Management Across Heterogeneous Systems and Infrastructures Anthony Simonet, Gilles Fedak (INRIA) Matei Ripeanu, Samer Al-Kiswany (UCB) Kyle Chard, Ian Foster (ANL/UC) Hot Topics in High-Performance Distributed Computing Workshop IBM Almaden Research Center San Jose, California March 12, 2015 1/13 G. Fedak() Active Data March 12, 2015
  • 2. Big Data ... Huge and growing volume of information originating from multiple sources. !"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($) . . . or Big Bottlenecks ? how to scale the infrastructure ? end-to-end performance improvement, inter-system optimization. how to improve productivity of data-intensive scientist ? data-oriented programming language, data quality, improve automation and errors recovery 2/13 G. Fedak() Active Data March 12, 2015
  • 3. Data Life Cycle Definition Data Life Cycle (DLC) is the course of operational stages through which data pass from the time when they enter a set of systems to the time when they leave it. !"#$%&%'()* +,-.,("-&&%)/* 01(,2/-* !)234&%&* !)234&%&* Challenges : Expose high level view DLC across distributed systems and infrastructures Expose interactions between the infrastructure and the DLC (e.g failures) 3/13 G. Fedak() Active Data March 12, 2015
  • 4. Active Data Active Data: Allow to reason about data sets handled by heterogeneous software and infrastructures. A formal model that captures the essential life cycle stages and properties: creation, deletion, faults, replication, error checking . . . programming model to develop easily data life cycle management applications. Allows legacy systems to expose their intrinsic data life cycle. 4/13 G. Fedak() Active Data March 12, 2015
  • 5. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations • Created t1 Written t2 Read t3 t4 Terminated Each token has a unique identifier, corresponding to the actual data item’s. 5/13 G. Fedak() Active Data March 12, 2015
  • 6. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 • Written t2 Read t3 t4 Terminated A transition is fired whenever a data state changes. 5/13 G. Fedak() Active Data March 12, 2015
  • 7. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 • Written t2 Read t3 t4 Terminated public void handler () { computeMD5 (); } Code may be plugged by clients to transitions. It is executed whenever the transition is fired. 5/13 G. Fedak() Active Data March 12, 2015
  • 8. Active Data Framework 6/13 G. Fedak() Active Data March 12, 2015 Life Cycle View File transferFile Dataset Metadata Guard Code Execution }Tagged Tokens Notification Framework features: Captures data events in legacy systems High-level life cycle-centered view of data Single namespace for all the files, datasets and metadata Powerful filters based on Data Tags Install Taggers on Transitions Guarded Transitions : only executes on token which have specific tags. Publish/subscribe transitions Custom user reaction to data progress Custom code execution Custom notifications (twitter, email, gdoc, ifttt . . . )
  • 9. Use Case: Advanced Photon Source Globus Catalog Globus Detector Local Storage Compute Cluster 1. Local Transfer 2. Extract Metadata 3. Globus Transfer 4. Swift Parallel Analysis 3 to 5 TB of data per week on this detector Raw data are pre-processed and registered in the Globus Catalog : Data are curated by several applications Data are shared amongst scientific user 7/13 G. Fedak() Active Data March 12, 2015
  • 10. Data Surveillance Framework 4 goals (that would otherwise require a lot of scripting and hacking): Monitoring Data Set Progress Better Automation Sharing & Notification Error Discovery & Recovery 8/13 G. Fedak() Active Data March 12, 2015
  • 11. APS Data Life Cycle Model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. 9/13 G. Fedak() Active Data March 12, 2015
  • 12. Error Detection & Recovery 10/13 G. Fedak() Active Data March 12, 2015 Example scenario Recover from system-wide errors: faulty acquired files are detected only after Swift fails to process them. In this situation, the user manually: Drops the whole dataset Removes any associated file and metadata Re-acquire the dataset using the same parameters
  • 13. E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent Use-case: APS data life cycle model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. Avalon June 3rd, 2014 22/30 Active Data Client Failure and Recovery Handler remove metadata from the catalog Filters data likely to fail Tagger Active Data Client Guard Handler run the Globus catalog UI scripts 11/13 G. Fedak() Active Data March 12, 2015
  • 14. Handler Code TransitionHandler handler = new TransitionHandler () { public void handler(Transition t, boolean isLocal , Token [] inTokens , Token [] outTokens) { // Get the dataset identifier LifeCycle lc = ad. getLifeCycle (inTokens [0]); datasetId = lc.getTokens("Shared storage.Created")[0]. getUid (); // Remove the dataset annotations from the catalog String url = "https :// catalog.globus.org/dataset/" + datasetId; Runtime r = Runtime.getRuntime (); Process p = r.exec(" catalog_client .py remove " + url); p.waitFor (); // Locally , remove the datasets String path = "~/aps/" + datasetId; FileUtils. deleteDirectory (new File(path)); // Publish the " Detector .End" Token root = lc.getTokens("Detector.Created")[0]; ad. publishTransition ("Detector.End", lc); // Notify the user sendEmail("user@server.com", "APS - Corrupted dataset " + datasetId); } }; HandlerGuard guard = new HandlerGuard () { public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) { return input [0]. hasTag(" f a i l u r e c o r r u p t e d "); }} ad.subscribeTo("Swift.Failure", handler , guard); 12/13 G. Fedak() Active Data March 12, 2015
  • 15. Conclusion Active Data allows to expose Data Life Cycle across heterogeneous systems and infrastructures transition-based programming model for DLC management application Monitoring, automation, error detection & recovery X-systems optimizations: incremental computing, data staging, caching, throttling etc. . . Perspectives : Use AD to deploy data management software stack on IaaS (Asma Ben Cheick, Heithem Abbes, Univ. Tunis) Big Data Apache stack X-optimization (H. He, CAS, Beijing) Volunteer & crowd computing (M. Moca, BBU, Romania) 13/13 G. Fedak() Active Data March 12, 2015
  • 16. Thank you! Questions?

×