Your SlideShare is downloading. ×
0
Active Data: Data Life Cycle Management
Across Heterogeneous Systems and
Infrastructures
Anthony Simonet, Gilles Fedak (IN...
Big Data ...
Huge and growing volume of information originating from multiple
sources.
!"#$%&'("$#) *+'&,-./"#)0+1)*2+("2(...
Data Life Cycle
Definition
Data Life Cycle (DLC) is the course of operational stages through which
data pass from the time ...
Active Data
Active Data:
Allow to reason about data sets handled by heterogeneous software
and infrastructures.
A formal m...
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on...
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on...
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on...
Active Data Framework
6/13
G. Fedak() Active Data March 12, 2015
Life Cycle View
File transferFile Dataset Metadata
Guard
...
Use Case: Advanced Photon Source
Globus Catalog Globus
Detector Local Storage Compute Cluster
1. Local
Transfer
2. Extract...
Data Surveillance Framework
4 goals (that would otherwise require a lot of scripting and hacking):
Monitoring Data Set Pro...
APS Data Life Cycle Model
Created Start transfer
Terminated
End
Detector
Created
SuccessFailure
SucceededFailed
EndEnd
Ter...
Error Detection & Recovery
10/13
G. Fedak() Active Data March 12, 2015
Example scenario
Recover from system-wide errors: f...
E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent
Use-case: APS data life cycle model
Created Start transfer
Term...
Handler Code
TransitionHandler handler = new TransitionHandler () {
public void handler(Transition t, boolean isLocal , To...
Conclusion
Active Data
allows to expose Data Life Cycle across heterogeneous systems and
infrastructures
transition-based ...
Thank you!
Questions?
Upcoming SlideShare
Loading in...5
×

Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

1,656

Published on

Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion.

A key point is to handle the complexity of the 'Data Life Cycle', i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing.


''Active Data'' is new approach to automate and improve the expressiveness of data management applications. It consists of

* a 'formal model' for Data Life Cycle, based on Petri Net, that allows to describe and expose data life cycle across heterogeneous systems and infrastructures.

* a 'programming model' allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,656
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures"

  1. 1. Active Data: Data Life Cycle Management Across Heterogeneous Systems and Infrastructures Anthony Simonet, Gilles Fedak (INRIA) Matei Ripeanu, Samer Al-Kiswany (UCB) Kyle Chard, Ian Foster (ANL/UC) Hot Topics in High-Performance Distributed Computing Workshop IBM Almaden Research Center San Jose, California March 12, 2015 1/13 G. Fedak() Active Data March 12, 2015
  2. 2. Big Data ... Huge and growing volume of information originating from multiple sources. !"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($) . . . or Big Bottlenecks ? how to scale the infrastructure ? end-to-end performance improvement, inter-system optimization. how to improve productivity of data-intensive scientist ? data-oriented programming language, data quality, improve automation and errors recovery 2/13 G. Fedak() Active Data March 12, 2015
  3. 3. Data Life Cycle Definition Data Life Cycle (DLC) is the course of operational stages through which data pass from the time when they enter a set of systems to the time when they leave it. !"#$%&%'()* +,-.,("-&&%)/* 01(,2/-* !)234&%&* !)234&%&* Challenges : Expose high level view DLC across distributed systems and infrastructures Expose interactions between the infrastructure and the DLC (e.g failures) 3/13 G. Fedak() Active Data March 12, 2015
  4. 4. Active Data Active Data: Allow to reason about data sets handled by heterogeneous software and infrastructures. A formal model that captures the essential life cycle stages and properties: creation, deletion, faults, replication, error checking . . . programming model to develop easily data life cycle management applications. Allows legacy systems to expose their intrinsic data life cycle. 4/13 G. Fedak() Active Data March 12, 2015
  5. 5. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations • Created t1 Written t2 Read t3 t4 Terminated Each token has a unique identifier, corresponding to the actual data item’s. 5/13 G. Fedak() Active Data March 12, 2015
  6. 6. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 • Written t2 Read t3 t4 Terminated A transition is fired whenever a data state changes. 5/13 G. Fedak() Active Data March 12, 2015
  7. 7. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 • Written t2 Read t3 t4 Terminated public void handler () { computeMD5 (); } Code may be plugged by clients to transitions. It is executed whenever the transition is fired. 5/13 G. Fedak() Active Data March 12, 2015
  8. 8. Active Data Framework 6/13 G. Fedak() Active Data March 12, 2015 Life Cycle View File transferFile Dataset Metadata Guard Code Execution }Tagged Tokens Notification Framework features: Captures data events in legacy systems High-level life cycle-centered view of data Single namespace for all the files, datasets and metadata Powerful filters based on Data Tags Install Taggers on Transitions Guarded Transitions : only executes on token which have specific tags. Publish/subscribe transitions Custom user reaction to data progress Custom code execution Custom notifications (twitter, email, gdoc, ifttt . . . )
  9. 9. Use Case: Advanced Photon Source Globus Catalog Globus Detector Local Storage Compute Cluster 1. Local Transfer 2. Extract Metadata 3. Globus Transfer 4. Swift Parallel Analysis 3 to 5 TB of data per week on this detector Raw data are pre-processed and registered in the Globus Catalog : Data are curated by several applications Data are shared amongst scientific user 7/13 G. Fedak() Active Data March 12, 2015
  10. 10. Data Surveillance Framework 4 goals (that would otherwise require a lot of scripting and hacking): Monitoring Data Set Progress Better Automation Sharing & Notification Error Discovery & Recovery 8/13 G. Fedak() Active Data March 12, 2015
  11. 11. APS Data Life Cycle Model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. 9/13 G. Fedak() Active Data March 12, 2015
  12. 12. Error Detection & Recovery 10/13 G. Fedak() Active Data March 12, 2015 Example scenario Recover from system-wide errors: faulty acquired files are detected only after Swift fails to process them. In this situation, the user manually: Drops the whole dataset Removes any associated file and metadata Re-acquire the dataset using the same parameters
  13. 13. E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent Use-case: APS data life cycle model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. Avalon June 3rd, 2014 22/30 Active Data Client Failure and Recovery Handler remove metadata from the catalog Filters data likely to fail Tagger Active Data Client Guard Handler run the Globus catalog UI scripts 11/13 G. Fedak() Active Data March 12, 2015
  14. 14. Handler Code TransitionHandler handler = new TransitionHandler () { public void handler(Transition t, boolean isLocal , Token [] inTokens , Token [] outTokens) { // Get the dataset identifier LifeCycle lc = ad. getLifeCycle (inTokens [0]); datasetId = lc.getTokens("Shared storage.Created")[0]. getUid (); // Remove the dataset annotations from the catalog String url = "https :// catalog.globus.org/dataset/" + datasetId; Runtime r = Runtime.getRuntime (); Process p = r.exec(" catalog_client .py remove " + url); p.waitFor (); // Locally , remove the datasets String path = "~/aps/" + datasetId; FileUtils. deleteDirectory (new File(path)); // Publish the " Detector .End" Token root = lc.getTokens("Detector.Created")[0]; ad. publishTransition ("Detector.End", lc); // Notify the user sendEmail("user@server.com", "APS - Corrupted dataset " + datasetId); } }; HandlerGuard guard = new HandlerGuard () { public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) { return input [0]. hasTag(" f a i l u r e c o r r u p t e d "); }} ad.subscribeTo("Swift.Failure", handler , guard); 12/13 G. Fedak() Active Data March 12, 2015
  15. 15. Conclusion Active Data allows to expose Data Life Cycle across heterogeneous systems and infrastructures transition-based programming model for DLC management application Monitoring, automation, error detection & recovery X-systems optimizations: incremental computing, data staging, caching, throttling etc. . . Perspectives : Use AD to deploy data management software stack on IaaS (Asma Ben Cheick, Heithem Abbes, Univ. Tunis) Big Data Apache stack X-optimization (H. He, CAS, Beijing) Volunteer & crowd computing (M. Moca, BBU, Romania) 13/13 G. Fedak() Active Data March 12, 2015
  16. 16. Thank you! Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×