Active Data PDSW'13

246 views
186 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
246
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Active Data PDSW'13

  1. 1. Introduction Active Data Discussion Conclusion Active Data A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet1 Gilles Fedak1 Matei Ripeanu2 Samer Al-Kiswany2 1Inria, ENS Lyon, University of Lyon 2University of British Columbia November 18th, 2013 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20
  2. 2. Introduction Active Data Discussion Conclusion Outline Introduction Data Life Cycle Management Use-case Requirements Active Data Active Data: principles & features Exemple: Globus Online and iRODS Discussion Advantages Limitations Conclusion Related works Conclusion A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 2/20
  3. 3. Introduction Active Data Discussion Conclusion Big Data A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 3/20 Science and Industry have become data-intensive Volume of data produced by science and industry grows exponentially How to store this deluge of data? How to extract knowledge and sense? How to make data valuable? Some examples CERN’s Large Hadron Collider: 1.5PB/week Large Synoptic Survey Telescope, Chile: 30 TB/night Billion edge social network graphs Searching and mining the Web
  4. 4. Introduction Active Data Discussion Conclusion Data Life Cycle Data Life Cycle Creation/Acquisition Transfer Replication Disposal/Archiving Definition The life cycle is the course of operational stages through which data pass from the time when they enter a system to the time when they leave it. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 4/20
  5. 5. Introduction Active Data Discussion Conclusion Data Life Cycle Management Complicated scenarios Execution of workflow Complex interactions between software Need to quickly react to operational events Ad-hoc task-centric approaches Hard to program, maintain and debug No formal specification Complicates interactions between systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 5/20
  6. 6. Introduction Active Data Discussion Conclusion Data Life Cycle Use-case Example: the Advanced Photon Source at Argonne National Lab 100TB of raw data per day Raw data are preprocessed and registered in a Globus dataset catalog Data are analyzed by various applications Results are stored in the dataset catalog and shared Instrument (Beamline) Local Storage Transfer Metadata Catalog Extract & Register Metadata Remote Data Center Transfer Academic Cluster Analysis More analysis Upload result Register result metadata A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 6/20
  7. 7. Introduction Active Data Discussion Conclusion Use-case Task Centric Vs Data Centric Independent scripts Express data-dependancies Hard to program, maintain, verify Cross data-center coordination Coarse granularity User-level fault-tolerance Incremental processing A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 7/20
  8. 8. Introduction Active Data Discussion Conclusion Requirements Challenges: a perfect system should. . . Simply represent the life cycle of data distributed across different data centers and systems Simplify DLM modeling and reasoning Hide the complexity resulting from using different infrastructures and systems Be easy to integrate with existing systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 8/20
  9. 9. Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions • Created t1 Written t2 Read t3 t4 Terminated Each token has a unique identifier, corresponding to the actual data item’s. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
  10. 10. Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 • Written t2 Read t3 t4 Terminated A transition is fired whenever a data state changes. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
  11. 11. Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 • Written t2 Read t3 t4 Terminated public void handler () { computeMD5 (); } Code may be plugged by clients to transitions. It is executed whenever the transition is fired. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
  12. 12. Introduction Active Data Discussion Conclusion Active Data features The Active Data programming model and runtime environment: Allows to react to life cycle progression Exposes transparently distributed data sets Can be integrated with existing systems Has scalable performance and minimum overhead over existing systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 10/20
  13. 13. Introduction Active Data Discussion Conclusion Implementation Prototype implemented in Java ( 2,800 LOC) Client/Service communication is Publish/Subscribe 2 types of subscription: Every transitions for a given data item Every data items for a given transition Active Data Service Client Client subscribeClient subscribe A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
  14. 14. Introduction Active Data Discussion Conclusion Implementation Several ways to publish transitions Instrument the code Read the logs Rely on an existing notification system The service orders transitions by time of arrival Active Data Service Client publish transition Client subscribeClient subscribe publish transition A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
  15. 15. Introduction Active Data Discussion Conclusion Implementation Clients run transition handler code locally Transition handlers are executed Serially In a blocking way In the order transitions were published Active Data Service Client publish transition Client subscribe notify Client subscribe notify publish transition A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
  16. 16. Introduction Active Data Discussion Conclusion Performance evaluation: Throughput 10 50 100 200 300 400 450 500 550 # clients 5,000 10,000 15,000 20,000 25,000 30,000 35,000 Transitionspersecond Figure: Average number of transitions per second handled by the Active Data Service Clients publish 10,000 transitions in a row without pausing. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
  17. 17. Introduction Active Data Discussion Conclusion Performance evaluation: Throughput 10 50 100 200 300 400 450 500 550 # clients 5,000 10,000 15,000 20,000 25,000 30,000 35,000 Transitionspersecond Figure: Average number of transitions per second handled by the Active Data Service The prototype scales up to 30,000 transitions per seconds. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
  18. 18. Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. Assess the quality of data Keep track of the origin of data over time Specialized Provenance Aware Storage Systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
  19. 19. Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. Assess the quality of data Keep track of the origin of data over time Specialized Provenance Aware Storage Systems −→ What about heterogeneous systems? A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
  20. 20. Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. Assess the quality of data Keep track of the origin of data over time Specialized Provenance Aware Storage Systems −→ What about heterogeneous systems? Example with Globus Online and iRODS File transfer service Data store and metadata catalog A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
  21. 21. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20 Data events coming from Globus Online and iRODS Created Get Put Terminated t5 t9 t6 t7 t8 t10 iRODS • Created t1 t2 Succeeded Failed t3 t4 Terminated Globus Online Id: {GO: 7b9e02c4-925d-11e2}
  22. 22. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20 Data events coming from Globus Online and iRODS Created Get Put Terminated t5 t9 t6 t7 t8 t10 iRODS Created t1 t2 • Succeeded Failed t3 t4 Terminated Globus Online public void handler () { iput (...); }
  23. 23. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20 Data events coming from Globus Online and iRODS • Created Get Put Terminated t5 t9 t6 t7 t8 t10 iRODS Id: {GO: 7b9e02c4-925d-11e2, iRODS: 10032} Created t1 t2 Succeeded Failed t3 t4 Terminated Globus Online
  24. 24. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20 Data events coming from Globus Online and iRODS Created Get • Put Terminated t5 t9 t6 t7 t8 t10 iRODS public void handler () { annotate (); } Created t1 t2 Succeeded Failed t3 t4 Terminated Globus Online
  25. 25. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS $ imeta ls -d test/ out_test_4628 AVUs defined for dataObj test/ out_test_4628 : attribute: GO_FAULTS value: 0 ---- attribute: GO_COMPLETION_TIME value: 2013 -03 -21 19:28:41Z ---- attribute: GO_REQUEST_TIME value: 2013 -03 -21 19:28:17Z ---- attribute: GO_TASK_ID value: 7b9e02c4 -925d -11e2 -97ce -123139404 f2e ---- attribute: GO_SOURCE value: go#ep1 /~/ test ---- attribute: GO_DESTINATION value: asimonet#fraise /~/ out_test_4628 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 15/20
  26. 26. Introduction Active Data Discussion Conclusion Advantages Simple and graphical way to program DLM operations Allows to formally verify some properties of data life cycles Easy coordination between systems Easy to scale Easy to debug Easy fault tolerance Fine-grain interaction with data life cycle A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 16/20
  27. 27. Introduction Active Data Discussion Conclusion Limitations Complexity to reason in terms of life cycle events Lack of standard A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 17/20
  28. 28. Introduction Active Data Discussion Conclusion Related works Data-centric parallel processing Programing models: MapReduce and higher level abstractions: PigLatin, Twister Incremental systems: MapReduce-Online, Percolator, Chimera, Nephele Other models with implicit parallelism: Swift, Dryad, Allpairs Storage systems BitDew MosaStore Provenance Aware Storage Systems Active Storage A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 18/20
  29. 29. Introduction Active Data Discussion Conclusion Conclusion Active Data is. . . Data-centric & Event-driven System-level data integration What’s next? Advanced representation of operations that consume and produce data: represent data derivation Data collection abilities Distributed implementation of the Publish/Subscribe layer A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 19/20
  30. 30. Introduction Active Data Discussion Conclusion Thank you! Questions? Inria booth #2116 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 20/20

×