Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Big Data, Beyond the Data Center
1. Big Data, Beyond the Data Center
Gilles Fedak
Gilles.Fedak@inria.fr
INRIA, University of Lyon, France
Cluj Economics and Business Seminar Series (CEBSS)
University Babes-Bolyai Faculty of Economics and Business
Administration
Cluj-Napoca, Romania
6/11/2014
2. AVALON Team
I Located in Lyon, France
I Joint Research Group
I INRIA : French National Institute for Research in Informatics
I ENS-Lyon : Ecole Normale Suprieure
I University of Lyon
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
3. AVALON Members
2
Avalon Members @ April 1st, 2014
Faculty Members (8)
(4 INRIA, 1 CNRS, 2 UCBL, 1 ENSL)
• Eddy Caron, MCF ENS Lyon, HDR (80%)
• Frédéric Desprez, DR INRIA, HDR (30%)
• Gilles Fedak, CR INRIA
• Jean-Patrick Gelas, MCF UCBL
• Olivier Glück, MCF UCBL
• Laurent Lefèvre, CR INRIA, HDR
• Christian Perez, DR INRIA, HDR, Project leader
• Frédéric Suter, CR CNRS
PhD students (6)
• Maurice-Djibril Faye, ENS-Lyon / Université
Gaston Berger (Sénégal)
• Sylvain Gault, MapReduce, INRIA
• Anthony Simonet, MapReduce, INRIA
• Vincent Lanore, ENSL
• Arnaud Lefray, SEED4C, ENSIB
• Daniel Balouek, CIFRE New Generation SR
Engineers (3+4+1)
• Simon Delamare, IR CNRS (80%)
• Jean-Christophe Mignot, IR CNRS (20%)
• Matthieu Imbert, INRIA SED (40%)
• Sylvain Bernard, CloudPower
• François Rossigneux, XCLOUD
• Guillaume Verger, SEED4C
• Yulin Zhang Huaxi, SEED4C
• Laurent Pouilloux (AE Héméra)
Postdoc
• Jonathan Rouzaud-Cornabas, CNRS
Temporary Teacher-Researcher
• Ghislain Landry Tsafack, UCBL
Assistant
• Evelyne Blesle, INRIA
Avalon Team Overview
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
4. AVALON Topics
3
Avalon: Research Activities
!#$%'($)*#+,-.)
/00-'1%2#()3)
4,.#51,)
*#+,-.)
/-$#'67'1.)
4,.#51,)*%(%$,,(6)
Super-computers
(Exascale)
Desktop
Grids
Clouds
(IaaS, PaaS)
Grids
(EGI)
Avalon Team Overview
Energy Application Profiling and Modelization
• Large Scale Energy Consumption Analysis for
Physical and Virtual Resources
• Energy Efficiency of Next Generation Large Scale
Platforms
Data-intensive Application Profiling, Modeling,
and Management
• Performance Prediction of Parallel Regular
Applications
• Modeling Large Scale Storage Infrastructure
• Data Management for Hybrid Computing
Infrastructures
Resource Agnostic Application Description
Model
• Moldable Application Description Model
• Dynamic Adaptation of the Application Structure
Application Mapping and Scheduling
• Application Mapping and Software Deployment
• Non-Deterministic Workflow Scheduling
• Security Management in Cloud Infrastructure
/00-'1%2#(.)
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
5. Big Data ...
I Huge and growing volume of information originating from
multiple sources.
0+1)*2+(2() !#$%'($#) *+',-./#) !$(%($) 34()5-$-)
I Impacts many scienti
7. c Instruments (LSST, LHC, OOOI), but not only
(Sequencing machines)
I Internet and Social Network (Google, Facebook, Twitter, etc.)
I Open Data (Open Library, Governemental, Genomics)
! impacts the whole process of scienti
8. c discovery (4th paradigm
of science)
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
9. ... or Big Bottlenecks ?
I Big Data creates several challenges :
I how to scale the infrastructure ?
I end-to-end performance improvement, inter-system
optimization.
I how to improve productivity of data-intensive scientist ?
I work
ow, programming language, quality of data provenance.
I how to enable collaborative data science ?
I incentive for data publication, data-sets sharing, collaborative
work
ow.
I New models and software are needed to represent and
manipulate large and distributed scienti
10. c data-sets.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
11. BitDew: Large Scale Data
Management
Haiwu He (CAS/CNIC), Franck Cappello (ANL, UIUC)
I G. Fedak, H. He, and F. Cappello. BitDew: A Programmable Environment for Large-Scale Data
Management and Distribution. In Proceedings of the ACM/IEEE SuperComputing Conference (SC08),
pages 112, Austin, USA, November 2008.
I BitDew: A Data Management and Distribution Service with Multi-Protocol and Reliable File Transfer. G.
Fedak, H. He, and F. Cappello Journal of Network and Computer Applications, 32(5):961{975, 2009.
I H. He, G. Fedak, B. Tran, and F. Cappello. BLAST Application with Data-aware Desktop Grid
Middleware. In Proceedings of 9th IEEE International Symposium on Cluster Computing and the Grid
CCGRID09, pages 284291, Shanghai, China, May 2009.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
12. Towards Data Desktop Grid
Desktop Grid or Volunteers Computing Systems
I High Throughput Computing over Large Sets of Idle Desktop
Computers
I Mature technology
I EU support : European Desktop Grid Infrastructures
But ...
I High number of resources
I Volatility
I Lack of trust
I Owned by volunteer
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
13. Towards Data Desktop Grid
Desktop Grid or Volunteers Computing Systems
I High Throughput Computing over Large Sets of Idle Desktop
Computers
I Mature technology
I EU support : European Desktop Grid Infrastructures
But ...
I High number of resources
I Volatility
I Lack of trust
I Owned by volunteer
I Scalable but mainly for embarrassingly parallel applications
with few I/O requirements
I Enabling data-intensive applications
I Bridge with Cloud and Grid infrastructures
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
14. Large Scale Data Management
BitDew : a Programmable Environment for Large Scale Data
Management
Key Idea 1: provides an API and a runtime environment which
integrates several P2P technologies in a consistent
way
Key Idea 2: relies on metadata (Data Attributes) to drive
transparently data management operation :
replication, fault-tolerance, distribution, placement,
life-cycle.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
15. BitDew : the Big Cloudy Picture
I Aggregates storage in a
single Data Space:
I Clients put and get data
from the data space
I Clients de
16. nes data
attributes
Data Space
put get
client
REPLICAT =3
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
17. BitDew : the Big Cloudy Picture
I Distinguishes service nodes
(stable), client and Worker
nodes (volatile)
I Service : ensure fault
tolerance, indexing and
scheduling of data to
Worker nodes
I Worker : stores data on
Desktop PCs
I push/pull protocol between
client ! service Worker
Data Space
Service Nodes Reservoir Nodes
pull
put get
client
pull
pull
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
18. Data Attributes
replica : indicates how many occurrences of data should be available at
the same time in the system
resilience : controls the resilience of data in presence of machine crash
lifetime : is a duration, absolute or relative to the existence of other data,
which indicates when a datum is obsolete
affinity : drives movement of data according to dependency rules
transfer protocol : gives the runtime environment hints about the
19. le transfer pro-
tocol appropriate to distribute the data
distribution : which indicates the maximum number of pieces of Data with the
same Attribute should be sent to particular node.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
20. Architecture Overview
File
System
Http
Ftp
Bittorrent
Data
Scheduler
Data
Transfer
Active Data BitDew
Transfer
Manager
Command
-line Tool
Service
Container
Storage
Master/
Worker
API
Service
Back-ends
Applications
Data
Catalog
Data
Repository
SQL
Server
DHT
BitDew Runtime Environnement
I Programming APIs to create Data, Attributes, manage
21. le
transfers and program applications
I Services (DC, DR, DT, DS) to store, index, distribute,
schedule, transfer and provide resiliency to the Data
I Several information storage backends and
23. Examples of BitDew Applications
I Data-Intensive Application
I DI-BOT : data-driven master-worker Arabic characters
recognition (M. Labidi, University of Sfax)
I MapReduce vs Hadoop, (X. Shi, L. Lu HUST, Wuhan China)
I Data Management Utilities
I File Sharing for Social Network (N. Kourtellis, Univ. Florida)
I Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI)
I Grid Data Stagging (IEP, Chinese Academy of Science)
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
24. MapReduce for Hybrid Distributed
Computing Infrastructures
Haiwu He (CAS/CSNET), Bing Tang (WUST), Xunhua Shi Lu Lu
(HUST), Mircea Moca Gheorghe Silaghi (Univ Babes Bolay)
I Towards MapReduce for Desktop Grid Computing .B. Tang, M. Moca, S. Chevalier, H. He, and G. Fedak.
In Fifth International Conference on P2P, Paral lel, Grid, Cloud and Internet Computing (3PGCIC'10),
Fukuoka, Japan, November 2010. I Distributed Results Checking for MapReduce on Volunteer Computing Mircea Moca, Gheorghe Cosmin
Silaghi and Gilles Fedak, in 4th Workshop on Desktop Grids and Volunteer Computing Systems (PCGrid
2010) IPDPS'2011, Anchorage Alaska.
I Assessing MapReduce for Internet Computing: a Comparison of Hadoop and BitDew-MapReduce Lu Lu,
Hai Jin, Xuanhua Shi and Gilles Fedak in the 13th ACM/IEEE International Conference on Grid Computing
(Grid 2012), Beijing, China, 2012
I Data-Intensive Computing on Desktop Grids, H. Lin and W.-C. Feng and G. Fedak Book Chapter in
Desktop Grid Computing Book, CRC Press, 2012 I Parallel Data Processing in Dynamic Hybrid Computing Environment Using MapReduce, Bing Tang, Haiwu
He, Gilles Fedak, 4th International Conference on Algorithms and Architectures for Parallel Processing
(ICA3PP'14), LNCS/Springer Verlags, August 24-27, Dalian, China, 2014
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
25. What is MapReduce ?
I Programming Model for data-intense applications
I Proposed by Google in 2004
I Simple, inspired by functionnal programming
I programmer simply de
26. nes Map and Reduce tasks
I Building block for other parallel programming tools
I Strong open-source implementation: Hadoop
I Highly scalable
I Accommodate large scale clusters: faulty and unreliable
resources
MapReduce: Simpli
27. ed Data Processing on Large Clusters Jerey Dean and Sanjay
Ghemawat, in OSDI'04: Sixth Symposium on Operating System Design and
Implementation, San Francisco, CA, December, 2004.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
28. Challenge of MapReduce over the Internet
Data Split Final Output
Reducer
Combine
Mapper
Input Files
Intermediate
Results
Output Files
Output1
Mapper
Mapper
Reducer
Mapper
Reducer
Mapper
Output2
Output3
I no shared
29. le system nor direct
communication between hosts
I Faults and hosts churn
I Result Certi
30. cation of
Intermediate Data
I Collective Operation (scatter
+ gather/reduction)
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
31. Implementing MapReduce over BitDew
Latency Hiding
I Multi-thredead worker to overlap
communication and computation.
I The number of maximum concurrent
Map and Reduce tasks can be
con
32. gured, as well as the minimum
number of tasks in the queue before
computations can start.
Barrier-free computation
I Reducers detect duplication of intermediate results (that happen because
of faults and/or lags).
I Early reduction : process IR as they arrive ! allowed us to remove the
barrier between Map and Reduce tasks.
I But ... IR are not sorted.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
33. Scheduling and Fault-tolerance
2-level scheduling
1. Data placement is ensured by the BitDew scheduler, which is mainly
guided by the data attribute.
2. Workers periodically report to the MR-scheduler, running on the master
node the state of their ongoing computation.
3. The MR-scheduler determines if there are more nodes available than
tasks to execute which can avoid the lagger eect.
Fault-tolerance
I In Desktop Grid, computing resources have high failure rates :
! during the computation (either execution of Map or Reduce
tasks)
! during the communication, that is
34. le upload and download.
I MapInput data and ReduceToken token have the resilient attribute
enabled.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
35. MapReduce Evaluation
#$'
a 2.7GB
collective
measure
chunks size
broadcast. The
#!!
#$!!
#!!!
'!!
(!!
!!
$!!
!
' #( %$ ( #$' $( #$
*+,-./
0?6,@A?B@9:3C5D/4
Figure 6. Scalability evaluation on the WordCount application: the y axis presents
the throughput in MB/s and the x axis the number of nodes varying from 1 to 512.
Figure: Scalability evaluation on the WordCount application: the y axis
presents the throughput in MB/s and the x axis the number of nodes
varying from 1 to 512.
Table II
EVALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF MAPPERS
AND REDUCERS.
#Mappers 4 8 16 32 32 32 32
#Reducers 1 1 1 1 4 8 16
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
36. Data Security and Privacy
Distributed Result Checking
I Traditional DG or VC projects implements result checking on
the server.
I IR are too large to be sent back to the server
! distributed result checking
I replicates the MapInput and Reducers
I select correct results by majority voting
!
!#
!$
%
%'
!)
(%
*
+
(a) No replication
(b) Replication of the Map tasks
(c) Replication of both Map and Reduce tasks
Figure 2. Dataflows in the MapReduce implementation
Ensuring distinct Data workers as Privacy
input for Map task. Consequently, the
reducer receives a set of rm versions of intermediate files
I Use corresponding a hybrid to a map infrastructure input fi. After receiving : rm
composed + 1
2 of private and public resources
I Use IDA (Information Dispersal Algorithms) approach to distribute and
identical versions, the reducer considers the respective result
correct and further, accepts it as input for a Reduce task.
Figure 2b illustrates the dataflow corresponding to a
MapReduce execution where rm = 3.
In order to activate the checking mechanism for the
results received from reducers, the master node schedules the
received intermediate result is present in its own codes list
K, ignoring failed results.
IV. EVALUATION OF THE RESULTS CHECKING
MECHANISM
In this section we describe the model for characteriz-ing
the errors produced by the MapReduce algorithm in
Volunteer Computing. We assume that workers sabotage
independently of one another, thus we do not take into
store securely the data
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
37. Active Data: Data Life-Cycle
Management
Anthony Simonet (INRIA), Matei Ripeanu (UCB), Samer Al-Kiswany
(UCB)
I Active Data: A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet, Gilles Fedak,
Matei Ripeanu and Samer Al-Kiswany. 8th Parallel Data Storage Workshop (PDSW'13), Proceedings of
SC13 workshops, Denver, November, 2013 (position paper 5 pages)
I Active Data: A Programming Model for Data Life-Cycle Management on Heterogeneous Systems and
Infrastructures. Anthony Simonet, Gilles Fedak and Matei Ripeanu. Technical Report under evaluation.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
38. Focus on Data Life-Cycle
Data Life Cycle: the course of operational stages through which
data pass from the time when they enter a system to the time
when they leave it.
!#$%%'()* +,-.,(-%)/* 01(,2/-*
!)234%*
!)234%*
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
39. Use Case : The Advanced Photon Source
I 3 to 5 TB of data per week on this detector
I Raw data are pre-processed and registered in the Globus
Catalog :
I Data are curated by several applications
I Data are shared amongst scienti
40. c user
Transfer
Instrument(Beamline) LocalStorage
MetadataCatalog
Extract Register Metadata
RemoteData Center
Transfer
AcademicCluster
Analysis
More analysis
Upload result
Register result metadata
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
41. Objectives
We're aiming at :
I A model to capture the essential life cycle stages and
properties: creation, deletion, faults, replication, error
checking . . .
I Allows legacy systems to expose their intrinsic data life cycle.
I Allow to reason about data sets handled by heterogeneous
software and infrastructures.
I Simplify the programming of applications that implement data
life cycle management.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
42. Active Data principles
System programmers expose their system's internal data life cycle
with a model based on Petri Nets.
A life cycle model is made of Places and Transitions
Created
t1
Written
t2
Read
t3
Terminated
t4
Each token has a unique identi
43. er, corresponding to the actual
data item's.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
44. Active Data principles
System programmers expose their system's internal data life cycle
with a model based on Petri Nets.
A life cycle model is made of Places and Transitions
Created
t1
Written
t2
Read
t3
Terminated
t4
A transition is
45. red whenever a data state changes.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
46. Active Data principles
System programmers expose their system's internal data life cycle
with a model based on Petri Nets.
A life cycle model is made of Places and Transitions
Created
t1
Written
t2
Read
t3
Terminated
t4
public void handler () {
compressFile ();
}
Code may be plugged by clients to transitions.
It is executed whenever the transition is
48. Active Data features
The Active Data programming model and runtime environment:
I Allows to react to life cycle progression
I Exposes transparently distributed data sets
I Can be integrated with existing systems
I Has scalable performance and minimum overhead over
existing systems
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
49. Integration with Data Management Systems
Created t1 To Place
t2
Deleted
t3
t4 Placed
Loop
t5
2
t6
t7
t9
Lost
t8
(a) Bitdew Scheduler
Created
t1
Ready
t2
Started
Deleted Invalid
t3
t4
Completed
t5
t6
t7
t8
(b) Bitdew File Transfer
IN CREATE
t1
IN MOVED FROM
t2
t3
IN MOVED TO
t4
IN CLOSE WRITE
t5
t6
t11 t9 t7 t8 t10
DELETED
t14
t13 CREATED
t12
(c) inotify
Deleted
t8
t7 t6
Get
Put
t5
Created
(d) iRODS
Created
t1 t2
Succeeded Failed
t3 t4
Deleted
(e) Globus Online
Figure 3: Data life cycle models for four data management system.
structed from its documentation.
Reading the source code of BitDew, we observe that data
items are managed by instances of the Data class, and this
class has the status variable which holds the data item
state. Therefore, we simply deduce from the enumeration of
the possible value of status the set of corresponding places
in the Petri Net (see Figure 2a and 2b). By further analyz-ing
the source code, we construct the model and summarize
how high level DLM features are modelized using Active
Data model:
Scheduling and replication Part of the complexity of
the data life cycle in BitDew comes from the Data Scheduler
I BitDew (INRIA), programmable
environment for data management.
I inotify Linux kernel subsystem:
noti
52. cation, write, movement and
deletion.
I iRODS (DICE, Univ. North Carolina),
rule-oriented data management
system.
I Globus Online (ANL) oers fast,
simple and reliable service to transfer
large volumes of data.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
53. Implementation
I Prototype implemented in Java (' 2,800 LOC)
I Client/Service communication is Publish/Subscribe
I 2 types of subscription:
I Every transitions for a given data item
I Every data items for a given transition
Active DataService Client
subscribe
Client
Client subscribe
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
54. Implementation
I Several ways to publish transitions
I Instrument the code
I Read the logs
I Rely on an existing noti
55. cation system
I The service orders transitions by time of arrival
publish transition
Active DataService Client
subscribe
Client
Client subscribe
publish transition
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
56. Implementation
I Clients run transition handler code locally
I Transition handlers are executed
I Serially
I In a blocking way
I In the order transitions were published
publish transition
Active DataService Client
subscribe
notify
Client
notify
Client subscribe
publish transition
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
57. Data Surveillance FrameWork for APS
Anthony Simonet (INRIA), Kyle Chard (ANL), Ian Foster (ANL/UC)
I A. Simonet, K. Chard, G. Fedak, I. Foster Active Data to Provide Smart Data Surveillance to E-Science
Users In Proceedings of EuromicroPDP'15, Turku Finland, March 4-6, 2015
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
58. Problems with APS
Globus Catalog Globus
Detector Local Storage Compute Cluster
1. Local
Transfer
2. Extract
Metadata
3. Globus
Transfer
4. Swift Parallel Analysis
What is inecient in this work
ow?
I Many error-prone tasks are performed manually
I Users can't monitor the whole process at once
I Small failures are dicult to detect
I A system alone can't recover from failures caused outside its
scope
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
59. Data Surveillance Framework
4 goals (that would otherwise require a lot of scripting and
hacking):
I Monitoring Data Set Progress
I Better Automation
I Sharing Noti
60. cation
I Error Discovery Recovery
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
61. Active Data Advanced Features
A. Progress Monitoring
I Associate Tags to Data
I Install Taggers on
Scientists require mechanisms to monitor their workflows
from a high level, generate reports on progress, and identify
potential errors without painstakingly auditing every dataset.
Monitoring is not limited to estimating completion time, but
also: i) receiving a single relevant notification when several
related events occurred in different systems; ii) quickly notic-ing
Transitions
I Guarded Transitions : only
that an operation failed within the mass of operations that
completed normally; iii) identifying steps that take longer to
run than usual, backtracking the chain of causality, fixing the
problem at runtime and optimizing the workflow for future
executions; iv) accelerating data sharing with the community
by pushing notifications to collaborators and colleagues.
executes on token which
have speci
63. cation Handlers:
The APS workflow, like many scientific workflows, re-quires
explicit human intervention to progress between stages
and to recover from unexpected events. Such interventions
include running scripts on generated datasets on the shared
cluster, registering datasets in the Globus catalog, and execut-ing
Push.co, Twitter, gdoc, ifttt
etc...
Swift analysis scripts on the compute cluster Such inter-ventions
cannot be easily integrated in a traditional workflow
system, because they reside a level of abstraction above the
workflow system. In fact, they are the operations that start the
workflow systems.
C. Sharing and Notification
Scientific sharing can be made more efficient by allowing
other scientists to be notified of new datasets availability
(in the catalog) with powerful filters to extract only the
notifications they need, and even to start processes as soon
as files are available. We believe the best way for scientists
to automatically integrate new datasets in their workflows is
to rely on widely used dissemination mechanisms—such as
File Dataset File transfer Metadata
Life Cycle View
Guard
Code Execution
}Tagged Tokens
Notification
Fig. 2: Data surveillance framework design
IV. SYSTEM DESIGN
We next present the data surveillance framework that we
designed to satisfy the APS users’ needs presented above. We
also elaborate on the design of specific features.
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
64. APS Data Life Cycle Model
Avalon Daniel Arnaud Anthony Vincent
Use-case: APS data life cycle model
Created Start transfer
End
Terminated
Detector
Created
Failure Success
Failed Succeeded
End End
Terminated
End transfer
Globus transfer
Created
End
Terminated
Start transfer
Shared storage
Update
Globus Catalog
Created
Failure Success
Failed Succeeded
End End
Terminated
Start Swift
Globus transfer
Extract Created
Remove Terminated
Created
Initialize
Set
End
Failure
Terminated
Derive
Swift
Data life cycle model composed of 6 systems.
Avalon June 3rd, 2014 22/30
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
66. nition
Complete history of data derivations and operations
I Assess dataset quality
I Records the context of data acquisition and transformation
I PASS: Provenance Aware Storage Systems
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
68. nition
Complete history of data derivations and operations
I Assess dataset quality
I Records the context of data acquisition and transformation
I PASS: Provenance Aware Storage Systems
! What about heterogeneous systems?
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
69. Exemple: Data Provenance
Exemple: Data Definition
Historique compl`ete des d´ erivations et des op´ erations sur une
donn´ee
De
70. nition
Complete history of data derivations and operations
I Assess dataset quality
I Records the context of data acquisition and transformation
I PASS: Provenance Aware Storage Systems
Estimer la qualit ´e du jeu de donn´ees
Garder les conditions d’acquisition et de transformation des
donn´ees
PASS: Provenance Aware Storage Systems
! What about heterogeneous systems?
−→ What about heterogeneous systems?
Example with Globus Online and iRODS
File transfer service Data store and metadata catalog
G. Fedak(INRIA/UBC) Active Data Vichy 2014 20/28
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
71. Provenance Scenario with Active Data
Data events coming from Globus Online and iRODS
Terminated
t5
Put
t9
t7
t8 t10
Created
Get
t6
iRODS
Created
Id: fGO: 7b9e02c4-925d-11e2g
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
72. Provenance Scenario with Active Data
Data events coming from Globus Online and iRODS
Terminated
public void handler () {
}
t5
iput (...) ;
Put
t9
t7
t8 t10
Created
Get
t6
iRODS
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
73. Provenance Scenario with Active Data
Data events coming from Globus Online and iRODS
Terminated
t5
Put
t9
t7
t8 t10
Created
Get
t6
iRODS
Id: fGO: 7b9e02c4-925d-11e2,iRODS: 10032g
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
74. Provenance Scenario with Active Data
Data events coming from Globus Online and iRODS
Terminated
t5
Put
t9
t7
t8 t10
Created
Get
t6
iRODS
public void handler () {
annotate ();
}
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
75. iRODS Provenance Result
$ imeta ls -d test / out_test_4628
AVUs defined for dataObj test / out_test_4628 :
attribute : GO_FAULTS
value : 0
----
attribute : GO_COMPLETION_TIME
value : 2013 -03 -21 19:28:41 Z
----
attribute : GO_REQUEST_TIME
value : 2013 -03 -21 19:28:17 Z
----
attribute : GO_TASK_ID
value : 7 b9e02c4 -925d -11 e2 -97 ce -123139404 f2e
----
attribute : GO_SOURCE
value : go# ep1 /~/ test
----
attribute : GO_DESTINATION
value : asimonet # fraise /~/ out_test_4628
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
76. Conclusion
We proposed several approaches to handle Big Data on hybrid
distributed computing infrastructures : data management, data
processing and programming model.
Next
I Data Collection Data Stream
I Incentive systems for collaborative data science
Want to learn more ?
I Book on Desktop Grid Computing ed. C.Cerin and G.Fedak CRC Press
I Home page for the Big Data class : http://graal.ens-lyon.fr/
gfedak/pmwiki-lri/pmwiki.php/Main/MarpReduceClass
I our websites : http://www.bitdew.net and http://www.xtremweb-hep.org
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014