SlideShare a Scribd company logo
1 of 49
Download to read offline
Big Data, Beyond the Data Center 
Gilles Fedak 
Gilles.Fedak@inria.fr 
INRIA, University of Lyon, France 
Cluj Economics and Business Seminar Series (CEBSS) 
University Babes-Bolyai Faculty of Economics and Business 
Administration 
Cluj-Napoca, Romania 
6/11/2014
AVALON Team 
I Located in Lyon, France 
I Joint Research Group 
I INRIA : French National Institute for Research in Informatics 
I ENS-Lyon : Ecole Normale Suprieure 
I University of Lyon 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
AVALON Members 
2 
Avalon Members @ April 1st, 2014 
Faculty Members (8) 
(4 INRIA, 1 CNRS, 2 UCBL, 1 ENSL) 
• Eddy Caron, MCF ENS Lyon, HDR (80%) 
• Frédéric Desprez, DR INRIA, HDR (30%) 
• Gilles Fedak, CR INRIA 
• Jean-Patrick Gelas, MCF UCBL 
• Olivier Glück, MCF UCBL 
• Laurent Lefèvre, CR INRIA, HDR 
• Christian Perez, DR INRIA, HDR, Project leader 
• Frédéric Suter, CR CNRS 
PhD students (6) 
• Maurice-Djibril Faye, ENS-Lyon / Université 
Gaston Berger (Sénégal) 
• Sylvain Gault, MapReduce, INRIA 
• Anthony Simonet, MapReduce, INRIA 
• Vincent Lanore, ENSL 
• Arnaud Lefray, SEED4C, ENSIB 
• Daniel Balouek, CIFRE New Generation SR 
Engineers (3+4+1) 
• Simon Delamare, IR CNRS (80%) 
• Jean-Christophe Mignot, IR CNRS (20%) 
• Matthieu Imbert, INRIA SED (40%) 
• Sylvain Bernard, CloudPower 
• François Rossigneux, XCLOUD 
• Guillaume Verger, SEED4C 
• Yulin Zhang Huaxi, SEED4C 
• Laurent Pouilloux (AE Héméra) 
Postdoc 
• Jonathan Rouzaud-Cornabas, CNRS 
Temporary Teacher-Researcher 
• Ghislain Landry Tsafack, UCBL 
Assistant 
• Evelyne Blesle, INRIA 
Avalon Team Overview 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
AVALON Topics 
3 
Avalon: Research Activities 
!#$%'($)*#+,-.) 
/00-'1%2#()3) 
4,.#51,) 
*#+,-.) 
/-$#'67'1.) 
4,.#51,)*%(%$,,(6) 
Super-computers 
(Exascale) 
Desktop 
Grids 
Clouds 
(IaaS, PaaS) 
Grids 
(EGI) 
Avalon Team Overview 
Energy Application Profiling and Modelization 
• Large Scale Energy Consumption Analysis for 
Physical and Virtual Resources 
• Energy Efficiency of Next Generation Large Scale 
Platforms 
Data-intensive Application Profiling, Modeling, 
and Management 
• Performance Prediction of Parallel Regular 
Applications 
• Modeling Large Scale Storage Infrastructure 
• Data Management for Hybrid Computing 
Infrastructures 
Resource Agnostic Application Description 
Model 
• Moldable Application Description Model 
• Dynamic Adaptation of the Application Structure 
Application Mapping and Scheduling 
• Application Mapping and Software Deployment 
• Non-Deterministic Workflow Scheduling 
• Security Management in Cloud Infrastructure 
/00-'1%2#(.) 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Big Data ... 
I Huge and growing volume of information originating from 
multiple sources. 
0+1)*2+(2() !#$%'($#) *+',-./#) !$(%($) 34()5-$-) 
I Impacts many scienti
c disciplines and industry branches 
I Large Scienti
c Instruments (LSST, LHC, OOOI), but not only 
(Sequencing machines) 
I Internet and Social Network (Google, Facebook, Twitter, etc.) 
I Open Data (Open Library, Governemental, Genomics) 
! impacts the whole process of scienti
c discovery (4th paradigm 
of science) 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
... or Big Bottlenecks ? 
I Big Data creates several challenges : 
I how to scale the infrastructure ? 
I end-to-end performance improvement, inter-system 
optimization. 
I how to improve productivity of data-intensive scientist ? 
I work
ow, programming language, quality of data provenance. 
I how to enable collaborative data science ? 
I incentive for data publication, data-sets sharing, collaborative 
work
ow. 
I New models and software are needed to represent and 
manipulate large and distributed scienti
c data-sets. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
BitDew: Large Scale Data 
Management 
Haiwu He (CAS/CNIC), Franck Cappello (ANL, UIUC) 
I G. Fedak, H. He, and F. Cappello. BitDew: A Programmable Environment for Large-Scale Data 
Management and Distribution. In Proceedings of the ACM/IEEE SuperComputing Conference (SC08), 
pages 112, Austin, USA, November 2008. 
I BitDew: A Data Management and Distribution Service with Multi-Protocol and Reliable File Transfer. G. 
Fedak, H. He, and F. Cappello Journal of Network and Computer Applications, 32(5):961{975, 2009. 
I H. He, G. Fedak, B. Tran, and F. Cappello. BLAST Application with Data-aware Desktop Grid 
Middleware. In Proceedings of 9th IEEE International Symposium on Cluster Computing and the Grid 
CCGRID09, pages 284291, Shanghai, China, May 2009. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Towards Data Desktop Grid 
Desktop Grid or Volunteers Computing Systems 
I High Throughput Computing over Large Sets of Idle Desktop 
Computers 
I Mature technology 
I EU support : European Desktop Grid Infrastructures 
But ... 
I High number of resources 
I Volatility 
I Lack of trust 
I Owned by volunteer 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Towards Data Desktop Grid 
Desktop Grid or Volunteers Computing Systems 
I High Throughput Computing over Large Sets of Idle Desktop 
Computers 
I Mature technology 
I EU support : European Desktop Grid Infrastructures 
But ... 
I High number of resources 
I Volatility 
I Lack of trust 
I Owned by volunteer 
I Scalable but mainly for embarrassingly parallel applications 
with few I/O requirements 
I Enabling data-intensive applications 
I Bridge with Cloud and Grid infrastructures 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Large Scale Data Management 
BitDew : a Programmable Environment for Large Scale Data 
Management 
Key Idea 1: provides an API and a runtime environment which 
integrates several P2P technologies in a consistent 
way 
Key Idea 2: relies on metadata (Data Attributes) to drive 
transparently data management operation : 
replication, fault-tolerance, distribution, placement, 
life-cycle. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
BitDew : the Big Cloudy Picture 
I Aggregates storage in a 
single Data Space: 
I Clients put and get data 
from the data space 
I Clients de
nes data 
attributes 
Data Space 
put get 
client 
REPLICAT =3 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
BitDew : the Big Cloudy Picture 
I Distinguishes service nodes 
(stable), client and Worker 
nodes (volatile) 
I Service : ensure fault 
tolerance, indexing and 
scheduling of data to 
Worker nodes 
I Worker : stores data on 
Desktop PCs 
I push/pull protocol between 
client ! service   Worker 
Data Space 
Service Nodes Reservoir Nodes 
pull 
put get 
client 
pull 
pull 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Data Attributes 
replica : indicates how many occurrences of data should be available at 
the same time in the system 
resilience : controls the resilience of data in presence of machine crash 
lifetime : is a duration, absolute or relative to the existence of other data, 
which indicates when a datum is obsolete 
affinity : drives movement of data according to dependency rules 
transfer protocol : gives the runtime environment hints about the
le transfer pro- 
tocol appropriate to distribute the data 
distribution : which indicates the maximum number of pieces of Data with the 
same Attribute should be sent to particular node. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Architecture Overview 
File 
System 
Http 
Ftp 
Bittorrent 
Data 
Scheduler 
Data 
Transfer 
Active Data BitDew 
Transfer 
Manager 
Command 
-line Tool 
Service 
Container 
Storage 
Master/ 
Worker 
API 
Service 
Back-ends 
Applications 
Data 
Catalog 
Data 
Repository 
SQL 
Server 
DHT 
BitDew Runtime Environnement 
I Programming APIs to create Data, Attributes, manage
le 
transfers and program applications 
I Services (DC, DR, DT, DS) to store, index, distribute, 
schedule, transfer and provide resiliency to the Data 
I Several information storage backends and
le transfer 
protocols 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Examples of BitDew Applications 
I Data-Intensive Application 
I DI-BOT : data-driven master-worker Arabic characters 
recognition (M. Labidi, University of Sfax) 
I MapReduce vs Hadoop, (X. Shi, L. Lu HUST, Wuhan China) 
I Data Management Utilities 
I File Sharing for Social Network (N. Kourtellis, Univ. Florida) 
I Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI) 
I Grid Data Stagging (IEP, Chinese Academy of Science) 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
MapReduce for Hybrid Distributed 
Computing Infrastructures 
Haiwu He (CAS/CSNET), Bing Tang (WUST), Xunhua Shi  Lu Lu 
(HUST), Mircea Moca  Gheorghe Silaghi (Univ Babes Bolay) 
I Towards MapReduce for Desktop Grid Computing .B. Tang, M. Moca, S. Chevalier, H. He, and G. Fedak. 
In Fifth International Conference on P2P, Paral lel, Grid, Cloud and Internet Computing (3PGCIC'10), 
Fukuoka, Japan, November 2010. I Distributed Results Checking for MapReduce on Volunteer Computing Mircea Moca, Gheorghe Cosmin 
Silaghi and Gilles Fedak, in 4th Workshop on Desktop Grids and Volunteer Computing Systems (PCGrid 
2010) IPDPS'2011, Anchorage Alaska. 
I Assessing MapReduce for Internet Computing: a Comparison of Hadoop and BitDew-MapReduce Lu Lu, 
Hai Jin, Xuanhua Shi and Gilles Fedak in the 13th ACM/IEEE International Conference on Grid Computing 
(Grid 2012), Beijing, China, 2012 
I Data-Intensive Computing on Desktop Grids, H. Lin and W.-C. Feng and G. Fedak Book Chapter in 
Desktop Grid Computing Book, CRC Press, 2012 I Parallel Data Processing in Dynamic Hybrid Computing Environment Using MapReduce, Bing Tang, Haiwu 
He, Gilles Fedak, 4th International Conference on Algorithms and Architectures for Parallel Processing 
(ICA3PP'14), LNCS/Springer Verlags, August 24-27, Dalian, China, 2014 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
What is MapReduce ? 
I Programming Model for data-intense applications 
I Proposed by Google in 2004 
I Simple, inspired by functionnal programming 
I programmer simply de
nes Map and Reduce tasks 
I Building block for other parallel programming tools 
I Strong open-source implementation: Hadoop 
I Highly scalable 
I Accommodate large scale clusters: faulty and unreliable 
resources 
MapReduce: Simpli
ed Data Processing on Large Clusters Jerey Dean and Sanjay 
Ghemawat, in OSDI'04: Sixth Symposium on Operating System Design and 
Implementation, San Francisco, CA, December, 2004. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Challenge of MapReduce over the Internet 
Data Split Final Output 
Reducer 
Combine 
Mapper 
Input Files 
Intermediate 
Results 
Output Files 
Output1 
Mapper 
Mapper 
Reducer 
Mapper 
Reducer 
Mapper 
Output2 
Output3 
I no shared
le system nor direct 
communication between hosts 
I Faults and hosts churn 
I Result Certi
cation of 
Intermediate Data 
I Collective Operation (scatter 
+ gather/reduction) 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Implementing MapReduce over BitDew 
Latency Hiding 
I Multi-thredead worker to overlap 
communication and computation. 
I The number of maximum concurrent 
Map and Reduce tasks can be 
con
gured, as well as the minimum 
number of tasks in the queue before 
computations can start. 
Barrier-free computation 
I Reducers detect duplication of intermediate results (that happen because 
of faults and/or lags). 
I Early reduction : process IR as they arrive ! allowed us to remove the 
barrier between Map and Reduce tasks. 
I But ... IR are not sorted. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Scheduling and Fault-tolerance 
2-level scheduling 
1. Data placement is ensured by the BitDew scheduler, which is mainly 
guided by the data attribute. 
2. Workers periodically report to the MR-scheduler, running on the master 
node the state of their ongoing computation. 
3. The MR-scheduler determines if there are more nodes available than 
tasks to execute which can avoid the lagger eect. 
Fault-tolerance 
I In Desktop Grid, computing resources have high failure rates : 
! during the computation (either execution of Map or Reduce 
tasks) 
! during the communication, that is
le upload and download. 
I MapInput data and ReduceToken token have the resilient attribute 
enabled. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
MapReduce Evaluation 
#$' 
a 2.7GB 
collective 
measure 
chunks size 
broadcast. The 
#!! 
#$!! 
#!!! 
'!! 
(!! 
!! 
$!! 
! 
 ' #( %$ ( #$' $( #$ 
*+,-./ 
0?6,@A?B@9:3C5D/4 
Figure 6. Scalability evaluation on the WordCount application: the y axis presents 
the throughput in MB/s and the x axis the number of nodes varying from 1 to 512. 
Figure: Scalability evaluation on the WordCount application: the y axis 
presents the throughput in MB/s and the x axis the number of nodes 
varying from 1 to 512. 
Table II 
EVALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF MAPPERS 
AND REDUCERS. 
#Mappers 4 8 16 32 32 32 32 
#Reducers 1 1 1 1 4 8 16 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Data Security and Privacy 
Distributed Result Checking 
I Traditional DG or VC projects implements result checking on 
the server. 
I IR are too large to be sent back to the server 
! distributed result checking 
I replicates the MapInput and Reducers 
I select correct results by majority voting 
! 
!# 
!$ 
% 
%' 
!) 
(% 
* 
+ 
(a) No replication 
(b) Replication of the Map tasks 
(c) Replication of both Map and Reduce tasks 
Figure 2. Dataflows in the MapReduce implementation 
Ensuring distinct Data workers as Privacy 
input for Map task. Consequently, the 
reducer receives a set of rm versions of intermediate files 
I Use corresponding a hybrid to a map infrastructure input fi. After receiving : rm 
composed + 1 
2 of private and public resources 
I Use IDA (Information Dispersal Algorithms) approach to distribute and 
identical versions, the reducer considers the respective result 
correct and further, accepts it as input for a Reduce task. 
Figure 2b illustrates the dataflow corresponding to a 
MapReduce execution where rm = 3. 
In order to activate the checking mechanism for the 
results received from reducers, the master node schedules the 
received intermediate result is present in its own codes list 
K, ignoring failed results. 
IV. EVALUATION OF THE RESULTS CHECKING 
MECHANISM 
In this section we describe the model for characteriz-ing 
the errors produced by the MapReduce algorithm in 
Volunteer Computing. We assume that workers sabotage 
independently of one another, thus we do not take into 
store securely the data 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Active Data: Data Life-Cycle 
Management 
Anthony Simonet (INRIA), Matei Ripeanu (UCB), Samer Al-Kiswany 
(UCB) 
I Active Data: A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet, Gilles Fedak, 
Matei Ripeanu and Samer Al-Kiswany. 8th Parallel Data Storage Workshop (PDSW'13), Proceedings of 
SC13 workshops, Denver, November, 2013 (position paper 5 pages) 
I Active Data: A Programming Model for Data Life-Cycle Management on Heterogeneous Systems and 
Infrastructures. Anthony Simonet, Gilles Fedak and Matei Ripeanu. Technical Report under evaluation. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Focus on Data Life-Cycle 
Data Life Cycle: the course of operational stages through which 
data pass from the time when they enter a system to the time 
when they leave it. 
!#$%%'()* +,-.,(-%)/* 01(,2/-* 
!)234%* 
!)234%* 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Use Case : The Advanced Photon Source 
I 3 to 5 TB of data per week on this detector 
I Raw data are pre-processed and registered in the Globus 
Catalog : 
I Data are curated by several applications 
I Data are shared amongst scienti
c user 
Transfer 
Instrument(Beamline) LocalStorage 
MetadataCatalog 
Extract Register Metadata 
RemoteData Center 
Transfer 
AcademicCluster 
Analysis 
More analysis 
Upload result 
Register result metadata 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Objectives 
We're aiming at : 
I A model to capture the essential life cycle stages and 
properties: creation, deletion, faults, replication, error 
checking . . . 
I Allows legacy systems to expose their intrinsic data life cycle. 
I Allow to reason about data sets handled by heterogeneous 
software and infrastructures. 
I Simplify the programming of applications that implement data 
life cycle management. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Active Data principles 
System programmers expose their system's internal data life cycle 
with a model based on Petri Nets. 
A life cycle model is made of Places and Transitions 
Created 
 
t1 
Written 
t2 
Read 
t3 
Terminated 
t4 
Each token has a unique identi
er, corresponding to the actual 
data item's. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Active Data principles 
System programmers expose their system's internal data life cycle 
with a model based on Petri Nets. 
A life cycle model is made of Places and Transitions 
Created 
t1 
Written 
 
t2 
Read 
t3 
Terminated 
t4 
A transition is
red whenever a data state changes. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Active Data principles 
System programmers expose their system's internal data life cycle 
with a model based on Petri Nets. 
A life cycle model is made of Places and Transitions 
Created 
t1 
Written 
 
t2 
Read 
t3 
Terminated 
t4 
public void handler () { 
compressFile (); 
} 
Code may be plugged by clients to transitions. 
It is executed whenever the transition is
red. 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Active Data features 
The Active Data programming model and runtime environment: 
I Allows to react to life cycle progression 
I Exposes transparently distributed data sets 
I Can be integrated with existing systems 
I Has scalable performance and minimum overhead over 
existing systems 
G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
Integration with Data Management Systems 
Created t1 To Place 
t2 
Deleted 
t3 
t4 Placed 
Loop 
t5 
2 
t6 
t7 
t9 
Lost 
t8 
(a) Bitdew Scheduler 
Created 
t1 
Ready 
t2 
Started 
Deleted Invalid 
t3 
t4 
Completed 
t5 
t6 
t7 
t8 
(b) Bitdew File Transfer 
IN CREATE 
t1 
IN MOVED FROM 
t2 
t3 
IN MOVED TO 
t4 
IN CLOSE WRITE 
t5 
t6 
t11 t9 t7 t8 t10 
DELETED 
t14 
t13 CREATED 
t12 
(c) inotify 
Deleted 
t8 
t7 t6 
Get 
Put 
t5 
Created 
(d) iRODS 
Created 
t1 t2 
Succeeded Failed 
t3 t4 
Deleted 
(e) Globus Online 
Figure 3: Data life cycle models for four data management system. 
structed from its documentation. 
Reading the source code of BitDew, we observe that data 
items are managed by instances of the Data class, and this 
class has the status variable which holds the data item 
state. Therefore, we simply deduce from the enumeration of 
the possible value of status the set of corresponding places 
in the Petri Net (see Figure 2a and 2b). By further analyz-ing 
the source code, we construct the model and summarize 
how high level DLM features are modelized using Active 
Data model: 
Scheduling and replication Part of the complexity of 
the data life cycle in BitDew comes from the Data Scheduler 
I BitDew (INRIA), programmable 
environment for data management. 
I inotify Linux kernel subsystem: 
noti

More Related Content

What's hot

GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.Alexandru Iosup
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
 
Brief on Linked Data at U.S. EPA to Chief Data Scientist
Brief on Linked Data at U.S. EPA to Chief Data ScientistBrief on Linked Data at U.S. EPA to Chief Data Scientist
Brief on Linked Data at U.S. EPA to Chief Data ScientistBernadette Hyland-Wood
 
US EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open DataUS EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open Data3 Round Stones
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsBeth Plale
 
Research data management & planning: an introduction
Research data management & planning: an introductionResearch data management & planning: an introduction
Research data management & planning: an introductionMaggie Neilson
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKAbhi Jit
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsBeth Plale
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Information Systems - Lecture A
Information Systems - Lecture AInformation Systems - Lecture A
Information Systems - Lecture ACMDLearning
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a surveyssuser0191d4
 
Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...librarianrafia
 
xldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierxldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierTim Frazier
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassAaron Collie
 

What's hot (20)

GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
 
Brief on Linked Data at U.S. EPA to Chief Data Scientist
Brief on Linked Data at U.S. EPA to Chief Data ScientistBrief on Linked Data at U.S. EPA to Chief Data Scientist
Brief on Linked Data at U.S. EPA to Chief Data Scientist
 
US EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open DataUS EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open Data
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
Research data management & planning: an introduction
Research data management & planning: an introductionResearch data management & planning: an introduction
Research data management & planning: an introduction
 
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Information Systems - Lecture A
Information Systems - Lecture AInformation Systems - Lecture A
Information Systems - Lecture A
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a survey
 
Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...
 
Virtualization for HPC at NCI
Virtualization for HPC at NCIVirtualization for HPC at NCI
Virtualization for HPC at NCI
 
xldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierxldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazier
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 

Viewers also liked

Mapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsMapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsGilles Fedak
 
Active Data PDSW'13
Active Data PDSW'13Active Data PDSW'13
Active Data PDSW'13Gilles Fedak
 
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...Gilles Fedak
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesGilles Fedak
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingGilles Fedak
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetGilles Fedak
 

Viewers also liked (7)

Mapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsMapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, Optimizations
 
Active Data PDSW'13
Active Data PDSW'13Active Data PDSW'13
Active Data PDSW'13
 
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud Computing
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the Internet
 

Similar to Big Data, Beyond the Data Center

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudOla Spjuth
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-dataRaul Palma
 
Team 05 linked data generation
Team 05 linked data generationTeam 05 linked data generation
Team 05 linked data generationplan4all
 
Dublinked tech workshop_15_dec2011
Dublinked tech workshop_15_dec2011Dublinked tech workshop_15_dec2011
Dublinked tech workshop_15_dec2011Dublinked .
 
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsNIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsVivien Bonazzi
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfkalai75
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euData management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euEUDAT
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data CommonsVivien Bonazzi
 
IRJET- Survey of Big Data with Hadoop
IRJET-  	  Survey of Big Data with HadoopIRJET-  	  Survey of Big Data with Hadoop
IRJET- Survey of Big Data with HadoopIRJET Journal
 
Mobile Data Analytics
Mobile Data AnalyticsMobile Data Analytics
Mobile Data AnalyticsRICHARD AMUOK
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupScott Mitchell
 

Similar to Big Data, Beyond the Data Center (20)

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 
Inspire hack 2017-linked-data
Inspire hack 2017-linked-dataInspire hack 2017-linked-data
Inspire hack 2017-linked-data
 
Team 05 linked data generation
Team 05 linked data generationTeam 05 linked data generation
Team 05 linked data generation
 
Dublinked tech workshop_15_dec2011
Dublinked tech workshop_15_dec2011Dublinked tech workshop_15_dec2011
Dublinked tech workshop_15_dec2011
 
Scientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDFScientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDF
 
Open-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDFOpen-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDF
 
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsNIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data Commons
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euData management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.eu
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data Commons
 
IRJET- Survey of Big Data with Hadoop
IRJET-  	  Survey of Big Data with HadoopIRJET-  	  Survey of Big Data with Hadoop
IRJET- Survey of Big Data with Hadoop
 
Mobile Data Analytics
Mobile Data AnalyticsMobile Data Analytics
Mobile Data Analytics
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Session 33 - Production Grids
Session 33 - Production GridsSession 33 - Production Grids
Session 33 - Production Grids
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
 

Recently uploaded

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 

Recently uploaded (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Big Data, Beyond the Data Center

  • 1. Big Data, Beyond the Data Center Gilles Fedak Gilles.Fedak@inria.fr INRIA, University of Lyon, France Cluj Economics and Business Seminar Series (CEBSS) University Babes-Bolyai Faculty of Economics and Business Administration Cluj-Napoca, Romania 6/11/2014
  • 2. AVALON Team I Located in Lyon, France I Joint Research Group I INRIA : French National Institute for Research in Informatics I ENS-Lyon : Ecole Normale Suprieure I University of Lyon G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 3. AVALON Members 2 Avalon Members @ April 1st, 2014 Faculty Members (8) (4 INRIA, 1 CNRS, 2 UCBL, 1 ENSL) • Eddy Caron, MCF ENS Lyon, HDR (80%) • Frédéric Desprez, DR INRIA, HDR (30%) • Gilles Fedak, CR INRIA • Jean-Patrick Gelas, MCF UCBL • Olivier Glück, MCF UCBL • Laurent Lefèvre, CR INRIA, HDR • Christian Perez, DR INRIA, HDR, Project leader • Frédéric Suter, CR CNRS PhD students (6) • Maurice-Djibril Faye, ENS-Lyon / Université Gaston Berger (Sénégal) • Sylvain Gault, MapReduce, INRIA • Anthony Simonet, MapReduce, INRIA • Vincent Lanore, ENSL • Arnaud Lefray, SEED4C, ENSIB • Daniel Balouek, CIFRE New Generation SR Engineers (3+4+1) • Simon Delamare, IR CNRS (80%) • Jean-Christophe Mignot, IR CNRS (20%) • Matthieu Imbert, INRIA SED (40%) • Sylvain Bernard, CloudPower • François Rossigneux, XCLOUD • Guillaume Verger, SEED4C • Yulin Zhang Huaxi, SEED4C • Laurent Pouilloux (AE Héméra) Postdoc • Jonathan Rouzaud-Cornabas, CNRS Temporary Teacher-Researcher • Ghislain Landry Tsafack, UCBL Assistant • Evelyne Blesle, INRIA Avalon Team Overview G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 4. AVALON Topics 3 Avalon: Research Activities !#$%'($)*#+,-.) /00-'1%2#()3) 4,.#51,) *#+,-.) /-$#'67'1.) 4,.#51,)*%(%$,,(6) Super-computers (Exascale) Desktop Grids Clouds (IaaS, PaaS) Grids (EGI) Avalon Team Overview Energy Application Profiling and Modelization • Large Scale Energy Consumption Analysis for Physical and Virtual Resources • Energy Efficiency of Next Generation Large Scale Platforms Data-intensive Application Profiling, Modeling, and Management • Performance Prediction of Parallel Regular Applications • Modeling Large Scale Storage Infrastructure • Data Management for Hybrid Computing Infrastructures Resource Agnostic Application Description Model • Moldable Application Description Model • Dynamic Adaptation of the Application Structure Application Mapping and Scheduling • Application Mapping and Software Deployment • Non-Deterministic Workflow Scheduling • Security Management in Cloud Infrastructure /00-'1%2#(.) G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 5. Big Data ... I Huge and growing volume of information originating from multiple sources. 0+1)*2+(2() !#$%'($#) *+',-./#) !$(%($) 34()5-$-) I Impacts many scienti
  • 6. c disciplines and industry branches I Large Scienti
  • 7. c Instruments (LSST, LHC, OOOI), but not only (Sequencing machines) I Internet and Social Network (Google, Facebook, Twitter, etc.) I Open Data (Open Library, Governemental, Genomics) ! impacts the whole process of scienti
  • 8. c discovery (4th paradigm of science) G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 9. ... or Big Bottlenecks ? I Big Data creates several challenges : I how to scale the infrastructure ? I end-to-end performance improvement, inter-system optimization. I how to improve productivity of data-intensive scientist ? I work ow, programming language, quality of data provenance. I how to enable collaborative data science ? I incentive for data publication, data-sets sharing, collaborative work ow. I New models and software are needed to represent and manipulate large and distributed scienti
  • 10. c data-sets. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 11. BitDew: Large Scale Data Management Haiwu He (CAS/CNIC), Franck Cappello (ANL, UIUC) I G. Fedak, H. He, and F. Cappello. BitDew: A Programmable Environment for Large-Scale Data Management and Distribution. In Proceedings of the ACM/IEEE SuperComputing Conference (SC08), pages 112, Austin, USA, November 2008. I BitDew: A Data Management and Distribution Service with Multi-Protocol and Reliable File Transfer. G. Fedak, H. He, and F. Cappello Journal of Network and Computer Applications, 32(5):961{975, 2009. I H. He, G. Fedak, B. Tran, and F. Cappello. BLAST Application with Data-aware Desktop Grid Middleware. In Proceedings of 9th IEEE International Symposium on Cluster Computing and the Grid CCGRID09, pages 284291, Shanghai, China, May 2009. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 12. Towards Data Desktop Grid Desktop Grid or Volunteers Computing Systems I High Throughput Computing over Large Sets of Idle Desktop Computers I Mature technology I EU support : European Desktop Grid Infrastructures But ... I High number of resources I Volatility I Lack of trust I Owned by volunteer G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 13. Towards Data Desktop Grid Desktop Grid or Volunteers Computing Systems I High Throughput Computing over Large Sets of Idle Desktop Computers I Mature technology I EU support : European Desktop Grid Infrastructures But ... I High number of resources I Volatility I Lack of trust I Owned by volunteer I Scalable but mainly for embarrassingly parallel applications with few I/O requirements I Enabling data-intensive applications I Bridge with Cloud and Grid infrastructures G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 14. Large Scale Data Management BitDew : a Programmable Environment for Large Scale Data Management Key Idea 1: provides an API and a runtime environment which integrates several P2P technologies in a consistent way Key Idea 2: relies on metadata (Data Attributes) to drive transparently data management operation : replication, fault-tolerance, distribution, placement, life-cycle. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 15. BitDew : the Big Cloudy Picture I Aggregates storage in a single Data Space: I Clients put and get data from the data space I Clients de
  • 16. nes data attributes Data Space put get client REPLICAT =3 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 17. BitDew : the Big Cloudy Picture I Distinguishes service nodes (stable), client and Worker nodes (volatile) I Service : ensure fault tolerance, indexing and scheduling of data to Worker nodes I Worker : stores data on Desktop PCs I push/pull protocol between client ! service Worker Data Space Service Nodes Reservoir Nodes pull put get client pull pull G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 18. Data Attributes replica : indicates how many occurrences of data should be available at the same time in the system resilience : controls the resilience of data in presence of machine crash lifetime : is a duration, absolute or relative to the existence of other data, which indicates when a datum is obsolete affinity : drives movement of data according to dependency rules transfer protocol : gives the runtime environment hints about the
  • 19. le transfer pro- tocol appropriate to distribute the data distribution : which indicates the maximum number of pieces of Data with the same Attribute should be sent to particular node. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 20. Architecture Overview File System Http Ftp Bittorrent Data Scheduler Data Transfer Active Data BitDew Transfer Manager Command -line Tool Service Container Storage Master/ Worker API Service Back-ends Applications Data Catalog Data Repository SQL Server DHT BitDew Runtime Environnement I Programming APIs to create Data, Attributes, manage
  • 21. le transfers and program applications I Services (DC, DR, DT, DS) to store, index, distribute, schedule, transfer and provide resiliency to the Data I Several information storage backends and
  • 22. le transfer protocols G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 23. Examples of BitDew Applications I Data-Intensive Application I DI-BOT : data-driven master-worker Arabic characters recognition (M. Labidi, University of Sfax) I MapReduce vs Hadoop, (X. Shi, L. Lu HUST, Wuhan China) I Data Management Utilities I File Sharing for Social Network (N. Kourtellis, Univ. Florida) I Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI) I Grid Data Stagging (IEP, Chinese Academy of Science) G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 24. MapReduce for Hybrid Distributed Computing Infrastructures Haiwu He (CAS/CSNET), Bing Tang (WUST), Xunhua Shi Lu Lu (HUST), Mircea Moca Gheorghe Silaghi (Univ Babes Bolay) I Towards MapReduce for Desktop Grid Computing .B. Tang, M. Moca, S. Chevalier, H. He, and G. Fedak. In Fifth International Conference on P2P, Paral lel, Grid, Cloud and Internet Computing (3PGCIC'10), Fukuoka, Japan, November 2010. I Distributed Results Checking for MapReduce on Volunteer Computing Mircea Moca, Gheorghe Cosmin Silaghi and Gilles Fedak, in 4th Workshop on Desktop Grids and Volunteer Computing Systems (PCGrid 2010) IPDPS'2011, Anchorage Alaska. I Assessing MapReduce for Internet Computing: a Comparison of Hadoop and BitDew-MapReduce Lu Lu, Hai Jin, Xuanhua Shi and Gilles Fedak in the 13th ACM/IEEE International Conference on Grid Computing (Grid 2012), Beijing, China, 2012 I Data-Intensive Computing on Desktop Grids, H. Lin and W.-C. Feng and G. Fedak Book Chapter in Desktop Grid Computing Book, CRC Press, 2012 I Parallel Data Processing in Dynamic Hybrid Computing Environment Using MapReduce, Bing Tang, Haiwu He, Gilles Fedak, 4th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP'14), LNCS/Springer Verlags, August 24-27, Dalian, China, 2014 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 25. What is MapReduce ? I Programming Model for data-intense applications I Proposed by Google in 2004 I Simple, inspired by functionnal programming I programmer simply de
  • 26. nes Map and Reduce tasks I Building block for other parallel programming tools I Strong open-source implementation: Hadoop I Highly scalable I Accommodate large scale clusters: faulty and unreliable resources MapReduce: Simpli
  • 27. ed Data Processing on Large Clusters Jerey Dean and Sanjay Ghemawat, in OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 28. Challenge of MapReduce over the Internet Data Split Final Output Reducer Combine Mapper Input Files Intermediate Results Output Files Output1 Mapper Mapper Reducer Mapper Reducer Mapper Output2 Output3 I no shared
  • 29. le system nor direct communication between hosts I Faults and hosts churn I Result Certi
  • 30. cation of Intermediate Data I Collective Operation (scatter + gather/reduction) G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 31. Implementing MapReduce over BitDew Latency Hiding I Multi-thredead worker to overlap communication and computation. I The number of maximum concurrent Map and Reduce tasks can be con
  • 32. gured, as well as the minimum number of tasks in the queue before computations can start. Barrier-free computation I Reducers detect duplication of intermediate results (that happen because of faults and/or lags). I Early reduction : process IR as they arrive ! allowed us to remove the barrier between Map and Reduce tasks. I But ... IR are not sorted. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 33. Scheduling and Fault-tolerance 2-level scheduling 1. Data placement is ensured by the BitDew scheduler, which is mainly guided by the data attribute. 2. Workers periodically report to the MR-scheduler, running on the master node the state of their ongoing computation. 3. The MR-scheduler determines if there are more nodes available than tasks to execute which can avoid the lagger eect. Fault-tolerance I In Desktop Grid, computing resources have high failure rates : ! during the computation (either execution of Map or Reduce tasks) ! during the communication, that is
  • 34. le upload and download. I MapInput data and ReduceToken token have the resilient attribute enabled. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 35. MapReduce Evaluation #$' a 2.7GB collective measure chunks size broadcast. The #!! #$!! #!!! '!! (!! !! $!! ! ' #( %$ ( #$' $( #$ *+,-./ 0?6,@A?B@9:3C5D/4 Figure 6. Scalability evaluation on the WordCount application: the y axis presents the throughput in MB/s and the x axis the number of nodes varying from 1 to 512. Figure: Scalability evaluation on the WordCount application: the y axis presents the throughput in MB/s and the x axis the number of nodes varying from 1 to 512. Table II EVALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF MAPPERS AND REDUCERS. #Mappers 4 8 16 32 32 32 32 #Reducers 1 1 1 1 4 8 16 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 36. Data Security and Privacy Distributed Result Checking I Traditional DG or VC projects implements result checking on the server. I IR are too large to be sent back to the server ! distributed result checking I replicates the MapInput and Reducers I select correct results by majority voting ! !# !$ % %' !) (% * + (a) No replication (b) Replication of the Map tasks (c) Replication of both Map and Reduce tasks Figure 2. Dataflows in the MapReduce implementation Ensuring distinct Data workers as Privacy input for Map task. Consequently, the reducer receives a set of rm versions of intermediate files I Use corresponding a hybrid to a map infrastructure input fi. After receiving : rm composed + 1 2 of private and public resources I Use IDA (Information Dispersal Algorithms) approach to distribute and identical versions, the reducer considers the respective result correct and further, accepts it as input for a Reduce task. Figure 2b illustrates the dataflow corresponding to a MapReduce execution where rm = 3. In order to activate the checking mechanism for the results received from reducers, the master node schedules the received intermediate result is present in its own codes list K, ignoring failed results. IV. EVALUATION OF THE RESULTS CHECKING MECHANISM In this section we describe the model for characteriz-ing the errors produced by the MapReduce algorithm in Volunteer Computing. We assume that workers sabotage independently of one another, thus we do not take into store securely the data G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 37. Active Data: Data Life-Cycle Management Anthony Simonet (INRIA), Matei Ripeanu (UCB), Samer Al-Kiswany (UCB) I Active Data: A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet, Gilles Fedak, Matei Ripeanu and Samer Al-Kiswany. 8th Parallel Data Storage Workshop (PDSW'13), Proceedings of SC13 workshops, Denver, November, 2013 (position paper 5 pages) I Active Data: A Programming Model for Data Life-Cycle Management on Heterogeneous Systems and Infrastructures. Anthony Simonet, Gilles Fedak and Matei Ripeanu. Technical Report under evaluation. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 38. Focus on Data Life-Cycle Data Life Cycle: the course of operational stages through which data pass from the time when they enter a system to the time when they leave it. !#$%%'()* +,-.,(-%)/* 01(,2/-* !)234%* !)234%* G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 39. Use Case : The Advanced Photon Source I 3 to 5 TB of data per week on this detector I Raw data are pre-processed and registered in the Globus Catalog : I Data are curated by several applications I Data are shared amongst scienti
  • 40. c user Transfer Instrument(Beamline) LocalStorage MetadataCatalog Extract Register Metadata RemoteData Center Transfer AcademicCluster Analysis More analysis Upload result Register result metadata G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 41. Objectives We're aiming at : I A model to capture the essential life cycle stages and properties: creation, deletion, faults, replication, error checking . . . I Allows legacy systems to expose their intrinsic data life cycle. I Allow to reason about data sets handled by heterogeneous software and infrastructures. I Simplify the programming of applications that implement data life cycle management. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 42. Active Data principles System programmers expose their system's internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 Written t2 Read t3 Terminated t4 Each token has a unique identi
  • 43. er, corresponding to the actual data item's. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 44. Active Data principles System programmers expose their system's internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 Written t2 Read t3 Terminated t4 A transition is
  • 45. red whenever a data state changes. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 46. Active Data principles System programmers expose their system's internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 Written t2 Read t3 Terminated t4 public void handler () { compressFile (); } Code may be plugged by clients to transitions. It is executed whenever the transition is
  • 47. red. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 48. Active Data features The Active Data programming model and runtime environment: I Allows to react to life cycle progression I Exposes transparently distributed data sets I Can be integrated with existing systems I Has scalable performance and minimum overhead over existing systems G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 49. Integration with Data Management Systems Created t1 To Place t2 Deleted t3 t4 Placed Loop t5 2 t6 t7 t9 Lost t8 (a) Bitdew Scheduler Created t1 Ready t2 Started Deleted Invalid t3 t4 Completed t5 t6 t7 t8 (b) Bitdew File Transfer IN CREATE t1 IN MOVED FROM t2 t3 IN MOVED TO t4 IN CLOSE WRITE t5 t6 t11 t9 t7 t8 t10 DELETED t14 t13 CREATED t12 (c) inotify Deleted t8 t7 t6 Get Put t5 Created (d) iRODS Created t1 t2 Succeeded Failed t3 t4 Deleted (e) Globus Online Figure 3: Data life cycle models for four data management system. structed from its documentation. Reading the source code of BitDew, we observe that data items are managed by instances of the Data class, and this class has the status variable which holds the data item state. Therefore, we simply deduce from the enumeration of the possible value of status the set of corresponding places in the Petri Net (see Figure 2a and 2b). By further analyz-ing the source code, we construct the model and summarize how high level DLM features are modelized using Active Data model: Scheduling and replication Part of the complexity of the data life cycle in BitDew comes from the Data Scheduler I BitDew (INRIA), programmable environment for data management. I inotify Linux kernel subsystem: noti
  • 52. cation, write, movement and deletion. I iRODS (DICE, Univ. North Carolina), rule-oriented data management system. I Globus Online (ANL) oers fast, simple and reliable service to transfer large volumes of data. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 53. Implementation I Prototype implemented in Java (' 2,800 LOC) I Client/Service communication is Publish/Subscribe I 2 types of subscription: I Every transitions for a given data item I Every data items for a given transition Active DataService Client subscribe Client Client subscribe G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 54. Implementation I Several ways to publish transitions I Instrument the code I Read the logs I Rely on an existing noti
  • 55. cation system I The service orders transitions by time of arrival publish transition Active DataService Client subscribe Client Client subscribe publish transition G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 56. Implementation I Clients run transition handler code locally I Transition handlers are executed I Serially I In a blocking way I In the order transitions were published publish transition Active DataService Client subscribe notify Client notify Client subscribe publish transition G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 57. Data Surveillance FrameWork for APS Anthony Simonet (INRIA), Kyle Chard (ANL), Ian Foster (ANL/UC) I A. Simonet, K. Chard, G. Fedak, I. Foster Active Data to Provide Smart Data Surveillance to E-Science Users In Proceedings of EuromicroPDP'15, Turku Finland, March 4-6, 2015 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 58. Problems with APS Globus Catalog Globus Detector Local Storage Compute Cluster 1. Local Transfer 2. Extract Metadata 3. Globus Transfer 4. Swift Parallel Analysis What is inecient in this work ow? I Many error-prone tasks are performed manually I Users can't monitor the whole process at once I Small failures are dicult to detect I A system alone can't recover from failures caused outside its scope G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 59. Data Surveillance Framework 4 goals (that would otherwise require a lot of scripting and hacking): I Monitoring Data Set Progress I Better Automation I Sharing Noti
  • 60. cation I Error Discovery Recovery G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 61. Active Data Advanced Features A. Progress Monitoring I Associate Tags to Data I Install Taggers on Scientists require mechanisms to monitor their workflows from a high level, generate reports on progress, and identify potential errors without painstakingly auditing every dataset. Monitoring is not limited to estimating completion time, but also: i) receiving a single relevant notification when several related events occurred in different systems; ii) quickly notic-ing Transitions I Guarded Transitions : only that an operation failed within the mass of operations that completed normally; iii) identifying steps that take longer to run than usual, backtracking the chain of causality, fixing the problem at runtime and optimizing the workflow for future executions; iv) accelerating data sharing with the community by pushing notifications to collaborators and colleagues. executes on token which have speci
  • 62. c B. Automation tags. I Noti
  • 63. cation Handlers: The APS workflow, like many scientific workflows, re-quires explicit human intervention to progress between stages and to recover from unexpected events. Such interventions include running scripts on generated datasets on the shared cluster, registering datasets in the Globus catalog, and execut-ing Push.co, Twitter, gdoc, ifttt etc... Swift analysis scripts on the compute cluster Such inter-ventions cannot be easily integrated in a traditional workflow system, because they reside a level of abstraction above the workflow system. In fact, they are the operations that start the workflow systems. C. Sharing and Notification Scientific sharing can be made more efficient by allowing other scientists to be notified of new datasets availability (in the catalog) with powerful filters to extract only the notifications they need, and even to start processes as soon as files are available. We believe the best way for scientists to automatically integrate new datasets in their workflows is to rely on widely used dissemination mechanisms—such as File Dataset File transfer Metadata Life Cycle View Guard Code Execution }Tagged Tokens Notification Fig. 2: Data surveillance framework design IV. SYSTEM DESIGN We next present the data surveillance framework that we designed to satisfy the APS users’ needs presented above. We also elaborate on the design of specific features. G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 64. APS Data Life Cycle Model Avalon Daniel Arnaud Anthony Vincent Use-case: APS data life cycle model Created Start transfer End Terminated Detector Created Failure Success Failed Succeeded End End Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Update Globus Catalog Created Failure Success Failed Succeeded End End Terminated Start Swift Globus transfer Extract Created Remove Terminated Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. Avalon June 3rd, 2014 22/30 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 66. nition Complete history of data derivations and operations I Assess dataset quality I Records the context of data acquisition and transformation I PASS: Provenance Aware Storage Systems G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 68. nition Complete history of data derivations and operations I Assess dataset quality I Records the context of data acquisition and transformation I PASS: Provenance Aware Storage Systems ! What about heterogeneous systems? G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 69. Exemple: Data Provenance Exemple: Data Definition Historique compl`ete des d´ erivations et des op´ erations sur une donn´ee De
  • 70. nition Complete history of data derivations and operations I Assess dataset quality I Records the context of data acquisition and transformation I PASS: Provenance Aware Storage Systems Estimer la qualit ´e du jeu de donn´ees Garder les conditions d’acquisition et de transformation des donn´ees PASS: Provenance Aware Storage Systems ! What about heterogeneous systems? −→ What about heterogeneous systems? Example with Globus Online and iRODS File transfer service Data store and metadata catalog G. Fedak(INRIA/UBC) Active Data Vichy 2014 20/28 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 71. Provenance Scenario with Active Data Data events coming from Globus Online and iRODS Terminated t5 Put t9 t7 t8 t10 Created Get t6 iRODS Created Id: fGO: 7b9e02c4-925d-11e2g t1 t2 Succeeded Failed t3 t4 Terminated Globus Online G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 72. Provenance Scenario with Active Data Data events coming from Globus Online and iRODS Terminated public void handler () { } t5 iput (...) ; Put t9 t7 t8 t10 Created Get t6 iRODS Created t1 t2 Succeeded Failed t3 t4 Terminated Globus Online G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 73. Provenance Scenario with Active Data Data events coming from Globus Online and iRODS Terminated t5 Put t9 t7 t8 t10 Created Get t6 iRODS Id: fGO: 7b9e02c4-925d-11e2,iRODS: 10032g Created t1 t2 Succeeded Failed t3 t4 Terminated Globus Online G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 74. Provenance Scenario with Active Data Data events coming from Globus Online and iRODS Terminated t5 Put t9 t7 t8 t10 Created Get t6 iRODS public void handler () { annotate (); } Created t1 t2 Succeeded Failed t3 t4 Terminated Globus Online G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 75. iRODS Provenance Result $ imeta ls -d test / out_test_4628 AVUs defined for dataObj test / out_test_4628 : attribute : GO_FAULTS value : 0 ---- attribute : GO_COMPLETION_TIME value : 2013 -03 -21 19:28:41 Z ---- attribute : GO_REQUEST_TIME value : 2013 -03 -21 19:28:17 Z ---- attribute : GO_TASK_ID value : 7 b9e02c4 -925d -11 e2 -97 ce -123139404 f2e ---- attribute : GO_SOURCE value : go# ep1 /~/ test ---- attribute : GO_DESTINATION value : asimonet # fraise /~/ out_test_4628 G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014
  • 76. Conclusion We proposed several approaches to handle Big Data on hybrid distributed computing infrastructures : data management, data processing and programming model. Next I Data Collection Data Stream I Incentive systems for collaborative data science Want to learn more ? I Book on Desktop Grid Computing ed. C.Cerin and G.Fedak CRC Press I Home page for the Big Data class : http://graal.ens-lyon.fr/ gfedak/pmwiki-lri/pmwiki.php/Main/MarpReduceClass I our websites : http://www.bitdew.net and http://www.xtremweb-hep.org G. Fedak(INRIA/Avalon Team) BitDew/Active Data 6/11/2014