Active Data is a data-centric approach to data life-cycle management that uses a Petri net-based model to represent data states and transitions between systems. It exposes distributed data sets and allows clients to react to life cycle events in a scalable way. A prototype implemented the publish-subscribe model and demonstrated handling over 30,000 transitions per second. Active Data provides advantages like formal verification and fault tolerance but requires more work to standardize and represent complex data operations.
This document describes two case studies of health and status monitoring systems used to monitor large, complex datasets and detect anomalies. In the first case study, a system monitored thousands of servers in a data center and detected dead or slow nodes that reduced overall performance. The second case study monitored billions of payment card transactions and developed over 15,500 statistical models to detect data quality issues and interoperability problems, improving approval rates and saving millions. Both cases highlighted the importance of executive support, dashboards, governance programs, and developing numerous statistical models tailored to different data segments.
The Open Science Data Cloud (OSDC) is a non-profit consortium that manages cloud computing infrastructure to support scientific research. It operates multiple clouds with thousands of nodes and petabytes of storage across four data centers. The OSDC supports various scientific projects involving astronomical, biological, and networking data as well as image processing. Its goals are to provide open, interoperable infrastructure at scale to enable large-scale scientific experiments and to preserve research data long-term like libraries preserve books. The OSDC is working on various technical challenges around data migration, virtual networking standards, and finding sustainable business models.
4 TeraGrid Sites Have Focal Points:
SDSC – The Data Place
Large-scale and high-performance data analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA – The Compute Place
Large-scale, Large Flops computation
Argonne – The Viz place
Scalable Viz walls
Caltech – The Applications place
Data and flops for applications – Especially some of the GriPhyN Apps
Specific machine configurations reflect this
Capturing Interactive Data Transformation Operations using Provenance WorkflowsAndre Freitas
The ready availability of data is leading to the increased opportunity of their re-use for new applications and for analyses. Most of these data are not necessarily in the format users want, are usually heterogeneous, and highly dynamic, and this necessitates data transformation efforts to re-purpose them. Interactive data transformation (IDT) tools are becoming easily available to lower these barriers to data transformation
efforts. This paper describes a principled way to capture data
lineage of interactive data transformation processes. We provide a formal
model of IDT, its mapping to a provenance representation, and its
implementation and validation on Google Refine. Provision of the data transformation process sequences allows assessment of data quality and
ensures portability between IDT and other data transformation platforms.
The proposed model showed a high level of coverage against a set of requirements used for evaluating systems that provide provenance
management solutions.
The document discusses big data visualization and visual analysis, focusing on the challenges and opportunities. It begins with an overview of visualization and then discusses several challenges in big data visualization, including integrating heterogeneous data from different sources and scales, dealing with data and task complexity, limited interaction capabilities for large data, scalability for both data and users, and the need for domain and development libraries/tools. It then provides examples of visualizing taxi GPS data and traffic patterns in Beijing to identify traffic jams.
This document discusses the concept of a Science DMZ, which consists of three key components: 1) a dedicated "friction-free" network path with high-performance networking devices located near the site perimeter to facilitate science data transfer, 2) dedicated high-performance data transfer nodes optimized for data transfer tools, and 3) a performance measurement/test node. It contrasts this approach with the typical ad-hoc deployment of a data transfer node wherever space allows, which often fails to provide necessary performance. Details of an example Science DMZ deployment at Lawrence Berkeley National Laboratory are provided.
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...Gilles Fedak
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Computing Infrastructures
Simon Delamare Gilles Fedak Derrick Kondo Oleg Lodygensky
High-Performance Parallel and Distributed Computing, 2012
Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
This document describes two case studies of health and status monitoring systems used to monitor large, complex datasets and detect anomalies. In the first case study, a system monitored thousands of servers in a data center and detected dead or slow nodes that reduced overall performance. The second case study monitored billions of payment card transactions and developed over 15,500 statistical models to detect data quality issues and interoperability problems, improving approval rates and saving millions. Both cases highlighted the importance of executive support, dashboards, governance programs, and developing numerous statistical models tailored to different data segments.
The Open Science Data Cloud (OSDC) is a non-profit consortium that manages cloud computing infrastructure to support scientific research. It operates multiple clouds with thousands of nodes and petabytes of storage across four data centers. The OSDC supports various scientific projects involving astronomical, biological, and networking data as well as image processing. Its goals are to provide open, interoperable infrastructure at scale to enable large-scale scientific experiments and to preserve research data long-term like libraries preserve books. The OSDC is working on various technical challenges around data migration, virtual networking standards, and finding sustainable business models.
4 TeraGrid Sites Have Focal Points:
SDSC – The Data Place
Large-scale and high-performance data analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA – The Compute Place
Large-scale, Large Flops computation
Argonne – The Viz place
Scalable Viz walls
Caltech – The Applications place
Data and flops for applications – Especially some of the GriPhyN Apps
Specific machine configurations reflect this
Capturing Interactive Data Transformation Operations using Provenance WorkflowsAndre Freitas
The ready availability of data is leading to the increased opportunity of their re-use for new applications and for analyses. Most of these data are not necessarily in the format users want, are usually heterogeneous, and highly dynamic, and this necessitates data transformation efforts to re-purpose them. Interactive data transformation (IDT) tools are becoming easily available to lower these barriers to data transformation
efforts. This paper describes a principled way to capture data
lineage of interactive data transformation processes. We provide a formal
model of IDT, its mapping to a provenance representation, and its
implementation and validation on Google Refine. Provision of the data transformation process sequences allows assessment of data quality and
ensures portability between IDT and other data transformation platforms.
The proposed model showed a high level of coverage against a set of requirements used for evaluating systems that provide provenance
management solutions.
The document discusses big data visualization and visual analysis, focusing on the challenges and opportunities. It begins with an overview of visualization and then discusses several challenges in big data visualization, including integrating heterogeneous data from different sources and scales, dealing with data and task complexity, limited interaction capabilities for large data, scalability for both data and users, and the need for domain and development libraries/tools. It then provides examples of visualizing taxi GPS data and traffic patterns in Beijing to identify traffic jams.
This document discusses the concept of a Science DMZ, which consists of three key components: 1) a dedicated "friction-free" network path with high-performance networking devices located near the site perimeter to facilitate science data transfer, 2) dedicated high-performance data transfer nodes optimized for data transfer tools, and 3) a performance measurement/test node. It contrasts this approach with the typical ad-hoc deployment of a data transfer node wherever space allows, which often fails to provide necessary performance. Details of an example Science DMZ deployment at Lawrence Berkeley National Laboratory are provided.
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...Gilles Fedak
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Computing Infrastructures
Simon Delamare Gilles Fedak Derrick Kondo Oleg Lodygensky
High-Performance Parallel and Distributed Computing, 2012
Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion.
A key point is to handle the complexity of the 'Data Life Cycle', i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing.
''Active Data'' is new approach to automate and improve the expressiveness of data management applications. It consists of
* a 'formal model' for Data Life Cycle, based on Petri Net, that allows to describe and expose data life cycle across heterogeneous systems and infrastructures.
* a 'programming model' allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.
Dr. Frank Wuerthwein from the University of California at San Diego presentation at International Super Computing Conference on Big Data, 2013, US Until recently, the large CERN experiments, ATLAS and CMS, owned and controlled the computing infrastructure they operated on in the US, and accessed data only when it was locally available on the hardware they operated. However, Würthwein explains, with data-taking rates set to increase dramatically by the end of LS1 in 2015, the current operational model is no longer viable to satisfy peak processing needs. Instead, he argues, large-scale processing centers need to be created dynamically to cope with spikes in demand. To this end, Würthwein and colleagues carried out a successful proof-of-concept study, in which the Gordon Supercomputer at the San Diego Supercomputer Center was dynamically and seamlessly integrated into the CMS production system to process a 125-terabyte data set.
Science Gateways: one portal, many e-Infrastructures and related servicesriround
Riccardo Rotondo from the Consortium GARR presents on science gateways and the Catania Science Gateway Framework. The framework provides a single portal for users to access many e-infrastructures and related services through standards-based science gateways. It utilizes federated authentication, a job management system, modular e-infrastructure and data services, and cloud integration. The goal is to increase accessibility and reuse of applications across infrastructures. Training materials and an open source codebase are also provided to encourage community involvement.
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
Stefano Nativi presents the RDA.
Workshop title: Organising high-quality research data management services
Workshop abstract:
Open science needs high quality data management where researchers can create, use and share data according to well defined standards and practices. this is one of the pillars of Open Science. In the data management landscape we find quite a few organisations that aim at achieving this, however to get it right, a collaboration is called for where all can play a suitable role and present this in a consistent way to the researcher.
The proposed workshop brings together representatives of standard organisation (RDA), eInfrastructures (EUDAT) and Libraries (LIBER) that together can organise the high quality data management for research.
DAY 1 - PARALLEL SESSION 2
http://opensciencefair.eu/workshops/organising-high-quality-research-data-management-services
The document discusses the development of a Hydrologic Community Modeling System (HCMS) using a workflow engine. It describes introducing various libraries into the HCMS, including a Data Access Library, Data Processing Library, Hydrologic Model Library, and Post-Analysis & Utilities Library. These libraries would allow for accessing and processing data from various sources, running hydrologic models in a modular way, and analyzing model results. The document also discusses using the TRIDENT workflow engine to facilitate composing hydrologic models and workflows with swappable components in a flexible manner.
Working with the Mainstay team, the Cisco IOT Manufacturing Marketing team combined research from manufacturing trade associations, management consulting research and an internal benchmarking project to create an Executive Briefing Presentation that would educate CxOs on the opportunities IOT can provide. This content was also repurposed to create a manufacturing IOT whitepaper to provide an asset to entice prospective customers to consider Cisco’s IOT offerings.
Get Cloud Resources to the IoT Edge with Fog ComputingBiren Gandhi
Fog Computing as a foundational architectural concept for Internet of Things (IoT) and Internet of Everything (IoE).
Embedded devices in the IoT are hampered by the compute, storage, and service limitations of living life on the edge. As IoT edge devices comprise broader sensor networks for industrial automation, transportation, and other safety critical applications, their high uptime requirements are nonnegotiable and service latencies must be kept within realtime or near real time parameters. However, the size, weight, power, and cost constraints of edge platforms also inhibit the ondevice resources available for executing such functions. In this session, Gandhi will introduce Fog Computing, a new paradigm for the IoT that extends compute, storage, and application resources from the cloud to the network edge. Beyond the interplay between Fog and Cloud, Gandhi will show how Fog services can be leveraged across a range of heterogeneous platforms—from end user devices and access points to edge routers and switches—through software technology that facilitates the collection, storage, analysis, and fusion of data to drive success in your next IoT device deployment.
Cari presentation maurice-tchoupe-joskelngoufoMokhtar SELLAMI
A publish/subscribe approach for implementing GAG’s distributed
collaborative business processes with high data availability
Maurice Tchoupé Tchendji and Joskel Ngoufo Tagueu
The document discusses grid computing and the speaker's background in the topic. It provides key takeaways about understanding the evolution of technologies like grid computing and envisioning upcoming trends. It then discusses what a grid is, including early definitions, and elements of grid computing like resource sharing, coordinated problem solving, and dynamic virtual organizations. The document also outlines attributes of grid computing related to virtualization, dynamic provisioning, resource pooling, and self-adaptive software. It provides examples of how grids are used and lists common grid components.
OGF Standards Overview - Globus World 2013Alan Sill
The document discusses the Open Grid Forum (OGF), an organization that develops standards for distributed computing including cloud and grid computing. It provides an overview of OGF's history, standards, and engagement with other standards bodies. Key points include:
1) OGF has developed many standards that form the basis for major science and business distributed computing including for resource management, data transfer, and service agreements.
2) OGF actively collaborates with other standards organizations through cooperative agreements and events like the Cloud Plugfest to promote interoperability.
3) OGF standards see significant adoption and are used in large-scale infrastructures worldwide like the Worldwide LHC Computing Grid which utilizes over 450,000 CPU
MT85 Challenges at the Edge: Dell Edge GatewaysDell EMC World
- Dell Technologies presented their IoT infrastructure portfolio, which spans from the edge to the core to the cloud. This includes embedded gateways, on-premise appliances, data center infrastructure, and cloud and application integration capabilities.
- Falling sensor costs, power efficiencies, ubiquity of mobility, growth of cloud computing, and other modern technologies are fueling more IoT solutions. However, enterprises face challenges like security, data volume/analytics complexity, and lack of standards.
- Dell's portfolio is designed to address these challenges and unlock IoT potential. Their edge gateways feature diverse connectivity, data protocol support, security, and manageability. This infrastructure combined with partners allows customers to gather and analyze data and
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards.
Visit: https://www.eudat.eu/eudat-summer-school
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/32c6TnG
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
The document discusses data streaming in IoT and big data analytics. It begins with an introduction to data streaming and the need for streaming techniques due to the complexity of analyzing large volumes of IoT data. It then covers the data streaming processing paradigm, including continuous queries, stateless and stateful operators, and windows. Challenges and research questions in data streaming are also discussed, such as distributed deployment, parallelism, and fault tolerance. The document concludes that data streaming is well-suited for real-time analysis of IoT data due to its ability to perform online, parallel and distributed processing.
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdfGary Mazzaferro
This document outlines deliverable suggestions for working groups on a general reference architecture for big data. It includes definitions of key terms, requirements, reference architectures, capabilities and gaps, and an orphaned stakeholder taxonomy. The working groups are tasked with defining requirements, developing reference architectures, identifying capabilities and gaps, and classifying stakeholders.
Watch full webinar here: https://bit.ly/2Y0vudM
What is Data Virtualization and why do I care? In this webinar we intend to help you understand not only what Data Virtualization is but why it's a critical component of any organization's data fabric and how it fits. How data virtualization liberates and empowers your business users via data discovery, data wrangling to generation of reusable reporting objects and data services. Digital transformation demands that we empower all consumers of data within the organization, it also demands agility too. Data Virtualization gives you meaningful access to information that can be shared by a myriad of consumers.
Register to attend this session to learn:
- What is Data Virtualization?
- Why do I need Data Virtualization in my organization?
- How do I implement Data Virtualization in my enterprise?
Which Computing Infrastructure for the Decentralized World ?Gilles Fedak
This document discusses iExec, a company that provides a decentralized computing infrastructure using blockchain technology. Key points:
- iExec operates a marketplace that allows providers to offer computing resources like servers, applications, and datasets, and users to access these resources using the RLC token.
- The infrastructure supports a variety of applications including blockchain DApps, HPC, and emerging distributed applications in areas like IoT, edge computing, and machine learning.
- iExec is developing technologies like trusted execution environments and sidechains to improve privacy, performance, and scalability of the decentralized computing platform.
iExec allows for scalable, efficient, and virtualized off-chain execution of arbitrary applications. It provides a global marketplace and uses Ethereum to advertise and provision cloud computing resources like servers, applications, and datasets from various providers in a decentralized peer-to-peer manner without a central authority. iExec's RLC token is used to access this decentralized cloud and incentivizes providers who are paid in RLC for their resources. The iExec software development kit allows decentralized applications to access these distributed cloud computing resources for off-chain computation and data needs.
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion.
A key point is to handle the complexity of the 'Data Life Cycle', i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing.
''Active Data'' is new approach to automate and improve the expressiveness of data management applications. It consists of
* a 'formal model' for Data Life Cycle, based on Petri Net, that allows to describe and expose data life cycle across heterogeneous systems and infrastructures.
* a 'programming model' allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.
Dr. Frank Wuerthwein from the University of California at San Diego presentation at International Super Computing Conference on Big Data, 2013, US Until recently, the large CERN experiments, ATLAS and CMS, owned and controlled the computing infrastructure they operated on in the US, and accessed data only when it was locally available on the hardware they operated. However, Würthwein explains, with data-taking rates set to increase dramatically by the end of LS1 in 2015, the current operational model is no longer viable to satisfy peak processing needs. Instead, he argues, large-scale processing centers need to be created dynamically to cope with spikes in demand. To this end, Würthwein and colleagues carried out a successful proof-of-concept study, in which the Gordon Supercomputer at the San Diego Supercomputer Center was dynamically and seamlessly integrated into the CMS production system to process a 125-terabyte data set.
Science Gateways: one portal, many e-Infrastructures and related servicesriround
Riccardo Rotondo from the Consortium GARR presents on science gateways and the Catania Science Gateway Framework. The framework provides a single portal for users to access many e-infrastructures and related services through standards-based science gateways. It utilizes federated authentication, a job management system, modular e-infrastructure and data services, and cloud integration. The goal is to increase accessibility and reuse of applications across infrastructures. Training materials and an open source codebase are also provided to encourage community involvement.
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
Stefano Nativi presents the RDA.
Workshop title: Organising high-quality research data management services
Workshop abstract:
Open science needs high quality data management where researchers can create, use and share data according to well defined standards and practices. this is one of the pillars of Open Science. In the data management landscape we find quite a few organisations that aim at achieving this, however to get it right, a collaboration is called for where all can play a suitable role and present this in a consistent way to the researcher.
The proposed workshop brings together representatives of standard organisation (RDA), eInfrastructures (EUDAT) and Libraries (LIBER) that together can organise the high quality data management for research.
DAY 1 - PARALLEL SESSION 2
http://opensciencefair.eu/workshops/organising-high-quality-research-data-management-services
The document discusses the development of a Hydrologic Community Modeling System (HCMS) using a workflow engine. It describes introducing various libraries into the HCMS, including a Data Access Library, Data Processing Library, Hydrologic Model Library, and Post-Analysis & Utilities Library. These libraries would allow for accessing and processing data from various sources, running hydrologic models in a modular way, and analyzing model results. The document also discusses using the TRIDENT workflow engine to facilitate composing hydrologic models and workflows with swappable components in a flexible manner.
Working with the Mainstay team, the Cisco IOT Manufacturing Marketing team combined research from manufacturing trade associations, management consulting research and an internal benchmarking project to create an Executive Briefing Presentation that would educate CxOs on the opportunities IOT can provide. This content was also repurposed to create a manufacturing IOT whitepaper to provide an asset to entice prospective customers to consider Cisco’s IOT offerings.
Get Cloud Resources to the IoT Edge with Fog ComputingBiren Gandhi
Fog Computing as a foundational architectural concept for Internet of Things (IoT) and Internet of Everything (IoE).
Embedded devices in the IoT are hampered by the compute, storage, and service limitations of living life on the edge. As IoT edge devices comprise broader sensor networks for industrial automation, transportation, and other safety critical applications, their high uptime requirements are nonnegotiable and service latencies must be kept within realtime or near real time parameters. However, the size, weight, power, and cost constraints of edge platforms also inhibit the ondevice resources available for executing such functions. In this session, Gandhi will introduce Fog Computing, a new paradigm for the IoT that extends compute, storage, and application resources from the cloud to the network edge. Beyond the interplay between Fog and Cloud, Gandhi will show how Fog services can be leveraged across a range of heterogeneous platforms—from end user devices and access points to edge routers and switches—through software technology that facilitates the collection, storage, analysis, and fusion of data to drive success in your next IoT device deployment.
Cari presentation maurice-tchoupe-joskelngoufoMokhtar SELLAMI
A publish/subscribe approach for implementing GAG’s distributed
collaborative business processes with high data availability
Maurice Tchoupé Tchendji and Joskel Ngoufo Tagueu
The document discusses grid computing and the speaker's background in the topic. It provides key takeaways about understanding the evolution of technologies like grid computing and envisioning upcoming trends. It then discusses what a grid is, including early definitions, and elements of grid computing like resource sharing, coordinated problem solving, and dynamic virtual organizations. The document also outlines attributes of grid computing related to virtualization, dynamic provisioning, resource pooling, and self-adaptive software. It provides examples of how grids are used and lists common grid components.
OGF Standards Overview - Globus World 2013Alan Sill
The document discusses the Open Grid Forum (OGF), an organization that develops standards for distributed computing including cloud and grid computing. It provides an overview of OGF's history, standards, and engagement with other standards bodies. Key points include:
1) OGF has developed many standards that form the basis for major science and business distributed computing including for resource management, data transfer, and service agreements.
2) OGF actively collaborates with other standards organizations through cooperative agreements and events like the Cloud Plugfest to promote interoperability.
3) OGF standards see significant adoption and are used in large-scale infrastructures worldwide like the Worldwide LHC Computing Grid which utilizes over 450,000 CPU
MT85 Challenges at the Edge: Dell Edge GatewaysDell EMC World
- Dell Technologies presented their IoT infrastructure portfolio, which spans from the edge to the core to the cloud. This includes embedded gateways, on-premise appliances, data center infrastructure, and cloud and application integration capabilities.
- Falling sensor costs, power efficiencies, ubiquity of mobility, growth of cloud computing, and other modern technologies are fueling more IoT solutions. However, enterprises face challenges like security, data volume/analytics complexity, and lack of standards.
- Dell's portfolio is designed to address these challenges and unlock IoT potential. Their edge gateways feature diverse connectivity, data protocol support, security, and manageability. This infrastructure combined with partners allows customers to gather and analyze data and
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards.
Visit: https://www.eudat.eu/eudat-summer-school
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/32c6TnG
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
The document discusses data streaming in IoT and big data analytics. It begins with an introduction to data streaming and the need for streaming techniques due to the complexity of analyzing large volumes of IoT data. It then covers the data streaming processing paradigm, including continuous queries, stateless and stateful operators, and windows. Challenges and research questions in data streaming are also discussed, such as distributed deployment, parallelism, and fault tolerance. The document concludes that data streaming is well-suited for real-time analysis of IoT data due to its ability to perform online, parallel and distributed processing.
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdfGary Mazzaferro
This document outlines deliverable suggestions for working groups on a general reference architecture for big data. It includes definitions of key terms, requirements, reference architectures, capabilities and gaps, and an orphaned stakeholder taxonomy. The working groups are tasked with defining requirements, developing reference architectures, identifying capabilities and gaps, and classifying stakeholders.
Watch full webinar here: https://bit.ly/2Y0vudM
What is Data Virtualization and why do I care? In this webinar we intend to help you understand not only what Data Virtualization is but why it's a critical component of any organization's data fabric and how it fits. How data virtualization liberates and empowers your business users via data discovery, data wrangling to generation of reusable reporting objects and data services. Digital transformation demands that we empower all consumers of data within the organization, it also demands agility too. Data Virtualization gives you meaningful access to information that can be shared by a myriad of consumers.
Register to attend this session to learn:
- What is Data Virtualization?
- Why do I need Data Virtualization in my organization?
- How do I implement Data Virtualization in my enterprise?
Which Computing Infrastructure for the Decentralized World ?Gilles Fedak
This document discusses iExec, a company that provides a decentralized computing infrastructure using blockchain technology. Key points:
- iExec operates a marketplace that allows providers to offer computing resources like servers, applications, and datasets, and users to access these resources using the RLC token.
- The infrastructure supports a variety of applications including blockchain DApps, HPC, and emerging distributed applications in areas like IoT, edge computing, and machine learning.
- iExec is developing technologies like trusted execution environments and sidechains to improve privacy, performance, and scalability of the decentralized computing platform.
iExec allows for scalable, efficient, and virtualized off-chain execution of arbitrary applications. It provides a global marketplace and uses Ethereum to advertise and provision cloud computing resources like servers, applications, and datasets from various providers in a decentralized peer-to-peer manner without a central authority. iExec's RLC token is used to access this decentralized cloud and incentivizes providers who are paid in RLC for their resources. The iExec software development kit allows decentralized applications to access these distributed cloud computing resources for off-chain computation and data needs.
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesGilles Fedak
The document discusses the iEx.ec Distributed Cloud, which allows blockchain applications to access off-chain computing resources through a market network built on the Ethereum blockchain. Key points include:
- iEx.ec creates a decentralized marketplace where computing resources like servers, apps, and data can be advertised and provisioned directly through smart contracts.
- This provides transparency, security, and no single point of failure compared to traditional clouds.
- The technology builds on decades of research in desktop grid computing and volunteer computing to execute tasks in a highly secure and scalable way.
- An initial proof-of-concept allows generation of custom Bitcoin addresses through parallel processing of tasks.
Talk at the Etherreum Developper Conference. Presents our approach to build a fully decentralized Cloud Infrastructure based on the Ethereum blockchain and Desktop Grid middleware.
How Blockchain and Smart Buildings can Reshape the InternetGilles Fedak
This document discusses how blockchain and smart buildings can reshape distributed cloud computing and the internet. It describes how blockchain technologies like Ethereum allow for distributed applications running on smart contracts. The iEx.ec project aims to provide a blockchain-based distributed cloud computing platform that gives applications access to computing resources like services, data, and infrastructure in a low-cost, secure, on-demand and fully distributed manner. This builds upon prior work in desktop grid computing and could make cloud computing more efficient and greener by better utilizing idle computing resources.
The document discusses MapReduce runtime environments, including their design, performance optimizations, and applications. It provides an overview of MapReduce, describing the programming model and key-value data processing. It also discusses the design of MapReduce execution runtimes, including their use of distributed file systems and handling of parallelization, load balancing, and failures. Finally, it outlines areas of ongoing research to improve MapReduce performance and applicability.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
20240609 QFM020 Irresponsible AI Reading List May 2024
Active Data PDSW'13
1. Introduction Active Data Discussion Conclusion
Active Data
A Data-Centric Approach to Data Life-Cycle Management
Anthony Simonet1 Gilles Fedak1
Matei Ripeanu2 Samer Al-Kiswany2
1Inria, ENS Lyon, University of Lyon 2University of British Columbia
November 18th, 2013
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20
2. Introduction Active Data Discussion Conclusion
Outline
Introduction
Data Life Cycle Management
Use-case
Requirements
Active Data
Active Data: principles & features
Exemple: Globus Online and iRODS
Discussion
Advantages
Limitations
Conclusion
Related works
Conclusion
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 2/20
3. Introduction Active Data Discussion Conclusion
Big Data
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 3/20
Science and Industry have become data-intensive
Volume of data produced by science and industry grows exponentially
How to store this deluge of data?
How to extract knowledge and sense?
How to make data valuable?
Some examples
CERN’s Large Hadron Collider: 1.5PB/week
Large Synoptic Survey Telescope, Chile: 30 TB/night
Billion edge social network graphs
Searching and mining the Web
4. Introduction Active Data Discussion Conclusion
Data Life Cycle
Data Life Cycle
Creation/Acquisition
Transfer
Replication
Disposal/Archiving
Definition
The life cycle is the course of operational stages through which
data pass from the time when they enter a system to the time
when they leave it.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 4/20
5. Introduction Active Data Discussion Conclusion
Data Life Cycle Management
Complicated scenarios
Execution of workflow
Complex interactions between software
Need to quickly react to operational events
Ad-hoc task-centric approaches
Hard to program, maintain and debug
No formal specification
Complicates interactions between systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 5/20
6. Introduction Active Data Discussion Conclusion
Data Life Cycle Use-case
Example: the Advanced Photon Source at Argonne National Lab
100TB of raw data per day
Raw data are preprocessed and registered in a Globus dataset
catalog
Data are analyzed by various applications
Results are stored in the dataset catalog and shared
Instrument
(Beamline)
Local
Storage
Transfer
Metadata
Catalog
Extract &
Register Metadata
Remote
Data Center
Transfer
Academic
Cluster
Analysis
More analysis
Upload result
Register result metadata
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 6/20
7. Introduction Active Data Discussion Conclusion
Use-case
Task Centric
Vs
Data Centric
Independent scripts Express data-dependancies
Hard to program, maintain, verify Cross data-center coordination
Coarse granularity User-level fault-tolerance
Incremental processing
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 7/20
8. Introduction Active Data Discussion Conclusion
Requirements
Challenges: a perfect system should. . .
Simply represent the life cycle of data distributed across
different data centers and systems
Simplify DLM modeling and reasoning
Hide the complexity resulting from using different
infrastructures and systems
Be easy to integrate with existing systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 8/20
9. Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cycle
with a model based on Petri Nets.
A life cycle model is made of Places and Transitions
•
Created
t1
Written
t2
Read
t3
t4
Terminated
Each token has a unique identifier, corresponding to the actual
data item’s.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
10. Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cycle
with a model based on Petri Nets.
A life cycle model is made of Places and Transitions
Created
t1
•
Written
t2
Read
t3
t4
Terminated
A transition is fired whenever a data state changes.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
11. Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cycle
with a model based on Petri Nets.
A life cycle model is made of Places and Transitions
Created
t1
•
Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Code may be plugged by clients to transitions.
It is executed whenever the transition is fired.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
12. Introduction Active Data Discussion Conclusion
Active Data features
The Active Data programming model and runtime environment:
Allows to react to life cycle progression
Exposes transparently distributed data sets
Can be integrated with existing systems
Has scalable performance and minimum overhead over
existing systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 10/20
13. Introduction Active Data Discussion Conclusion
Implementation
Prototype implemented in Java ( 2,800 LOC)
Client/Service communication is Publish/Subscribe
2 types of subscription:
Every transitions for a given data item
Every data items for a given transition
Active Data
Service
Client
Client
subscribeClient
subscribe
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
14. Introduction Active Data Discussion Conclusion
Implementation
Several ways to publish transitions
Instrument the code
Read the logs
Rely on an existing notification system
The service orders transitions by time of arrival
Active Data
Service
Client
publish transition
Client
subscribeClient
subscribe
publish transition
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
15. Introduction Active Data Discussion Conclusion
Implementation
Clients run transition handler code locally
Transition handlers are executed
Serially
In a blocking way
In the order transitions were published
Active Data
Service
Client
publish transition
Client
subscribe
notify
Client
subscribe
notify
publish transition
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
16. Introduction Active Data Discussion Conclusion
Performance evaluation: Throughput
10 50 100 200 300 400 450 500 550
# clients
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Transitionspersecond
Figure: Average number of transitions per second handled by the Active
Data Service
Clients publish 10,000 transitions in a row without pausing.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
17. Introduction Active Data Discussion Conclusion
Performance evaluation: Throughput
10 50 100 200 300 400 450 500 550
# clients
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Transitionspersecond
Figure: Average number of transitions per second handled by the Active
Data Service
The prototype scales up to 30,000 transitions per seconds.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
18. Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
Assess the quality of data
Keep track of the origin of data over time
Specialized Provenance Aware Storage Systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
19. Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
Assess the quality of data
Keep track of the origin of data over time
Specialized Provenance Aware Storage Systems
−→ What about heterogeneous systems?
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
20. Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
Assess the quality of data
Keep track of the origin of data over time
Specialized Provenance Aware Storage Systems
−→ What about heterogeneous systems?
Example with Globus Online and iRODS
File transfer service Data store and metadata catalog
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
21. Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
Get
Put
Terminated
t5
t9
t6
t7
t8 t10
iRODS
•
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
Id: {GO: 7b9e02c4-925d-11e2}
22. Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
Get
Put
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Created
t1 t2
•
Succeeded Failed
t3 t4
Terminated
Globus Online
public void handler () {
iput (...);
}
23. Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
•
Created
Get
Put
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Id: {GO: 7b9e02c4-925d-11e2,
iRODS: 10032}
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
24. Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
Get
• Put
Terminated
t5
t9
t6
t7
t8 t10
iRODS
public void handler () {
annotate ();
}
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
25. Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
$ imeta ls -d test/ out_test_4628
AVUs defined for dataObj test/ out_test_4628 :
attribute: GO_FAULTS
value: 0
----
attribute: GO_COMPLETION_TIME
value: 2013 -03 -21 19:28:41Z
----
attribute: GO_REQUEST_TIME
value: 2013 -03 -21 19:28:17Z
----
attribute: GO_TASK_ID
value: 7b9e02c4 -925d -11e2 -97ce -123139404 f2e
----
attribute: GO_SOURCE
value: go#ep1 /~/ test
----
attribute: GO_DESTINATION
value: asimonet#fraise /~/ out_test_4628
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 15/20
26. Introduction Active Data Discussion Conclusion
Advantages
Simple and graphical way to program DLM operations
Allows to formally verify some properties of data life cycles
Easy coordination between systems
Easy to scale
Easy to debug
Easy fault tolerance
Fine-grain interaction with data life cycle
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 16/20
27. Introduction Active Data Discussion Conclusion
Limitations
Complexity to reason in terms of life cycle events
Lack of standard
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 17/20
28. Introduction Active Data Discussion Conclusion
Related works
Data-centric parallel processing
Programing models:
MapReduce and higher level abstractions: PigLatin, Twister
Incremental systems: MapReduce-Online, Percolator, Chimera,
Nephele
Other models with implicit parallelism: Swift, Dryad, Allpairs
Storage systems
BitDew
MosaStore
Provenance Aware Storage Systems
Active Storage
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 18/20
29. Introduction Active Data Discussion Conclusion
Conclusion
Active Data is. . .
Data-centric & Event-driven
System-level data integration
What’s next?
Advanced representation of operations that consume and
produce data: represent data derivation
Data collection abilities
Distributed implementation of the Publish/Subscribe layer
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 19/20
30. Introduction Active Data Discussion Conclusion
Thank you!
Questions?
Inria booth #2116
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 20/20