This presentation discusses a data sharing synchronization tool, implemented using Infinispan. The focus is medical images and meta data such as the cancer imaging archive (TCIA)
Machine Learning on Distributed Systems by Josh PoduskaData Con LA
Abstract:- Most real-world data science workflows require more than multiple cores on a single server to meet scale and speed demands, but there is a general lack of understanding when it comes to what machine learning on distributed systems looks like in practice. Gartner and Forrester do not consider distributed execution when they score advanced analytics software solutions. Many formal machine learning training occurs on single node machines with non-distributed algorithms. In this talk we discuss why an understanding of distributed architectures is important for anyone in the analytical sciences. We will cover the current distributed machine learning ecosystem. We will review common pitfalls when performing machine learning at scale. We will discuss architectural considerations for a machine learning program such as the role of storage and compute and under what circumstances they should be combined or separated.
Short introduction to ML frameworks on HadoopYuya Takashina
This document provides a short introduction to machine learning (ML) frameworks built on Hadoop, including Hadoop, Spark, and Petuum. It notes that Hadoop is the de facto standard for distributed storage and processing of big data. Spark is 10x faster than Hadoop for some applications by caching data in memory. Petuum is even faster than Spark for ML by using asynchronous communication to reduce network costs while still guaranteeing convergence, and provides deep learning APIs.
This document discusses the benefits and constraints of traditional enterprise data warehouses (EDW) and Hadoop frameworks. It notes that EDWs are used for reporting and analysis but require expensive ETL processes to load structured data into tables. Hadoop provides linear scalability, lower costs, and supports both SQL and non-SQL queries by keeping metadata and storage separate. The document argues that EDWs and Hadoop can coexist, with Hadoop handling ETL workloads and acting as low-cost storage, while EDWs focus on reporting and analytics using existing BI tools connected to Hadoop.
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...Teresa Giacomini
PostgreSQL is becoming the relational database of choice. An important factor in the rising popularity of Postgres is the extension APIs that allow developers to improve any database module’s behavior. As a result, Postgres users have access to hundreds of extensions today.
In this talk, we're going to first describe extension APIs. Then, we’re going to present four popular Postgres extensions, and demo their use.
* PostGIS turns Postgres into a spatial database through adding support for geographic objects.
* HLL & TopN add approximation algorithms to Postgres. These algorithms are used when real-time responses matter more than exact results.
* pg_partman makes managing partitions in Postgres easy. Through partitions, Postgres provide 5-10x higher performance for time-series data.
* Citus transforms Postgres into a distributed database. To do this, Citus shards data, performs distributed deadlock detection, and parallelizes queries.
Finally, we’ll conclude with why we think Postgres sets the way forward for relational databases.
PostgreSQL is becoming the relational database of choice. One important factor in the rising popularity of Postgres are the extension APIs. These APIs allow developers to extend any database sub-module’s behavior for higher performance, security, or new functionality. As a result, Postgres users have access to over a hundred extensions today, and more to come in the future.
In this talk, I’m going to first describe PostgreSQL’s extension APIs. These APIs are unique to Postgres, and have the potential to change the database landscape. Then, we’re going to present the four most popular Postgres extensions, show the use cases where they are applicable, and demo their usage.
PostGIS turns Postgres into a spatial database. It adds support for geographic objects, allowing location queries to be run in SQL.
HyperLogLog (HLL) & TopN add approximation algorithms to Postgres. These sketch algorithms are used in distributed systems when real-time responses to queries matter more than exact results.
pgpartman makes creating and managing partitions in Postgres easy. Through careful partition management with pgpartman, Postgres offers 5-10x higher write and query performance for time-series data.
Citus transforms Postgres into a distributed database. Citus transparently shards and replicates data, performs distributed deadlock detection, and parallelizes queries.
After demoing these popular extensions, we’ll conclude with why we think the monolithic relational database is dying and how Postgres sets a path for the future. We’ll end the talk with a Q&A.
Greenplum is the first open source Massively Parallel Processing (MPP) data warehouse, built with over two million lines of code. MPP allows a program to run across multiple processors that each use their own memory and operating system. Greenplum was released under Apache software and differs functionally and architecturally from other open source data systems through its use of MPP to execute complex SQL analytics over large datasets at high speeds. As an open source system, Greenplum assures customers that their software needs will be met long-term.
Building High Performance MySQL Query Systems and Analytic ApplicationsCalpont
This presentation describes how to build fast running MySQL applications that service read-based systems. It takes a special look at column databases and Calpont's InfiniDB
This document discusses challenges in large scale machine learning. It begins by discussing why distributed machine learning is necessary when data is too large for one computer to store or when models have too many parameters. It then discusses various challenges that arise in distributed machine learning including scalability issues, class imbalance, the curse of dimensionality, overfitting, and algorithm complexities related to data loading times. Specific examples are provided of distributing k-means clustering and spectral clustering algorithms. Distributed implementations of support vector machines are also discussed. Throughout, it emphasizes the importance of understanding when and where distributed approaches are suitable compared to single machine learning.
Machine Learning on Distributed Systems by Josh PoduskaData Con LA
Abstract:- Most real-world data science workflows require more than multiple cores on a single server to meet scale and speed demands, but there is a general lack of understanding when it comes to what machine learning on distributed systems looks like in practice. Gartner and Forrester do not consider distributed execution when they score advanced analytics software solutions. Many formal machine learning training occurs on single node machines with non-distributed algorithms. In this talk we discuss why an understanding of distributed architectures is important for anyone in the analytical sciences. We will cover the current distributed machine learning ecosystem. We will review common pitfalls when performing machine learning at scale. We will discuss architectural considerations for a machine learning program such as the role of storage and compute and under what circumstances they should be combined or separated.
Short introduction to ML frameworks on HadoopYuya Takashina
This document provides a short introduction to machine learning (ML) frameworks built on Hadoop, including Hadoop, Spark, and Petuum. It notes that Hadoop is the de facto standard for distributed storage and processing of big data. Spark is 10x faster than Hadoop for some applications by caching data in memory. Petuum is even faster than Spark for ML by using asynchronous communication to reduce network costs while still guaranteeing convergence, and provides deep learning APIs.
This document discusses the benefits and constraints of traditional enterprise data warehouses (EDW) and Hadoop frameworks. It notes that EDWs are used for reporting and analysis but require expensive ETL processes to load structured data into tables. Hadoop provides linear scalability, lower costs, and supports both SQL and non-SQL queries by keeping metadata and storage separate. The document argues that EDWs and Hadoop can coexist, with Hadoop handling ETL workloads and acting as low-cost storage, while EDWs focus on reporting and analytics using existing BI tools connected to Hadoop.
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...Teresa Giacomini
PostgreSQL is becoming the relational database of choice. An important factor in the rising popularity of Postgres is the extension APIs that allow developers to improve any database module’s behavior. As a result, Postgres users have access to hundreds of extensions today.
In this talk, we're going to first describe extension APIs. Then, we’re going to present four popular Postgres extensions, and demo their use.
* PostGIS turns Postgres into a spatial database through adding support for geographic objects.
* HLL & TopN add approximation algorithms to Postgres. These algorithms are used when real-time responses matter more than exact results.
* pg_partman makes managing partitions in Postgres easy. Through partitions, Postgres provide 5-10x higher performance for time-series data.
* Citus transforms Postgres into a distributed database. To do this, Citus shards data, performs distributed deadlock detection, and parallelizes queries.
Finally, we’ll conclude with why we think Postgres sets the way forward for relational databases.
PostgreSQL is becoming the relational database of choice. One important factor in the rising popularity of Postgres are the extension APIs. These APIs allow developers to extend any database sub-module’s behavior for higher performance, security, or new functionality. As a result, Postgres users have access to over a hundred extensions today, and more to come in the future.
In this talk, I’m going to first describe PostgreSQL’s extension APIs. These APIs are unique to Postgres, and have the potential to change the database landscape. Then, we’re going to present the four most popular Postgres extensions, show the use cases where they are applicable, and demo their usage.
PostGIS turns Postgres into a spatial database. It adds support for geographic objects, allowing location queries to be run in SQL.
HyperLogLog (HLL) & TopN add approximation algorithms to Postgres. These sketch algorithms are used in distributed systems when real-time responses to queries matter more than exact results.
pgpartman makes creating and managing partitions in Postgres easy. Through careful partition management with pgpartman, Postgres offers 5-10x higher write and query performance for time-series data.
Citus transforms Postgres into a distributed database. Citus transparently shards and replicates data, performs distributed deadlock detection, and parallelizes queries.
After demoing these popular extensions, we’ll conclude with why we think the monolithic relational database is dying and how Postgres sets a path for the future. We’ll end the talk with a Q&A.
Greenplum is the first open source Massively Parallel Processing (MPP) data warehouse, built with over two million lines of code. MPP allows a program to run across multiple processors that each use their own memory and operating system. Greenplum was released under Apache software and differs functionally and architecturally from other open source data systems through its use of MPP to execute complex SQL analytics over large datasets at high speeds. As an open source system, Greenplum assures customers that their software needs will be met long-term.
Building High Performance MySQL Query Systems and Analytic ApplicationsCalpont
This presentation describes how to build fast running MySQL applications that service read-based systems. It takes a special look at column databases and Calpont's InfiniDB
This document discusses challenges in large scale machine learning. It begins by discussing why distributed machine learning is necessary when data is too large for one computer to store or when models have too many parameters. It then discusses various challenges that arise in distributed machine learning including scalability issues, class imbalance, the curse of dimensionality, overfitting, and algorithm complexities related to data loading times. Specific examples are provided of distributing k-means clustering and spectral clustering algorithms. Distributed implementations of support vector machines are also discussed. Throughout, it emphasizes the importance of understanding when and where distributed approaches are suitable compared to single machine learning.
There are several types of databases that can be used depending on needs and priorities. A centralized database stores all data in one location, making organization and backups easier but potentially slowing performance from high usage. Distributed databases split data across multiple locations for faster retrieval from nearby sites, though accessing distant data can be slower and ensuring consistency is important. Horizontal and vertical partitioning further divide distributed databases by specific criteria like common fields or geographic regions. Replication copies all data to multiple locations so it can be accessed locally with changes synced to the central database during off-peak times. Central indexes link to actual data stored elsewhere to reduce updates to the main database and potentially cause delays in retrieving data. Data warehouses and data mining analyze stored information.
The thinking persons guide to data warehouse designCalpont
The document discusses key considerations for designing a data warehouse, including building a logical design, transitioning to a physical design, and monitoring and tuning the design. It recommends using a modeling tool to capture logical designs, manual partitioning in some cases, and letting database engines do the work. It also covers physical design decisions like SQL vs NoSQL, row vs column storage, partitioning, indexing and optimizing data loads. Regular monitoring of workloads, bottlenecks and ratios is advised to tune performance.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
MySQL conference 2010 ignite talk on InfiniDBCalpont
InfiniDB is a column-oriented database engine that scales up across CPU cores and scales out across multiple nodes. It provides high performance for analytics, data warehousing, and read-intensive applications. Tests showed InfiniDB used less space, loaded data faster, and had significantly faster total and average query times compared to row-oriented databases. InfiniDB also showed predictable linear performance gains as data and nodes were increased.
Machine learning can be distributed across multiple machines to allow for processing of large datasets and complex models. There are three main approaches to distributed machine learning: data parallel, where the data is partitioned across machines and models are replicated; model parallel, where different parts of large models are distributed; and graph parallel, where graphs and algorithms are partitioned. Distributed frameworks use these approaches to efficiently and scalably train machine learning models on big data in parallel.
Building next generation data warehousesAlex Meadows
All Things Open 2016 Talk - discussing technologies used to augment traditional data warehousing. Those technologies are:
* data vault
* anchor modeling
* linked data
* NoSQL
* data virtualization
* textual disambiguation
Data Warehouse Logical Design using MysqlHAFIZ Islam
Bert Scalzo discusses optimizing data warehouse performance in MySQL. Key points include:
- Designing star schemas with fact and dimension tables, indexing dimensions, and normalizing dimensions.
- Tuning involves choosing storage engines like MyISAM or InnoDB, configuring MySQL, and designing indexes.
- Data loading requires staging tables and recreating indexes after loads. Analyzing tables updates statistics.
- Query style affects performance - simple joins are best. The explain plan and its cost indicate query efficiency. With the right design, MySQL can effectively handle large data warehouses.
OLAP tools are categorized based on how they store and process multi-dimensional data, with the main categories being MOLAP, ROLAP, HOLAP, and DOLAP. MOLAP uses specialized data structures and MDDBMS to organize and analyze aggregated data for optimal query performance. ROLAP uses relational databases with a metadata layer to facilitate multiple views of data. HOLAP combines aspects of MOLAP and ROLAP. DOLAP provides limited analysis directly from databases or via servers to desktops in the form of datacubes for local storage, analysis and maintenance.
This document outlines the steps for building a data warehouse, including: 1) extracting transactional data from various sources, 2) transforming the data to relate tables and columns, 3) loading the transformed data into a dimensional database to improve query performance, 4) building pre-calculated summary values using SQL Server Analysis Services to speed up report generation, and 5) building a front-end reporting tool for end users to easily fetch required information.
The document discusses different types of distributed computing including distributed supercomputing, high-throughput computing, on-demand computing, data-intensive computing, and collaborative computing. It provides examples of tasks for each type and challenges involved such as scheduling resources, scalability, and performance across heterogeneous systems. Specific examples mentioned include climate modeling, computational chemistry, parameter studies, cryptographic problems, high energy physics data analysis, and collaborative exploration of large geophysical datasets.
Working with the vast variety of data out there can be a huge challenge for organizations. We believe that a “one size does not fit all” solution is required to work with such data. The BigDAWG polystore is a federated DB system for multiple, disparate data models. It supports the notions of location transparency and semantic completeness through islands of information which support a data model, query language and candidate set of DB engines. A prototype of the BigDAWG system has shown great promise when applied to diverse medical data.
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...Michael Stack
Huan-Ping Su (蘇桓平), Yi-Sheng Lien (連奕盛) National Cheng Kung University
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...Fabio Fumarola
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
OLAP (Online Analytical Processing) is a technology that uses a multidimensional view of aggregated data to provide quicker access to strategic information and help with decision making. It has four main characteristics: using multidimensional data analysis techniques, providing advanced database support, offering easy-to-use end user interfaces, and supporting client/server architecture. A key aspect is representing data in a multidimensional structure that allows for consolidation and aggregation of data at different levels.
The document discusses MapReduce and the Hadoop framework. It provides an overview of how MapReduce works, examples of problems it can solve, and how Hadoop implements MapReduce at scale across large clusters in a fault-tolerant manner using the HDFS distributed file system and YARN resource management.
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data FormatsQian Lin
1) SciMATE is a novel MapReduce-like framework that supports multiple scientific data formats like NetCDF and HDF5 for in-situ data analysis without needing to reload data.
2) It provides a customizable data adapter layer and optimized data access strategies for patterns like strided, column, and discrete reads to improve performance.
3) An evaluation showed SciMATE has good thread and node scalability for data processing and loading and that contiguous column reads outperform fixed-size column reads.
1) Statistics Denmark provides statistical data through a centralized dissemination system called StatBank, which hosts 1,500 tables covering all subjects. StatBank allows users to access and download data in a variety of formats for free.
2) While some data inputs and management are decentralized, output is centralized through StatBank to ensure coordination of structure, formatting and simultaneous releases.
3) Principles of the dissemination system include prioritizing electronic access over paper, using StatBank as the single source of official statistics for all publications and target audiences. Metadata is also centralized and stored once for reuse across outputs.
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseIshara Amarasekera
This presentation was prepared by Ishara Amarasekera based on the paper, A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database by Hasso Plattner.
This presentation contains a summary of the content provided in this research paper and was presented as a paper discussion for the course, Advanced Database Systems in Computer Science.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
OLAP tools enable interactive analysis of multidimensional data from multiple perspectives. There are three main types of OLAP tools: ROLAP, MOLAP, and HOLAP. ROLAP uses relational databases and SQL queries, while MOLAP pre-computes and stores aggregated data in multidimensional arrays for fast querying. HOLAP is a hybrid that stores some data in ROLAP and some in MOLAP to optimize both query performance and cube processing time.
Visualizing big data in the browser using sparkDatabricks
This document discusses using Spark to enable interactive visualization of big data in the browser. Spark can help address challenges of manipulating large datasets by caching data in memory to reduce latency, increasing parallelism, and summarizing, modeling, or sampling large datasets to reduce the number of data points. The goal is to put visualization back into the normal workflow of data analysis regardless of data size and enable sharing and collaboration through interactive and reproducible visualizations in the browser.
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
This document discusses building data pipelines with Apache Spark and DataStax Enterprise (DSE) for both static and real-time data. It describes how DSE provides a scalable, fault-tolerant platform for distributed data storage with Cassandra and real-time analytics with Spark. It also discusses using Kafka as a messaging queue for streaming data and processing it with Spark. The document provides examples of using notebooks, Parquet, and Akka for building pipelines to handle both large static datasets and fast, real-time streaming data sources.
There are several types of databases that can be used depending on needs and priorities. A centralized database stores all data in one location, making organization and backups easier but potentially slowing performance from high usage. Distributed databases split data across multiple locations for faster retrieval from nearby sites, though accessing distant data can be slower and ensuring consistency is important. Horizontal and vertical partitioning further divide distributed databases by specific criteria like common fields or geographic regions. Replication copies all data to multiple locations so it can be accessed locally with changes synced to the central database during off-peak times. Central indexes link to actual data stored elsewhere to reduce updates to the main database and potentially cause delays in retrieving data. Data warehouses and data mining analyze stored information.
The thinking persons guide to data warehouse designCalpont
The document discusses key considerations for designing a data warehouse, including building a logical design, transitioning to a physical design, and monitoring and tuning the design. It recommends using a modeling tool to capture logical designs, manual partitioning in some cases, and letting database engines do the work. It also covers physical design decisions like SQL vs NoSQL, row vs column storage, partitioning, indexing and optimizing data loads. Regular monitoring of workloads, bottlenecks and ratios is advised to tune performance.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
MySQL conference 2010 ignite talk on InfiniDBCalpont
InfiniDB is a column-oriented database engine that scales up across CPU cores and scales out across multiple nodes. It provides high performance for analytics, data warehousing, and read-intensive applications. Tests showed InfiniDB used less space, loaded data faster, and had significantly faster total and average query times compared to row-oriented databases. InfiniDB also showed predictable linear performance gains as data and nodes were increased.
Machine learning can be distributed across multiple machines to allow for processing of large datasets and complex models. There are three main approaches to distributed machine learning: data parallel, where the data is partitioned across machines and models are replicated; model parallel, where different parts of large models are distributed; and graph parallel, where graphs and algorithms are partitioned. Distributed frameworks use these approaches to efficiently and scalably train machine learning models on big data in parallel.
Building next generation data warehousesAlex Meadows
All Things Open 2016 Talk - discussing technologies used to augment traditional data warehousing. Those technologies are:
* data vault
* anchor modeling
* linked data
* NoSQL
* data virtualization
* textual disambiguation
Data Warehouse Logical Design using MysqlHAFIZ Islam
Bert Scalzo discusses optimizing data warehouse performance in MySQL. Key points include:
- Designing star schemas with fact and dimension tables, indexing dimensions, and normalizing dimensions.
- Tuning involves choosing storage engines like MyISAM or InnoDB, configuring MySQL, and designing indexes.
- Data loading requires staging tables and recreating indexes after loads. Analyzing tables updates statistics.
- Query style affects performance - simple joins are best. The explain plan and its cost indicate query efficiency. With the right design, MySQL can effectively handle large data warehouses.
OLAP tools are categorized based on how they store and process multi-dimensional data, with the main categories being MOLAP, ROLAP, HOLAP, and DOLAP. MOLAP uses specialized data structures and MDDBMS to organize and analyze aggregated data for optimal query performance. ROLAP uses relational databases with a metadata layer to facilitate multiple views of data. HOLAP combines aspects of MOLAP and ROLAP. DOLAP provides limited analysis directly from databases or via servers to desktops in the form of datacubes for local storage, analysis and maintenance.
This document outlines the steps for building a data warehouse, including: 1) extracting transactional data from various sources, 2) transforming the data to relate tables and columns, 3) loading the transformed data into a dimensional database to improve query performance, 4) building pre-calculated summary values using SQL Server Analysis Services to speed up report generation, and 5) building a front-end reporting tool for end users to easily fetch required information.
The document discusses different types of distributed computing including distributed supercomputing, high-throughput computing, on-demand computing, data-intensive computing, and collaborative computing. It provides examples of tasks for each type and challenges involved such as scheduling resources, scalability, and performance across heterogeneous systems. Specific examples mentioned include climate modeling, computational chemistry, parameter studies, cryptographic problems, high energy physics data analysis, and collaborative exploration of large geophysical datasets.
Working with the vast variety of data out there can be a huge challenge for organizations. We believe that a “one size does not fit all” solution is required to work with such data. The BigDAWG polystore is a federated DB system for multiple, disparate data models. It supports the notions of location transparency and semantic completeness through islands of information which support a data model, query language and candidate set of DB engines. A prototype of the BigDAWG system has shown great promise when applied to diverse medical data.
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...Michael Stack
Huan-Ping Su (蘇桓平), Yi-Sheng Lien (連奕盛) National Cheng Kung University
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...Fabio Fumarola
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
OLAP (Online Analytical Processing) is a technology that uses a multidimensional view of aggregated data to provide quicker access to strategic information and help with decision making. It has four main characteristics: using multidimensional data analysis techniques, providing advanced database support, offering easy-to-use end user interfaces, and supporting client/server architecture. A key aspect is representing data in a multidimensional structure that allows for consolidation and aggregation of data at different levels.
The document discusses MapReduce and the Hadoop framework. It provides an overview of how MapReduce works, examples of problems it can solve, and how Hadoop implements MapReduce at scale across large clusters in a fault-tolerant manner using the HDFS distributed file system and YARN resource management.
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data FormatsQian Lin
1) SciMATE is a novel MapReduce-like framework that supports multiple scientific data formats like NetCDF and HDF5 for in-situ data analysis without needing to reload data.
2) It provides a customizable data adapter layer and optimized data access strategies for patterns like strided, column, and discrete reads to improve performance.
3) An evaluation showed SciMATE has good thread and node scalability for data processing and loading and that contiguous column reads outperform fixed-size column reads.
1) Statistics Denmark provides statistical data through a centralized dissemination system called StatBank, which hosts 1,500 tables covering all subjects. StatBank allows users to access and download data in a variety of formats for free.
2) While some data inputs and management are decentralized, output is centralized through StatBank to ensure coordination of structure, formatting and simultaneous releases.
3) Principles of the dissemination system include prioritizing electronic access over paper, using StatBank as the single source of official statistics for all publications and target audiences. Metadata is also centralized and stored once for reuse across outputs.
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseIshara Amarasekera
This presentation was prepared by Ishara Amarasekera based on the paper, A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database by Hasso Plattner.
This presentation contains a summary of the content provided in this research paper and was presented as a paper discussion for the course, Advanced Database Systems in Computer Science.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
OLAP tools enable interactive analysis of multidimensional data from multiple perspectives. There are three main types of OLAP tools: ROLAP, MOLAP, and HOLAP. ROLAP uses relational databases and SQL queries, while MOLAP pre-computes and stores aggregated data in multidimensional arrays for fast querying. HOLAP is a hybrid that stores some data in ROLAP and some in MOLAP to optimize both query performance and cube processing time.
Visualizing big data in the browser using sparkDatabricks
This document discusses using Spark to enable interactive visualization of big data in the browser. Spark can help address challenges of manipulating large datasets by caching data in memory to reduce latency, increasing parallelism, and summarizing, modeling, or sampling large datasets to reduce the number of data points. The goal is to put visualization back into the normal workflow of data analysis regardless of data size and enable sharing and collaboration through interactive and reproducible visualizations in the browser.
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
This document discusses building data pipelines with Apache Spark and DataStax Enterprise (DSE) for both static and real-time data. It describes how DSE provides a scalable, fault-tolerant platform for distributed data storage with Cassandra and real-time analytics with Spark. It also discusses using Kafka as a messaging queue for streaming data and processing it with Spark. The document provides examples of using notebooks, Parquet, and Akka for building pipelines to handle both large static datasets and fast, real-time streaming data sources.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Justin Sheppard discussed Transforming Data Architecture Complexity at Sears. High ETL complexity and costs, data latency and redundancy, and batch window limits are just some of the IT challenges caused by traditional data warehouses. Gain an understanding of big data tools through the use cases and technology that enables Sears to solve the problems of the traditional enterprise data warehouse approach. Learn how Sears uses Hadoop as a data hub to minimize data architecture complexity – resulting in a reduction of time to insight by 30-70% – and discover “quick wins” such as mainframe MIPS reduction.
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example.
This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
This document summarizes a presentation about deep learning workflows and best practices on Apache Spark. It discusses how deep learning fits within broader data pipelines for tasks like training and transformation. It also outlines recurring patterns for integrating Spark and deep learning frameworks, including using Spark for data parallelism and embedding deep learning transforms. The presentation provides tips for developers on topics like using GPUs with PySpark and monitoring deep learning jobs. It concludes by discussing challenges in the areas of distributed deep learning and Spark integration.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
ISUM 2015 Keynote
Summary: Computational and Data Science is about extracting knowledge from data and modeling. This end goal can only be achieved through a craft that combines people, processes, computational and Big Data platforms, application-specific purpose and programmability. Publications and provenance of the data products products leading to these publications are also important. With this in mind, this talk defines a terminology for computational and data science applications, and discuss why focusing on these concepts is important for executability and reproducibility in computational and data science.
Haytham ElFadeel presented on next-generation storage systems and key-value stores. He began with an overview of scalable systems and the need for both vertical and horizontal scalability. He discussed the limitations of traditional databases in scaling, including complexity, wasted features, and multi-step query processing. Key-value stores were presented as an alternative, offering simple interfaces and designs optimized for scaling across hundreds of machines. Performance comparisons showed key-value stores significantly outperforming databases. Systems discussed included Amazon Dynamo, Facebook Cassandra, and Redis.
This webinar discusses tools for making big data easy to work with. It covers MetaScale Expertise, which provides Hadoop expertise and case studies. Kognitio Analytics is discussed as a way to accelerate Hadoop for organizations. The webinar agenda includes an introduction, presentations on MetaScale and Kognitio, and a question and answer session. Rethinking data strategies with Hadoop and using in-memory analytics are presented as ways to gain insights from large, diverse datasets.
Parallel Distributed Deep Learning on HPCC SystemsHPCC Systems
As part of the 2018 HPCC Systems Community Day event:
The training process for modern deep neural networks requires big data and large amounts of computational power. Combining HPCC Systems and Google’s TensorFlow, Robert created a parallel stochastic gradient descent algorithm to provide a basis for future deep neural network research, thereby helping to enhance the distributed neural network training capabilities of HPCC Systems.
Robert Kennedy is a first year Ph.D. student in CS at Florida Atlantic University with research interests in Deep Learning and parallel and distributed computing. His current research is in improving distributed deep learning by implementing and optimizing distributed algorithms.
A podium abstract presented at AMIA 2016 Joint Summits on Translational Science. This discusses Data Café — A Platform For Creating Biomedical Data Lakes.
This document provides an outline and introduction for a lecture on MapReduce and Hadoop. It discusses Hadoop architecture including HDFS and YARN, and how they work together to provide distributed storage and processing of big data across clusters of machines. It also provides an overview of MapReduce programming model and how data is processed through the map and reduce phases in Hadoop. References several books on Hadoop, MapReduce, and big data fundamentals.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.
The webinar discusses how organizations can make big data easy to use with the right tools and talent. It presents on MetaScale's expertise in helping Sears Holdings implement Hadoop and how Kognitio's in-memory analytics platform can accelerate Hadoop for organizations. The webinar agenda includes an introduction, a case study on Sears Holdings' Hadoop implementation, an explanation of how Kognitio's platform accelerates Hadoop, and a Q&A session.
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
Watch full webinar here: https://bit.ly/3hgOSwm
Data Lake technologies have been in constant evolution in recent years, with each iteration primising to fix what previous ones failed to accomplish. Several data lake engines are hitting the market with better ingestion, governance, and acceleration capabilities that aim to create the ultimate data repository. But isn't that the promise of a logical architecture with data virtualization too? So, what’s the difference between the two technologies? Are they friends or foes? This session will explore the details.
This document contains a presentation on using graph databases for recommendations. It begins with an introduction to graphs and graph theory, then discusses what graph databases are and how they are different from relational databases. It explains how graphs are well-suited for complex querying and representing connected data. The presentation describes how recommendation systems work and how graph algorithms and storing recommendation data in a graph structure provide benefits like real-time recommendations, navigating relationships between items, and efficient operations. It concludes with a demonstration, examples, and discussing future events.
Similar to Data replication and synchronization tool (20)
Google Summer of Code (GSoC) is a remote open-source internship program funded by Google, for contributors to remotely work with an open source organization (and get paid) over a summer.
https://kkpradeeban.blogspot.com/2022/11/google-summer-of-code-gsoc-2023.html
GSoC 2022 comes with more changes and flexibility. This presentation aims to give an introduction to the contributors and what to expect this summer.
https://kkpradeeban.blogspot.com/2022/01/google-summer-of-code-gsoc-2022.html
This document provides information about Google Summer of Code (GSoC) 2022. It discusses why students should participate in GSoC, the application timeline and process, tips for finding projects and communicating with mentors, expectations during the coding and evaluation periods, and opportunities to continue contributing to open source projects after GSoC. The overall goal is to help potential contributors understand what is required to be accepted into and succeed in GSoC.
Niffler is an efficient DICOM Framework for machine learning pipelines and processing workflows on metadata. It facilitates efficient transfer of DICOM images on-demand and real-time from PACS to the research environments, to run processing workflows and machine learning pipelines.
https://github.com/Emory-HITI/Niffler/
This is an introductory presentation to GSoC 2021. This year there were a few specific changes to GSoC compared to the past years. Specifically, workload and the student stipend have been made half in 2021 compared to the previous years.
We propose Niffler (https://github.com/Emory-HITI/Niffler), an open-source ML framework that runs in research
clusters by receiving images in real-time using DICOM protocol from hospitals' PACS.
This presentation aims to introduce GSoC to new mentors and mentoring organizations. More details - https://kkpradeeban.blogspot.com/2019/12/google-summer-of-code-gsoc-2020-for.html
An introductory presentation to Google Summer of Code (GSoC), focusing on the year 2020. More information can be found at https://kkpradeeban.blogspot.com/search/label/GSoC
The diversity of data management systems affords developers the luxury of building heterogeneous architectures to address the unique needs of big data. It allows one to mix-n-match systems that can store, query, update, and process data based on specific use cases. However, this heterogeneity brings
with it the burden of developing custom interfaces for each data management system. Existing big data frameworks fall short in mitigating these challenges imposed. In this paper, we present Bindaas, a secure and extensible big data middleware that offers uniform access to diverse data sources. By providing a RESTful web service interface to the data sources, Bindaas exposes query, update, store, and delete functionality of the data sources as data service APIs, while providing turn-key support for standard operations involving access control and audit-trails. The research community has deployed Bindaas in
various production environments in healthcare. Our evaluations highlight the efficiency of Bindaas in serving concurrent requests to data source instances with minimal overheads.
This is the 2nd defense of my Ph.D. double degree.
More details - https://kkpradeeban.blogspot.com/2019/08/my-phd-defense-software-defined-systems.html
The presentation slides of my Ph.D. thesis. For more information - https://kkpradeeban.blogspot.com/2019/07/my-phd-defense-software-defined-systems.html
The presentation slides of my Ph.D. thesis proposal ("CAT" as known in my university). I received a score of 18/20.
Supervisors:
Prof. Luís Veiga (IST, ULisboa)
Prof. Peter Van Roy (UCLouvain)
Jury:
Prof. Javid Taheri (Karlstad University)
Prof. Fernando Mira da Silva (IST, ULisboa)
This is my presentation at IFIP Networking 2018 in Zurich.
In this paper, we propose a cloud-assisted network as an alternative connectivity provider.
More details: https://kkpradeeban.blogspot.com/2018/05/moving-bits-with-fleet-of-shared.html
Services that access or process a large volume of data are known as data services. Big data frameworks consist of diverse storage media and heterogeneous data formats. Through their service-based approach, data services offer a standardized execution model to big data frameworks. Software-Defined Networking (SDN) increases the programmability of the network, by unifying the control plane centrally, away from the distributed data plane devices. In this paper, we present Software-Defined Data Services (SDDS), extending the data services with the SDN paradigm. SDDS consists of two aspects. First, it models the big data executions as data services or big services composed of several data services. Then, it orchestrates the services centrally in an interoperable manner, by logically separating the executions from the storage. We present the design of an SDDS orchestration framework for network-aware big data executions in data centers. We then evaluate the performance of SDDS through microbenchmarks on a prototype implementation. By extending SDN beyond data centers, we can deploy SDDS in broader execution environments.
https://kkpradeeban.blogspot.com/2018/04/software-defined-data-services.html
This is the presentation of DMAH workshop in conjunction with VLDB'17. This describes my work during my stay at Emory BMI.
More information: https://kkpradeeban.blogspot.com/2017/08/on-demand-service-based-big-data.html
This is a poster I presented at ACRO Summer School at Karlstad University. This presents my PhD work.
More details: http://kkpradeeban.blogspot.com/2017/07/my-first-polygonal-journey.html
This is the presentation I did to the audience of EMJD-DC Spring Event 2017 Brussels to discuss my research. http://kkpradeeban.blogspot.be/2017/05/emjd-dc-spring-event-2017.html
This document summarizes the PhD work of Pradeeban Kathiravelu on improving scalability and resilience in multi-tenant distributed clouds. It describes two approaches: 1) SMART uses SDN to provide differentiated quality of service and service level agreements by dynamically diverting and cloning priority network flows. 2) Mayan componentizes big data services as microservices that can be executed in a network-aware and scalable way across distributed clouds. Evaluation shows these approaches improve speedup and ensure SLAs for critical flows compared to network-agnostic distributed execution.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
1. Data Replication and
Synchronization Tool
Ashish Sharma
Pradeeban Kathiravelu
PPoowweerrppooiinntt TTeemmppllaatteess 1
2. Introduction
• Data is huge
• Consumers often share a sub set of
data with others.
– Pointers to data, actually.
• Medical data is structured in
hierarchies.
Powerpoint Templates 2
3. Motivation
• Creating and sharing pointers to
interesting sub sets of data.
• Data Sharing Synchronization
System
– Fault-tolerant.
– In-Memory.
• Generic, while targeting the
medical images and meta data.
– The Cancer Imaging Archive
Powerpoint Templates 3
(TCIA)
4. Solution Architecture
• Users create, share, and update
replica sets from a data source.
• Infinispan In-Memory Data Grid
(version 6.0.2) to store the replica
sets.
Fig 1. Deployment Architecture
Powerpoint Templates 4
5. Execution Flow
• Publisher-Consumer API to consume
the replica sets and Data Provider API
to communicate with the data
source.
Powerpoint Fig 2. Execution Templates Flow
5
7. Extensibility
• Not tightly coupled to the technology.
– Other data-grids
• Hazelcast, Terracotta Big
Memory, Oracle Coherence
– Persistence
• Integration to SQL or NoSQL
solutions such as Mongo DB.
–Data sources other than TCIA.
Powerpoint Templates 7
8. What Infinispan offers?
• High Performance and Scalability.
• Fault-tolerance
– Multiple nodes with TCP-IP or
Multicast based JGroups clustering
configurations.
• Distributed Execution.
– Optimized for single node as a local
cache as well as a multiple-node
execution.
• MapReduce Framework.
Powerpoint Templates 8
9. What Infinispan offers?
• High Performance and Scalable.
• Fault-tolerant
– Multiple nodes with TCP-IP or
Multicast based JGroups clustering
configurations.
Thank you!
• Distributed Execution.
– Optimized for single node as a local
cache as well as a multiple-node
execution.
• MapReduce Framework.
Powerpoint Templates 9
10. What Infinispan offers?
• High Performance and Scalable.
• Fault-tolerant
– Multiple nodes with TCP-IP or
Multicast based JGroups clustering
configurations.
Thank you!
• Distributed Execution.
– Optimized for single node as a local
cache as well as a multiple-node
execution.
• MapReduce Framework.
Powerpoint Templates 10