TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
This presentation was held by Meikel Poess on September 3, 2014 at VLDB 2014 in Hangzhou, China.
Full paper and additional information available at:
http://msrg.org/papers/VLDB2014TPCDI
Abstract:
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.
This tutorial was held at IEEE BigData '14 on October 29, 2014 in Bethesda, ML, USA.
Presenters: Chaitan Baru and Tilmann Rabl
More information available at:
http://msrg.org/papers/BigData14-Rabl
Summary:
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
This presentation was held at ISC 2014 on June 26, 2014 in Leipzig, Germany.
More information available at:
http://msrg.org/papers/ISC2014-Rabl
Abstract:
The Workshops for Big Data Benchmarking (http://clds.sdsc.edu/bdbc/workshops), which have been underway since May 2012, have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPCx-HS, based on Terasort. TPCx-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extends the TPC-DS benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems. This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
TPCx-HS is the first vendor-neutral benchmark focused on big data systems – which have become a critical part of the enterprise IT ecosystem.
Watch the video presentation: http://wp.me/p3RLHQ-cLY
Learn more: http://www.tpc.org/tpcx-hs
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
This presentation was held by Meikel Poess on September 3, 2014 at VLDB 2014 in Hangzhou, China.
Full paper and additional information available at:
http://msrg.org/papers/VLDB2014TPCDI
Abstract:
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.
This tutorial was held at IEEE BigData '14 on October 29, 2014 in Bethesda, ML, USA.
Presenters: Chaitan Baru and Tilmann Rabl
More information available at:
http://msrg.org/papers/BigData14-Rabl
Summary:
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
This presentation was held at ISC 2014 on June 26, 2014 in Leipzig, Germany.
More information available at:
http://msrg.org/papers/ISC2014-Rabl
Abstract:
The Workshops for Big Data Benchmarking (http://clds.sdsc.edu/bdbc/workshops), which have been underway since May 2012, have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPCx-HS, based on Terasort. TPCx-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extends the TPC-DS benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems. This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
TPCx-HS is the first vendor-neutral benchmark focused on big data systems – which have become a critical part of the enterprise IT ecosystem.
Watch the video presentation: http://wp.me/p3RLHQ-cLY
Learn more: http://www.tpc.org/tpcx-hs
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...J On The Beach
Developing reliable data acquisition, processing and control modules for mission critical systems - as they run at CERN - has always been challenging. As both data volumes and rates increase, non-functional requirements such as performance, availability, and maintainability are getting more important than ever. C2MON is a modular Open Source Java framework for realising highly available, large industrial monitoring and control solutions. It has been initially developed for CERN’s demanding infrastructure monitoring needs and is based on more than 10 years of experience with the Technical Infrastructure Monitoring (TIM) systems at CERN. Combining maintainability and high-availability within a portable architecture is the focus of this work. Making use of standard Java libraries for in-memory data management, clustering and data persistence, the platform becomes interesting for many Big Data scenarios.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Projectinside-BigData.com
Adrian Tate from Cray and Stig Telfer from StackHPC gave this talk at the 2018 Swiss HPC Conference.
“The Human Brain Project is an EU flagship project that seeks to “provide researchers worldwide with tools and mathematical models for sharing and analysing large brain data they need for understanding how the human brain works.” The HBP encompasses massively parallel applications in neuro-simulation, with advanced computational and data movement requirements way beyond current technological capabilities. It will therefore drive innovation in HPC. As part of the R&D co-design, the HBP sub-contracted Cray to develop a pilot system next-generation supercomputer and software stack to address cutting-edge brain simulation and analysis activities. This system features a novel memory / storage hierarchy consisting of new memories, nvRAM and SSDs, accessed through Ceph over the Intel Omnipath interconnect.
This talk will describe how Cray, StackHPC and the HBP co-designed a next-generation storage system based on Ceph, exploiting complex memory hierarchies and enabling next-generation mixed workload execution. We will describe the challenges, show performance data and detail the ways that a similar storage setup may be used in HPC systems of the future.”
Watch the video: https://insidehpc.com/2018/04/ceph-brain-storage-data-movement-supporting-human-brain-project/
Learn more: http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
An effective classification approach for big data with parallel generalized H...riyaniaes
Advancements in information technology is contributing to the excessive rate of big data generation recently. Big data refers to datasets that are huge in volume and consumes much time and space to process and transmit using the available resources. Big data also covers data with unstructured and structured formats. Many agencies are currently subscribing to research on big data analytics owing to the failure of the existing data processing techniques to handle the rate at which big data is generated. This paper presents an efficient classification and reduction technique for big data based on parallel generalized Hebbian algorithm (GHA) which is one of the commonly used principal component analysis (PCA) neural network (NN) learning algorithms. The new method proposed in this study was compared to the existing methods to demonstrate its capabilities in reducing the dimensionality of big data. The proposed method in this paper is implemented using Spark Radoop platform.
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...J On The Beach
Developing reliable data acquisition, processing and control modules for mission critical systems - as they run at CERN - has always been challenging. As both data volumes and rates increase, non-functional requirements such as performance, availability, and maintainability are getting more important than ever. C2MON is a modular Open Source Java framework for realising highly available, large industrial monitoring and control solutions. It has been initially developed for CERN’s demanding infrastructure monitoring needs and is based on more than 10 years of experience with the Technical Infrastructure Monitoring (TIM) systems at CERN. Combining maintainability and high-availability within a portable architecture is the focus of this work. Making use of standard Java libraries for in-memory data management, clustering and data persistence, the platform becomes interesting for many Big Data scenarios.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Projectinside-BigData.com
Adrian Tate from Cray and Stig Telfer from StackHPC gave this talk at the 2018 Swiss HPC Conference.
“The Human Brain Project is an EU flagship project that seeks to “provide researchers worldwide with tools and mathematical models for sharing and analysing large brain data they need for understanding how the human brain works.” The HBP encompasses massively parallel applications in neuro-simulation, with advanced computational and data movement requirements way beyond current technological capabilities. It will therefore drive innovation in HPC. As part of the R&D co-design, the HBP sub-contracted Cray to develop a pilot system next-generation supercomputer and software stack to address cutting-edge brain simulation and analysis activities. This system features a novel memory / storage hierarchy consisting of new memories, nvRAM and SSDs, accessed through Ceph over the Intel Omnipath interconnect.
This talk will describe how Cray, StackHPC and the HBP co-designed a next-generation storage system based on Ceph, exploiting complex memory hierarchies and enabling next-generation mixed workload execution. We will describe the challenges, show performance data and detail the ways that a similar storage setup may be used in HPC systems of the future.”
Watch the video: https://insidehpc.com/2018/04/ceph-brain-storage-data-movement-supporting-human-brain-project/
Learn more: http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
An effective classification approach for big data with parallel generalized H...riyaniaes
Advancements in information technology is contributing to the excessive rate of big data generation recently. Big data refers to datasets that are huge in volume and consumes much time and space to process and transmit using the available resources. Big data also covers data with unstructured and structured formats. Many agencies are currently subscribing to research on big data analytics owing to the failure of the existing data processing techniques to handle the rate at which big data is generated. This paper presents an efficient classification and reduction technique for big data based on parallel generalized Hebbian algorithm (GHA) which is one of the commonly used principal component analysis (PCA) neural network (NN) learning algorithms. The new method proposed in this study was compared to the existing methods to demonstrate its capabilities in reducing the dimensionality of big data. The proposed method in this paper is implemented using Spark Radoop platform.
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
The Aspera Solution enables telecommunication companies with a complete portfolio of file transfer, distribution, synchronization, and automation software to
- Systematically achieve maximum transfer speeds of big data, regardless of network conditions and transfer distance.
- Centrally manage, monitor and control your transfer activity, server infrastructure, bandwidth utilisation
- Automate your file transfer workflows and schedule transfer activity and bandwidth availability
- Uncompromising security and reliability
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Maté Ongenaert
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor GLPG1690 in a mouse model for IPF
Maté Ongenaert (Mechelen, Belgium), Maté Ongenaert, Sonia Dupont, Roland Blanqué, Reginald Brys, Ellen van der Aar, Bertrand Heckmann
ERS - European Respiratory Society International Congress 2016. Session: Therapeutic horizons: novel targets and pharmacological models
Background and objectives
GLPG1690 is a novel potent autotaxin (ATX) inhibitor shown to be efficacious in the mouse bleomycin (BLM) lung fibrosis model. Here, we analyze the impact of GLPG1690 on the gene expression signature in mouse fibrotic lung tissue.
Methods
Lung fibrosis was induced by intranasal administration of BLM. Animals were treated with GLPG1690 or vehicle.Whole superior right lung was used for RNA extraction. Full transcriptome analysis was performed using the Agilent SurePrint G3 mouse chip. Analysis was performed using empirical Bayes methods and linear models. Public human IPF expression data were re-analyzed.
Results
GLPG1690 strongly reduced lung fibrosis as shown by reduction of Ashcroft scores and collagen content. Microarray analysis of the lungs revealed that GLPG1690 strongly reversed the impact of gene expression caused by BLM (367 out of the 2375 probes). As GLPG1690 treatment affects 395 probes, this treatment effect is highly relevant in the model. Gene clusters affected by BLM treatment and reverted by GLPG1690 are related to extracellular matrix (such as Tnc and Spp1), collagen (Col3a1) and cytokines/chemokines (Cxcl12). Several of the affected genes are known to be involved in the development or progression of lung fibrosis in IPF patients.
Conclusions
These data provide further mechanistic understanding of the efficacy of ATX inhibition in a pre-clinical lung fibrosis model, highlighting a role for extracellular matrix and inflammation biology. These data strongly suggest that GLPG1690 may be beneficial in treating IPF patients and support its evaluation in a clinical study
IBM Managed File Transfer Portfolio - IBMImpact 2014Leif Davidsen
The data held in files can represent some of the most valuable assets that a business has. If this data is trapped in a file in a remote system then it loses value. IBM has recently updated its portfolio of Managed File Transfer offerings, simplifying choice for the business user and offering better value for money, while extending the access to this valuable data. Here about different managed file transfer use cases and suggested solutions
Open human genome data - presentation at the annual TKT/CLIDP doctoral programme symposium 2016 "Open up! – Open Data and Open Access" of the University of Turku.
New Progress in Pyrosequencing for DNA MethylationQIAGEN
Pyrosequencing is a highly flexible technology that lets you rapidly analyze short- to medium-length sequences fast and quantitatively with high accuracy. The real-time, high-resolution sequence output makes the technology highly suitable for applications including complex mutation analysis, microbial identification and DNA methylation quantification.
The main bottleneck in Pyrosequencing has been limited sequence length, which is critical for some applications. Our new technology, software, and chemistry overcome this bottleneck and give sequence reads that are typically twice as long as those from previous PyroMark systems. The new PyroMark Q24 Advanced system also reduces background noise, improving quantification even at sites distant from the sequencing start. The new system is ideal for applications requiring analysis of longer sequences, such as DNA methylation analysis in epigenetic research, frequency determination in mutation analysis, and various de novo sequencing applications.
In this presentation, we will discuss the following applications and technology improvements:
• DNA methylation analysis at single base resolution at CpG and CpN sites
• Improved quantification of sequence variations at any sequence position
• Easy and improved base calling functionality
Is your organization considering migrating an existing data center over to AWS to reduce cost, improve reliability, security, and operational performance of your IT operations? If so, join us for a webinar on how to plan and execute your migration to the cloud from classification of applications, assessing your application needs, identifying the target applications and other various migration strategies.
PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...QIAGEN
This End-Point PCR Beginner´s Guide will not only give you a comprehensive overview of tools and techniques to help you to get the most out of your samples, but also give you information on dedicated solutions and complete workflows on multiplex PCR and PCR fragment analysis.
Microbiome Profiling with the Microbial Genomics Pro SuiteQIAGEN
In this slide deck, we introduce the scientist-friendly Microbial Genomics Pro Suite offering workflows optimized for microbiome profiling, microbial typing and outbreak analysis. The workflows and tools for microbial genomics introduced with this software package are further extending the comprehensive set of genomics, transcriptomics and epigenomics analysis solutions that researchers know from CLC Genomics Workbench.
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...QIAGEN
High-throughput sequencing, combined with high-resolution metagenomic analysis, provides a powerful diagnostic tool for clinical management of enteric disease. Forty-five patient samples of known and unknown disease etiology and 20 samples from health individuals were subjected to next-generation sequencing. Subsequent metagenomic analysis identified all microorganisms (bacteria, viruses, fungi and parasites) in the samples, including the expected pathogens in the samples of known etiology. Multiple pathogens were detected in the individual samples, providing evidence for polymicrobial infection. Patients were clearly differentiated from healthy individuals based on microorganism abundance and diversity. The speed, accuracy and actionable features of CosmosID bioinformatics and curated GenBook® databases, implemented in the QIAGEN Microbial Genomics Pro Suite, and the functional analysis, leveraging the QIAGEN functional metagenomics workflow, provide a powerful tool contributing to the revolution in clinical diagnostics, prophylactics and therapeutics that is now in progress globally.
Strategies for Hyperscale Data Centers to Approach Net ZeroMattias Ganslandt
Ganslandt and Bergquist presentation 22 February 2017 at Green Data Center Conference in San Diego and Di Data Center Conference in Stockholm #GDCON #didatacenter #cloudcomputing
This slidedeck details two comprehensive informatics solutions — the Biomedical Genomics Workbench and Ingenuity Knowledge Base Variant Analysis platforms. We show the intuitive user interface of CLC Cancer Research Workbench and demonstrate how the rich biological content from Ingenuity Knowledge Base helps you rapidly identify critical variants in your samples.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Network Engineering for High Speed Data SharingGlobus
These slides were presented by ESnet's Eli Dart at the AGU Fall Meeting 2018 in a session titled "Scalable Data Management Practices in Earth Sciences" convened by Ian Foster, Globus co-founder and director of Argonne's data science and learning division.
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo
Watch Pablo's session from Fast Data Strategy on-demand here: https://goo.gl/1aEBo8
The tide is changing for analytics architectures. Traditional approaches, from the data warehouse to the data lake, implicitly assume that all relevant data can be stored in a single, centralized repository. But this approach is slow and expensive, and sometimes not even feasible, because some data sources are too big to be replicated, and data is often too distributed such as those found in cloud data sources to make a “full centralization” strategy successful.
Watch this session to learn more about:
• Modern data architectures
• Why logical architectures are the best option when integrating big data
• How Denodo’s parallel in-memory capabilities with dynamic query optimization redefine analytics architectures
This is a talk I gave at a Northwestern University - Complete Genomics Workshop on April 21, 2011 about using clouds to support research in genomics and related areas.
Cloudgene - A MapReduce based Workflow Management SystemLukas Forer
Cloudgene is a freely available platform to improve the usability of MapReduce programs by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds).
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
1. Xing Xu, Ph.D
Director of Cloud Computing Product
Challenges in the Era of Big Genomic
Data and Our Practices in BGI
2. Topics for Today
About BGI
Challenges and Solutions
Data transfer
Cloud Computing
Computational Algorithms and Infrastructure
Data Storage
2
3. BGI
The world largest genome sequencing center
Started with Human Genome Project in 1999 with only a
few sequencers.
Now more than 150 sequencers, 6 TB/day sequencing
throughput.
MODEL
ABI
3730XL
Roche
454
ABI
SOLiD 4
Solexa
GA IIx
Illumina
HiSeq 2000
INSTALLATION 16 1 27 6 135
4. BGI
The world largest genome sequencing center
The largest computing and storage center for
genomics in China
- 20,000+ CPU cores
- 19 NVIDIA GPUs
- 220+ Tflops peak
performance
- 17 PB data storage
- The storage and
computation capability
increase by 10000 folds!
- Still increasing …
5. BGI
The world largest genome sequencing center
The largest computing and storage center for
genomics in China
One of world leading research institutes in
Genomics
Since 2007,
- 253 papers in high-impact journals
- Including 47 in Nature and its sub-journals,
9 in Science,2 in Cell, and 1 in NEJM, with
42 first and/or corresponding authors
- 369 patent applications
- 254 software authorship
6. BGI
The world largest genome sequencing center
The largest computing and storage center for
genomics in China
One of world leading research institutes in
Genomics
BGI has the sequencing capacity, hardware resource
and software proficiency to be the one of the strongest
end-to-end service providers in the world for NGS
sequencing, data analysis and data interpretation.
9. Challenges for
Handling Big Data
Exponential growth of data amount
Complicate data analysis process
Widely distributed data
Images from omicsmaps.com 9
BGI
10. Challenges and Solutions
Data transfer
Cloud Computing
Computational Algorithms and Infrastructure
Data Management
10
11. Solutions for data transfer
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
11
12. Solutions for data transfer
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
12
High speed
data transfer
13. Solutions for data transfer:
High speed data transfer
13
Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June,
2012.
14. Solutions for data transfer:
High speed data transfer
14
Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June.
A 24GB file was transferred from China to US
in 30 Seconds (~8Gbits/s).
Right software: Aspera Fastp data transfer protocol
Right infrastructure: 10Gb link between US and China
Right technology: RAM Disk, iPV6
15. Solutions for data transfer
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
15
Aspera
Server
Aspera
Client
Aspera
Client
Aspera
Client
Software license
Expensive physical
bandwidth
Free
BGI
Clients
Bottleneck on the
client site
Not a good solution
of sharing
16. Solutions for data transfer
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
16
17. Solutions for cloud
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A Software as a Service (SaaS) platform
for NGS data analysis
17
18. EasyGenomics™
EasyGenomics is a Software as a Service (SaaS)
bioinformatics platform for research and applications.
Algorithms, W
orkflows,
Reports
Computational
Resources
Database,
Data management
Web portal,
Simple UIHigh speed
connection
19. A typical user case
19
A normal user case of EasyGenomics and Customers’ Local Computational resource.
The double line items are Customers’ data or resource. The single line items are
results and data within BGI and EasyGenomics platform. The widths of arrows
represent the sizes of data flows (not in real proportion).
Customers’
Local
Resources
20. Bioinformatics Workflow
Four steps:
Upload, Create a Sample, Perform Analyses, Download Results
Algorithms:
Carefully chosen, tested and optimized
Workflows:
Whole Genome Resequencing, Exome Resequencing, RNA-Seq,
small RNA, ncRNA, and De novo Assembly
26. What’s new?
An internal version of EG is running
automatically as a production system.
It integrates the new data delivery portal of
sequencing service.
Aspera fastp download
Accessible to all workflows on EasyGenomics
26
27. You can chose to deliver data
to EasyGenomics platform
27
Configuration file
30. Solutions for cloud
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
30
31. Two paths for
the future cloud solution
Software as a Service (SaaS) to Platform as a
Service (PaaS)
To give the flexibility to research users:
Add their own tools (any tools)
Integrate their own workflows (different combinations of
modules)
One-Click SaaS solution
To give the automated solution for clinical users:
Automated solution for repetitive works
Fulfill very specific functions
31
32. Solutions for
Algorithm and Infrastructure
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
Algorithm and Infrastructure
Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)
32
36. 0
2
4
6
8
10
12
14
16
Old Pipeline Cloud-based pipeline
Two weeks
Within 15 hrs (
120cores)
Data: Human 60X whole genome Re-sequencing
Fast and Scalable
• The Hadoop Implementation provides great scalability.
• Simply by providing more resource, the analysis can finish much
faster.
37. SOAP-GaeaAlignment (1 human sample in 1000genome)
Software Mapping Rate
Confident Mapping
Rate(MAPQ>=10)
Stampy 85.93% 70.00%
SOAP2 79.14% 79.14%
Novo align 82.53% 79.74
BWA 91.54% 84.78%
Bowtie 81.15% 81.15%
SOAP-GaeaAlignment 91.75% 85.20%
It’s not only FAST,
but also ACCURATE
38. Assembly
Constructing de bruijn Graph
Solving Tiny Repeats Merging Bubbles
Scaffolding Merging Contigs
SOAP-Hecate: Distributed
de novo Genome Assembly
39. Contig Extension
Scaffolding
Gap closing
SOAPdenovo v2
SOAP-Hecate v2.5
(84 cores)
SOAP-Hecate v2.5
(180 cores)
Data Size 670GB 670GB 670GB
No. of Servers 1 7 15
Time 59 hour 59hour 38hour
Memory Size 400*1 24*7 24G*15
Mode Centralized Distributed Distributed
*80X human whole genome
SOAP-Hecate is scalable and
using much less memory
Scalability
Performance
SOAP-Hecate SOAPdenovo ALLPATH Phusion2, phrap Meraculous ABySS
Scaffold
N50
26,570,829 117,000 211,000 495,000 486,000 144,300
Tested on simulated data from Assemblathon 1(Earl, Bradnam et al.
40. Solutions for Algorithms
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
Algorithm and Infrastructure
Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)
GPU based acceleration: SOAP3 (Aligner), GSNP(SNP
caller), GAMA (Population genetics tool)
40
41. SOAP3: ~20X speed up
from SOAP2
SOAP
SOAP2
(2008)
20-30x
SOAP3
(2011)
10-30X
GPU Version
1893.45
10671.39
211.53
819.81
0
2000
4000
6000
8000
10000
12000
Human Zebra fish
Total Time (second)
SOAP2 SOAP3
14.12
14.6
13
13.5
14
14.5
15
Human Zebra fish
Speedup
84.2
64.49
88.29
76.55
0
20
40
60
80
100
Human Zebra fish
Alignment Ratio (%)
SOAP2 SOAP3
Collaboration from University of Hong Kong
43. Solutions for
Data Management
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
Algorithm and Infrastructure
Scale up with Hadoop / MapReduce
GPU based acceleration
Data Management
Data management in BGI
43
44. Paradigm Shift
Traditional Model
Business
Determine
what question
to ask
IT
Structures the
data to
answer
that question
Big Data Model
IT
Delivers a
platform to
enable
creative
discovery
Business
Explores what
questions
could be
asked
46. BGI Data Pyramid
iRODS
(Data)
Database
(Information)
Data Mining
(Knowledge)
Health/Clinical APP
(Decision)
• Data Preservation
• Data Retrieval
• Data Sharing
• BGI-SNP
• BGI-SV
• BGI-GaP
• Disease: HGVD/PMRD
• Systems Biology
• Drug Discovery
• Diagnosis of Genetic
Diseases
• Drug of Choice
48. iRODS - integrated Rule
Oriented Data System
48*Access data with Web-based Browser or iRODS GUI or Command Line clients.
renci.org
49. iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow - iRODS
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug
iRODS-based Data Management
• Contents: raw data, analyzed data and related metadata
• Data backup
• Fully integrated with LIMS
• Able to search and access any data according to the metadata from
BGI data standard, e.g. project, sample, cohort, phenotype, QC, etc.
• Federation: integrate separate iRODS zones
50. Variant
(Gene)
Disease
Drug
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow – BGI-DB
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
BGI-DB
• A locus-specific database (LSDB) for all variants identified by BGI
• Manage all basic information generated from data analysis pipelines
• Link all detailed information about individual samples to each variant
• Easy to query information from samples with certain commonality
(such as same phenotype, same cohort, etc.)
• Provide the raw information for further data mining steps
51. iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow – BGI-DW & BGI-KB
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug
BGI Data Warehousing & Knowledge Base
• BGI data warehousing (BGI-DW) consists of a series of secondary databases related to
variants, diseases and drugs
• BGI knowledge base (BGI-KB) stores and manages the knowledge obtained through
mining BGI-DB, BGI-DW and other public resources
• Periodically and automatically updated
• Provide APIs for the bioinformaticians to query the information and generate
individualized reports
52. iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow - Successful Story
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Query the allele frequency
database to filter out common
variants and identify disease-
causal variants
Calculate variant frequencies
from certain cohorts and save
them into the allele frequency
database
Diagnosis for Monogenic
Disease
Group samples
into cohorts
based on their
phenotypes
Variant
(Gene)
Disease
Drug
53. Summary of Our Practice
in IT infrastructure
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
Algorithm and Infrastructure
Scale up with Hadoop / MapReduce
GPU based acceleration
Data Management
Using iRODs file system to manage big data 53
54. Acknowledgement
Development Team
Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.
Flex Lab: Yan Li (Hecate), Zhi Zhang(GAEA, iRODS) etc. GPU Lab: Bingqiang Wang etc.
Test & QA Team
Xin Guan, Jingjuan Liu, etc.
PMO & IT Operation
Wenjun Zeng, Litong Lai, Jing Tian, etc.
Product Team
Xing Xu, Jing Guo, Fang Fang etc.
Other BGI Teams
Collaborators:
University of Hong Kong (HKU)
Hong Kong University of Science and Technology (HKUST)
Nvidia - Aspera
RENCI - TianJing Supercomputing center
Editor's Notes
At the heart of EasyGenomics is our Bioinformatics Core. 5 workflows with carefully chosen algorithms, tested and optimized. Filtering, QC Report, Alignment along with other supporting features.
Then we chooseHadoop. The Hadoop streaming framework enabled us to fast parallel the common used algorithms.The distributed storage system HDFS, ensured the safety of our data.In addition, the Map/reduce framwork enabled us to further improve the accuracy and performance of our modules
We integrated all the standard re-sequencing steps onto the cloud-based framework. Which can greatly accelerate the data anlysis.
Except for parallel existing algorithms, we also designed new approaches based on the distributed frameworks. Benefit from its capacity of big data processing, we developed new models for mapping and variation detections.
This cloud-basedpipeline is already applied in Cancer and human disease studies within BGI.It can reduce the analysis time from two weeks into two days.Additionally, the figure on the bottom shows the speedups of our applications, which means we can control the analysis time by choosing the size of the cluster. It would be even faster.
This is the comparison on a human sample in 1000genome project Our mapping tools –GaeaAlignment returning the highest mapping rate.
Hecate distributed graph simplification algrithm into etire cluster of computer nodes。Transfered memory usage into fast distributed IO work which enabled the assembly work
Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. (Source: Wikipedia)SOAP: first-generation short read alignment toolSOAP2 (2008): 20 to 30 times faster than SOAP, less memoryCollaboration between BGI & University of Hong KongCompressed indexing: bidirectional BWT (2BWT)SOAP3 (2011): 10 to 30 times faster than SOAP2Collaboration from University of Hong KongGPU’s parallel processing powerCPU memory: increase from a few to tens GBGPU-based indexing: GPU-2BWT