This presentation was held at ISC 2014 on June 26, 2014 in Leipzig, Germany.
More information available at:
http://msrg.org/papers/ISC2014-Rabl
Abstract:
The Workshops for Big Data Benchmarking (http://clds.sdsc.edu/bdbc/workshops), which have been underway since May 2012, have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPCx-HS, based on Terasort. TPCx-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extends the TPC-DS benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems. This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
This presentation was held by Meikel Poess on September 3, 2014 at VLDB 2014 in Hangzhou, China.
Full paper and additional information available at:
http://msrg.org/papers/VLDB2014TPCDI
Abstract:
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.
TPCx-HS is the first vendor-neutral benchmark focused on big data systems – which have become a critical part of the enterprise IT ecosystem.
Watch the video presentation: http://wp.me/p3RLHQ-cLY
Learn more: http://www.tpc.org/tpcx-hs
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
This presentation speaks about -
1) How to perform big data testing
2) Tools that can be used for testing
3) Different validation stages involved
4) Performance testing
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
This presentation was held by Meikel Poess on September 3, 2014 at VLDB 2014 in Hangzhou, China.
Full paper and additional information available at:
http://msrg.org/papers/VLDB2014TPCDI
Abstract:
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.
TPCx-HS is the first vendor-neutral benchmark focused on big data systems – which have become a critical part of the enterprise IT ecosystem.
Watch the video presentation: http://wp.me/p3RLHQ-cLY
Learn more: http://www.tpc.org/tpcx-hs
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
This presentation speaks about -
1) How to perform big data testing
2) Tools that can be used for testing
3) Different validation stages involved
4) Performance testing
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
SQL Server Managing Test Data & Stress Testing January 2011Mark Ginnebaugh
A quick look at some of the available functionality for SQL Server developers who have access to Visual Studio 2010 and SQL-Hero.
With Visual Studio 2010 Premium (and Professional to a degree) delivering similar capabilities to what was available in VS 2008 Database Pro Edition, the ability to generate a mass amount of sample data for your database has only gotten more accessible with time.
Realizing that other tools exist in this space and not all SQL developers use Visual Studio, we’ll also take a look at the third party data generation facility available in SQL-Hero, seeing how we can create thousands (or millions!) of records very quickly using a powerful rules engine, plus automate this process to support continuous integration strategies.
The core idea behind Hadoop is to distribute both the data and user software on individual shards within the cluster. The Bigdata Replay method is drastically different in that it packs user software into batches on a single multicore machine and uses circuit emulation to maximize throughout when bringing data shards for replay. The effect from hotspots, defined as drastically higher access frequency to a small portion of (popular) data, is different in the two platforms. This paper models the difference numerically but in a relative form, which makes it possible to compare the two platforms.
High Performance Computing and Big Data Geoffrey Fox
We propose a hybrid software stack with Large scale data systems for both research and commercial applications running on the commodity (Apache) Big Data Stack (ABDS) using High Performance Computing (HPC) enhancements typically to improve performance. We give several examples taken from bio and financial informatics.
We look in detail at parallel and distributed run-times including MPI from HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that one needs to distinguish the different needs of parallel (tightly coupled) and distributed (loosely coupled) systems.
We also study "Java Grande" or the principles to use to allow Java codes to perform as fast as those written in more traditional HPC languages. We also note the differences between capacity (individual jobs using many nodes) and capability (lots of independent jobs) computing.
We discuss how this HPC-ABDS concept allows one to discuss convergence of Big Data, Big Simulation, Cloud and HPC Systems. See http://hpc-abds.org/kaleidoscope/
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
Data Scientists often have access to very sensitive material: data! Today's data scientists need a way to interact with toxic data where spilling more than a few data could be destructive to a company. Securing compute clusters to be like nuclear glove boxes of old is one technique to limit data exfiltration and ensure data production is regularized, reliable and secure.
This talk will cover the philosophy and implementation of:
Data Dropbox: data goes in blindly but can be verified via checksums - data directionality is enforced; using HDFS is a model and the state of HBase is discussed.
Data Glovebox: one can manipulate data as desired but can not exfiltrate except via very specific, controlled processes; the Oozie Git action is a step in this direction.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...DataWorks Summit
Apache Hadoop 3.x ushers in major architectural advances such as erasure coding in HDFS, containerized workload flexibility, GPU resource pooling and a litany of other features. These enhancements help drive real benefits when combined with high-speed, high-capacity solid state drives (SSDs).
Micron is a user of Apache Hadoop as well as an innovator in next-gen IT architecture, pushing the envelope on flash storage with the latest 3D NAND. Micron labs have benchmarks showing that adding a single SSD to existing HDD-based cluster nodes can deliver 200% faster Hadoop, at a fraction of the cost of new nodes.
In this session, Micron and Hortonworks will show real-world results demonstrating the tangible benefits of Apache Hadoop 3.x combined with the latest in non-volatile storage and an updated IT infrastructure with NVMe™ solid state drives in well-designed platforms. We will explore specific workloads and application acceleration by combining Apache Hadoop 3.x with SSDs to build analytics platforms that provide a sustainable competitive advantage for many applications to deliver a combination of low latency, high-performance active archives with better results and reduced storage overhead.
Speakers
Saumitra Buragohain, Hortonworks, Sr. Director, Product Management
Mike Cunliffe, Micron Technology, Data Management Architect
Exploiting machine learning to keep Hadoop clusters healthyDataWorks Summit
Oath has one of the largest footprint of Hadoop, with tens of thousands of jobs run every day. Reliability and consistency is the key here. With 50k+ nodes there will be considerable amount of nodes having disk, memory, network, and slowness issues. If we have any hosts with issues serving/running jobs can increase tight SLA bound jobs’ run times exponentially and frustrate users and support team to debug it.
We are constantly working to develop system that works in tandem with Hadoop to quickly identify and single out pressure points. Here we would like to concentrate on disk, as per our experience disk are the most trouble maker and fragile, specially the high density disks. Because of the huge scale and monetary impact because of slow performing disks, we took challenge to build system to predict and take worn-out disks before they become performance bottleneck and hit jobs’ SLAs. Now task is simple look into symptoms of hard drive failure and take them out? Right? No it’s not straight forward when we are talking about 200+k disk drives. Just collecting such huge data periodically and reliably is one of the small challenges as compared to analyzing such huge datasets and predicting bad disks. Now lets see data regarding each disk we have reallocated sectors count, reported uncorrectable errors, command timeout, and uncorrectable sector count. On top of it hard disk model has its own interpretation of the above-mentioned statistics. DHEERAJ KAPUR, Principal Engineer, Oath and SWETHA BANAGIRI
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Alluxio, Inc.
Alluxio Community Office Hour
September 29, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Adit Madan, Alluxio
In this talk, we describe the architecture to migrate analytics workloads incrementally to any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) directly on on-prem data without copying the data to cloud storage.
In this Office Hour:
- We will go over an architecture for running elastic compute clusters in the cloud using on-prem HDFS.
- Have a casual online video chat with Alluxio Open Source core maintainers to address any Alluxio related questions from our community members
My other computer is a datacentre - 2012 editionSteve Loughran
An updated version of the "my other computer is a datacentre" talk, presented at the Bristol University HPC talk.
Because it is targeted at universities, it emphasises some of the interesting problems -the classic CS ones of scheduling, new ones of availability and failure handling within what is now a single computer, and emergent problems of power and heterogeneity. It also includes references, all of which are worth reading, and, being mostly Google and Microsoft papers, are free to download without needing ACM or IEEE library access.
Comments welcome.
Current Trends and Challenges in Big Data BenchmarkingeXascale Infolab
Years ago, it was common to write a for-loop and call it benchmark. Nowadays, benchmarks are complex pieces of software and specifications. In this talk, the idea of benchmark engineering, trends in the area of benchmarking research and current efforts of the SPEC Research Group and the WBDB community focusing on Big Data will be discussed. The way in which benchmarks are used has changed. Traditionally, they were mostly used for generating throughput numbers. Today, benchmarks are, e.g., used as test frameworks to evaluate different aspects of systems such as scalability or performance. Since benchmarks provide standardized workloads and meaningful metrics, they are increasingly important for research.
The benchmark community is currently focusing on new trends such as cloud computing, big data, power-consumption and large scale, highly distributed systems. For several of these trends traditional benchmarking approaches fail: how can we benchmark a highly distributed system with thousands of nodes and data sources? What does a typical Big Data workload look like and how does it scale? How can we benchmark a real world setup in a realistic way on limited resources? What does performance mean in the context of Big Data? What is the right metric?
Speaker: Kai Sachs is a member of the Lifecycle & Cloud Management group at SAP AG. He received a joint Diploma degree in business administration and computer science as well as a PhD degree from Technische Universität Darmstadt. His PhD thesis was awarded with the SPEC Distinguished Dissertation Award 2011 for outstanding contributions in the area of performance evaluation and benchmarking. His research interests include software performance engineering, capacity planning, cloud computing and benchmarking. He is co-founder of ACM/SPEC International Conference on Performance Engineering (ICPE). He has served as member of several program and organization committees and as reviewer for many conferences and journals. Among others he was the PC Chair of the SPEC Benchmark Workshop 2010, Program Chair of the Workshop on Hot Topics on Cloud Services 2013 and the Industrial PC Chair of the ICPE 2011. Kai Sachs is currently serving on the editorial board of the CSI Transactions on ICT, as vice-chair of the SPEC Research Group, as PC Co-Chair of the ACM/SPEC ICPE 2015 and as Co-Chair of the Workshop on Big Data Benchmarking 2014.
This tutorial was held at IEEE BigData '14 on October 29, 2014 in Bethesda, ML, USA.
Presenters: Chaitan Baru and Tilmann Rabl
More information available at:
http://msrg.org/papers/BigData14-Rabl
Summary:
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
SQL Server Managing Test Data & Stress Testing January 2011Mark Ginnebaugh
A quick look at some of the available functionality for SQL Server developers who have access to Visual Studio 2010 and SQL-Hero.
With Visual Studio 2010 Premium (and Professional to a degree) delivering similar capabilities to what was available in VS 2008 Database Pro Edition, the ability to generate a mass amount of sample data for your database has only gotten more accessible with time.
Realizing that other tools exist in this space and not all SQL developers use Visual Studio, we’ll also take a look at the third party data generation facility available in SQL-Hero, seeing how we can create thousands (or millions!) of records very quickly using a powerful rules engine, plus automate this process to support continuous integration strategies.
The core idea behind Hadoop is to distribute both the data and user software on individual shards within the cluster. The Bigdata Replay method is drastically different in that it packs user software into batches on a single multicore machine and uses circuit emulation to maximize throughout when bringing data shards for replay. The effect from hotspots, defined as drastically higher access frequency to a small portion of (popular) data, is different in the two platforms. This paper models the difference numerically but in a relative form, which makes it possible to compare the two platforms.
High Performance Computing and Big Data Geoffrey Fox
We propose a hybrid software stack with Large scale data systems for both research and commercial applications running on the commodity (Apache) Big Data Stack (ABDS) using High Performance Computing (HPC) enhancements typically to improve performance. We give several examples taken from bio and financial informatics.
We look in detail at parallel and distributed run-times including MPI from HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that one needs to distinguish the different needs of parallel (tightly coupled) and distributed (loosely coupled) systems.
We also study "Java Grande" or the principles to use to allow Java codes to perform as fast as those written in more traditional HPC languages. We also note the differences between capacity (individual jobs using many nodes) and capability (lots of independent jobs) computing.
We discuss how this HPC-ABDS concept allows one to discuss convergence of Big Data, Big Simulation, Cloud and HPC Systems. See http://hpc-abds.org/kaleidoscope/
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
Data Scientists often have access to very sensitive material: data! Today's data scientists need a way to interact with toxic data where spilling more than a few data could be destructive to a company. Securing compute clusters to be like nuclear glove boxes of old is one technique to limit data exfiltration and ensure data production is regularized, reliable and secure.
This talk will cover the philosophy and implementation of:
Data Dropbox: data goes in blindly but can be verified via checksums - data directionality is enforced; using HDFS is a model and the state of HBase is discussed.
Data Glovebox: one can manipulate data as desired but can not exfiltrate except via very specific, controlled processes; the Oozie Git action is a step in this direction.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...DataWorks Summit
Apache Hadoop 3.x ushers in major architectural advances such as erasure coding in HDFS, containerized workload flexibility, GPU resource pooling and a litany of other features. These enhancements help drive real benefits when combined with high-speed, high-capacity solid state drives (SSDs).
Micron is a user of Apache Hadoop as well as an innovator in next-gen IT architecture, pushing the envelope on flash storage with the latest 3D NAND. Micron labs have benchmarks showing that adding a single SSD to existing HDD-based cluster nodes can deliver 200% faster Hadoop, at a fraction of the cost of new nodes.
In this session, Micron and Hortonworks will show real-world results demonstrating the tangible benefits of Apache Hadoop 3.x combined with the latest in non-volatile storage and an updated IT infrastructure with NVMe™ solid state drives in well-designed platforms. We will explore specific workloads and application acceleration by combining Apache Hadoop 3.x with SSDs to build analytics platforms that provide a sustainable competitive advantage for many applications to deliver a combination of low latency, high-performance active archives with better results and reduced storage overhead.
Speakers
Saumitra Buragohain, Hortonworks, Sr. Director, Product Management
Mike Cunliffe, Micron Technology, Data Management Architect
Exploiting machine learning to keep Hadoop clusters healthyDataWorks Summit
Oath has one of the largest footprint of Hadoop, with tens of thousands of jobs run every day. Reliability and consistency is the key here. With 50k+ nodes there will be considerable amount of nodes having disk, memory, network, and slowness issues. If we have any hosts with issues serving/running jobs can increase tight SLA bound jobs’ run times exponentially and frustrate users and support team to debug it.
We are constantly working to develop system that works in tandem with Hadoop to quickly identify and single out pressure points. Here we would like to concentrate on disk, as per our experience disk are the most trouble maker and fragile, specially the high density disks. Because of the huge scale and monetary impact because of slow performing disks, we took challenge to build system to predict and take worn-out disks before they become performance bottleneck and hit jobs’ SLAs. Now task is simple look into symptoms of hard drive failure and take them out? Right? No it’s not straight forward when we are talking about 200+k disk drives. Just collecting such huge data periodically and reliably is one of the small challenges as compared to analyzing such huge datasets and predicting bad disks. Now lets see data regarding each disk we have reallocated sectors count, reported uncorrectable errors, command timeout, and uncorrectable sector count. On top of it hard disk model has its own interpretation of the above-mentioned statistics. DHEERAJ KAPUR, Principal Engineer, Oath and SWETHA BANAGIRI
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Alluxio, Inc.
Alluxio Community Office Hour
September 29, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Adit Madan, Alluxio
In this talk, we describe the architecture to migrate analytics workloads incrementally to any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) directly on on-prem data without copying the data to cloud storage.
In this Office Hour:
- We will go over an architecture for running elastic compute clusters in the cloud using on-prem HDFS.
- Have a casual online video chat with Alluxio Open Source core maintainers to address any Alluxio related questions from our community members
My other computer is a datacentre - 2012 editionSteve Loughran
An updated version of the "my other computer is a datacentre" talk, presented at the Bristol University HPC talk.
Because it is targeted at universities, it emphasises some of the interesting problems -the classic CS ones of scheduling, new ones of availability and failure handling within what is now a single computer, and emergent problems of power and heterogeneity. It also includes references, all of which are worth reading, and, being mostly Google and Microsoft papers, are free to download without needing ACM or IEEE library access.
Comments welcome.
Current Trends and Challenges in Big Data BenchmarkingeXascale Infolab
Years ago, it was common to write a for-loop and call it benchmark. Nowadays, benchmarks are complex pieces of software and specifications. In this talk, the idea of benchmark engineering, trends in the area of benchmarking research and current efforts of the SPEC Research Group and the WBDB community focusing on Big Data will be discussed. The way in which benchmarks are used has changed. Traditionally, they were mostly used for generating throughput numbers. Today, benchmarks are, e.g., used as test frameworks to evaluate different aspects of systems such as scalability or performance. Since benchmarks provide standardized workloads and meaningful metrics, they are increasingly important for research.
The benchmark community is currently focusing on new trends such as cloud computing, big data, power-consumption and large scale, highly distributed systems. For several of these trends traditional benchmarking approaches fail: how can we benchmark a highly distributed system with thousands of nodes and data sources? What does a typical Big Data workload look like and how does it scale? How can we benchmark a real world setup in a realistic way on limited resources? What does performance mean in the context of Big Data? What is the right metric?
Speaker: Kai Sachs is a member of the Lifecycle & Cloud Management group at SAP AG. He received a joint Diploma degree in business administration and computer science as well as a PhD degree from Technische Universität Darmstadt. His PhD thesis was awarded with the SPEC Distinguished Dissertation Award 2011 for outstanding contributions in the area of performance evaluation and benchmarking. His research interests include software performance engineering, capacity planning, cloud computing and benchmarking. He is co-founder of ACM/SPEC International Conference on Performance Engineering (ICPE). He has served as member of several program and organization committees and as reviewer for many conferences and journals. Among others he was the PC Chair of the SPEC Benchmark Workshop 2010, Program Chair of the Workshop on Hot Topics on Cloud Services 2013 and the Industrial PC Chair of the ICPE 2011. Kai Sachs is currently serving on the editorial board of the CSI Transactions on ICT, as vice-chair of the SPEC Research Group, as PC Co-Chair of the ACM/SPEC ICPE 2015 and as Co-Chair of the Workshop on Big Data Benchmarking 2014.
This tutorial was held at IEEE BigData '14 on October 29, 2014 in Bethesda, ML, USA.
Presenters: Chaitan Baru and Tilmann Rabl
More information available at:
http://msrg.org/papers/BigData14-Rabl
Summary:
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015 Vladi Vexler
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out
451 Research:
- Key challenges in the data landscape
- Evolution of distributed database environments
ScaleBase
- Pros and cons of abstracting complex databases topology
- Top strategies of distributed data modeling
- Advanced data modeling and “what-if” simulations with
- ScaleBase Analysis Genie
- Scaling real apps – From need to deployment
When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
Prof. Dr. Alexander Mädche, University of Mannheim
Dr. Hendrik Meth, BorgWarner IT Services Europa GmbH
Walldorf, September 11th 2015
SAP University Alliance EMEA Conference
Easy SPARQLing for the Building Performance ProfessionalMartin Kaltenböck
Slides of Martin Kaltenböcks (SWC) presentation at SEMANTiCS2014 conference in Leipzig on 5th of September 2014 about the 'Tool for Building Energy Performance Scenarios' of GBPN (Global Buildings Performance Network, http://gbpn.org) that provides a prediction tool for buildings performance worldwide by making use of Linked Open Data (LOD).
The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis
The Briefing Room with Rick van der Lans and Think Big, a Teradata Company
Live Webcast on June 16, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=197f8106531874cc5c14081ca214eaff
Hadoop is arguably one of the most disruptive technologies of the last decade. Once lauded solely for its ability to transform the speed of batch processing, it has marched steadily forward and promulgated an array of performance-enhancing accessories, notably Spark and YARN. Hadoop has evolved into much more than a file system and batch processor, and it now promises to stand as the data management and analytics backbone for enterprises.
Register for this episode of The Briefing Room to learn from veteran Analyst Rick van der Lans, as he discusses the emerging roles of Hadoop within the analytics ecosystem. He’ll be briefed by Ron Bodkin of Think Big, a Teradata Company, who will explore Hadoop’s maturity spectrum, from typical entry use cases all the way up the value chain. He’ll show how enterprises that already use Hadoop in production are finding new ways to exploit its power and build creative, dynamic analytics environments.
Visit InsideAnalysis.com for more information.
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs
Marlabs’ Business Intelligence and Analytics practice can support customers’ needs throughout the information management lifecycle. As a vendor-agnostic and holistic service provider with expertise in a range of tools and technologies, we can help clients make informed decisions to employ the right technologies that align with their business needs.
In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
Government GraphSummit: And Then There Were 15 StandardsNeo4j
Todd Pihl PhD., Technical Project Mgr. & Mark Jensen, Director of Data Managements and Interoperability, National Institute of Health, Frederick National Labs for Cancer Research
Data repositories such as NCI’s Cancer Research Data Commons receive data that use a variety of data models and vocabularies. This presents a significant obstacle to finding and using the data outside of their original purpose. In this talk we’ll show how using Neo4j allows different data models to be represented and mapped to each other, giving data managers a new way to provide harmonized data to their users.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
How to make your data count webinar, 26 Nov 2018ARDC
Slides from the 26 Nov Make your data count webinar. The research community has long grappled with the problem of assessing and tracking the results of scholarship. Research data is no exception. The Make Data Count (MDC) project (https://makedatacount.org/), funded by the Sloan Foundation, has delivered a data usage metric standard (Code of Practice) and a workflow for the retrieval and display of standardised usage and citation metrics in your repository interface.
Listen to this webinar to learn more about the Make Data Count project and the 5 steps you can take to make the data in your repository count. Hear from MDC project team members who have already implemented MDC in the dash (https://dash.ucop.edu) and DataOne (https://search.dataone.org/data) repositories. Learn from their experience, see the results.
Our international speaker line-up includes Daniella Lowenberg (California Digital Library) and Patricia Cruse (DataCite).
Recording available: https://youtu.be/Lkysz0Mc7fo
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Crafting bigdatabenchmarks
1. Tilmann Rabl
Middleware Systems Research Group & bankmark UG
ISC’14, June 26, 2014
Crafting Benchmarks for Big Data
MIDDLEWARESYSTEMS
RESEARCH GROUP
MSRG.ORG
2. Outline
•Big Data Benchmarking Community
•Our approach to building benchmarks
•Big Data Benchmarks
•Characteristics
•BigBench
•Big Decisions
•Hammer
•DAP
•Slides borrowed from Chaitan Baru
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 2
3. Big Data Benchmarking Community
•Genesis of the Big Data Benchmarking effort
•Grant from NSF under the Cluster Exploratory (CluE) program (Chaitan Baru, SDSC)
•Chaitan Baru (SDSC), Tilmann Rabl (University of Toronto), Milind Bhandarkar (Pivotal/Greenplum), Raghu Nambiar (Cisco), Meikel Poess (Oracle)
•Launched Workshops on Big Data Benchmarking
•First WBDB: May 2012, San Jose. Hosted by Brocade
•Objectives
•Lay the ground for development of industry standards for measuring the effectiveness of hardware and software technologies dealing with big data
•Exploit synergies between benchmarking efforts
•Offer a forum for presenting and debating platforms, workloads, data sets and metrics relevant to big bata
•Big Data Benchmark Community (BDBC)
•Regular conference calls for talks and announcements
•Open to anyone interested, free of charge
•BDBC makes no claims to any developments or ideas
•clds.ucsd.edu/bdbc/community
Crafting Benchmarks 26.06.2014 for Big Data - Tilmann Rabl 3
4. •Actian
•AMD
•BMMsoft
•Brocade
•CA Labs
•Cisco
•Cloudera
•Convey Computer
•CWI/Monet
•Dell
•EPFL
•Facebook
•Google
•Greenplum
•Hewlett-Packard
•Hortonworks
•Indiana Univ/ HathitrustResearch Foundation
•InfoSizing
•Intel
•LinkedIn
•MapR/Mahout
•Mellanox
•Microsoft
•NSF
•NetApp
•NetApp/OpenSFS
•Oracle
•Red Hat
•San Diego Supercomputer Center
•SAS
•Scripps Research Institute
•Seagate
•Shell
•SNIA
•Teradata Corporation
•Twitter
•UC Irvine
•Univ. of Minnesota
•Univ. of Toronto
•Univ. of Washington
•VMware
•WhamCloud
•Yahoo!
1st WBDB: Attendee Organizations
Crafting Benchmarks f 26.06.2014 or Big Data - Tilmann Rabl 4
5. Further Workshops
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 5
2ndWBDB: http://clds.sdsc.edu/wbdb2012.in
3rdWBDB: http://clds.sdsc.edu/wbdb2013.cn
4thWBDB: http://clds.sdsc.edu/wbdb2013.us
5thWBDB: http://clds.sdsc.edu/wbdb2014.de
6. First Outcomes
•Big Data Benchmarking Community (BDBC) mailing list (~200 members from ~80 organizations)
•Organized webinars every other Thursday
•http://clds.sdsc.edu/bdbc/community
•Paper from First WBDB
•Setting the Direction for Big Data Benchmark Standards C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, and T. Rabl, published in Selected Topics in Performance Evaluation and Benchmarking, Springer-Verlag
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 6
7. Further Outcomes
•Selected papers in Springer Verlag, Lecture Notes in Computer Science, Springer Verlag
•LNCS 8163: Specifying Big Data Benchmarks (covering the first and second workshops)
•LNCS 8585: Advancing Big Data Benchmarks (covering the third and fourth workshops, in print)
•Papers from 5thWBDB will be in VolIII
•Formation of TPC Subcommittee on Big Data Benchmarking
•Working on TPCx-HS: TPC Express benchmark for Hadoop Systems, based on Terasort
•http://www.tpc.org/tpcbd/
•Formation of a SPEC Research Group on Big Data Benchmarking
•Proposal of BigDataTop100 List
•Specification of BigBench
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 7
8. TPC Big Data Subcommittee
•TPCx-HS
•TPC Express for Hadoop Systems
•Based on Terasort
•Teragen, Terasort, Teravalidate
•Database size / Scale Factors
•SF: 1, 3, 10, 30, 100, 300, 1000, 3000, 10000 TB
•Corresponds to: 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B 100-byte records
•Performance Metric
•HSph@SF= SF/T (total elapsed time in hours)
•Price/Performance
•$/HSph, $ is 3-year total cost of ownership
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 8
9. Formation of SPEC Research Big Data Working Group
•Mission Statement
ThemissionoftheBigData(BD)workinggroupistofacilitateresearchandtoengageindustryleadersfordefininganddevelopingperformancemethodologiesofbigdataapplications.Theterm‘‘bigdata’’hasbecomeamajorforceofinnovationacrossenterprisesofallsizes.Newplatforms,claimingtobethe“bigdata”platformwithincreasinglymorefeaturesformanagingbigdatasets,arebeingannouncedalmostonaweeklybasis.Yet,thereiscurrentlyalackofwhatconstitutesabigdatasystemandanymeansofcomparabilityamongsuchsystems.
•Initial Committee Structure
•Tilmann Rabl (Chair)
•Chaitan Baru (Vice Chair)
•Meikel Poess (Secretary)
•To replace less formal BDBC group
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 9
10. BigDataTop100 List
•Modeled after Top500 and Graph500 in HPC community
•Proposal presented at Strata Conference, February 2013
•Based on application-level benchmarking
•Article in inaugural issue of the Big Data Journal
•Big Data Benchmarking and the Big Data Top100 List by Baru, Bhandarkar, Nambiar, Poess, Rabl, Big Data Journal, Vol.1, No.1, 60-64, Anne LiebertPublications.
•In progress
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 10
11. Big Data Benchmarks
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 11
12. Types of Big Data Benchmarks
•Micro-benchmarks. To evaluate specific lower-level, system operations
•E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU
•Functional benchmarks. Specific high-level function.
•E.g. Sorting: Terasort
•E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, …
•Genre-specific benchmarks. Benchmarks related to type of data
•E.g. Graph500. Breadth-first graph traversals
•Application-level benchmarks
•Measure system performance (hardware and software) for a given application scenario—with given data and workload
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 12
13. Application-Level Benchmark Design Issues from WBDB
•Audience: Who is the audience for the benchmark?
•Marketing (Customers / End users)
•Internal Use (Engineering)
•Academic Use (Research and Development)
•Is the benchmark for innovation or competition?
•If a competitive benchmark is successful, it will be used for innovation
•Application: What type of application should be modeled?
•TPC: schema + transaction/query workload
•BigData: Abstractions of a data processing pipeline, e.g. Internet-scale businesses
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 13
14. App Level Issues -2
•Component vs. end-to-end benchmark. Is it possible to factor out a set of benchmark “components”, which can be isolated and plugged into an end-to- end benchmark?
•The benchmark should consist of individual components that ultimately make up an end-to- end benchmark
•Single benchmark specification: Is it possible to specify a single benchmark that captures characteristics of multiple applications ?
•Maybe: Create a single, multi-step benchmark, with plausible end-to-end scenario
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 14
15. App Level Issues -3
•Paper & Pencil vs. Implementation-based. Should the implementation be specification-driven or implementation-driven?
•Start with an implementation and develop specification at the same time
•Reuse. Can we reuse existing benchmarks?
•Leverage existing work and built-up knowledgebase
•Benchmark Data. Where do we get the data from?
•Synthetic data generation: structured, non-structured data
•Verifiability. Should there be a process for verification of results?
•YES!
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 15
16. Abstracting the Big Data World
1.Enterprise Data Warehouse + Other types of data
•Structured enterprise data warehouse
•Extend to incorporate semi-structured data, e.g. from weblogs, machine logs, clickstream, customer reviews, …
•“Design time” schemas
2.Collection of heterogeneous data + Pipelines of processing
•Enterprise data processing as a pipeline from data ingestion to transformation, extraction, subsetting, machine learning, predictive analytics
•Data from multiple structured and non-structured sources
•“Runtime” schemas –late binding, application-driven schemas
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 16
17. Other Benchmarks discussed at WBDB
•Big Decision, Jimmy Zhao, HP
•HiBench/Hammer, LanYi, Intel
•BigDataBench, JianfengZhan, Chinese Academy of Sciences
•CloudSuite, OnurKocberber, EPFL
•Genre specific benchmarks
•Microbenchmarks
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 17
18. The BigBenchProposal
•End to end benchmark
•Application level
•Based on a product retailer (TPC-DS)
•Focused on Parallel DBMS and MR engines
•History
•Launched at 1stWBDB, San Jose
•Published at SIGMOD 2013
•Full spec at WBDB proceedings 2012
•Full kit at WBDB 2014
•Collaboration with Industry & Academia
•First: Teradata, University of Toronto, Oracle, InfoSizing
•Now: UofT, bankmark, Intel, Oracle, Microsoft, UCSD, Pivotal, Cloudera, InfoSizing, SAP, Hortonworks, Cisco, …
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 18
19. Data Model
19
Unstructured Data
Semi-Structured Data
Structured Data
Sales
Customer
Item
Marketprice
Web Page
Web Log
Reviews
Adapted
TPC-DS
BigBench
Specific
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl
20. •Variety
•Different schema parts
•Volume
•Based on scale factor
•Similar to TPC-DS scaling, but continuous
•Weblogs & product reviews also scaled
•Velocity
•Refreshes for all data
•Different velocity for different areas
•Vstructured< Vunstructured< Vsemistructured
Data Model –3 Vs
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 20
21. Workload
•Workload Queries
•30 “queries”
•Specified in English (sort of)
•No required syntax
•Business functions (Adapted from McKinsey)
•Marketing
•Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences
•Merchandising
•Assortment optimization, Pricing optimization
•Operations
•Performance transparency, Product return analysis
•Supply chain
•Inventory management
•Reporting (customers and products)
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 21
22. SQL-MR Query 1
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 22
23. HiveQLQuery 1
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 23
SELECTpid1, pid2, COUNT (*) AS cnt
FROM (
FROM (
FROM (
SELECT s.ss_ticket_numberAS oid, s.ss_item_skAS pid
FROM store_saless
INNER JOIN item iON s.ss_item_sk= i.i_item_sk
WHERE i.i_category_idin (1 ,2 ,3) and s.ss_store_skin (10 , 20, 33, 40, 50)
) q01_temp_join
MAP q01_temp_join.oid, q01_temp_join.pid
USING 'cat'
AS oid, pid
CLUSTER BY oid
) q01_map_output
REDUCE q01_map_output.oid, q01_map_output.pid
USING 'java -cpbigbenchqueriesmr.jar:hive-contrib.jarde.bankmark.bigbench.queries.q01.Red'
AS (pid1 BIGINT, pid2 BIGINT)
) q01_temp_basket
GROUP BY pid1, pid2
HAVING COUNT (pid1) > 49
ORDER BY pid1, cnt, pid2;
24. BigBench Current Status
•All queries are available in Hive/Hadoop
•New data generator (continuous scaling, realistic data) available
•New metric available
•Complete driver available
•Refresh will be done soon
•Full kit at WBDB 2014
•https://github.com/intel-hadoop/Big-Bench
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 24
Query Types
Number of Queries
Percentage
Pure HiveQL
14
46%
Mahout
5
17%
OpenNLP
5
17%
Custom MR
6
20%
25. Big Decision, Jimmy Zhao, HP, 4thWBDB
•Benchmark for A DSS/Data Mining solutions
•Everything running in the same system
•Engine of Analytics
•Reflecting the real business model
•Huge data volume
•Data from Social
•Data from Web log
•Data from Comments
•Broader Data support
•Semi-structured data
•Un-structured data
•Continuous Data Integration
•ETL just a normal job of the system
•Data Integration whenever there’s data
•Big Data Analytics
Big Decision –Big TPC-DS!
TPC-DS
•Mature and proved workload for BI
•Mix workloads
•Well defined scale factors
Semi + unstructured TPC-DS
•Additional data and dimension from new data
•Semi-structured and unstructured data
•TB to PB or even Zeta Byte support
NEW TPC-DS generator –Agile ETL
•Continuously data generation and injection
•Consider as part of the workloads
New massive parallel processing technologies
•Convert queries to SQL liked queries
•Include interactive & regular Queries
•Include Machine Learning jobs
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 25
26. Agile ETL
Marketing
Big Decision Block Diagram
SNS Marketing
TPC-DS
Web page
Sales
Web log
Item
Reviews
Social Message
Search & Social Advertise
Search
Social Advertise
Social Web pages
Extraction
Transform
Load
Customer
Social Feedbacks
Mobile log
Mobile log
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 26
27. HiBench, LanYi, Intel, 4thWBDB
26.06.2014
HiBench
–Enhanced DFSIO
Micro Benchmarks
Web Search
–Sort
–WordCount
–TeraSort
–Nutch Indexing
–Page Rank
Machine Learning
–Bayesian Classification
–K-Means Clustering
HDFSSee our paper “The HiBenchSuite: Characterization of the MapReduce-Based Data Analysis” in ICDE’10 workshops (WISS’10)
1.Different from GridMix, SWIM?
2.Micro Benchmark?
3.Isolated components?
4.End-2-end Benchmark?
5.We need ETL-Recommendation Pipeline
Crafting Benchmarks for Big Data - Tilmann Rabl 27
29. ETL-Recommendation (hammer)
•Task Dependences
Pref-logs
ETL-logs
Pref-sales
Item based Collaborative Filtering
Pref-comb
ETL-sales
Offline test
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 29
30. The Deep Analytics Pipeline, Bhandarkar (1stWBDB)
•“User Modeling” pipelines
•Generic use case: Determine user interests or user categories by mining user activities
•Large dimensionality of possible user activities
•Typical user represents a sparse activity vector
•Event attributes change over time
30
Data Acquisition/
Normalization/ Sessionization
Feature andTarget Generation
Model Training
Offline Scoring & Evaluation
Batch Scoring & Upload toServer
Acquisition/
Recording
Extraction/
Cleaning/
Annotation
Integration/ Aggregation/
Representation
Analysis/
Modeling
Interpretation
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl
31. Example Application Domains
•Retail
•Events: clicks on purchases, ad clicks, FB likes, …
•Goal: Personalized product recommendations
•Datacenters
•Events: log messages, traffic, communications events, …
•Goal: Predict imminent failures
•Healthcare
•Events: Doctor visits, medical history, medicine refills, …
•Goal: Prevent hospital readmissions
•Telecom
•Events: Calls made, duration, calls dropped, location, social graph, …
•Goal: Reduce customer churn
•Web Ads
•Events: Clicks on content, likes, reposts, search queries, comments, …
•Goal: Increase engagement, increase clicks on revenue-generation content
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 31
32. Steps in the Pipeline
•Acquisition and normalization of data
•Collate, consolidate data
•Join targets and features
•Construct targets; filter out user activity without targets; join feature vector with targets
•Model Training
•Multi-model: regressions, Naïve Bayes, decision trees, Support Vector Machines, …
•Offline scoring
•Score features, evaluate metrics
•Batch scoring
•Apply models to all user activity; upload scores
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 32
33. Application Classes
•Widely varying number of events per entity
•Multiple classes of applications, based on size, e.g.:
•Tiny (100K entities, 10 events per entity)
•Small (1M entities, 10 events per entity)
•Medium (10M entities, 100 events per entity)
•Large (100M entities, 1000 events per entity)
•Huge (1B entities, 1000 events per entity)
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 33
34. Proposal for Pipeline Benchmark Results
•Publish results for every stage in the pipeline
•Data pipelines for different application domains may be constructed by mix and match of various pipeline stages
•Different modeling techniques per class
•So, need to publish performance numbers for every stage
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 34
35. Get involved
•Workshop on Big Data Benchmarking (WBDB)
•Fifth workshop: August 6-7, Potsdam, Germany
•clds.ucsd.edu/wbdb2014.de
•Proceedings will be published in Springer LNCS
•Big Data Benchmarking Community
•Biweekly conference calls (sort of)
•Mailing list
•clds.ucsd.edu/bdbc/community
•Coming up next: BDBC@SPEC Research
•We will join forces with SPEC Research
•Try BigBench:
•https://github.com/intel-hadoop/Big-Bench
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 35
36. Questions?
Crafting Benchmarks for Big Data - Tilmann Rabl 36
Contact:
Tilmann Rabl
tilmann.rabl@bankmark.de
tilmann.rabl@utoronto.ca
26.06.2014
MIDDLEWARESYSTEMS
RESEARCH GROUP
MSRG.ORG
Thank You!