In this paper we present the initial results of our work to run BigBench on Spark. First, we evaluated the data scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive results. Our experiments show that: (1) for both MapReduce and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.
http://clds.sdsc.edu/wbdb2015.ca/program
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
The document evaluates and compares the performance of DataStax Enterprise (DSE) and Cloudera Hadoop Distribution (CDH) using the HiBench benchmark suite. It finds that CDH outperforms DSE for CPU-intensive, read-intensive, and mixed workloads, while DSE has better performance for write-intensive workloads. The evaluation was conducted on an 8-node cluster using data sizes from 240GB to 440GB. Ongoing work includes analyzing availability, evaluating different file formats, and comparing graph processing engines.
This document discusses HiBench, a benchmark suite for Hadoop. It provides an overview of HiBench and how it can be used to characterize and evaluate Hadoop deployments. Evaluation results using HiBench show that a newer Intel Xeon server platform provides up to 86% more throughput and is up to 56% faster than an older platform. Evaluations between Hadoop versions 0.19.1 and 0.20.0 show that improvements in the newer version help reduce job completion times. The document concludes by providing suggestions for optimizing Hadoop deployments through hardware and software configurations.
Lessons Learned on Benchmarking Big Data Platformst_ivanov
The document discusses benchmarking different big data platforms and SQL-on-Hadoop engines. It evaluates the performance of Hadoop using the TPCx-HS benchmark with different network configurations. It also compares the performance of SQL query engines like Hive, Spark SQL, Impala, and file formats like ORC and Parquet using the TPC-H benchmark on a 1TB dataset. The results show that a dedicated 1Gb network is 5 times faster than a shared network. For SQL query engines, Hive with ORC format is on average 1.44 times faster than with Parquet. Spark SQL could only run 12 queries and was faster on 5 queries compared to Hive.
This tutorial was held at IEEE BigData '14 on October 29, 2014 in Bethesda, ML, USA.
Presenters: Chaitan Baru and Tilmann Rabl
More information available at:
http://msrg.org/papers/BigData14-Rabl
Summary:
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
This document discusses benchmarking Hadoop and big data systems. It provides an overview of common Hadoop benchmarks including microbenchmarks like TestDFSIO, TeraSort, and NNBench which test individual Hadoop components. It also describes BigBench, a benchmark modeled after TPC-DS that aims to test a more complete big data analytics workload using techniques like MapReduce, Hive, and Mahout across structured, semi-structured, and unstructured data. The document emphasizes using Hadoop distributions for administration and both microbenchmarks and full benchmarks like BigBench for evaluation.
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
TPCx-HS is the first vendor-neutral benchmark focused on big data systems – which have become a critical part of the enterprise IT ecosystem.
Watch the video presentation: http://wp.me/p3RLHQ-cLY
Learn more: http://www.tpc.org/tpcx-hs
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
The document evaluates and compares the performance of DataStax Enterprise (DSE) and Cloudera Hadoop Distribution (CDH) using the HiBench benchmark suite. It finds that CDH outperforms DSE for CPU-intensive, read-intensive, and mixed workloads, while DSE has better performance for write-intensive workloads. The evaluation was conducted on an 8-node cluster using data sizes from 240GB to 440GB. Ongoing work includes analyzing availability, evaluating different file formats, and comparing graph processing engines.
This document discusses HiBench, a benchmark suite for Hadoop. It provides an overview of HiBench and how it can be used to characterize and evaluate Hadoop deployments. Evaluation results using HiBench show that a newer Intel Xeon server platform provides up to 86% more throughput and is up to 56% faster than an older platform. Evaluations between Hadoop versions 0.19.1 and 0.20.0 show that improvements in the newer version help reduce job completion times. The document concludes by providing suggestions for optimizing Hadoop deployments through hardware and software configurations.
Lessons Learned on Benchmarking Big Data Platformst_ivanov
The document discusses benchmarking different big data platforms and SQL-on-Hadoop engines. It evaluates the performance of Hadoop using the TPCx-HS benchmark with different network configurations. It also compares the performance of SQL query engines like Hive, Spark SQL, Impala, and file formats like ORC and Parquet using the TPC-H benchmark on a 1TB dataset. The results show that a dedicated 1Gb network is 5 times faster than a shared network. For SQL query engines, Hive with ORC format is on average 1.44 times faster than with Parquet. Spark SQL could only run 12 queries and was faster on 5 queries compared to Hive.
This tutorial was held at IEEE BigData '14 on October 29, 2014 in Bethesda, ML, USA.
Presenters: Chaitan Baru and Tilmann Rabl
More information available at:
http://msrg.org/papers/BigData14-Rabl
Summary:
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
This document discusses benchmarking Hadoop and big data systems. It provides an overview of common Hadoop benchmarks including microbenchmarks like TestDFSIO, TeraSort, and NNBench which test individual Hadoop components. It also describes BigBench, a benchmark modeled after TPC-DS that aims to test a more complete big data analytics workload using techniques like MapReduce, Hive, and Mahout across structured, semi-structured, and unstructured data. The document emphasizes using Hadoop distributions for administration and both microbenchmarks and full benchmarks like BigBench for evaluation.
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
TPCx-HS is the first vendor-neutral benchmark focused on big data systems – which have become a critical part of the enterprise IT ecosystem.
Watch the video presentation: http://wp.me/p3RLHQ-cLY
Learn more: http://www.tpc.org/tpcx-hs
Covers different types of big data benchmarking, different suites, details into terasort, demo with TPCx-HS
Meetup Details of presentation:
http://www.meetup.com/lspe-in/events/203918952/
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
BigBench is the brand new standard (TPCx-BB) for benchmarking and testing Big Data systems. The BigBench specification describes several application use cases combining the need for SQL queries, Map/Reduce, user code (UDF), Machine Learning, and even streaming. From the available implementation, we can test the different framework combinations such as Hadoop+Hive (with Mahout) and Spark (SparkSQL+MLlib) in their different versions and configurations, helping us to spot problems and possible optimizations of our data stacks.
This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with their respective 1 and 2 versions under distinct configurations including Tez, Mahout, MLlib. Experiments are run on Cloud and On-Prem clusters of different numbers of nodes and data scales, taking into account interactive and batch usage. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and logfile analysis. The talk concludes with the main findings, the scalability, and limits of each framework.
Originally presented at: https://dataworkssummit.com/munich-2017/sessions/using-bigbench-to-compare-hive-and-spark-versions-and-features/
Originally presented at Strata EU 2017: https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57631
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability.
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Nicolas uses BigBench, the brand new standard (TPCx-BB) for big data systems, with both Spark and Hive implementations for benchmarking the systems. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
The work is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines and BigBench. The ALOJA project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. Nicolas highlights how to easily repeat the benchmarks through ALOJA and benefit from BigBench to optimize your Spark cluster for advanced users. The work is a continuation of a paper to be published at the IEEE Big Data 16 conference. (A preprint copy can be obtained here.)
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
This presentation was held by Meikel Poess on September 3, 2014 at VLDB 2014 in Hangzhou, China.
Full paper and additional information available at:
http://msrg.org/papers/VLDB2014TPCDI
Abstract:
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
Originally presented at the BDOOP and Spark Barcelona meetup groups: http://meetu.ps/3bwCTM
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. The talk compares:
• The performance of both v1 and v2 for Spark and Hive
• PaaS cloud services: Azure HDinsight, Amazon Web Services EMR, Google Cloud Dataproc
• Out-of-the-box support for Spark and Hive versions from providers
• PaaS reliability, scalability, and price-performance of the solutions
Using BigBench, the new Big Data benchmark standard. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
Current Trends and Challenges in Big Data BenchmarkingeXascale Infolab
Years ago, it was common to write a for-loop and call it benchmark. Nowadays, benchmarks are complex pieces of software and specifications. In this talk, the idea of benchmark engineering, trends in the area of benchmarking research and current efforts of the SPEC Research Group and the WBDB community focusing on Big Data will be discussed. The way in which benchmarks are used has changed. Traditionally, they were mostly used for generating throughput numbers. Today, benchmarks are, e.g., used as test frameworks to evaluate different aspects of systems such as scalability or performance. Since benchmarks provide standardized workloads and meaningful metrics, they are increasingly important for research.
The benchmark community is currently focusing on new trends such as cloud computing, big data, power-consumption and large scale, highly distributed systems. For several of these trends traditional benchmarking approaches fail: how can we benchmark a highly distributed system with thousands of nodes and data sources? What does a typical Big Data workload look like and how does it scale? How can we benchmark a real world setup in a realistic way on limited resources? What does performance mean in the context of Big Data? What is the right metric?
Speaker: Kai Sachs is a member of the Lifecycle & Cloud Management group at SAP AG. He received a joint Diploma degree in business administration and computer science as well as a PhD degree from Technische Universität Darmstadt. His PhD thesis was awarded with the SPEC Distinguished Dissertation Award 2011 for outstanding contributions in the area of performance evaluation and benchmarking. His research interests include software performance engineering, capacity planning, cloud computing and benchmarking. He is co-founder of ACM/SPEC International Conference on Performance Engineering (ICPE). He has served as member of several program and organization committees and as reviewer for many conferences and journals. Among others he was the PC Chair of the SPEC Benchmark Workshop 2010, Program Chair of the Workshop on Hot Topics on Cloud Services 2013 and the Industrial PC Chair of the ICPE 2011. Kai Sachs is currently serving on the editorial board of the CSI Transactions on ICT, as vice-chair of the SPEC Research Group, as PC Co-Chair of the ACM/SPEC ICPE 2015 and as Co-Chair of the Workshop on Big Data Benchmarking 2014.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
The physicists at CERN are increasingly turning to Spark to process large physics datasets in a distributed fashion with the aim of reducing time-to-physics with increased interactivity. The physics data itself is stored in CERN’s mass storage system: EOS and CERN’s IT department runs on-premise private cloud based on OpenStack as a way to provide on-demand compute resources to physicists. This provides both opportunity and challenges to Big Data team at CERN to provide elastic, scalable, reliable spark-as-a-service on OpenStack.
The talk focuses on the design choices made and challenges faced while developing spark-as-a-service over kubernetes on openstack to simplify provisioning, automate management, and minimize the operating burden of managing Spark Clusters. In addition, the service tooling simplifies submitting applications on the behalf of the users, mounting user-specified ConfigMaps, copying application logs to s3 buckets for troubleshooting, performance analysis and accounting of spark applications and support for stateful spark streaming applications. We will also share results from running large scale sustained workloads over terabytes of physics data.
Big data appliance ecosystem - in memory db, hadoop, analytics, data mining, business intelligence with multiple data source charts, twitter support and analysis.
Best pratices at BGI for the Challenges in the Era of Big Genomics DataXing Xu
BGI is the world's largest genome sequencing center, with over 150 sequencers and a sequencing throughput of 6 TB per day. It also has the largest computing and storage center for genomics in China, with over 20,000 CPU cores, 19 GPUs, 220+ teraflops of peak performance, and 17 petabytes of data storage. BGI faces challenges from the exponential growth of genomic data, complex data analysis processes, and widely distributed data. It addresses these challenges through solutions like high-speed data transfer, cloud computing platforms like EasyGenomics, and distributed algorithms and infrastructure using Hadoop and GPU acceleration.
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
This document discusses using Apache Spark and Apache HBase to build a time series application. It provides an overview of time series data and requirements for ingesting, storing, and analyzing high volumes of time series data. The document then describes using Spark Streaming to process real-time data streams from sensors and storing the data in HBase. It outlines the steps in the lab exercise, which involves reading sensor data from files, converting it to objects, creating a Spark Streaming DStream, processing the DStream, and saving the data to HBase.
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost.
While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.
We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost.
The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.
Yahoo migrated most of its Pig workload from MapReduce to Tez to achieve significant performance improvements and resource utilization gains. Some key challenges in the migration included addressing misconfigurations, bad programming practices, and behavioral changes between the frameworks. Yahoo was able to run very large and complex Pig on Tez jobs involving hundreds of vertices and terabytes of data smoothly at scale. Further optimizations are still needed around speculative execution and container reuse to improve utilization even more. The migration to Tez resulted in up to 30% reduction in runtime, memory, and CPU usage for Yahoo's Pig workload.
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
Apache Spark performance on SQL and DataFrame/DataSet workloads has made impressive progress, thanks to Catalyst and Tungsten, but there is still a significant gap towards what is achievable by best-of-breed query engines or hand-written low-level C code on modern server-class hardware. This session presents Flare, a new experimental back-end for Spark SQL that yields significant speed-ups by compiling Catalyst query plans to native code.
Flare’s low-level implementation takes full advantage of native execution, using techniques such as NUMA-aware scheduling and data layouts to leverage ‘mechanical sympathy’ and bring execution closer to the metal than current JVM-based techniques on big memory machines. Thus, with available memory increasingly in the TB range, Flare makes scale-up on server-class hardware an interesting alternative to scaling out across a cluster, especially in terms of data center costs. This session will describe the design of Flare, and will demonstrate experiments on standard SQL benchmarks that exhibit order of magnitude speedups over Spark 2.1.
The document discusses big data visualization and visual analysis, focusing on the challenges and opportunities. It begins with an overview of visualization and then discusses several challenges in big data visualization, including integrating heterogeneous data from different sources and scales, dealing with data and task complexity, limited interaction capabilities for large data, scalability for both data and users, and the need for domain and development libraries/tools. It then provides examples of visualizing taxi GPS data and traffic patterns in Beijing to identify traffic jams.
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangDatabricks
At Facebook, millions of Hive queries are executed on a daily basis, and the workload contributes to important analytics that drive product decisions and insights. Spark SQL in Apache Spark provides much of the same functionality as Hive query language (HQL) more efficiently, and Facebook is building a framework to migrate existing production Hive workload to Spark SQL with minimal user intervention.
Before Facebook began large-scale migration to SparkSQL, they worked on identifying the gap between HQL and SparkSQL. They built an offline syntax analysis tool that parses, analyzes, optimizes and generates physical plans on daily HQL workload. In this session, they’ll share their results. After finding their syntactic analysis encouraging, they built tooling for offline semantic analysis where they run HQL queries in their Spark shadow cluster and validate the outputs. Output validation is necessary since the runtime behavior in Spark SQL may be different from HQL. They have built a migration framework that supports HQL in both Hive and Spark execution engines, can shadow and validate HQL workloads in Spark, and makes it easy for users to convert their workloads.
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented.
http://clds.sdsc.edu/wbdb2014.de/program
Managing Apache Spark Workload and Automatic OptimizingDatabricks
eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.
Speaker: Lantao Jin
Covers different types of big data benchmarking, different suites, details into terasort, demo with TPCx-HS
Meetup Details of presentation:
http://www.meetup.com/lspe-in/events/203918952/
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
BigBench is the brand new standard (TPCx-BB) for benchmarking and testing Big Data systems. The BigBench specification describes several application use cases combining the need for SQL queries, Map/Reduce, user code (UDF), Machine Learning, and even streaming. From the available implementation, we can test the different framework combinations such as Hadoop+Hive (with Mahout) and Spark (SparkSQL+MLlib) in their different versions and configurations, helping us to spot problems and possible optimizations of our data stacks.
This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with their respective 1 and 2 versions under distinct configurations including Tez, Mahout, MLlib. Experiments are run on Cloud and On-Prem clusters of different numbers of nodes and data scales, taking into account interactive and batch usage. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and logfile analysis. The talk concludes with the main findings, the scalability, and limits of each framework.
Originally presented at: https://dataworkssummit.com/munich-2017/sessions/using-bigbench-to-compare-hive-and-spark-versions-and-features/
Originally presented at Strata EU 2017: https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57631
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability.
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Nicolas uses BigBench, the brand new standard (TPCx-BB) for big data systems, with both Spark and Hive implementations for benchmarking the systems. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
The work is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines and BigBench. The ALOJA project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. Nicolas highlights how to easily repeat the benchmarks through ALOJA and benefit from BigBench to optimize your Spark cluster for advanced users. The work is a continuation of a paper to be published at the IEEE Big Data 16 conference. (A preprint copy can be obtained here.)
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
This presentation was held by Meikel Poess on September 3, 2014 at VLDB 2014 in Hangzhou, China.
Full paper and additional information available at:
http://msrg.org/papers/VLDB2014TPCDI
Abstract:
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
Originally presented at the BDOOP and Spark Barcelona meetup groups: http://meetu.ps/3bwCTM
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. The talk compares:
• The performance of both v1 and v2 for Spark and Hive
• PaaS cloud services: Azure HDinsight, Amazon Web Services EMR, Google Cloud Dataproc
• Out-of-the-box support for Spark and Hive versions from providers
• PaaS reliability, scalability, and price-performance of the solutions
Using BigBench, the new Big Data benchmark standard. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
Current Trends and Challenges in Big Data BenchmarkingeXascale Infolab
Years ago, it was common to write a for-loop and call it benchmark. Nowadays, benchmarks are complex pieces of software and specifications. In this talk, the idea of benchmark engineering, trends in the area of benchmarking research and current efforts of the SPEC Research Group and the WBDB community focusing on Big Data will be discussed. The way in which benchmarks are used has changed. Traditionally, they were mostly used for generating throughput numbers. Today, benchmarks are, e.g., used as test frameworks to evaluate different aspects of systems such as scalability or performance. Since benchmarks provide standardized workloads and meaningful metrics, they are increasingly important for research.
The benchmark community is currently focusing on new trends such as cloud computing, big data, power-consumption and large scale, highly distributed systems. For several of these trends traditional benchmarking approaches fail: how can we benchmark a highly distributed system with thousands of nodes and data sources? What does a typical Big Data workload look like and how does it scale? How can we benchmark a real world setup in a realistic way on limited resources? What does performance mean in the context of Big Data? What is the right metric?
Speaker: Kai Sachs is a member of the Lifecycle & Cloud Management group at SAP AG. He received a joint Diploma degree in business administration and computer science as well as a PhD degree from Technische Universität Darmstadt. His PhD thesis was awarded with the SPEC Distinguished Dissertation Award 2011 for outstanding contributions in the area of performance evaluation and benchmarking. His research interests include software performance engineering, capacity planning, cloud computing and benchmarking. He is co-founder of ACM/SPEC International Conference on Performance Engineering (ICPE). He has served as member of several program and organization committees and as reviewer for many conferences and journals. Among others he was the PC Chair of the SPEC Benchmark Workshop 2010, Program Chair of the Workshop on Hot Topics on Cloud Services 2013 and the Industrial PC Chair of the ICPE 2011. Kai Sachs is currently serving on the editorial board of the CSI Transactions on ICT, as vice-chair of the SPEC Research Group, as PC Co-Chair of the ACM/SPEC ICPE 2015 and as Co-Chair of the Workshop on Big Data Benchmarking 2014.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
The physicists at CERN are increasingly turning to Spark to process large physics datasets in a distributed fashion with the aim of reducing time-to-physics with increased interactivity. The physics data itself is stored in CERN’s mass storage system: EOS and CERN’s IT department runs on-premise private cloud based on OpenStack as a way to provide on-demand compute resources to physicists. This provides both opportunity and challenges to Big Data team at CERN to provide elastic, scalable, reliable spark-as-a-service on OpenStack.
The talk focuses on the design choices made and challenges faced while developing spark-as-a-service over kubernetes on openstack to simplify provisioning, automate management, and minimize the operating burden of managing Spark Clusters. In addition, the service tooling simplifies submitting applications on the behalf of the users, mounting user-specified ConfigMaps, copying application logs to s3 buckets for troubleshooting, performance analysis and accounting of spark applications and support for stateful spark streaming applications. We will also share results from running large scale sustained workloads over terabytes of physics data.
Big data appliance ecosystem - in memory db, hadoop, analytics, data mining, business intelligence with multiple data source charts, twitter support and analysis.
Best pratices at BGI for the Challenges in the Era of Big Genomics DataXing Xu
BGI is the world's largest genome sequencing center, with over 150 sequencers and a sequencing throughput of 6 TB per day. It also has the largest computing and storage center for genomics in China, with over 20,000 CPU cores, 19 GPUs, 220+ teraflops of peak performance, and 17 petabytes of data storage. BGI faces challenges from the exponential growth of genomic data, complex data analysis processes, and widely distributed data. It addresses these challenges through solutions like high-speed data transfer, cloud computing platforms like EasyGenomics, and distributed algorithms and infrastructure using Hadoop and GPU acceleration.
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
This document discusses using Apache Spark and Apache HBase to build a time series application. It provides an overview of time series data and requirements for ingesting, storing, and analyzing high volumes of time series data. The document then describes using Spark Streaming to process real-time data streams from sensors and storing the data in HBase. It outlines the steps in the lab exercise, which involves reading sensor data from files, converting it to objects, creating a Spark Streaming DStream, processing the DStream, and saving the data to HBase.
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost.
While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.
We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost.
The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.
Yahoo migrated most of its Pig workload from MapReduce to Tez to achieve significant performance improvements and resource utilization gains. Some key challenges in the migration included addressing misconfigurations, bad programming practices, and behavioral changes between the frameworks. Yahoo was able to run very large and complex Pig on Tez jobs involving hundreds of vertices and terabytes of data smoothly at scale. Further optimizations are still needed around speculative execution and container reuse to improve utilization even more. The migration to Tez resulted in up to 30% reduction in runtime, memory, and CPU usage for Yahoo's Pig workload.
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
Apache Spark performance on SQL and DataFrame/DataSet workloads has made impressive progress, thanks to Catalyst and Tungsten, but there is still a significant gap towards what is achievable by best-of-breed query engines or hand-written low-level C code on modern server-class hardware. This session presents Flare, a new experimental back-end for Spark SQL that yields significant speed-ups by compiling Catalyst query plans to native code.
Flare’s low-level implementation takes full advantage of native execution, using techniques such as NUMA-aware scheduling and data layouts to leverage ‘mechanical sympathy’ and bring execution closer to the metal than current JVM-based techniques on big memory machines. Thus, with available memory increasingly in the TB range, Flare makes scale-up on server-class hardware an interesting alternative to scaling out across a cluster, especially in terms of data center costs. This session will describe the design of Flare, and will demonstrate experiments on standard SQL benchmarks that exhibit order of magnitude speedups over Spark 2.1.
The document discusses big data visualization and visual analysis, focusing on the challenges and opportunities. It begins with an overview of visualization and then discusses several challenges in big data visualization, including integrating heterogeneous data from different sources and scales, dealing with data and task complexity, limited interaction capabilities for large data, scalability for both data and users, and the need for domain and development libraries/tools. It then provides examples of visualizing taxi GPS data and traffic patterns in Beijing to identify traffic jams.
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangDatabricks
At Facebook, millions of Hive queries are executed on a daily basis, and the workload contributes to important analytics that drive product decisions and insights. Spark SQL in Apache Spark provides much of the same functionality as Hive query language (HQL) more efficiently, and Facebook is building a framework to migrate existing production Hive workload to Spark SQL with minimal user intervention.
Before Facebook began large-scale migration to SparkSQL, they worked on identifying the gap between HQL and SparkSQL. They built an offline syntax analysis tool that parses, analyzes, optimizes and generates physical plans on daily HQL workload. In this session, they’ll share their results. After finding their syntactic analysis encouraging, they built tooling for offline semantic analysis where they run HQL queries in their Spark shadow cluster and validate the outputs. Output validation is necessary since the runtime behavior in Spark SQL may be different from HQL. They have built a migration framework that supports HQL in both Hive and Spark execution engines, can shadow and validate HQL workloads in Spark, and makes it easy for users to convert their workloads.
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented.
http://clds.sdsc.edu/wbdb2014.de/program
Managing Apache Spark Workload and Automatic OptimizingDatabricks
eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.
Speaker: Lantao Jin
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
1) Bayesian optimization can be used to efficiently tune the hyperparameters of machine learning models, requiring far fewer evaluations than standard random search or grid search methods to find good hyperparameters.
2) It builds a statistical model called a Gaussian process to model the objective function based on previous evaluations, and uses this to select the most promising hyperparameters to evaluate next in order to optimize an objective metric like accuracy.
3) SigOpt is a service that uses Bayesian optimization to tune machine learning models, outperforming expert humans on tasks like classifying images from CIFAR10 and reducing error rates more than standard methods.
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
1. Tuning machine learning models is challenging due to the large number of non-intuitive hyperparameters.
2. Traditional tuning methods like grid search are computationally expensive and can find local optima rather than global optima.
3. Bayesian optimization uses Gaussian processes to build statistical models from prior evaluations to determine the most promising hyperparameters to test next, requiring far fewer evaluations than traditional methods to find better performing models.
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
This talk will describe his research into using Hadoop to query and manage big geographic datasets, specifically OpenStreetMap(OSM). OSM is an “open-source” map of the world, growing at a large rate, currently around 5TB of data. The talk will introduce OSM, detail some aspects of the research, but also discuss his experiences with using the SpatialHadoop stack on Azure and Google Cloud.
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
This session illustrates the different tools available in SQL Anywhere to analyze performance issues, as well as describes the most common types of performance problems encountered by database developers and administrators. We also take a look at various tips and techniques that will help boost the performance of your SQL Anywhere database.
Improving Query Processing Time of Olap Cube using Olap OperationsIRJET Journal
This document discusses improving the query processing time of OLAP cubes using OLAP operations. It analyzes the query processing times of OLAP cubes and different OLAP operations (roll-up, drill-down, slice, dice) on sample data. The results show that OLAP operations have significantly faster query processing times (2-5 seconds) compared to plain OLAP cubes (26 seconds). Therefore, applying OLAP operations on cubes can improve query efficiency and response times for analytical queries on multidimensional data.
This document discusses Bayesian global optimization and its application to tuning machine learning models. It begins by outlining some of the challenges of tuning ML models, such as the non-intuitive nature of the task. It then introduces Bayesian global optimization as an approach to efficiently search the hyperparameter space to find optimal configurations. The key aspects of Bayesian global optimization are described, including using Gaussian processes to build models of the objective function from sampled points and finding the next best point to sample via expected improvement. Several examples are provided demonstrating how Bayesian global optimization outperforms standard tuning methods in optimizing real-world ML tasks.
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
Using Bayesian Optimization to Tune Machine Learning Models: In this talk we briefly introduce Bayesian Global Optimization as an efficient way to optimize machine learning model parameters, especially when evaluating different parameters is time-consuming or expensive. We will motivate the problem and give example applications.
We will also talk about our development of a robust benchmark suite for our algorithms including test selection, metric design, infrastructure architecture, visualization, and comparison to other standard and open source methods. We will discuss how this evaluation framework empowers our research engineers to confidently and quickly make changes to our core optimization engine.
We will end with an in-depth example of using these methods to tune the features and hyperparameters of a real world problem and give several real world applications.
Sempala - Interactive SPARQL Query Processing on HadoopAlexander Schätzle
Driven by initiatives like Schema.org, the amount of semantically annotated data is expected to grow steadily towards massive scale, requiring cluster-based solutions to query it. At the same time, Hadoop has become dominant in the area of Big Data processing with large infrastructures being already deployed and used in manifold application fields. For Hadoop-based applications, a common data pool (HDFS) provides many synergy benefits, making it very attractive to use these infrastructures for semantic data processing as well.
Indeed, existing SPARQL-on-Hadoop (MapReduce) approaches have already demonstrated very good scalability, however, query runtimes are rather slow due to the underlying batch processing framework. While this is acceptable for data-intensive queries, it is not satisfactory for the majority of SPARQL queries that are typically much more selective requiring only small subsets of the data.
In this paper, we present Sempala, a SPARQL-over-SQL-on-Hadoop approach designed with selective queries in mind. Our evaluation shows performance improvements by an order of magnitude compared to existing approaches, paving the way for interactive-time SPARQL query processing on Hadoop.
This document summarizes a project analyzing procurement data from the city of Los Angeles using Apache Spark. It includes an overview of the project, descriptions of Apache Spark and the Databricks cluster used for analysis. Several Spark SQL queries and visualizations are shown to determine expenses by year, department, item, and other factors. The conclusions recommend reducing transportation costs by building supplier plants closer to Los Angeles.
This document summarizes a project analyzing procurement data from the city of Los Angeles using Apache Spark. It includes an overview of the project, descriptions of Apache Spark and the Databricks cluster used for analysis. Several Spark SQL queries and visualizations are shown to determine expenses by year, department, item, and other categories. The conclusions recommend reducing transportation costs by building supplier plants closer to Los Angeles.
This document summarizes a project analyzing procurement data from the city of Los Angeles using Apache Spark. It includes an overview of the project, descriptions of Apache Spark and the Databricks cluster used for analysis. Several Spark SQL queries and visualizations are shown to determine expenses by year, department, item, and other categories. The conclusions recommend reducing transportation costs by building supplier plants closer to Los Angeles.
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD
Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt
This document discusses using SigOpt to tune deep learning models. It notes that tuning deep learning systems is non-intuitive and expert-intensive using traditional random search or grid search methods. SigOpt provides a more efficient approach using Bayesian optimization to suggest optimal hyperparameters after each trial, reducing wasted expert time and computation. The document provides examples applying SigOpt to tune convolutional neural networks on CIFAR10, demonstrating a 1.6% reduction in error rate over expert tuning with no wasted trials.
How KeyBank Used Elastic to Build an Enterprise Monitoring SolutionElasticsearch
KeyBank is using an iterative design approach to scale their end-to-end enterprise monitoring system with Kafka and Elasticsearch at its core. See how they did it and the lessons learned along the way.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
Adopting software design practices for better machine learningMLconf
Jeff McGehee discusses Google's Rules of Machine Learning and introduces the concept of Speed of Delivery and Quality of Product (SoDQoP) for evaluating machine learning projects over time. He argues that process is the area with the most room for improvement in machine learning. The document provides recommendations for building lean machine learning systems through fast iteration, leveraging existing machine learning APIs, and making it easy to improve models with new data.
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
Similar to WBDB 2015 Performance Evaluation of Spark SQL using BigBench (20)
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfUndress Baby
The quest for the best AI face swap solution is marked by an amalgamation of technological prowess and artistic finesse, where cutting-edge algorithms seamlessly replace faces in images or videos with striking realism. Leveraging advanced deep learning techniques, the best AI face swap tools meticulously analyze facial features, lighting conditions, and expressions to execute flawless transformations, ensuring natural-looking results that blur the line between reality and illusion, captivating users with their ingenuity and sophistication.
Web:- https://undressbaby.com/
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
SOCRadar's Aviation Industry Q1 Incident Report is out now!
The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers.
SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
1. Performance Evaluation of
Spark SQL using BigBench
Todor Ivanov and Max-Georg Beer
Frankfurt Big Data Lab
Goethe University Frankfurt am Main, Germany
http://www.bigdata.uni-frankfurt.de/
6th Workshop on Big Data Benchmarking 2015
June 16th – 17th, Toronto, Canada
2. Agenda
• Motivation & Research Objectives
• Towards BigBench on Spark
– Our Experience with BigBench
– Lessons Learned
• Data Scalability Experiments
– Cluster Setup & Configuration
– BigBench on MapReduce
– BigBench on Spark SQL
– Hive & Spark SQL Comparison
• Next Steps
6th Workshop on Big Data Benchmarking 2015 2
3. Motivation
• „Towards A Complete BigBench Implementation” by Tilmann Rabl @WBDB 2014
– end-to-end, application-level,
analytical big data benchmark
– technology agnostic
– based on TPC-DS
– consists of 30 queries
• Implementation for the Hadoop Ecosystem
– https://github.com/intel-hadoop/Big-Bench
What about implementing BigBench on Spark?
6th Workshop on Big Data Benchmarking 2015 3
BigBench Logical Data Schema
4. Research Objectives
• Understand and experiment with BigBench on MapReduce
• Implement & run BigBench on Spark
• Evaluate and compare both BigBench implementations
6th Workshop on Big Data Benchmarking 2015 4
5. Agenda
• Motivation & Research Objectives
• Towards BigBench on Spark
– Our Experience with BigBench
– Lessons Learned
• Data Scalability Experiments
– Cluster Setup & Configuration
– BigBench on MapReduce
– BigBench on Spark SQL
– Hive & Spark SQL Comparison
• Next Steps
6th Workshop on Big Data Benchmarking 2015 5
6. Towards BigBench on Spark
• Analyse the different query groups in BigBench
Evaluate the Data Scalability of the BigBench queries.
• The largest group consists of 14 pure HiveQL queries
• Spark SQL supports the HiveQL syntax
Compare the performace of Hive and Spark SQL using the HiveQL queries.
6th Workshop on Big Data Benchmarking 2015 6
Query Types Queries
Number of
Queries
Pure HiveQL
6, 7, 9, 11, 12, 13, 14, 15,
16, 17, 21, 22, 23, 24
14
Java MapReduce with HiveQL 1, 2 2
Python Streaming MR with HiveQL 3, 4, 8, 29, 30 5
Mahout (Java MR) with HiveQL 5, 20, 25, 26, 28 5
OpenNLP (Java MR) with HiveQL 10, 18, 19, 27 4
7. Lessons Learned
Our BigBench on MapReduce experiments showed:
• The OpenNLP queries (Q19, Q10) scale best with the increase of the data size.
• Q27 (OpenNLP) is not suitable for scalability comparison.
• A subset of the Python Streaming (MR) queries (Q4, Q30, Q3) show the worst scaling
behavior.
Comparing Hive and Spark SQL we observed:
• A group of Spark SQL queries (Q7, Q16, Q21, Q22, Q23 and Q24) does not scale
properly with the increase of the data size. Possible reason join optimization issues.
• For the stable HiveQL queries (Q6, Q9, Q11, Q12, Q13, Q14, Q15 and Q17) Spark SQL
performs between 1.5x and 6.3x times faster than Hive.
6th Workshop on Big Data Benchmarking 2015 7
8. Our Experience with BigBench
• Validating the Spark SQL query results
– Empty query results
– Non-deterministic end results (OpenNLP and Mahout)
– No reference results are available
• BigBench Setup: https://github.com/BigData-Lab-Frankfurt/Big-Bench-Setup
– Executing single or subset of queries
– Gather execution times, row counts and sample values from result tables
6th Workshop on Big Data Benchmarking 2015 8
Query #
Row Count
SF 100
Row Count
SF 300
Row Count
SF 600
Row Count
SF 1000
Sample Row
Q1 0 0 0 0
Q2 1288 1837 1812 1669 1415 41 1
Q3 131 426 887 1415 20 5809 1
Q4 73926146 233959972 468803001 795252823 0_1199 1
Q5 logRegResult.txt
AUC = 0.50 confusion: [[0.0, 0.0],
[1.0, 3129856.0]] entropy: [[-0.7, -
0.7], [-0.7, -0.7]]
… … …
9. Agenda
• Motivation & Research Objectives
• Towards BigBench on Spark
– Our Experience with BigBench
– Lessons Learned
• Data Scalability Experiments
– Cluster Setup & Configuration
– BigBench on MapReduce
– BigBench on Spark SQL
– Hive & Spark SQL Comparison
• Next Steps
6th Workshop on Big Data Benchmarking 2015 9
10. Cluster Setup
• Operating System: Ubuntu Server 14.04.1. LTS
• Cloudera’s Hadoop Distribution - CDH 5.2
• Replication Factor of 2 (only 3 worker nodes)
• Hive version 0.13.1
• Spark version 1.4.0-SNAPSHOT (March 27th 2015)
• BigBench & Scripts (https://github.com/BigData-Lab-Frankfurt/Big-Bench-Setup)
• 3 test repetitions
• Performance Analysis Tool (PAT) (https://github.com/intel-hadoop/PAT)
6th Workshop on Big Data Benchmarking 2015 10
Setup Description Summary
Total Nodes: 4 x Dell PowerEdge T420
Total Processors/
Cores/Threads:
5 CPUs/
30 Cores/ 60 Threads
Total Memory: 4x 32GB = 128 GB
Total Number of Disks:
13 x 1TB,SATA, 3.5 in, 7.2K
RPM, 64MB Cache
Total Storage Capacity: 13 TB
Network: 1 GBit Ehternet
11. Cluster Configuration
• Optimizing cluster performance can be very time-consuming process.
• Following the best practices published by Sandy Ryza (Cloudera):
– “How-to: Tune Your Apache Spark Jobs”, http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
6th Workshop on Big Data Benchmarking 2015 11
Component Parameter Configuration Values
YARN
yarn.nodemanager.resource.memory-mb 31GB
yarn.scheduler.maximum-allocation-mb 31GB
yarn.nodemanager.resource.cpu-vcores 11
Spark
master yarn
num-executors 9
executor-cores 3
executor-memory 9GB
spark.serializer
org.apache.spark.
serializer.KryoSerializer
MapReduce
mapreduce.map.java.opts.max.heap 2GB
mapreduce.reduce.java.opts.max.heap 2GB
mapreduce.map.memory.mb 3GB
mapreduce.reduce.memory.mb 3GB
Hive
hive.auto.convert.join (Q9 only) true
Client Java Heap Size 2GB
12. Agenda
• Motivation & Research Objectives
• Towards BigBench on Spark
– Our Experience with BigBench
– Lessons Learned
• Data Scalability Experiments
– Cluster Setup & Configuration
– BigBench on MapReduce
– BigBench on Spark SQL
– Hive & Spark SQL Comparison
• Next Steps
6th Workshop on Big Data Benchmarking 2015 12
13. BigBench on MapReduce
• Tested Scale Factors: 100 GB, 300 GB, 600 GB and 1TB
• Times normalized with respect to 100GB SF as baseline.
• Longer normalized times indicate slower execution with the increase of the data size.
• Shorter normalized times indicate better scalability with the increase of the data size.
6th Workshop on Big Data Benchmarking 2015 13
0
1
2
3
4
5
6
7
8
9
10
11
12
13
NormalizedTime
Normalized BigBench Times with respect to baseline 100GB Scale Factor
300GB 600GB
1TB Linear 300GB
Linear 600GB Linear 1TB
14. BigBench on MapReduce – worst scalability
• Tested Scale Factors: 100 GB, 300 GB, 600 GB and 1TB
• Times normalized with respect to 100GB SF as baseline.
• Group A: Q4, Q30, Q3 (Python Streaming) and Q5 (Mahout) show the worst scaling
behavior.
6th Workshop on Big Data Benchmarking 2015 14
-2
0
2
4
6
8
10
12
14
NormalizedTime
Normalized BigBench + MapReduce Times with respect to baseline 100GB SF
300GB 600GB
1TB Linear 300GB
Linear 600GB Linear 1TB
15. Group A: Analysis of Q4 (Python) & Q5 (Mahout)
Scale Factor: 1TB
Q4 (Python
Streaming)
Q5 (Mahout)
Average Runtime
(minutes):
929 minutes 273 minutes
Avg. CPU
Utilization %:
48.82 (User %);
3.31 (System %);
4.98 (IOwait%)
51.50 (User %);
3.37 (System %);
3.65 (IOwait%)
Avg. Memory
Utilization %:
95.99 % 91.85 %
6th Workshop on Big Data Benchmarking 2015 15
• Q4 is memory bound with 96% utilization and around 5% IOwaits, which means that the
CPU is waiting for outstanding disk I/O requests.
• Q5 is memory bound with around 92% utilization. The Mahout execution takes only 18
minutes before the query end and utilizes very few resources.
0
50
100
0
1903
3814
5731
7650
9554
11464
13381
15300
17204
19115
21031
22950
24854
26765
28681
30600
32504
34415
36331
38250
40154
42065
43981
45900
47804
49715
51631
53550
55454
CPUUtilization%
Time (sec)
Q4 (Python)
IOwait % User % System %
0
50
100
0
640
1275
1914
2551
3190
3825
4464
5101
5740
6375
7014
7652
8290
8925
9564
10202
10840
11475
12114
12752
13390
14025
14664
15302
15940
16575
CPUUtilization%
Time (sec)
Q5 (Mahout)
IOwait % User % System %
Starts the Mahout
execution.
16. BigBench on MapReduce – best scalability
• Tested Scale Factors: 100 GB, 300 GB, 600 GB and 1TB
• Times normalized with respect to 100GB SF as baseline.
• Group B: Q27, Q19, Q10 (OpenNLP) and Q23 (HiveQL) show the best scaling behavior.
6th Workshop on Big Data Benchmarking 2015 16
-2
0
2
4
6
8
10
12
14
NormalizedTime
Normalized BigBench + MapReduce Times with respect to baseline 100GB SF
300GB 600GB
1TB Linear 300GB
Linear 600GB Linear 1TB
17. Group B: Analysis of Q27 (OpenNLP)
• Q27 keeps the system underutilized and outputs non-deterministic values.
6th Workshop on Big Data Benchmarking 2015 17
Scale Factor: 1TB Q27 (OpenNLP)
Input Data size/
Number of Tables:
2GB / 1 Tables
Average Runtime
(minutes):
0.7 minutes
Avg. CPU Utilization
%:
10.03 (User %);
1.94 (System %);
1.29 (IOwait%)
Avg. Memory
Utilization %:
27.19 %
Scale Factor 100GB 300GB 600GB 1TB
Number of rows in
result table
1 0 3 0
Times (minutes) 0.91 0.63 0.98 0.70
0
20
40
60
80
100
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 84
CPUUtilization%
Time (sec)IOwait % User % System %
0
20
40
60
80
100
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 82
MemoryUtilization%
Time (sec)
18. Group B: Analysis of Q18 (OpenNLP)
• Q18 is memory bound with around 90% utilization and high CPU usage of 56%.
6th Workshop on Big Data Benchmarking 2015 18
Scale Factor: 1TB Q18 (OpenNLP)
Input Data size/
Number of Tables:
71GB / 3 Tables
Average Runtime
(minutes):
28 minutes
Avg. CPU
Utilization %:
55.99 (User %);
2.04 (System %);
0.31 (IOwait%)
Avg. Memory
Utilization %:
90.22 %
0
20
40
60
80
100
0
50
96
144
190
236
284
330
376
424
470
516
564
610
656
704
750
796
844
890
936
984
1030
1076
1124
1170
1216
1264
1310
1356
1404
1450
1496
1544
1590
1636
1684
CPUUtilizatioin%
Time (sec)IOwait % User % System %
0
20
40
60
80
100
5
55
101
149
195
241
289
335
381
429
475
521
569
615
661
709
755
801
849
895
941
989
1035
1081
1129
1175
1221
1269
1315
1361
1409
1455
1501
1549
1595
1641
1689
MemoryUtilization%
Time (sec)
19. Agenda
• Motivation & Research Objectives
• Towards BigBench on Spark
– Our Experience with BigBench
– Lessons Learned
• Data Scalability Experiments
– Cluster Setup & Configuration
– BigBench on MapReduce
– BigBench on Spark SQL
– Hive & Spark SQL Comparison
• Next Steps
6th Workshop on Big Data Benchmarking 2015 19
20. BigBench on Spark SQL – worst scalability
• Test the group of 14 pure HiveQL queries.
• Tested Scale Factors: 100 GB, 300 GB, 600 GB and 1TB
• Times normalized with respect to 100GB SF as baseline.
• Group A: Q24, Q21, Q16 and Q7 achieve the worst data scalability behavior.
• Possible reason for Group A behavior is reported in SPARK-2211 (Join Optimization).
6th Workshop on Big Data Benchmarking 2015 20
0
3
6
9
12
15
18
21
24
NormalizedTime
Normalized BigBench + Spark SQL Times with respect to baseline 100GB SF
300GB 600GB
1TB Linear 300GB
Linear 600GB Linear 1TB
21. BigBench on Spark SQL – best scalability
• Test the group of 14 pure HiveQL queries.
• Tested Scale Factors: 100 GB, 300 GB, 600 GB and 1TB
• Times normalized with respect to 100GB SF as baseline.
• Group B: Q15, Q11,Q9 and Q14 achieve the best data scalability behavior.
6th Workshop on Big Data Benchmarking 2015 21
0
3
6
9
12
15
18
21
24
NormalizedTime
Normalized BigBench + Spark SQL Times with respect to baseline 100GB SF
300GB 600GB
1TB Linear 300GB
Linear 600GB Linear 1TB
22. Agenda
• Motivation & Research Objectives
• Towards BigBench on Spark
– Our Experience with BigBench
– Lessons Learned
• Data Scalability Experiments
– Cluster Setup & Configuration
– BigBench on MapReduce
– BigBench on Spark SQL
– Hive & Spark SQL Comparison
• Next Steps
6th Workshop on Big Data Benchmarking 2015 22
23. Hive & Spark SQL Comparison (1)
• Calculate the Hive to Spark SQL ratio (%): ((HiveTime * 100) / SparkTime) - 100)
• Group 1: Q7, Q16, Q21, Q22, Q23 and Q24 drastically increase their Spark SQL execution
time for the larger data sets.
• Complex Join issues described in SPARK-2211(https://issues.apache.org/jira/browse/SPARK-2211 ).
6th Workshop on Big Data Benchmarking 2015 23
Q6 Q7 Q9 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q21 Q22 Q23 Q24
100GB 150 257 152 148 259 245 156 46 70 387 71 -55 9 44
300GB 204 180 284 234 279 262 251 89 88 398 -35 -68 -24 -54
600GB 246 37 398 344 279 263 328 132 25 402 -62 -78 -55 -76
1TB 279 13 528 443 295 278 389 170 12 423 -69 -76 -64 -81
-100
-50
0
50
100
150
200
250
300
350
400
450
500
TimeRatio(%)
Hive to Spark SQL Query Time Ratio (%) defined as ((HiveTime*100)/SparkTime)-100)
24. Group 1: Analysis of Q7 (HiveQL)
Scale Factor: 1TB Hive Spark SQL
Average Runtime
(minutes):
46 minutes 41 minutes
Avg. CPU
Utilization %:
56.97 (User %);
3.89 (System %);
0.40 (IOwait %)
16.65 (User %);
2.62 (System %);
21.28 (IOwait %)
Avg. Memory
Utilization %:
94.33 % 93.78 %
6th Workshop on Big Data Benchmarking 2015 24
• Q7 is only 13% slower on Hive compared to Spark SQL.
• Spark SQL spends around 21% (IOwait) of the CPU time on waiting for outstanding disk I/O
requests in Q7 utilizes efficiently only around 17% of the CPU.
0
50
100
0 256 511 766 1021 1276 1531 1786 2041 2296 2551 2806
CPUUtilization%
Time (sec)
Q7 Hive
IOwait % User % System %
0
50
100
0 256 511 766 1021 1276 1531 1787 2042 2297 2595
CPUUtilization
%
Time (sec)
Q7 Spark SQL
IOwait % User % System %
25. Hive & Spark SQL Comparison (2)
• Group 2: Q12,Q13 and Q17 show modest performance improvement with the increase of the
data size.
6th Workshop on Big Data Benchmarking 2015 25
Q6 Q7 Q9 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q21 Q22 Q23 Q24
100GB 150 257 152 148 259 245 156 46 70 387 71 -55 9 44
300GB 204 180 284 234 279 262 251 89 88 398 -35 -68 -24 -54
600GB 246 37 398 344 279 263 328 132 25 402 -62 -78 -55 -76
1TB 279 13 528 443 295 278 389 170 12 423 -69 -76 -64 -81
-100
-50
0
50
100
150
200
250
300
350
400
450
500
TimeRatio(%)
Hive to Spark SQL Query Time Ratio (%) defined as ((HiveTime*100)/SparkTime)-100)
26. Hive & Spark SQL Comparison (3)
• Group 3: Q6, Q9, Q11, Q14 and Q15 perform between 46% and 528% faster on Spark SQL
than on Hive.
6th Workshop on Big Data Benchmarking 2015 26
Q6 Q7 Q9 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q21 Q22 Q23 Q24
100GB 150 257 152 148 259 245 156 46 70 387 71 -55 9 44
300GB 204 180 284 234 279 262 251 89 88 398 -35 -68 -24 -54
600GB 246 37 398 344 279 263 328 132 25 402 -62 -78 -55 -76
1TB 279 13 528 443 295 278 389 170 12 423 -69 -76 -64 -81
-100
-50
0
50
100
150
200
250
300
350
400
450
500
TimeRatio(%)
Hive to Spark SQL Query Time Ratio (%) defined as ((HiveTime*100)/SparkTime)-100)
27. Group 3: Analysis of Q9 (HiveQL)
• Spark SQL is 6 times faster than Hive.
• Hive utilizes on average 60% CPU and 78% memory, whereas Spark SQL consumes on
average 28% CPU and 61% memory.
6th Workshop on Big Data Benchmarking 2015 27
Scale Factor: 1TB Hive Spark SQL
Average Runtime
(minutes):
18 minutes 3 minutes
Avg. CPU
Utilization %:
60.34 (User %);
3.44 (System %);
0.38 (IOwait %)
27.87 (User %);
2.22 (System %);
4.09 (IOwait %)
Avg. Memory
Utilization %:
78.87 % 61.27 %
0
50
100
0
30
59
88
117
146
175
204
233
262
291
320
349
378
407
436
465
494
523
552
581
610
639
668
697
726
755
784
813
842
871
900
929
958
987
1016
1045
1074
1103
CPUUtilization%
Time (sec)
Q9 Hive
IOwait % User % System %
0
50
100
0
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
CPUUtilization%
Time (sec)
Q9 Spark SQL
IOwait % User % System %
29. Acknowledgments
• Fields Institute – Research in Mathematical Sciences
• SPEC Research Big Data Working Group
• Tilmann Rabl (Univesity of Toronto/Bankmark UG)
• John Poelman (IBM)
• Yi Yao Joshua & Bhaskar Gowda (Intel)
• Marten Rosselli, Karsten Tolle, Roberto V. Zicari & Raik Niemann (Frankfurt Big Data Lab)
6th Workshop on Big Data Benchmarking 2015 29