SlideShare a Scribd company logo
1 of 28
Download to read offline
The state of in the cloud
Nicolas Poggi
Oct 2017
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_
/_/
Outline
1. Intro
1. PaaS Cloud
2. BigBench
2. Part I – Scalability
1. Hive vs. Spark
3. Part II – Additional experiments
1. Versions, Concurrency, 10TB
4. Summary
2
Motivation
• 2016 SQL-on-Hadoop paper and presentations
• Focused on Hive, due to SparkSQL not being ready to use in PaaS
• Used benchmark (TPC-H)
• Early 2017, BigBench testing Hive and Spark and TPC-TC paper
• New code available in May for MLlib2 compatibility
• Goals:
Evaluate the current out-of-the-box experience of Spark v2 in PaaS cloud
• Using Hive as baseline
3
Platform-as-a-Service Spark
• Simplified management
• Cloud-based managed Hadoop services
• Ready to use Spark, Hive, …
• Deploys in minutes, on-demand, elastic
• Pay-as-you-go pricing model
• Decoupled compute and storage
• Optimized for general purpose
• Fined tuned to the cloud provider architecture
4
Surveyed PaaS services
• Amazon Elastic Map Reduce (EMR)
• Released: Apr 2009
• OS: Amazon Linux AMI (RHEL-like)
• Spark 2.1.0 and Hive 2.1 (Tez)
• VM: M4.2xlarge (32GB RAM)
• Google Cloud DataProc (GCD)
• Released: Feb 2016
• OS: Debian 8.4
• Spark 2.1.0 (preview), Hive 2.1 (M/R)
• VM: n1-standard-8 (30GB RAM)
• Azure HDInsight (HDI)
• Released: Oct 2013
• OS: Ubuntu 16.04 (HDP-based)
• Spark 2.1.0 and 1.6.3, Hive 1.2 (Tez)
• VM: D4v2 (28GB RAM)
• Target deployment 128-cores:
• 16 data nodes with 8-cores each
• Master node with 16-cores
• Decoupled storage only
• EBS, WASB, GCS
5
What is BigBench (TPCx-BB)
• End-to-end application level benchmark specification
• result of many years of collaboration of industry and academia
• Covers most Big Data Analytical properties (3Vs)
• 30 business use cases for a retailer company
• Merchandizing,
• pricing,
• customers …
• Defines data scale factors
• 1GB to PBs
6
Retailer database
Sequential Hive vs Spark 2.1
Queries 1-30 on Spark 2.1 (power runs)
Query 1 Query 2 …. Query 30
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.1.0
/_/
BB 1GB-1TB Scalability Dataproc:
Hive 2.1 (M/R) vs Spark 2.1
BB 1GB-1TB Scalability EMR:
Hive 2.1 (Tez) vs Spark 2.1
BB 1GB-1TB Scalability HDI:
Hive 1.2 (Tez) vs Spark 2.1
BB 1GB-1TB Scalability: Hive vs Spark 2.1
All providers
BB 1TB Power runs : Hive vs Spark 2.1 (ALL)
CPU % Q5 (ML) in Hive and Spark (HDI)
13
• Hive (MLlib2) • Spark (MLlib2)
Time (s) Time (s) - 2X faster
Radar charts – query characterization
• Useful for displaying multivariate
data (5 resources)
• Quickly identify similarities and
differences.
• From example
• Hive and Spark
• Only Disk Write is similar
• Hive consumes more MEM and CPU
• Spark read more from disk (DISK_R)
• And moderately more network
Sample radar chart for Q7 in EMR at 1TB 14
BB 1TB Query 5 (ML) providers comparison
Hive (MLlib2) Spark (MLlib2)
15
Other comparisons:
10TB SQL-Only
2.0.2 vs 2.1.0
1.6.3 vs 2.1.0
MLlib v1 vs v2
16
BB 1GB-10TB Scalability SQL-only queries
Hive
Spark
BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP)
Notes:
Spark 2.1 a bit faster at
small scales, slower at
100 GB and 1 TB on the
UDF/NLP queries
2.1 faster up
to 100GB
Slower at 1TB
BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0
MLlib 1 vs 2.1 MLlib 2(HDI)
Notes:
• Spark 2.1 is always
faster than 1.6.3 in
HDI
• MLlilb 2 using
dataframes over RDDs
is only slightly faster
than V1.
Concurrency runs (throughput)
SQL-only: 100GB – 1TB 512-core cluster
2020
BB Throughput at 100GB 8 streams SQL-only
(512-cores)
BB Throughput at 1TB 4 streams SQL-only
(512-cores)
Conclusions
• All providers have up to date (2.1.0) and well tuned versions of Spark
• They could run BigBench up to 1TB on medium-sized cluster,
• [Almost] Out-of-the box
• Performance similar among providers for similar cluster types and disk configs
• Difference according to scale (and pricing)
• Spark 2.1.0 is faster than previous versions
• Also MLlib 2 with dataframes
• But improvements within the 30% range
• Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales for sequential
• But Spark significantly faster at high data scales and concurrency
• BigBench has been useful to stress a cluster with different workloads
• Highlights config problems fast and stresses scale limits
• Helpful for tuning the clusters
23
Resources and references
BigBench and ALOJA
• BigBench Spark 2 branch (thanks Christoph
and Michael from bankmark.de):
• https://github.com/carabolic/Big-Data-
Benchmark-for-Big-Bench/tree/spark2
• Original BigBench Implementation
repository
• https://github.com/intel-hadoop/Big-Data-
Benchmark-for-Big-Bench
• ALOJA benchmarking platform
• https://github.com/Aloja/aloja
• ALOJA fork of BigBench (adds support for
HDI and fixes spark)
• https://github.com/Aloja/Big-Data-Benchmark-
for-Big-Bench
Papers and slides
• https://www.slideshare.net/ni_po
• Characterizing TPCx-BB Queries,
Hive, and Spark in Multi-Cloud
Environments – N. Poggi et. Al
• TPC-TC 2017
• The State of SQL-on-Hadoop in the
Cloud – N. Poggi et. al.
• IEEE Big Data 2016
• https://doi.org/10.1109/BigData.2016
.7840751
24
Thanks, questions?
Follow up / feedback : Npoggi@ac.upc.edu
Twitter: ni_po
The state of in the cloud
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_
/_/
Extra slides
26
BB 1TB Query 2 (M/R) providers comparison
Hive Spark
27
Spark config
EMR CDP HDI
Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131
Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76
Driver memory 5G 5G 5G
Executor memory 5G 10G 4G
Executor cores 4 4 3
Executor instances Dynamic Dynamic 20
dynamicAllocation
enabled
TRUE TRUE FALSE
Executor
memoryOverhead
Default (384MB) 1,117 MB Default (384MB)
28

More Related Content

What's hot

Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
이승재, 마비노기 듀얼: 분산 데이터베이스 트랜잭션 설계와 구현, NDC2015
이승재, 마비노기 듀얼: 분산 데이터베이스 트랜잭션 설계와 구현, NDC2015이승재, 마비노기 듀얼: 분산 데이터베이스 트랜잭션 설계와 구현, NDC2015
이승재, 마비노기 듀얼: 분산 데이터베이스 트랜잭션 설계와 구현, NDC2015devCAT Studio, NEXON
 
Optimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxOptimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxIDERA Software
 
IndirectDraw with unity
IndirectDraw with unityIndirectDraw with unity
IndirectDraw with unityJung Suk Ko
 
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)WON JOON YOO
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsThomas Sykes
 
How to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsHow to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsInfluxData
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with linksChris Testa-O'Neill
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
 
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3Kirill Eremenko
 
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Cathrine Wilhelmsen
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databasesAshwani Kumar
 
DDD架構設計
DDD架構設計DDD架構設計
DDD架構設計國昭 張
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)Fellowship at Vodafone FutureLab
 
Mengenal Lebih Jauh Tentang DevOps
Mengenal Lebih Jauh Tentang DevOpsMengenal Lebih Jauh Tentang DevOps
Mengenal Lebih Jauh Tentang DevOpsTerabitKomputer
 
Google Kubernetes Engine (GKE)
Google Kubernetes Engine (GKE)Google Kubernetes Engine (GKE)
Google Kubernetes Engine (GKE)Edith Puclla
 

What's hot (20)

Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Data lineage
Data lineageData lineage
Data lineage
 
이승재, 마비노기 듀얼: 분산 데이터베이스 트랜잭션 설계와 구현, NDC2015
이승재, 마비노기 듀얼: 분산 데이터베이스 트랜잭션 설계와 구현, NDC2015이승재, 마비노기 듀얼: 분산 데이터베이스 트랜잭션 설계와 구현, NDC2015
이승재, 마비노기 듀얼: 분산 데이터베이스 트랜잭션 설계와 구현, NDC2015
 
Optimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxOptimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptx
 
Data Federation
Data FederationData Federation
Data Federation
 
IndirectDraw with unity
IndirectDraw with unityIndirectDraw with unity
IndirectDraw with unity
 
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
How to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsHow to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your Workloads
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with links
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3
 
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
DDD架構設計
DDD架構設計DDD架構設計
DDD架構設計
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
 
Mengenal Lebih Jauh Tentang DevOps
Mengenal Lebih Jauh Tentang DevOpsMengenal Lebih Jauh Tentang DevOps
Mengenal Lebih Jauh Tentang DevOps
 
Google Kubernetes Engine (GKE)
Google Kubernetes Engine (GKE)Google Kubernetes Engine (GKE)
Google Kubernetes Engine (GKE)
 

Viewers also liked

Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Spark Summit
 
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouFast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouSpark Summit
 
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Spark Summit
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Spark Summit
 

Viewers also liked (6)

Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
 
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouFast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
 
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
 

Similar to The State of Spark in the Cloud with Nicolas Poggi

The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloudNicolas Poggi
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
How SQL Server 2016 SP1 Changes the Game
How SQL Server 2016 SP1 Changes the GameHow SQL Server 2016 SP1 Changes the Game
How SQL Server 2016 SP1 Changes the GamePARIKSHIT SAVJANI
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungRendy Bambang Junior
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol PerformanceBIOVIA
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Petr Novotný
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 

Similar to The State of Spark in the Cloud with Nicolas Poggi (20)

The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
How SQL Server 2016 SP1 Changes the Game
How SQL Server 2016 SP1 Changes the GameHow SQL Server 2016 SP1 Changes the Game
How SQL Server 2016 SP1 Changes the Game
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev Bandung
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad EscortsCall girls in Ahmedabad High profile
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

The State of Spark in the Cloud with Nicolas Poggi

  • 1. The state of in the cloud Nicolas Poggi Oct 2017 ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ /_/
  • 2. Outline 1. Intro 1. PaaS Cloud 2. BigBench 2. Part I – Scalability 1. Hive vs. Spark 3. Part II – Additional experiments 1. Versions, Concurrency, 10TB 4. Summary 2
  • 3. Motivation • 2016 SQL-on-Hadoop paper and presentations • Focused on Hive, due to SparkSQL not being ready to use in PaaS • Used benchmark (TPC-H) • Early 2017, BigBench testing Hive and Spark and TPC-TC paper • New code available in May for MLlib2 compatibility • Goals: Evaluate the current out-of-the-box experience of Spark v2 in PaaS cloud • Using Hive as baseline 3
  • 4. Platform-as-a-Service Spark • Simplified management • Cloud-based managed Hadoop services • Ready to use Spark, Hive, … • Deploys in minutes, on-demand, elastic • Pay-as-you-go pricing model • Decoupled compute and storage • Optimized for general purpose • Fined tuned to the cloud provider architecture 4
  • 5. Surveyed PaaS services • Amazon Elastic Map Reduce (EMR) • Released: Apr 2009 • OS: Amazon Linux AMI (RHEL-like) • Spark 2.1.0 and Hive 2.1 (Tez) • VM: M4.2xlarge (32GB RAM) • Google Cloud DataProc (GCD) • Released: Feb 2016 • OS: Debian 8.4 • Spark 2.1.0 (preview), Hive 2.1 (M/R) • VM: n1-standard-8 (30GB RAM) • Azure HDInsight (HDI) • Released: Oct 2013 • OS: Ubuntu 16.04 (HDP-based) • Spark 2.1.0 and 1.6.3, Hive 1.2 (Tez) • VM: D4v2 (28GB RAM) • Target deployment 128-cores: • 16 data nodes with 8-cores each • Master node with 16-cores • Decoupled storage only • EBS, WASB, GCS 5
  • 6. What is BigBench (TPCx-BB) • End-to-end application level benchmark specification • result of many years of collaboration of industry and academia • Covers most Big Data Analytical properties (3Vs) • 30 business use cases for a retailer company • Merchandizing, • pricing, • customers … • Defines data scale factors • 1GB to PBs 6 Retailer database
  • 7. Sequential Hive vs Spark 2.1 Queries 1-30 on Spark 2.1 (power runs) Query 1 Query 2 …. Query 30 Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 2.1.0 /_/
  • 8. BB 1GB-1TB Scalability Dataproc: Hive 2.1 (M/R) vs Spark 2.1
  • 9. BB 1GB-1TB Scalability EMR: Hive 2.1 (Tez) vs Spark 2.1
  • 10. BB 1GB-1TB Scalability HDI: Hive 1.2 (Tez) vs Spark 2.1
  • 11. BB 1GB-1TB Scalability: Hive vs Spark 2.1 All providers
  • 12. BB 1TB Power runs : Hive vs Spark 2.1 (ALL)
  • 13. CPU % Q5 (ML) in Hive and Spark (HDI) 13 • Hive (MLlib2) • Spark (MLlib2) Time (s) Time (s) - 2X faster
  • 14. Radar charts – query characterization • Useful for displaying multivariate data (5 resources) • Quickly identify similarities and differences. • From example • Hive and Spark • Only Disk Write is similar • Hive consumes more MEM and CPU • Spark read more from disk (DISK_R) • And moderately more network Sample radar chart for Q7 in EMR at 1TB 14
  • 15. BB 1TB Query 5 (ML) providers comparison Hive (MLlib2) Spark (MLlib2) 15
  • 16. Other comparisons: 10TB SQL-Only 2.0.2 vs 2.1.0 1.6.3 vs 2.1.0 MLlib v1 vs v2 16
  • 17. BB 1GB-10TB Scalability SQL-only queries Hive Spark
  • 18. BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP) Notes: Spark 2.1 a bit faster at small scales, slower at 100 GB and 1 TB on the UDF/NLP queries 2.1 faster up to 100GB Slower at 1TB
  • 19. BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0 MLlib 1 vs 2.1 MLlib 2(HDI) Notes: • Spark 2.1 is always faster than 1.6.3 in HDI • MLlilb 2 using dataframes over RDDs is only slightly faster than V1.
  • 20. Concurrency runs (throughput) SQL-only: 100GB – 1TB 512-core cluster 2020
  • 21. BB Throughput at 100GB 8 streams SQL-only (512-cores)
  • 22. BB Throughput at 1TB 4 streams SQL-only (512-cores)
  • 23. Conclusions • All providers have up to date (2.1.0) and well tuned versions of Spark • They could run BigBench up to 1TB on medium-sized cluster, • [Almost] Out-of-the box • Performance similar among providers for similar cluster types and disk configs • Difference according to scale (and pricing) • Spark 2.1.0 is faster than previous versions • Also MLlib 2 with dataframes • But improvements within the 30% range • Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales for sequential • But Spark significantly faster at high data scales and concurrency • BigBench has been useful to stress a cluster with different workloads • Highlights config problems fast and stresses scale limits • Helpful for tuning the clusters 23
  • 24. Resources and references BigBench and ALOJA • BigBench Spark 2 branch (thanks Christoph and Michael from bankmark.de): • https://github.com/carabolic/Big-Data- Benchmark-for-Big-Bench/tree/spark2 • Original BigBench Implementation repository • https://github.com/intel-hadoop/Big-Data- Benchmark-for-Big-Bench • ALOJA benchmarking platform • https://github.com/Aloja/aloja • ALOJA fork of BigBench (adds support for HDI and fixes spark) • https://github.com/Aloja/Big-Data-Benchmark- for-Big-Bench Papers and slides • https://www.slideshare.net/ni_po • Characterizing TPCx-BB Queries, Hive, and Spark in Multi-Cloud Environments – N. Poggi et. Al • TPC-TC 2017 • The State of SQL-on-Hadoop in the Cloud – N. Poggi et. al. • IEEE Big Data 2016 • https://doi.org/10.1109/BigData.2016 .7840751 24
  • 25. Thanks, questions? Follow up / feedback : Npoggi@ac.upc.edu Twitter: ni_po The state of in the cloud ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ /_/
  • 27. BB 1TB Query 2 (M/R) providers comparison Hive Spark 27
  • 28. Spark config EMR CDP HDI Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131 Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76 Driver memory 5G 5G 5G Executor memory 5G 10G 4G Executor cores 4 4 3 Executor instances Dynamic Dynamic 20 dynamicAllocation enabled TRUE TRUE FALSE Executor memoryOverhead Default (384MB) 1,117 MB Default (384MB) 28