SlideShare a Scribd company logo
1 of 25
Download to read offline
Harnessing Big Data with Spark
Lawrence Spracklen
Alpine Data
2
Alpine Data
3
Map Reduce
•  Allows the distribution of large data computations
across a cluster
•  Computations typically composed of a sequence of
MR operations
Big	
  
Data	
  
Map()
Output	
  
Reduce()
4
MR Performance
•  Multiple disk interactions required in EACH MR
operation
Map	
   Reduce	
  
5
Performance Hierarchy
0.10GB/s 0.10GB/s 0.60GB/s 80GB/s
100X Read Bandwidth
6
Optimizing MR
•  Many companies have significant legacy MR
code
–  Either direct MR or indirect usage via Pig
•  A variety of techniques to accelerate MR
–  Apache Tez
–  Tachyon or Apache ignite
–  System ML
7
Spark
•  Several significant advancements over MR
–  Generalizes two stage MR into arbitrary DAGs
–  Enables in-memory dataset caching
–  Improved usability
•  Reduced disk read/writes delivers significant
speedups
–  Especially for iterative algorithms like ML
8
Perf comparisons
*http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce
9
Spark Tuning
•  Increased reliance on memory introduces
greater requirement for tuning
•  Need to understand memory requirements for
caching
•  Significant performance benefits associated
with “getting it right”
•  Auto-tuning is coming….
10
Optimization opportunities
•  Spark delivers improved ML performance using
reduced cluster resources
•  Enables numerous opportunities
–  Reduced time to insights
–  Reduced cluster size
–  Eliminate subsampling
–  AutoML
11
AutoML
•  Data sets increasingly large and complex
•  Increasing difficult to intuitively “know” optimal
–  Feature engineering
–  Choice of algorithm
–  Optimize parameterization of algorithm(s)
•  Significant manual trial-and-error
•  Cult of the algorithm
12
Feature Engineering
•  Essential for model performance, efficacy,
robustness and simplicity
–  Feature extraction
–  Feature selection
–  Feature construction
–  Feature elimination
•  Domain/dataset knowledge is important, but
basic automation feasible
13
Algorithm selection
•  Select dependent column
•  Indicate classification or regression
•  Press “go”
Algorithms run in parallel across cluster
Minimally provides good starting point
Significantly reduces “busy work”
14
Hyperparameter optimization
•  Are the default parameters optimal?
•  How do I adjust intelligently
–  Number of trees? Depth of trees? Splitting
criteria?
•  Tedious trial and error
•  Overfitting danger
•  Intelligent automatic search
15
Algorithm tuning
•  Gradient boosted tree parameterization e.g.
–  # of trees
–  Maximum tree depth
–  Loss function
–  Minimum node split size
–  Bagging rate
–  Shrinkage
16
AutoML
Data	
  Set	
  
Alg	
  #1	
  
Alg	
  #2	
  
Alg	
  #3	
  
Alg	
  #N	
  
Alg	
  #1	
  
Alg	
  #N	
  
1)Investigate N
ML algorithms
2) Tune top
performing
algorithms
Feature	
  
engineering	
  
Alg	
  #2	
  
Alg	
  #1	
  
Alg	
  #N	
  
2) Feature
elimination
17
Spark is for large datasets
*http://datascience.la/benchmarking-random-forest-implementations/
•  If your data fits on a single node….
•  Other high-performance options exist
*http://haifengl.github.io/smile/index.html
Runtime
18
Data set size
•  Large data lakes can
consist of many small
files
•  Memory per node
increasing rapidly
*http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html
19
NVDIMMS
•  Driving significant increases in node memory
–  Up to 10X increase in density
•  Coming in late 2016…
20
Hybrid operators
•  Time consuming to maintain multiple ML
libraries & manually determine optimal choice
•  Develop hybrid implementations that
automatically choose optimal approach
–  Data set size
–  Cluster size
–  Cluster utilization
21
Single-node performance (1/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen
22
Single-node performance (2/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen
23
Operationalization
•  What happens after the models are created?
•  How does the business benefit from the
insights?
•  Operationalization is frequently the weak link
–  Operationalizing PowerPoint?
–  Hand rolled scoring flows
24
PFA
•  Portable Format for Analytics (PFA)
•  Successor to PMML
•  Significant flexibility in encapsulating complex
data preprocessing
25
Conclusions
•  Spark delivers significant performance
improvements over MR
–  Can introduce more tuning requirements
•  Provides an opportunity for AutoML
–  Automatically determine good solutions
•  Understand when its appropriate
•  Don’t forget about about operationalization

More Related Content

What's hot

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on SparkMathieu Dumoulin
 
Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Edward Yoon
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing FrameworksAntonios Katsarakis
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
 
Apache spark presentation
Apache spark presentationApache spark presentation
Apache spark presentationMahboob Hussain
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitCarlo C. del Mundo
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 

What's hot (20)

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Introduction to apache horn (incubating)
Introduction to apache horn (incubating)
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Giraph
GiraphGiraph
Giraph
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Apache spark presentation
Apache spark presentationApache spark presentation
Apache spark presentation
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 

Similar to Harnessing Big Data with Spark

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Alpine Data
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseDatabricks
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxCive1971
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesRomi Kuntsman
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Spark Summit
 

Similar to Harnessing Big Data with Spark (20)

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptx
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 

More from Alpine Data

Big Data Day LA 2017
Big Data Day LA 2017Big Data Day LA 2017
Big Data Day LA 2017Alpine Data
 
Operationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryOperationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryAlpine Data
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 
UC Berkeley Data Science Webinar
UC Berkeley Data Science WebinarUC Berkeley Data Science Webinar
UC Berkeley Data Science WebinarAlpine Data
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsAlpine Data
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with SparkAlpine Data
 

More from Alpine Data (6)

Big Data Day LA 2017
Big Data Day LA 2017Big Data Day LA 2017
Big Data Day LA 2017
 
Operationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryOperationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud Foundry
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
UC Berkeley Data Science Webinar
UC Berkeley Data Science WebinarUC Berkeley Data Science Webinar
UC Berkeley Data Science Webinar
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System Administrators
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with Spark
 

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Harnessing Big Data with Spark

  • 1. Harnessing Big Data with Spark Lawrence Spracklen Alpine Data
  • 3. 3 Map Reduce •  Allows the distribution of large data computations across a cluster •  Computations typically composed of a sequence of MR operations Big   Data   Map() Output   Reduce()
  • 4. 4 MR Performance •  Multiple disk interactions required in EACH MR operation Map   Reduce  
  • 5. 5 Performance Hierarchy 0.10GB/s 0.10GB/s 0.60GB/s 80GB/s 100X Read Bandwidth
  • 6. 6 Optimizing MR •  Many companies have significant legacy MR code –  Either direct MR or indirect usage via Pig •  A variety of techniques to accelerate MR –  Apache Tez –  Tachyon or Apache ignite –  System ML
  • 7. 7 Spark •  Several significant advancements over MR –  Generalizes two stage MR into arbitrary DAGs –  Enables in-memory dataset caching –  Improved usability •  Reduced disk read/writes delivers significant speedups –  Especially for iterative algorithms like ML
  • 9. 9 Spark Tuning •  Increased reliance on memory introduces greater requirement for tuning •  Need to understand memory requirements for caching •  Significant performance benefits associated with “getting it right” •  Auto-tuning is coming….
  • 10. 10 Optimization opportunities •  Spark delivers improved ML performance using reduced cluster resources •  Enables numerous opportunities –  Reduced time to insights –  Reduced cluster size –  Eliminate subsampling –  AutoML
  • 11. 11 AutoML •  Data sets increasingly large and complex •  Increasing difficult to intuitively “know” optimal –  Feature engineering –  Choice of algorithm –  Optimize parameterization of algorithm(s) •  Significant manual trial-and-error •  Cult of the algorithm
  • 12. 12 Feature Engineering •  Essential for model performance, efficacy, robustness and simplicity –  Feature extraction –  Feature selection –  Feature construction –  Feature elimination •  Domain/dataset knowledge is important, but basic automation feasible
  • 13. 13 Algorithm selection •  Select dependent column •  Indicate classification or regression •  Press “go” Algorithms run in parallel across cluster Minimally provides good starting point Significantly reduces “busy work”
  • 14. 14 Hyperparameter optimization •  Are the default parameters optimal? •  How do I adjust intelligently –  Number of trees? Depth of trees? Splitting criteria? •  Tedious trial and error •  Overfitting danger •  Intelligent automatic search
  • 15. 15 Algorithm tuning •  Gradient boosted tree parameterization e.g. –  # of trees –  Maximum tree depth –  Loss function –  Minimum node split size –  Bagging rate –  Shrinkage
  • 16. 16 AutoML Data  Set   Alg  #1   Alg  #2   Alg  #3   Alg  #N   Alg  #1   Alg  #N   1)Investigate N ML algorithms 2) Tune top performing algorithms Feature   engineering   Alg  #2   Alg  #1   Alg  #N   2) Feature elimination
  • 17. 17 Spark is for large datasets *http://datascience.la/benchmarking-random-forest-implementations/ •  If your data fits on a single node…. •  Other high-performance options exist *http://haifengl.github.io/smile/index.html Runtime
  • 18. 18 Data set size •  Large data lakes can consist of many small files •  Memory per node increasing rapidly *http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html
  • 19. 19 NVDIMMS •  Driving significant increases in node memory –  Up to 10X increase in density •  Coming in late 2016…
  • 20. 20 Hybrid operators •  Time consuming to maintain multiple ML libraries & manually determine optimal choice •  Develop hybrid implementations that automatically choose optimal approach –  Data set size –  Cluster size –  Cluster utilization
  • 23. 23 Operationalization •  What happens after the models are created? •  How does the business benefit from the insights? •  Operationalization is frequently the weak link –  Operationalizing PowerPoint? –  Hand rolled scoring flows
  • 24. 24 PFA •  Portable Format for Analytics (PFA) •  Successor to PMML •  Significant flexibility in encapsulating complex data preprocessing
  • 25. 25 Conclusions •  Spark delivers significant performance improvements over MR –  Can introduce more tuning requirements •  Provides an opportunity for AutoML –  Automatically determine good solutions •  Understand when its appropriate •  Don’t forget about about operationalization