SlideShare a Scribd company logo
Harnessing Big Data with Spark
Lawrence Spracklen
Alpine Data
2
Alpine Data
3
Map Reduce
•  Allows the distribution of large data computations
across a cluster
•  Computations typically composed of a sequence of
MR operations
Big	
  
Data	
  
Map()
Output	
  
Reduce()
4
MR Performance
•  Multiple disk interactions required in EACH MR
operation
Map	
   Reduce	
  
5
Performance Hierarchy
0.10GB/s 0.10GB/s 0.60GB/s 80GB/s
100X Read Bandwidth
6
Optimizing MR
•  Many companies have significant legacy MR
code
–  Either direct MR or indirect usage via Pig
•  A variety of techniques to accelerate MR
–  Apache Tez
–  Tachyon or Apache ignite
–  System ML
7
Spark
•  Several significant advancements over MR
–  Generalizes two stage MR into arbitrary DAGs
–  Enables in-memory dataset caching
–  Improved usability
•  Reduced disk read/writes delivers significant
speedups
–  Especially for iterative algorithms like ML
8
Perf comparisons
*http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce
9
Spark Tuning
•  Increased reliance on memory introduces
greater requirement for tuning
•  Need to understand memory requirements for
caching
•  Significant performance benefits associated
with “getting it right”
•  Auto-tuning is coming….
10
Optimization opportunities
•  Spark delivers improved ML performance using
reduced cluster resources
•  Enables numerous opportunities
–  Reduced time to insights
–  Reduced cluster size
–  Eliminate subsampling
–  AutoML
11
AutoML
•  Data sets increasingly large and complex
•  Increasing difficult to intuitively “know” optimal
–  Feature engineering
–  Choice of algorithm
–  Optimize parameterization of algorithm(s)
•  Significant manual trial-and-error
•  Cult of the algorithm
12
Feature Engineering
•  Essential for model performance, efficacy,
robustness and simplicity
–  Feature extraction
–  Feature selection
–  Feature construction
–  Feature elimination
•  Domain/dataset knowledge is important, but
basic automation feasible
13
Algorithm selection
•  Select dependent column
•  Indicate classification or regression
•  Press “go”
Algorithms run in parallel across cluster
Minimally provides good starting point
Significantly reduces “busy work”
14
Hyperparameter optimization
•  Are the default parameters optimal?
•  How do I adjust intelligently
–  Number of trees? Depth of trees? Splitting
criteria?
•  Tedious trial and error
•  Overfitting danger
•  Intelligent automatic search
15
Algorithm tuning
•  Gradient boosted tree parameterization e.g.
–  # of trees
–  Maximum tree depth
–  Loss function
–  Minimum node split size
–  Bagging rate
–  Shrinkage
16
AutoML
Data	
  Set	
  
Alg	
  #1	
  
Alg	
  #2	
  
Alg	
  #3	
  
Alg	
  #N	
  
Alg	
  #1	
  
Alg	
  #N	
  
1)Investigate N
ML algorithms
2) Tune top
performing
algorithms
Feature	
  
engineering	
  
Alg	
  #2	
  
Alg	
  #1	
  
Alg	
  #N	
  
2) Feature
elimination
17
Spark is for large datasets
*http://datascience.la/benchmarking-random-forest-implementations/
•  If your data fits on a single node….
•  Other high-performance options exist
*http://haifengl.github.io/smile/index.html
Runtime
18
Data set size
•  Large data lakes can
consist of many small
files
•  Memory per node
increasing rapidly
*http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html
19
NVDIMMS
•  Driving significant increases in node memory
–  Up to 10X increase in density
•  Coming in late 2016…
20
Hybrid operators
•  Time consuming to maintain multiple ML
libraries & manually determine optimal choice
•  Develop hybrid implementations that
automatically choose optimal approach
–  Data set size
–  Cluster size
–  Cluster utilization
21
Single-node performance (1/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen
22
Single-node performance (2/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen
23
Operationalization
•  What happens after the models are created?
•  How does the business benefit from the
insights?
•  Operationalization is frequently the weak link
–  Operationalizing PowerPoint?
–  Hand rolled scoring flows
24
PFA
•  Portable Format for Analytics (PFA)
•  Successor to PMML
•  Significant flexibility in encapsulating complex
data preprocessing
25
Conclusions
•  Spark delivers significant performance
improvements over MR
–  Can introduce more tuning requirements
•  Provides an opportunity for AutoML
–  Automatically determine good solutions
•  Understand when its appropriate
•  Don’t forget about about operationalization

More Related Content

What's hot

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on SparkMathieu Dumoulin
 
Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Edward Yoon
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing FrameworksAntonios Katsarakis
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
 
Apache spark presentation
Apache spark presentationApache spark presentation
Apache spark presentationMahboob Hussain
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitCarlo C. del Mundo
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 

What's hot (20)

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Introduction to apache horn (incubating)
Introduction to apache horn (incubating)
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Giraph
GiraphGiraph
Giraph
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Apache spark presentation
Apache spark presentationApache spark presentation
Apache spark presentation
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 

Similar to Harnessing Big Data with Spark

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Alpine Data
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseDatabricks
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxCive1971
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesRomi Kuntsman
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Spark Summit
 

Similar to Harnessing Big Data with Spark (20)

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptx
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 

More from Alpine Data

Big Data Day LA 2017
Big Data Day LA 2017Big Data Day LA 2017
Big Data Day LA 2017Alpine Data
 
Operationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryOperationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryAlpine Data
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 
UC Berkeley Data Science Webinar
UC Berkeley Data Science WebinarUC Berkeley Data Science Webinar
UC Berkeley Data Science WebinarAlpine Data
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsAlpine Data
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with SparkAlpine Data
 

More from Alpine Data (6)

Big Data Day LA 2017
Big Data Day LA 2017Big Data Day LA 2017
Big Data Day LA 2017
 
Operationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryOperationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud Foundry
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
UC Berkeley Data Science Webinar
UC Berkeley Data Science WebinarUC Berkeley Data Science Webinar
UC Berkeley Data Science Webinar
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System Administrators
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with Spark
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单ewymefz
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationBoston Institute of Analytics
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sMAQIB18
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Domenico Conte
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 

Recently uploaded (20)

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Harnessing Big Data with Spark

  • 1. Harnessing Big Data with Spark Lawrence Spracklen Alpine Data
  • 3. 3 Map Reduce •  Allows the distribution of large data computations across a cluster •  Computations typically composed of a sequence of MR operations Big   Data   Map() Output   Reduce()
  • 4. 4 MR Performance •  Multiple disk interactions required in EACH MR operation Map   Reduce  
  • 5. 5 Performance Hierarchy 0.10GB/s 0.10GB/s 0.60GB/s 80GB/s 100X Read Bandwidth
  • 6. 6 Optimizing MR •  Many companies have significant legacy MR code –  Either direct MR or indirect usage via Pig •  A variety of techniques to accelerate MR –  Apache Tez –  Tachyon or Apache ignite –  System ML
  • 7. 7 Spark •  Several significant advancements over MR –  Generalizes two stage MR into arbitrary DAGs –  Enables in-memory dataset caching –  Improved usability •  Reduced disk read/writes delivers significant speedups –  Especially for iterative algorithms like ML
  • 9. 9 Spark Tuning •  Increased reliance on memory introduces greater requirement for tuning •  Need to understand memory requirements for caching •  Significant performance benefits associated with “getting it right” •  Auto-tuning is coming….
  • 10. 10 Optimization opportunities •  Spark delivers improved ML performance using reduced cluster resources •  Enables numerous opportunities –  Reduced time to insights –  Reduced cluster size –  Eliminate subsampling –  AutoML
  • 11. 11 AutoML •  Data sets increasingly large and complex •  Increasing difficult to intuitively “know” optimal –  Feature engineering –  Choice of algorithm –  Optimize parameterization of algorithm(s) •  Significant manual trial-and-error •  Cult of the algorithm
  • 12. 12 Feature Engineering •  Essential for model performance, efficacy, robustness and simplicity –  Feature extraction –  Feature selection –  Feature construction –  Feature elimination •  Domain/dataset knowledge is important, but basic automation feasible
  • 13. 13 Algorithm selection •  Select dependent column •  Indicate classification or regression •  Press “go” Algorithms run in parallel across cluster Minimally provides good starting point Significantly reduces “busy work”
  • 14. 14 Hyperparameter optimization •  Are the default parameters optimal? •  How do I adjust intelligently –  Number of trees? Depth of trees? Splitting criteria? •  Tedious trial and error •  Overfitting danger •  Intelligent automatic search
  • 15. 15 Algorithm tuning •  Gradient boosted tree parameterization e.g. –  # of trees –  Maximum tree depth –  Loss function –  Minimum node split size –  Bagging rate –  Shrinkage
  • 16. 16 AutoML Data  Set   Alg  #1   Alg  #2   Alg  #3   Alg  #N   Alg  #1   Alg  #N   1)Investigate N ML algorithms 2) Tune top performing algorithms Feature   engineering   Alg  #2   Alg  #1   Alg  #N   2) Feature elimination
  • 17. 17 Spark is for large datasets *http://datascience.la/benchmarking-random-forest-implementations/ •  If your data fits on a single node…. •  Other high-performance options exist *http://haifengl.github.io/smile/index.html Runtime
  • 18. 18 Data set size •  Large data lakes can consist of many small files •  Memory per node increasing rapidly *http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html
  • 19. 19 NVDIMMS •  Driving significant increases in node memory –  Up to 10X increase in density •  Coming in late 2016…
  • 20. 20 Hybrid operators •  Time consuming to maintain multiple ML libraries & manually determine optimal choice •  Develop hybrid implementations that automatically choose optimal approach –  Data set size –  Cluster size –  Cluster utilization
  • 23. 23 Operationalization •  What happens after the models are created? •  How does the business benefit from the insights? •  Operationalization is frequently the weak link –  Operationalizing PowerPoint? –  Hand rolled scoring flows
  • 24. 24 PFA •  Portable Format for Analytics (PFA) •  Successor to PMML •  Significant flexibility in encapsulating complex data preprocessing
  • 25. 25 Conclusions •  Spark delivers significant performance improvements over MR –  Can introduce more tuning requirements •  Provides an opportunity for AutoML –  Automatically determine good solutions •  Understand when its appropriate •  Don’t forget about about operationalization