SlideShare a Scribd company logo
Spark Webinar 
October 2nd, 2014 
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Vinay Shukla & Ram Venkatesh
Agenda 
• What is Spark? 
• What have we done with Spark so far 
• Tech Previews 
• Brief on Spark 1.1.0 Tech Preview 
• Multi tenant & multi workload with YARN 
• Introducing Spark-3561 
• Get Involved 
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Let’s Talk About Apache Spark 
What is Spark? 
• Spark is a general-purpose big data engine that provides simple APIs for data scientists and 
engineers familiar with Scala, Python and Java to build ad-hoc interactive analytics, iterative 
machine-learning, and other use cases well-suited to interactive, in-memory data processing of GB to 
TB sized datasets. 
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Spark? 
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
(Hadoop|FlatMapped|Filter|MapPartitions|Shuffled)RDD 
stage0: (Hadoop|FlatMapped|Filter|MapPartitions)RDD 
stage1: ShuffledRDD 
ShuffleMapTask: (flatMap | 
map) 
Task Task 
ResultTask: (reduceByKey) 
ShuffleMapTask: (flatMap | 
map) 
Spark API 
Spark 
Compiler / Optimizer 
DAG Runtime 
Execution Engine 
Spark Cluster YARN Mesos 
Client 
Cluster 
DAGScheduler, ActiveJob 
Task 
SparkAM
Let’s Talk About Apache Spark (cont’d) 
What’s Our Spark Strategy? 
• Hortonworks is focused on enabling Spark for Enterprise Hadoop so users can deploy Spark-based 
applications along with their other Hadoop workloads in a consistent, predictable, and robust way. 
– Leverage Scale, Multi-tenancy provided by “YARN” so its memory and CPU intensive apps can work with 
predictable performance 
– Integrate it with HDP’s operations, security, governance, scalability, availability, and multi-tenancy capabilities 
Do We Have a Plan to Support Spark? Yes. 
• Spark is available now as a Technology Preview. 
• We are working our standard process of Tech Preview -> GA. We did this for Storm, Falcon, etc. 
• Spark will be added to our HDP Enterprise Plus subscription when it’s GA ready 
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Timeline 
Break-down 
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Roadmap 
2014 JULY SEPT 
1.0.1 TP 
Refresh 
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
1.1.0 TP 
Refresh 
DEC 
1.2.0 GA 
• Hive 13 support 
• Limited ORC support 
• Spark on YARN: Deployment Best Practices 
• Ambari Support for Spark Install/Stop/Config 
• Spark on Kerberized Cluster 
• Authentication against LDAP in Spark UI
What’s in Spark 1.1.0 Tech Preview 
• Upgrades Spark to Hive .13 
• Provides Hive .13 features (new Hive UDFs) in Spark 
• Limited ORC support 
• Ability to manipulate ORC as HadoopRDD 
….. 
val inputRead = 
sc.hadoopFile("hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/orc_table",cl 
assOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],classOf[org.apache.hadoop.io. 
NullWritable],classOf[org.apache.hadoop.hive.ql.io.orc.OrcStruct]) 
val k = inputRead.map(pair => pair._2.toString) 
val c = k.collect 
….. 
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Enterprise Readiness 
Enhancements 
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Investment Phases 
• Phase 1 
• Hive 0.13 support 
• Limited ORC support 
• Security: Spark certification on Kerberized Cluster 
• Security: Authentication in Spark UI against LDAP/AD 
• Operations: Ambari Stack Definition: Install/Start/Stop/Config/Quick links to Spark UI 
• Phase 2 
• Improve reliability & Scale of Spark-on-YARN 
• Enhance ORC support 
• Improve Debug Capabilities 
• Security: Wire Encryption and Authorization with XA/Argus 
• Operations: Spark logs published to YARN Application Timeline Service (ATS) 
• Operations: Enhanced workload management 
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark on Hadoop 
October 2nd, 2014 
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Ram Venkatesh
Spark-on-Hadoop – End User Benefits 
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
• Developer Productivity 
• Simple, easy to use APIs 
• Direct and elegant representation of the data processing flow 
• Focus on application business logic rather than Hadoop internals 
• Integrated develop-deploy-debug experience through the IDE 
• Multi-tenancy 
• Shared infrastructure across workloads – interactive queries by day, batch ETL at night 
• Better utilization of compute capacity 
• Move the execution to the data tier instead of the other way around 
• Reduced load on distributed filesystem (HDFS) 
• Reduce unnecessary replicated reads and writes 
• Reduced network usage 
• Eliminates the need for data transfer in and out of the cluster 
Page 12
Spark-on-Hadoop – Design considerations 
• Don’t solve problems that have already been solved. 
–Leverage discrete task based compute model for elasticity, scalability and fault tolerance 
–Leverage several man years of work in Hadoop Map-Reduce data shuffling operations 
–Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN 
–Leverage built-in security mechanisms in Hadoop for privacy and isolation 
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
• Don’t create new problems 
–Preserve the simple developer experience 
–No changes to Spark programs, all programs run unmodified 
–Propose simple, mainstream in-the-community extension to the Apache Spark project 
Page 13 
Look to the Future with an eye on the Past
Spark on Hadoop – From service model to app model 
Spark jobs compile down to a Directed Acyclic Graph (DAG). 
• Vertices in the graph represent user logic 
• Edges represent data movement from producers to consumers 
• Spark DAG executed using Apache Tez at runtime 
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
Page 14 
Preprocessor Stage 
Partition Stage 
Aggregate Stage 
Sampler 
Task-1 Task-2 
Task-1 Task-2 
Task-1 Task-2 
Samples 
Ranges 
Distributed Sort
Spark-on-Hadoop – Simplifying Operations 
• No deployments to do. No side effects. Easy and safe to try it out! 
• Completely client side application. 
• Simply upload to any accessible FileSystem and point to the cluster through configuration files. 
• Enables running different versions concurrently. Easy to test new functionality while keeping stable 
versions for production. 
• Leverages YARN local resources. 
Spark Client TezTask 
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
TezTask 
© Hortonworks Inc. 2013 
Page 15 
Client 
Machine 
Node 
Manager 
Node 
Manager 
HDFS 
Spark-v1 Spark-v2 
Spark Client 
Client 
Machine
Benefits of native Hadoop execution of Spark DAGs 
• Elastic resource management - dynamic acquisition and release of containers 
•Works with YARN pre-emption, reservation and headroom calculations 
• Auto-parallelism based on sampling – you no longer need to guess no. of reducers 
• Efficient data movement between stages using the Hadoop shuffle 
• Integrates with resource isolation and governance mechanisms in Hadoop 
• Classpath and jarfile management through local resources 
• Detailed job-level metrics through integration with the YARN ATS 
Enables large-scale, multi-tenant batch ETL Spark programs 
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
Page 16
Introducing SPARK-3561 
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
DEMO: SPARK-3561 in action 
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SPARK-3561 under the hood 
Example program: 
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SPARK-3561 Demo – contd. 
Execute program using spark-submit 
spark-submit --class dev.demo.WordCount 
--master execution-context:org.apache.spark.tez.TezJobExecutionContext 
spark-on-hadoop-1.0.jar 1 test.txt 
Execute interactive Spark commands through spark-shell 
spark-shell --master execution-context:org.apache.spark.tez.TezJobExecutionContext 
INFO main spark.SparkContext:59 - Will use custom job execution context org.apache.spark.tez.TezJobExecutionContext 
INFO main adapter.SparkToTezAdapter:59 - Adapting PairRDDFunctions.saveAsNewAPIHadoopDataset for Tez 
INFO main repl.SparkILoop:59 - Created spark context.. 
Spark context available as sc. 
scala> 
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SPARK-3561 – feedback requested 
Provide feedback on your ETL/batch scenarios 
Participate in the discussion on the JIRA 
Try it out when it becomes available 
Looking for early adopters to run and validate at scale 
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resources 
• Spark Labs Page : http://hortonworks.com/hadoop/spark/ 
• Spark Roadmap Blog : http://hortonworks.com/blog/extending-spark-yarn- 
enterprise-hadoop/ 
• Spark 1.1.0 Tech Preview : http://hortonworks.com/kb/spark-1-1-0- 
technical-preview-hdp-2-1-5/ 
• Public Spark Forums : 
http://hortonworks.com/community/forums/forum/spark/ 
• Spark-3561 : https://issues.apache.org/jira/browse/SPARK-3561 
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q&A… 
Discussion 
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

More Related Content

What's hot

Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
DataWorks Summit
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 

What's hot (20)

Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 

Viewers also liked

Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
Hortonworks
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
Hortonworks
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
Hortonworks
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksHortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Hortonworks
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1
Hortonworks
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Reactive Programming Meetup - NodeJs on K8s
Reactive Programming Meetup - NodeJs on K8sReactive Programming Meetup - NodeJs on K8s
Reactive Programming Meetup - NodeJs on K8s
Roland Tritsch
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
Marilyn Waldman
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
Marilyn Waldman
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02
Nguyen Tuan
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Spark Summit
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Reactive Programming Meetup - NodeJs on K8s
Reactive Programming Meetup - NodeJs on K8sReactive Programming Meetup - NodeJs on K8s
Reactive Programming Meetup - NodeJs on K8s
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 

Similar to YARN Ready: Apache Spark

Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
DataWorks Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
Avinash Gautam
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Wangda Tan
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
Module01
 Module01 Module01
Module01
NPN Training
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
Knoldus Inc.
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
Szehon Ho
 
Spark + Hadoop Perfect together
Spark + Hadoop Perfect togetherSpark + Hadoop Perfect together
Spark + Hadoop Perfect together
Isheeta Sanghi
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Data Con LA
 
Big Data Processing Using Spark.pptx
Big  Data  Processing  Using  Spark.pptxBig  Data  Processing  Using  Spark.pptx
Big Data Processing Using Spark.pptx
DeekshaM35
 

Similar to YARN Ready: Apache Spark (20)

Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Module01
 Module01 Module01
Module01
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Spark + Hadoop Perfect together
Spark + Hadoop Perfect togetherSpark + Hadoop Perfect together
Spark + Hadoop Perfect together
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Big Data Processing Using Spark.pptx
Big  Data  Processing  Using  Spark.pptxBig  Data  Processing  Using  Spark.pptx
Big Data Processing Using Spark.pptx
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Hortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Hortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
Hortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Hortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
Hortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Hortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
Hortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Hortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Hortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
Hortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Hortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
Hortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

YARN Ready: Apache Spark

  • 1. Spark Webinar October 2nd, 2014 Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Vinay Shukla & Ram Venkatesh
  • 2. Agenda • What is Spark? • What have we done with Spark so far • Tech Previews • Brief on Spark 1.1.0 Tech Preview • Multi tenant & multi workload with YARN • Introducing Spark-3561 • Get Involved Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 3. Let’s Talk About Apache Spark What is Spark? • Spark is a general-purpose big data engine that provides simple APIs for data scientists and engineers familiar with Scala, Python and Java to build ad-hoc interactive analytics, iterative machine-learning, and other use cases well-suited to interactive, in-memory data processing of GB to TB sized datasets. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 4. What is Spark? Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved (Hadoop|FlatMapped|Filter|MapPartitions|Shuffled)RDD stage0: (Hadoop|FlatMapped|Filter|MapPartitions)RDD stage1: ShuffledRDD ShuffleMapTask: (flatMap | map) Task Task ResultTask: (reduceByKey) ShuffleMapTask: (flatMap | map) Spark API Spark Compiler / Optimizer DAG Runtime Execution Engine Spark Cluster YARN Mesos Client Cluster DAGScheduler, ActiveJob Task SparkAM
  • 5. Let’s Talk About Apache Spark (cont’d) What’s Our Spark Strategy? • Hortonworks is focused on enabling Spark for Enterprise Hadoop so users can deploy Spark-based applications along with their other Hadoop workloads in a consistent, predictable, and robust way. – Leverage Scale, Multi-tenancy provided by “YARN” so its memory and CPU intensive apps can work with predictable performance – Integrate it with HDP’s operations, security, governance, scalability, availability, and multi-tenancy capabilities Do We Have a Plan to Support Spark? Yes. • Spark is available now as a Technology Preview. • We are working our standard process of Tech Preview -> GA. We did this for Storm, Falcon, etc. • Spark will be added to our HDP Enterprise Plus subscription when it’s GA ready Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 6. Spark Timeline Break-down Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 7. Spark Roadmap 2014 JULY SEPT 1.0.1 TP Refresh Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 1.1.0 TP Refresh DEC 1.2.0 GA • Hive 13 support • Limited ORC support • Spark on YARN: Deployment Best Practices • Ambari Support for Spark Install/Stop/Config • Spark on Kerberized Cluster • Authentication against LDAP in Spark UI
  • 8. What’s in Spark 1.1.0 Tech Preview • Upgrades Spark to Hive .13 • Provides Hive .13 features (new Hive UDFs) in Spark • Limited ORC support • Ability to manipulate ORC as HadoopRDD ….. val inputRead = sc.hadoopFile("hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/orc_table",cl assOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],classOf[org.apache.hadoop.io. NullWritable],classOf[org.apache.hadoop.hive.ql.io.orc.OrcStruct]) val k = inputRead.map(pair => pair._2.toString) val c = k.collect ….. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 9. Spark Enterprise Readiness Enhancements Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 10. Spark Investment Phases • Phase 1 • Hive 0.13 support • Limited ORC support • Security: Spark certification on Kerberized Cluster • Security: Authentication in Spark UI against LDAP/AD • Operations: Ambari Stack Definition: Install/Start/Stop/Config/Quick links to Spark UI • Phase 2 • Improve reliability & Scale of Spark-on-YARN • Enhance ORC support • Improve Debug Capabilities • Security: Wire Encryption and Authorization with XA/Argus • Operations: Spark logs published to YARN Application Timeline Service (ATS) • Operations: Enhanced workload management Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 11. Spark on Hadoop October 2nd, 2014 Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Ram Venkatesh
  • 12. Spark-on-Hadoop – End User Benefits Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 • Developer Productivity • Simple, easy to use APIs • Direct and elegant representation of the data processing flow • Focus on application business logic rather than Hadoop internals • Integrated develop-deploy-debug experience through the IDE • Multi-tenancy • Shared infrastructure across workloads – interactive queries by day, batch ETL at night • Better utilization of compute capacity • Move the execution to the data tier instead of the other way around • Reduced load on distributed filesystem (HDFS) • Reduce unnecessary replicated reads and writes • Reduced network usage • Eliminates the need for data transfer in and out of the cluster Page 12
  • 13. Spark-on-Hadoop – Design considerations • Don’t solve problems that have already been solved. –Leverage discrete task based compute model for elasticity, scalability and fault tolerance –Leverage several man years of work in Hadoop Map-Reduce data shuffling operations –Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN –Leverage built-in security mechanisms in Hadoop for privacy and isolation Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 • Don’t create new problems –Preserve the simple developer experience –No changes to Spark programs, all programs run unmodified –Propose simple, mainstream in-the-community extension to the Apache Spark project Page 13 Look to the Future with an eye on the Past
  • 14. Spark on Hadoop – From service model to app model Spark jobs compile down to a Directed Acyclic Graph (DAG). • Vertices in the graph represent user logic • Edges represent data movement from producers to consumers • Spark DAG executed using Apache Tez at runtime Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 Page 14 Preprocessor Stage Partition Stage Aggregate Stage Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed Sort
  • 15. Spark-on-Hadoop – Simplifying Operations • No deployments to do. No side effects. Easy and safe to try it out! • Completely client side application. • Simply upload to any accessible FileSystem and point to the cluster through configuration files. • Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. • Leverages YARN local resources. Spark Client TezTask Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TezTask © Hortonworks Inc. 2013 Page 15 Client Machine Node Manager Node Manager HDFS Spark-v1 Spark-v2 Spark Client Client Machine
  • 16. Benefits of native Hadoop execution of Spark DAGs • Elastic resource management - dynamic acquisition and release of containers •Works with YARN pre-emption, reservation and headroom calculations • Auto-parallelism based on sampling – you no longer need to guess no. of reducers • Efficient data movement between stages using the Hadoop shuffle • Integrates with resource isolation and governance mechanisms in Hadoop • Classpath and jarfile management through local resources • Detailed job-level metrics through integration with the YARN ATS Enables large-scale, multi-tenant batch ETL Spark programs Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 Page 16
  • 17. Introducing SPARK-3561 Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 18. DEMO: SPARK-3561 in action Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 19. SPARK-3561 under the hood Example program: Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 20. SPARK-3561 Demo – contd. Execute program using spark-submit spark-submit --class dev.demo.WordCount --master execution-context:org.apache.spark.tez.TezJobExecutionContext spark-on-hadoop-1.0.jar 1 test.txt Execute interactive Spark commands through spark-shell spark-shell --master execution-context:org.apache.spark.tez.TezJobExecutionContext INFO main spark.SparkContext:59 - Will use custom job execution context org.apache.spark.tez.TezJobExecutionContext INFO main adapter.SparkToTezAdapter:59 - Adapting PairRDDFunctions.saveAsNewAPIHadoopDataset for Tez INFO main repl.SparkILoop:59 - Created spark context.. Spark context available as sc. scala> Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 21. SPARK-3561 – feedback requested Provide feedback on your ETL/batch scenarios Participate in the discussion on the JIRA Try it out when it becomes available Looking for early adopters to run and validate at scale Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 22. Resources • Spark Labs Page : http://hortonworks.com/hadoop/spark/ • Spark Roadmap Blog : http://hortonworks.com/blog/extending-spark-yarn- enterprise-hadoop/ • Spark 1.1.0 Tech Preview : http://hortonworks.com/kb/spark-1-1-0- technical-preview-hdp-2-1-5/ • Public Spark Forums : http://hortonworks.com/community/forums/forum/spark/ • Spark-3561 : https://issues.apache.org/jira/browse/SPARK-3561 Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 23. Q&A… Discussion Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved