SlideShare a Scribd company logo
1 of 20
Download to read offline
Installation
Martin Zapletal Cake Solutions
Apache Spark
Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and machine learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model
7) Deployment
Table of Contents
● Spark architecture
● download, versions, install, startup
● Cluster managers
○ Local
○ Standalone
○ Mesos
○ YARN
● Spark shell
● Job deployment
● Streaming job deployment
● Integration with other tools
● after this session you should be able to install Spark, run Spark cluster and deploy
basic jobs
Installation
● prebuilt packages for different versions of Hadoop, CDH (Cloudera’s
distribution of Hadoop), MapR (MapR’s distribution of Hadoop)
○ currently only support for Scala 2.10
● build from source
○ uses mvn, but has a sbt wrapper
○ need to specify Hadoop version build against
○ can be built with Scala 2.11 support
Spark architecture
● cluster persistent, user submits Jobs
● SparkContext (driver) contacts Cluster Manager which assigns cluster resources
● then it sends application code to assigned Executors (distributing computation, not data!)
● finally sends tasks to Executors to run
● each master and worker run a webUI that displays task progress and results
● each application (SparkContext) has its own executors (not shared) living for the whole duration of the program
running in separate JVM using multiple threads
● Cluster Manager agnostic. Spark only needs to acquire executors and have them communicate with each other
Spark streaming
● mostly similar
● Receiver components - consuming from data source
● Receiver sends information to driver program which then schedules tasks (discretized streams,
small batches) to run in the cluster
○ number of assigned cores must be higher than number of Receivers
● different job lifecycle
○ potentially unbounded
○ needs to be stopped by calling sc.stop()
SparkContext
● passing configuration
● accessing cluster
● SparkContext then used to create RDD from input data
○ various sources
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val conf = new SparkConf().setAppName(appName).setMaster(master)
val ssc = new StreamingContext(conf, Seconds(1))
Spark architecture
● 5 modes:
1. local
2. standalone
3. Yarn
4. Mesos
5. Amazon EC2
Local mode
● for application development purposes, no cluster required
● local
○ Run Spark locally with one worker thread (i.e. no parallelism at all).
● local[K]
○ Run Spark locally with K worker threads (ideally, set this to the number of cores on your
machine).
● local[*]
○ Run Spark locally with as many worker threads as logical cores on your machine.
● example local
Standalone mode
● place compiled version of spark at each node
● deployment scripts
○ sbin/start-master.sh
○ sbin/start-slaves.sh
○ sbin/stop-all.sh
● various settings, e.g. port, webUI port, memory, cores, java opts
● drivers use spark://HOST:PORT as master
● only supports a simple FIFO scheduler
○ application or global config decides how many cores and memory will be assigned to it.
● resilient to Worker failures, Master single point of failure
● supports Zookeeper for multiple Masters, leader election and state recovery. Running applications unaffected
● or local filesystem recovery mode just restarts Master if it goes down. Single node. Better with external monitor
● example 2 start cluster
YARN mode
● yet another resource negotiator
● decouples resource management and scheduler from data processing framework
● exclusive to Hadoop ecosystem
● binary distribution of spark built with YARN support
● uses hadoop configuration HADOOP_CONF_DIR or YARN_CONF_DIR
● master is set to either yarn-client or yarn-cluster
Mesos mode
● Mesos is a cluster operating system
● abstracts CPU, memory, storage and other resources enabling fault tolerant and elastic distriuted system
● can run Spark along with other applications (Hadoop, Kafka, ElasticSearch, Jenkins, ...) and manage resources
and scheduling across the whole cluster and all the applications
● Mesos master replaces Spark Master as Cluster Manager
● Spark binary accessible by Mesos (config)
● mesos://HOST:PORT for single master mesos or mesos://zk://HOST:PORT for multi master mesos using
Zookeeper for failover
● In “fine-grained” mode (default), each Spark task runs as a separate Mesos task. This allows multiple instances
of Spark (and other frameworks) to share machines at a very fine granularity
● The “coarse-grained” mode will instead launch only one long-running Spark task on each Mesos machine, and
dynamically schedule its own “mini-tasks” within it.
● project Myriad
● utility to connect to a cluster/local Spark
● no need to write program
● constructs and provides SparkContext
● similar to Scala console
● example 3 shell
Spark shell
Job deployment
● client or cluster mode
● spark-submit script
● spark driver program
● allows to write same programs, differ in deployment to cluster
Spark submit script
● need to build and submit a jar with all dependencies (the dependencies
need to be available at worker nodes)
● all other jars need to be specified using --jars
● spark and hadoop dependencies can be provided
● ./bin/spark-submit
○ --class <main class>
○ --master <master>
○ --deploy-mode <deploy mode>
○ --conf <key>=<value>
○ <application jar>
○ <application arguments>
Spark submit script
● non trivial automation
● need to build application jar, have it available at driver, submit job with
arguments and collect result
● deployment pipeline necessary
Spark driver program
● can be part of scala/akka application and execute Spark jobs
● needs dependencies, can not be provided
● jars need to be specified using .setJars() method
● running Spark applications, passing parameters and retrieving results
same as just running any other code
● dependency management, versions, compatibility, jar size
● one SparkContext per JVM
● example 3 submit script
Integration
● streaming
○ Kafka, Flume, Kinesis, Twitter, ZeroMQ, MQTT
● batch
○ HDFS, Cassandra, HBase, Amazon S3, …
○ text files, SequenceFiles, any Hadoop InputFormat
○ when loading local file then the file must be present on worker nodes
on given path. You need to either copy it or use dfs
Conclusion
● getting started with Spark is relatively simple
● tools simplifying development (console, local mode)
● cluster deployment fragile and difficult to troubleshoot
● networking using akka remoting
Questions

More Related Content

What's hot

Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Shi Shao Feng
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacApache Apex
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applicationsRobert Sanders
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for ScyllaScyllaDB
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot
 
TeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesTeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesDatabricks
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streamingTao Li
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
A Travel Through Mesos
A Travel Through MesosA Travel Through Mesos
A Travel Through MesosDatio Big Data
 

What's hot (20)

Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Internals
InternalsInternals
Internals
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applications
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
TeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesTeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage Devices
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
A Travel Through Mesos
A Travel Through MesosA Travel Through Mesos
A Travel Through Mesos
 

Viewers also liked

Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Ed Kohlwey
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hivearunkumar sadhasivam
 
Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Martin Zapletal
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyMartin Zapletal
 
greenplum installation guide - 4 node VM
greenplum installation guide - 4 node VM greenplum installation guide - 4 node VM
greenplum installation guide - 4 node VM seungdon Choi
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationChristopher Batey
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosPaco Nathan
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 

Viewers also liked (20)

Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hive
 
Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
 
greenplum installation guide - 4 node VM
greenplum installation guide - 4 node VM greenplum installation guide - 4 node VM
greenplum installation guide - 4 node VM
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Curator intro
Curator introCurator intro
Curator intro
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 

Similar to Install Apache Spark

Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersAnant Corporation
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Akhil Das
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 

Similar to Install Apache Spark (20)

Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource Managers
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Spark core
Spark coreSpark core
Spark core
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark ...
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 

Recently uploaded

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

Install Apache Spark

  • 1. Installation Martin Zapletal Cake Solutions Apache Spark
  • 2. Apache Spark and Big Data 1) History and market overview 2) Installation 3) MLlib and machine learning on Spark 4) Porting R code to Scala and Spark 5) Concepts - Core, SQL, GraphX, Streaming 6) Spark’s distributed programming model 7) Deployment
  • 3. Table of Contents ● Spark architecture ● download, versions, install, startup ● Cluster managers ○ Local ○ Standalone ○ Mesos ○ YARN ● Spark shell ● Job deployment ● Streaming job deployment ● Integration with other tools ● after this session you should be able to install Spark, run Spark cluster and deploy basic jobs
  • 4. Installation ● prebuilt packages for different versions of Hadoop, CDH (Cloudera’s distribution of Hadoop), MapR (MapR’s distribution of Hadoop) ○ currently only support for Scala 2.10 ● build from source ○ uses mvn, but has a sbt wrapper ○ need to specify Hadoop version build against ○ can be built with Scala 2.11 support
  • 5. Spark architecture ● cluster persistent, user submits Jobs ● SparkContext (driver) contacts Cluster Manager which assigns cluster resources ● then it sends application code to assigned Executors (distributing computation, not data!) ● finally sends tasks to Executors to run ● each master and worker run a webUI that displays task progress and results ● each application (SparkContext) has its own executors (not shared) living for the whole duration of the program running in separate JVM using multiple threads ● Cluster Manager agnostic. Spark only needs to acquire executors and have them communicate with each other
  • 6. Spark streaming ● mostly similar ● Receiver components - consuming from data source ● Receiver sends information to driver program which then schedules tasks (discretized streams, small batches) to run in the cluster ○ number of assigned cores must be higher than number of Receivers ● different job lifecycle ○ potentially unbounded ○ needs to be stopped by calling sc.stop()
  • 7. SparkContext ● passing configuration ● accessing cluster ● SparkContext then used to create RDD from input data ○ various sources val conf = new SparkConf().setAppName(appName).setMaster(master) val sc = new SparkContext(conf) val conf = new SparkConf().setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(1))
  • 8. Spark architecture ● 5 modes: 1. local 2. standalone 3. Yarn 4. Mesos 5. Amazon EC2
  • 9. Local mode ● for application development purposes, no cluster required ● local ○ Run Spark locally with one worker thread (i.e. no parallelism at all). ● local[K] ○ Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). ● local[*] ○ Run Spark locally with as many worker threads as logical cores on your machine. ● example local
  • 10. Standalone mode ● place compiled version of spark at each node ● deployment scripts ○ sbin/start-master.sh ○ sbin/start-slaves.sh ○ sbin/stop-all.sh ● various settings, e.g. port, webUI port, memory, cores, java opts ● drivers use spark://HOST:PORT as master ● only supports a simple FIFO scheduler ○ application or global config decides how many cores and memory will be assigned to it. ● resilient to Worker failures, Master single point of failure ● supports Zookeeper for multiple Masters, leader election and state recovery. Running applications unaffected ● or local filesystem recovery mode just restarts Master if it goes down. Single node. Better with external monitor ● example 2 start cluster
  • 11. YARN mode ● yet another resource negotiator ● decouples resource management and scheduler from data processing framework ● exclusive to Hadoop ecosystem ● binary distribution of spark built with YARN support ● uses hadoop configuration HADOOP_CONF_DIR or YARN_CONF_DIR ● master is set to either yarn-client or yarn-cluster
  • 12. Mesos mode ● Mesos is a cluster operating system ● abstracts CPU, memory, storage and other resources enabling fault tolerant and elastic distriuted system ● can run Spark along with other applications (Hadoop, Kafka, ElasticSearch, Jenkins, ...) and manage resources and scheduling across the whole cluster and all the applications ● Mesos master replaces Spark Master as Cluster Manager ● Spark binary accessible by Mesos (config) ● mesos://HOST:PORT for single master mesos or mesos://zk://HOST:PORT for multi master mesos using Zookeeper for failover ● In “fine-grained” mode (default), each Spark task runs as a separate Mesos task. This allows multiple instances of Spark (and other frameworks) to share machines at a very fine granularity ● The “coarse-grained” mode will instead launch only one long-running Spark task on each Mesos machine, and dynamically schedule its own “mini-tasks” within it. ● project Myriad
  • 13. ● utility to connect to a cluster/local Spark ● no need to write program ● constructs and provides SparkContext ● similar to Scala console ● example 3 shell Spark shell
  • 14. Job deployment ● client or cluster mode ● spark-submit script ● spark driver program ● allows to write same programs, differ in deployment to cluster
  • 15. Spark submit script ● need to build and submit a jar with all dependencies (the dependencies need to be available at worker nodes) ● all other jars need to be specified using --jars ● spark and hadoop dependencies can be provided ● ./bin/spark-submit ○ --class <main class> ○ --master <master> ○ --deploy-mode <deploy mode> ○ --conf <key>=<value> ○ <application jar> ○ <application arguments>
  • 16. Spark submit script ● non trivial automation ● need to build application jar, have it available at driver, submit job with arguments and collect result ● deployment pipeline necessary
  • 17. Spark driver program ● can be part of scala/akka application and execute Spark jobs ● needs dependencies, can not be provided ● jars need to be specified using .setJars() method ● running Spark applications, passing parameters and retrieving results same as just running any other code ● dependency management, versions, compatibility, jar size ● one SparkContext per JVM ● example 3 submit script
  • 18. Integration ● streaming ○ Kafka, Flume, Kinesis, Twitter, ZeroMQ, MQTT ● batch ○ HDFS, Cassandra, HBase, Amazon S3, … ○ text files, SequenceFiles, any Hadoop InputFormat ○ when loading local file then the file must be present on worker nodes on given path. You need to either copy it or use dfs
  • 19. Conclusion ● getting started with Spark is relatively simple ● tools simplifying development (console, local mode) ● cluster deployment fragile and difficult to troubleshoot ● networking using akka remoting