Spark Architecture and Demo

2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Apache Spark : Fast and Easy Data
Processing
Sujee Maniyam
Founder / Principal
Elephant Scale LLC
sujee@elephantscale.com
http://elephantscale.com

Outline
r  Background
r  Spark Architecture
r  Demo

Spark
r  Fast & Expressive Cluster computing engine
r  Compatible with Hadoop
r  Came out of Berkeley AMP Lab
r  Now Apache project
r  Version 1.2 just released (Dec 2014)

Timeline Hadoop & Spark

Hypo-meter J

Spark Job Trends

Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Upto 2x -10x faster for data on disk
- Upto 100x faster for data in memory
Compact code
Java, Python, Scala supported
Shell for ad-hoc exploration

Hadoop + Yarn : Universal OS for
Cluster Computing
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Storage
Cluster
Management
Applications

Spark Vs Hadoop
r  Spark is ‘easier’ than Hadoop
r  ‘friendlier’ for data scientists / analysts
r  Interactive shell
r fast development cycles
r adhoc exploration
r  API supports multiple languages
r Java, Scala, Python
r  Great for small (Gigs) to medium (100s of Gigs)
data

Spark Vs. Hadoop
Fast!

Is Spark Replacing Hadoop?
r  Right now, Spark runs on Hadoop / YARN
r Complimentary
r  Can be seen as generic MapReduce
r  Spark is really great if data fits in memory (few
hundred gigs),
r  Spark can be used as compute platform with
various storage types (see next slide)

Spark & Pluggable Storage
Spark
(compute engine)
HDFS Amazon S3 Cassandra ???

Hadoop & Spark Future ???

Outline
r  Background
r  à Spark Architecture
r  Demo

Spark Eco-System
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema / sql Real Time
Machine
Learning

Spark Core
r  Distributed compute engine
r  Sort / shuffle algorithms
r  Handles node failures -> re-computes missing
pieces

Spark SQL
r  Adhoc / interactive analytics
r  ETL workflow
r  Reporting
r  Natively understands
r JSON : {“name”: “mark”, “age”: 40}
r Parquet : on disk columnar format, heavily
compressed

Spark SQL Quick Example
{"name":"mark", "age":50, "sex":"M"},
{"name":"mary", "age":45, "sex":"F"},
{"name":"brian", "age":15, "sex":"M"},
{"name":"nancy", "age":17, "sex":"F"}
// … setup …
val teenagers = sqlContext.sql("SELECT name
FROM people WHERE age >= 13 AND age <= 19")
è  [brian, nancy]

Spark Streaming
r  Process data streams in real time, in micro-
batches
r  Low latency, high throughput (1000s events /
sec)
r  Stock ticks / sensor data (connected devices /
IoT – Internet of Things)
Streaming
Sources
Storage

Machine Learning (ML Lib)
r  Machine learning at scale
r  Out of the box ML capabilities !
r  Java / Scala / Python language support
r  Lots of common algorithms are supported
r  Classification / Regressions
r  Linear models (linear R, logistic regression, SVM)
r  Decision trees
r  Collaborative filtering (recommendations)
r  K-Means clustering
r  More to come

Spark Cluster Deployment
1)  Standalone
2)  Yarn
3)  Mesos
Client
Node

Spark Cluster Architecture
r  Multiple ‘applications’ can run at the same time
r  Driver (or ‘main’) launches an application
r  Each application gets its own ‘executor’
r  Isolated (runs in different JVMs)
r  Also means data can not be shared across applications
r  Cluster Managers:
r  multiple cluster managers are supported
r  1) Standalone : simple to setup
r  2) YARN : on top of Hadoop
r  3) Mesos : General cluster manager (AMP lab)

Spark Data Model : RDD
r  Resilient Distributed Dataset (RDD)
r  Can live in
r Memory (best case scenario)
r Or on disk (FS, HDFS, S3 …etc)
r  Each RDD is split into multiple partitions
r  Partitions may live on different nodes
r  Partitions can be computed in parallel on
different nodes

Partitions Explained
1G file
64 M 64 M 64 M 64 M
Task Task Task Task
Result
parallelism

RDD : Loading
r  Use Spark context to load RDDs from disk /
external storage
val sc = new SparkContext(…)
val f = sc.textFile(“/data/input1.txt”) // single file
sc.textFile(“/data/”) // load all files under dir
sc.textFile(“/data/*.log”) // wild card matching

RDD Operations
r  Two kinds of operations on RDDs
r 1) Transformations
r Create a new RDD from existing ones (e.g. Map)
r 2) Actions
r E.g. Returns the results to clients (e.g. Reduce)
r  Transformations are lazy.. Actions force
transformations

Transformations / Actions
RDD 1 RDD 2
Client
RDD 3
Transformation 1
(map)
Action1
(collect)

RDD Transformations
Transformation Description Example
filter Filters through each record
(aka grep)
f.filter( line =>
line.contains(“ERROR”))
union Merges two RDDs rdd1.union(rdd2)
…see docs …

RDD Actions
Action Description Example
count() Counts all records in an rdd f.count()
first() Extract the first record f.first ()
take(n) Take first N lines f.take(10)
collect() Gathers all records for RDD.
All data has to fit in memory of
ONE machine (don’t use for big
data sets)
f.collect()
…. See
documentation
..

RDD : Saving
r  saveAsTextFile () and saveAsSequenceFile()
f.saveAsTextFile(“/output/directory”) // a directory
r  Output usually is a directory
r RDDs will be saved as multiple files in the dir
r Each partition à one output file

Caching of RDDs
r  RDDs can be loaded from disk and computed
r Hadoop mapreduce model
r  Also RDDs can be cached in memory
r  Subsequent operations are much faster
f.persist() // on disk or memory
f.cache() // memory only
r  In memory RDDs are
great for iterative workloads
r Machine learning algorithms

Demo Time !

Demo : Spark-shell
r  Invoke spark shell
r  Load a data set
r  Do basic operations (count / filter)

Demo: RDD Caching
r  From Spark shell
r  Load an RDD
r  Demonstrate the difference between cached and
non-cached

Demo : Map Reduce
r  Quick Word count
val input = sc.textFile(“…”)
val counts = input.flatMap(
line => line.split(“ “)).
map(word => (word, 1)).
reduceByKey(_+_)

Thanks !
Sujee Maniyam
sujee@elephantscale.com
http://elephantscale.com
Expert consulting & training in Big Data

Credits
r  http://spark.apache.org/
r  http://www.strategictechplanning.com
r  Tuningpp.com
r  Kidzworld.com

Spark Architecture and Demo

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Spark Architecture and Demo

Similar to Spark Architecture and Demo (20)

More from Sujee Maniyam

More from Sujee Maniyam (6)

Recently uploaded

Recently uploaded (20)

Spark Architecture and Demo