Hadoop to spark_v2

From Hadoop to Spark
Introduction
Hadoop and Spark Comparison

HI, I’m Sujee Maniyam
•  Founder / Principal @ ElephantScale
•  Consulting & Training in Big Data
•  Spark / Hadoop / NoSQL /
Data Science
•  Author
–  “Hadoop illuminated” open source book
–  “HBase Design Patterns”
•  Open Source contributor: github.com/sujee
•  sujee@elephantscale.com
•  www.ElephantScale.com
(c) ElephantScale.com 2015
Spark Training
available!
2

Webinar Audience
u  I am already using Hadoop,
Should I go to Spark?
u  I am thinking about Hadoop,
should I skip Hadoop and go to Spark ?
(c) ElephantScale.com 2015 3

Webinar Outline
u  Intro: what is Hadoop and what is Spark?
u  Capabilities and advantages of Spark & Hadoop
u  Best use cases for Spark / Hadoop
u  From Hadoop to Spark – how to?
Webinar: From Hadoop to Spark(c) ElephantScale.com 2015 4

Introduction
Introduction

Hadoop in 20 Seconds
u  ‘The Original’ Big data platform
u  Very well field tested
u  Scales to peta-bytes of data
u  Enables analytics at massive scale

Hadoop Eco System
BatchReal Time

Hadoop Ecosystem – by function
u  HDFS
– provides distributed storage
u  Map Reduce
– Provides distributed computing
u  Pig
– High level MapReduce
u  Hive
– SQL layer over Hadoop
u  HBase
– NoSQL storage for real-time queries

Hadoop Extended Eco-System
Source : hortonworks
9

Hadoop : Use Cases
u  Two modes : Batch & Real Time
u  Batch use case
– Analytics at large scale (Terra bytes to peta bytes scale)
– Analytics times can be minutes / hours.
Depends on
•  Size of data being analyzed
•  And type of query
– Examples:
•  Large ETL work loads
•  “Analyze clickstream data and calculate top page visits”
•  “Combine purchase data and click-data and figure out discounts to
apply”

Hadoop Use Cases
u  Real Time Use Cases do not rely on Map Reduce
u  Instead we use HBase
– A real-time NoSQL datastore built on Hadoop
u  Example : Tracking Sensor data
– Store data from millions of sensor
– Could be billions of data points
– “Find latest reading from a sensor”
– This query must be done in
real time (in milli-seconds)
u  “Needle in HayStack” scenarios
– We look for one / few records within
billions

Hadoop Reference Architecture (Example)
Source : hortonworks

Data Spectrum

Big Data Analytics Evolution (v1)
u  Decision times : batch ( hours / days)
u  Use cases:
– Modeling
– ETL
– Reporting

Moving Towards Fast Data (v2)
u  Decision time : (near) real time
– seconds (or milli seconds)
u  Use Cases
– Alerts (medical / security)
– Fraud detection

Current Big Data Processing Challenges
u  Processing needs outpacing 1st generation tools
u  Beyond Batch
– Not every one has terra-bytes of data to process
– Small – Medium data sets (few hundred gigs) are more prevalent
– Data may not be on disk
•  In memory
•  Coming via streaming channels
u  MapReduce (MR)’s limitations
– Batch processing doesn't fit all needs
– Not effective for ‘iterative programming’ (machine learning
algorithms ..etc)
– High latency for streaming needs
u  Spark is a 2nd generation tool addressing these needs
16(c) ElephantScale.com 2015

What is Spark?
u  Open source cluster computing engine
– Very fast: In-memory ops 100x faster than MR
•  On-disk ops 10x faster than MR
– General purpose: MR, SQL, streaming, machine learning,
analytics
– Compatible: Runs over Hadoop, Mesos, Yarn, standalone
•  Works with HDFS, S3, Cassandra, HBase, …
– Easier to code: Word count in 2 lines
u  Spark's roots:
– Came out of Berkeley AMP Lab
– Now top-level Apache project
– Version 1.5 released in Sept 2015
“First Big Data platform to integrate batch, streaming and interactive
computations in a unified framework” – stratio.com

Spark Illustrated
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema /
sql
Real Time
Machine
Learning
Standalone YARN MESOS
Cluster
managers
GraphX
Graph
processing
HDFSS3 Cassandra ???
Data
Storage

Spark Core
u  Basic building blocks for distributed compute engine
– Task schedulers and memory management
– Fault recovery (recovers missing pieces on node failure)
– Storage system interfaces
u  Defines Spark API and data model
u  Data Model: RDD (Resilient Distributed Dataset)
– Distributed collection of items
– Can be worked on in parallel
– Easily created from many data sources (Any HDFS InputSource)
u  Spark API: Scala, Python, and Java
– Compact API for working with RDD and interacting with Spark
– Much easier to use than MapReduce API
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201519

Spark Components
u  Spark SQL: Structured data
– Supports SQL and HQL (Hive Query Language)
– Data sources include Hive tables, JSON, CSV, Parquet (1)
u  Spark Streaming: Live streams of data in real-time
– Low latency, high throughput (1000s events / sec)
– Log files, stock ticks, sensor data / IOT (Internet of Things) …
u  ML Lib: Machine Learning at scale
– Classification/regression, collaborative filtering …
– Model evaluation and data import
u  GraphX: Graph manipulation, graph-parallel computation
– Social network friendships, link data, …
– Graph manipulation and operations and common algorithms
(c)
Elephant
Scale.co
m 201520

Spark : 'Unified' Stack
u  Spark components support multiple programming models
– Map reduce style batch processing
– Streaming / real time processing
– Querying via SQL
– Machine learning
u  All modules are tightly integrated
– Facilitates rich applications
u  Spark can be the only stack you need !
– No need to run multiple clusters (Hadoop cluster, Storm cluster,
etc.)
(c)
Elephant
Scale.co
m 201521

Hypo-meter J

Spark Job Trends

Introduction
Going from Hadoop to Spark

Spark Benchmarks
Source : stratio.com

Spark Code / Activity
(c)
ElephantS
cale.com
2015
Source : stratio.com
26

Timeline : Hadoop & Spark

Hadoop Vs. Spark
Hadoop
Spark
Source : http://www.kwigger.com/mit-skifte-til-mac/

Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Up 10x faster for data on disk
- Up to 100x faster for data in
memory
Mostly Java Compact code
Java, Python, Scala supported
No unified shell Shell for ad-hoc exploration

Spark Is Better Fit for Iterative Workloads

Spark Programming Model
u  More generic than MapReduce

Is Spark Replacing Hadoop?
u  Spark runs on Hadoop / YARN
u  Can access data in HDFS
u  Use YARN for clustering
u  Spark programming model is more flexible than MapReduce
u  Spark is really great if data fits in memory (few hundred gigs),
u  Spark is ‘storage agnostic’ (see next slide)

Spark & Pluggable Storage
Spark
(compute engine)
HDFS Amazon S3 Cassandra ???

Spark & Hadoop
Use Case Hadoop Spark
Batch processing Hadoop’s MapReduce
(Java, Pig, Hive)
Spark RDDs
(java / scala / python)
SQL querying Hadoop : Hive Spark SQL
Stream Processing / Real
Time processing
Storm
Kafka
Spark Streaming
Machine Learning Mahout Spark ML Lib
Real time lookups HBase (NoSQL) No native Spark
component.
But Spark can query data
in NoSQL stores

Hadoop + Yarn : OS for Distributed Compute
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Storage
Cluster
Management
Applications
(or at least, that’s the idea)

Hadoop & Spark Future ???

Introduction

Why Move From Hadoop to Spark?
u  Spark is ‘easier’ than Hadoop
u  ‘friendlier’ for data scientists / analysts
– Interactive shell
•  fast development cycles
•  adhoc exploration
u  API supports multiple languages
– Java, Scala, Python
u  Great for small (Gigs) to medium (100s of Gigs) data

Spark : ‘Unified’ Stack
u  Spark supports multiple programming models
– Map reduce style batch processing
– Streaming / real time processing
– Querying via SQL
– Machine learning
u  All modules are tightly integrated
– Facilitates rich applications
u  Spark can be the only stack you need !
– No need to run multiple clusters
(Hadoop cluster, Storm cluster, … etc.)
Image: buymeposters.com

Migrating From Hadoop à Spark
Functionality Hadoop Spark
Distributed Storage -  HDFS
-  Cloud storage
(Amazon S3)
-  HDFS
-  Cloud storage
(Amazon S3)
-  Distributed File
system (NFS /
Ceph)
-  Distributed NoSQL
(Cassandra)
-  Tachyon (in
memory)
SQL querying Hive Spark SQL (Data
frames)
ETL work flow Pig -  Spork : Pig on
Spark
-  Mix of Spark SQL +
RDD programming
Machine Learning Mahout ML Lib
NoSQL DB HBase ???

Things to Consider When Moving From
Hadoop to Spark
1.  Data size
2.  File System
3.  Analytics
A.  SQL
B.  ETL
C.  Machine Learning

Data Size : “You Don’t Have Big Data”

Data Size (T-shirt sizing)
Image credit : blog.trumpi.co.za
10 G + 100 G +
1 TB + 100 TB + PB +
< few G
Hadoop / Spark
Spark

Data Size
u  Lot of Spark adoption at SMALL – MEDIUM scale
– Good fit
– Data might fit in memory !!
u  Applications
– Iterative workloads (Machine learning, etc.)
– Streaming

Decision : Data Size
Data Size
< 1 TB
(Spark)
> 1 TB
(Hadoop /
Spark)

Decision : File System
“What kind of
file system do I
need for Spark”

File System
u  Hadoop = Storage + Compute
u  Spark = Compute only
u  Spark needs a distributed FS
u  File system choices for Spark
– HDFS - Hadoop File System
•  Reliable
•  Good performance (data locality)
•  Field tested for PB of data
– S3 : Amazon
•  Reliable cloud storage
•  Huge scale
– NFS : Network File System (‘shared FS across machines)
– Tachyon (in memory - experimental)

Spark File Systems

File Systems For Spark
HDFS NFS Amazon S3
Data locality High
(best)
Local enough None
(ok)
Throughput High
(best)
Medium
(good)
Low
(ok)
Latency Low
(best)
Low High
Reliability Very High
(replicated)
Low Very High
Cost Varies Varies $30 / TB / Month

File Systems Throughput Comparison
u  Data : 10G + (11.3 G)
u  Each file : ~1+ G ( x 10)
u  400 million records total
u  Partition size : 128 M
u  On HDFS & S3
u  Cluster :
– 8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD )
– Hadoop cluster , Horton Works HDP v2.2
– Spark : on same 8 nodes, stand-alone, v 1.2

HDFS Vs. S3 (lower is better)
(c)
ElephantS
cale.com
201551

HDFS Vs. S3 (lower is better)
(c)
ElephantS
cale.com
201552

HDFS Vs. S3 Conclusions
HDFS S3
Data locality à much higher
throughput
Data is streamed à lower
throughput
Need to maintain an Hadoop cluster No Hadoop cluster to maintain à
convenient
Large data sets (TB + ) Good use case:
-  Smallish data sets (few gigs)
-  Load once and cache and re-use

Decision : File Systems
Already have
Hadoop?
NO
HDFS
S3
NFS (Ceph)
Cassandra
(real time)
YES
use HDFS

Next Decision : SQL
“We use SQL heavily for data
mining.
We are using Hive / Impala on
Hadoop.
Is Spark right for us?”

SQL in Hadoop / Spark
Hadoop Spark
Engine -  Hive (on Map Reduce or
Tez on Hortonworks)
-  Impala (Cloudera)
-  Spark SQL using
Dataframes
-  Hive context
Language HiveQL - HiveQL
- RDD programming in
Java / Python / Scala
Scale Terabytes / Petabytes Gigabytes / Terabytes /
Petabytes
Inter operability Data stored in HDFS -  Hive tables
-  File system
Formats CSV, JSON, Parquet CSV, JSON, Parquet

Dataframes Vs. RDDs
u  RDDs have data
u  DataFrames also have schema
u  Dataframes Used to be called ‘schemaRDD’
u  Unified way to load / save data in multiple formats
u  Provides high level operations
– Count / sum / average
– Select columns & filter them
57
(c)
Elephant
Scale.co
m 2015
// load json data
df = sqlContext.read
.format(“json”)
.load(“/data/data.json”)
// save as parquet (faster queries)
df.write
.format(“parquet”)
.saveAsTable(“/data/datap/”)

Supported Formats
58
(c)
Elephant
Scale.co
m 2015

Creating a DataFrame From JSON
{"name": "John", "age": 35 }
{"name": "Jane", "age": 40 }
{"name": "Mike", "age": 20 }
{"name": "Sue", "age": 52 }
(c)
Elephant
Scale.co
m 201559 Session 6: Spark SQL
scala> val peopleDF = sqlContext.read.json("people.json")
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> peopleDF.printSchema()
root
|-- age: long(nullable = true)
|-- name: string (nullable = true)
scala> peopleDF.show()
+---+----+
|age|name|
+---+----+
| 35|John|
| 40|Jane|
| 20|Mike|
| 52| Sue|
+---+----+

Querying Using SQL
u  A DataFrame can be registered as a temporary table
– You can then use SQL to query it, as shown below
– This is handled similarly to DSL queries - building up an AST and
sending it to Catalyst
(c)
Elephant
Scale.co
m 201560 Session 6: Spark SQL
scala> df.registerTempTable("people")
scala> sqlContext.sql("select * from people").show()
name age
John 35
Jane 40
Mike 20
Sue 52
scala> sqlContext.sql("select * from people where age > 35").show()
name age
Jane 40
Sue 52

Going From Hive à Spark
u  Spark natively supports querying data stored in Hive tables!
u  Handy to use in an existing Hadoop cluster !!
61 Session 6: Spark SQL
HIVE
Hive> select customer_id, SUM(cost) as total from billing group by
customer_id order by total DESC LIMIT 10;
SPARK
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val top10 = hiveCtx.sql(
"select customer_id, SUM(cost) as total from billing group by
customer_id order by total DESC LIMIT 10")
top10.collect()

Spark SQL Vs. Hive
(c)
ElephantS
cale.com
2015
Fast on same
HDFS data !
62

Spark SQL Vs. Hive
(c)
ElephantS
cale.com
2015
Fast on
same data
on HDFS
63

Decision : SQL
Using Hive?
Yes
Spark Using
HiveContext
NO
Spark SQL with
Dataframes

Next Decision : ETL
“we do lot of ETL work on
our Hadoop cluster.
Using tools like Pig /
Cascading
Can we use Spark? “

ETL on Hadoop / Spark
ETL Hadoop Spark
ETL Tools Pig, Cascading, Oozie -  Native RDD
programming
(Scala, Java,
Python)
-  Cascading?
Pig High level ETL workflow Spork : Pig on Spark
Cascading High level Spark-scalding
Cask Works Works

Data Transformation on Spark
u  Dataframes are great for high level manipulation of data
– High level operations : Join / Union …etc
– Joining / Merging disparate data sets
– Can read and understand multitude of data formats (JSON /
Parquet ..etc)
– Very easy to program
u  RDD APIs allow low level programming
– Complex manipulations
– Lookups
– Supports multiple lanaguages (Java / Scala / Python)
u  High level libraries are emerging
– Tresata
– CASK

Decisions : ETL
Current ETL
Pig
Spork RDD API
Dataframes
Cascading
Cascading on
Spark
RDD
Data frames
Java
MapReduce
/ Custom
RDD
Dataframes

Decision : Machine Learning
Can we use Spark for
Machine Learning?
YES

Machine Learning : Hadoop / Spark
Hadoop Spark
Tool Mahout MLLib
API Java Java / Scala / Python
Iterative Algorithms Slower Very fast
(in memory)
In Memory processing No YES
Mahout runs on Hadoop
or on Spark
New and young lib
Latest news! Mahout only accepts new
code that runs on Spark
Mahout & MLLib on Spark
Future? Many opinions

Decision : In Memory Process
How can we do in-
memory processing
using Spark?

Numbers Every One Should Know by Jeff
Dean, Fellow @ Google
Operation Cost (in nano seconds)
L1 cache reference 0.5
Branch mispredict (cpu) 5
L2 cache reference 7
Mutex lock/unlock 100
Main memory reference 100
Compress 1K bytes with Zippy 10,000
Send 2K bytes over 1 Gbps network 20,000
Read 1 MB sequentially from memory 250,000 0.25 ms
Round trip within same datacenter 500,000 0.5 ms
Disk seek 10,000,000 10 ms
Read 1 MB sequentially from network 10,000,000 10 ms
Read 1 MB sequentially from disk 30,000,000 30 ms
Send packet CA->Netherlands->CA 150,000,000 150 ms

Spark Caching
u  Caching is pretty effective (small / medium data sets)
u  Cached data can not be shared across applications
(each application executes in its own sandbox)

Caching Results
Cached!

Sharing Cached Data
u  By default Spark applications can not share cached data
– Running in isolation
u  1) ‘spark job server’
– Multiplexer
– All requests are executed through same ‘context’
– Provides web-service interface
u  2) Tachyon
– Distributed In-memory file system
– Memory is the new disk!
– Out of AMP lab , Berkeley
– Early stages (very promising)

Spark Job Server

Spark Job Server
u  Open sourced from Ooyala
u  ‘Spark as a Service’ – simple REST interface to launch jobs
u  Sub-second latency !
u  Pre-load jars for even faster spinup
u  Share cached RDDs across requests (NamedRDD)
u  https://github.com/spark-jobserver/spark-jobserver
App1 :
sharedCtx.saveRDD(“my cached rdd”, rdd1)
App2:
RDD rdd2 = sharedCtx.loadRDD (“my cached rdd”)

Tachyon + Spark

How to Get Spark?

Getting Spark
Running Hadoop?
NO
Need HDFS?
YES
Install HDFS +
YARN + Spark
NO
Environment ?
Production
Spark + Mesos +
S3
Testing
Spark (standalone)
+ NFS or S3
YES
Install Spark on
Hadoop cluster

Spark Cluster Setup 1 : Simple
u  Great for POCs / experimentation
u  No dependencies
u  Using Spark’s ‘stand alone’ manager

Spark Cluster Setup 2 : Production
u  Works well with Hadoop eco system (HDFS / Hive ..etc)
u  Best way to adopt Spark on Hadoop
u  Uses YARN as cluster manager

Spark Cluster Setup 3 : Production
u  Uses Mesos as cluster manager

Hadoop -> Spark Case Study

Use Case 1 : Moving to Cloud

Use Case 1 : Lessons Learned
u  Size
– Small Hadoop cluster (8 nodes)
– Smallish data : 50G – 300G
– Data for processing : few Gigs per query
u  Good !
– Only one moving part 'spark'
– No Hadoop cluster to maintain
– S3 was a dependable storage (passive)
– Query response time gone from minutes to seconds (b/c we went
from MR à Spark)
u  Not so good
– We lost data locality of HDFS
(ok for small/medium data sets)

Use Case 2 : Persistent Caching in Spark
u  Can we improve latency in this setup?
u  Caching will help
u  How ever, in Spark cached data can not be shared across
applications L

Use Case 2 : Persistent Caching in Spark
u  Spark Job Server to rescue !

Final Thoughts
u  Already on Hadoop?
– Try Spark side-by-side
– Process some data in HDFS
– Try Spark SQL for Hive tables
u  Contemplating Hadoop?
– Try Spark (standalone)
– Choose NFS or S3 file system
u  Take advantage of caching
– Iterative loads
– Spark Job server
– Tachyon

Thanks and questions?
Sujee Maniyam
Founder / Principal @ ElephantScale
Expert Consulting + Training in Big Data technologies
sujee@elephantscale.com
Elephantscale.com
Sign up for upcoming trainings : ElephantScale.com/training

Hadoop to spark_v2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop to spark_v2

Similar to Hadoop to spark_v2 (20)

More from elephantscale

More from elephantscale (8)

Recently uploaded

Recently uploaded (20)

Hadoop to spark_v2