SlideShare a Scribd company logo
Webinar: From Hadoop to Spark
Introduction
Hadoop and Spark Comparison
From Hadoop to Spark
2
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Webinar Objectives
 Intro: what is Hadoop and what is Spark?
 Spark's capabilities and advantages vs Hadoop
 From Hadoop to Spark – how to?
2
Introduction
Introduction
Hadoop and Spark Comparison
From Hadoop to Spark
4
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop in 20 Seconds
 ‘The’ Big data platform
 Very well field tested
 Scales to peta-bytes of data
 MapReduce : Batch oriented compute
5
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop Eco System
BatchReal Time
6
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop Ecosystem – by function
 HDFS
– provides distributed storage
 Map Reduce
– Provides distributed computing
 Pig
– High level MapReduce
 Hive
– SQL layer over Hadoop
 HBase
– NoSQL storage for real-time queries
7
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark in 20 Seconds
 Fast & Expressive Cluster computing engine
 Compatible with Hadoop
 Came out of Berkeley AMP Lab
 Now Apache project
 Version 1.3 just released (April 2015)
“First Big Data platform to integrate batch, streaming and
interactive computations in a unified framework” – stratio.com
8
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Eco-System
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema / sql Real Time Machine Learning
Stand alone YARN MESOS
Cluster
managers
GraphX
Graph processing
9
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hypo-meter 
10
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Job Trends
11
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Benchmarks
Source : stratio.com
12
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Code / Activity
©
Source : stratio.com
13
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Timeline : Hadoop & Spark
Hadoop and Spark Comparison
Introduction
Hadoop and Spark Comparison
Going from Hadoop to Spark
Session 2: Introduction to Spark
15
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop Vs. Spark
Hadoop
Spark
Source : http://www.kwigger.com/mit-skifte-til-mac/
16
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Up 10x faster for data on disk
- Up to 100x faster for data in
memory
Compact code
Java, Python, Scala supported
Shell for ad-hoc exploration
17
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop + Yarn : OS for Distributed Compute
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Storage
Cluster
Management
Applications
(or at least, that’s the idea)
18
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Is Better Fit for Iterative Workloads
19
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Programming Model
 More generic than MapReduce
20
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Is Spark Replacing Hadoop?
 Spark runs on Hadoop / YARN
– Complimentary
 Spark programming model is more flexible than MapReduce
 Spark is really great if data fits in memory (few hundred gigs),
 Spark is ‘storage agnostic’ (see next slide)
21
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark & Pluggable Storage
Spark
(compute engine)
HDFS Amazon S3 Cassandra ???
22
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark & Hadoop
Use Case Other Spark
Batch processing Hadoop’s MapReduce
(Java, Pig, Hive)
Spark RDDs
(java / scala / python)
SQL querying Hadoop : Hive Spark SQL
Stream Processing / Real
Time processing
Storm
Kafka
Spark Streaming
Machine Learning Mahout Spark ML Lib
Real time lookups NoSQL (Hbase,
Cassandra ..etc)
No Spark component.
But Spark can query data
in NoSQL stores
23
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop & Spark Future ???
Going from Hadoop to Spark
Introduction
Hadoop and Spark Comparison
Going from Hadoop to Spark
Session 2: Introduction to Spark
25
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Why Move From Hadoop to Spark?
 Spark is ‘easier’ than Hadoop
 ‘friendlier’ for data scientists / analysts
– Interactive shell
• fast development cycles
• adhoc exploration
 API supports multiple languages
– Java, Scala, Python
 Great for small (Gigs) to medium (100s of Gigs) data
26
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark : ‘Unified’ Stack
 Spark supports multiple programming models
– Map reduce style batch processing
– Streaming / real time processing
– Querying via SQL
– Machine learning
 All modules are tightly integrated
– Facilitates rich applications
 Spark can be the only stack you need !
– No need to run multiple clusters
(Hadoop cluster, Storm cluster, … etc.)
Image: buymeposters.com
27
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Migrating From Hadoop  Spark
Functionality Hadoop Spark
Distributed Storage HDFS Cloud storage like
Amazon S3
Or NFS mounts
SQL querying Hive Spark SQL
ETL work flow Pig - Spork : Pig on
Spark
- Mix of Spark SQL
Machine Learning Mahout ML Lib
NoSQL DB HBase ???
28
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Five Steps of Moving From Hadoop to Spark
1. Data size
2. File System
3. SQL
4. ETL
5. Machine Learning
29
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Data Size : “You Don’t Have Big Data”
30
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
1) Data Size (T-shirt sizing)
Image credit : blog.trumpi.co.za
10 G + 100 G +
1 TB + 100 TB + PB +
< few G
Hadoop
Spark
31
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
1) Data Size
 Lot of Spark adoption at SMALL – MEDIUM scale
– Good fit
– Data might fit in memory !!
– Hadoop may be overkill
 Applications
– Iterative workloads (Machine learning, etc.)
– Streaming
 Hadoop is still preferred platform for TB + data
32
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
2) File System
 Hadoop = Storage + Compute
Spark = Compute only
Spark needs a distributed FS
 File system choices for Spark
– HDFS - Hadoop File System
• Reliable
• Good performance (data locality)
• Field tested for PB of data
– S3 : Amazon
• Reliable cloud storage
• Huge scale
– NFS : Network File System (‘shared FS across machines)
33
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark File Systems
34
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
File Systems For Spark
HDFS NFS Amazon S3
Data locality High
(best)
Local enough None
(ok)
Throughput High
(best)
Medium
(good)
Low
(ok)
Latency Low
(best)
Low High
Reliability Very High
(replicated)
Low Very High
Cost Varies Varies $30 / TB / Month
35
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
File Systems Throughput Comparison
 Data : 10G + (11.3 G)
 Each file : ~1+ G ( x 10)
 400 million records total
 Partition size : 128 M
 On HDFS & S3
 Cluster :
– 8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD )
– Hadoop cluster , Latest Horton Works HDP v2.2
– Spark : on same 8 nodes, stand-alone, v 1.2
36
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
HDFS Vs. S3 (lower is better)
©
37
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
HDFS Vs. S3 Conclusions
HDFS S3
Data locality  much higher
throughput
Data is streamed  lower
throughput
Need to maintain an Hadoop cluster No Hadoop cluster to maintain 
convenient
Large data sets (TB + ) Good use case:
- Smallish data sets (few gigs)
- Load once and cache and re-use
38
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
3) SQL in Hadoop / Spark
Hadoop Spark
Engine Hive Spark SQL
Language HiveQL - HiveQL
- RDD programming in
Java / Python / Scala
Scale Petabytes Terabytes ?
Inter operability Can read Hive tables or
stand alone data
Formats CSV, JSON, Parquet CSV, JSON, Parquet
39
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark SQL Vs. Hive
©
Fast on same
HDFS data !
40
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
4) ETL on Hadoop / Spark
Hadoop Spark
ETL Tools Pig, Cascading, Oozie Native RDD
programming
(Scala, Java, Python)
Pig High level ETL workflow Spork : Pig on Spark
Cascading High level Spark-scalding
41
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
4) ETL On Hadoop / Spark : Conclusions
 Try spork or spark-scalding
– Code re-use
– Not re-writing from scratch
 Program RDDs directly
– More flexible
– Multiple language support : Scala / Java / Python
– Simpler / faster in some cases
 Our experience of porting a financial application
– Tresata vs. RDD
42
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
5) Machine Learning : Hadoop / Spark
Hadoop Spark
Tool Mahout MLLib
API Java Java / Scala / Python
Iterative Algorithms Slower Very fast
(in memory)
In Memory processing No YES
Mahout runs on Hadoop
or on Spark
New and young lib
Latest news! Mahout only accepts new
code that runs on Spark
Mahout & MLLib on Spark
Future? Many opinions
43
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Our experience, legal (eDiscovery)
FreeEed (Hadoop) 3VEed (Storm, Spark)
Scalable document processing
All Enron docs in 1 hour (50-node Hadoop)
Allows dynamically adding data sources
Use case: more data discovered for the
same lawsuit
Allows real-time data processing
User case: real-time emails
Provide much improved load balancing
Example: 10 GB PST mailbox
Overall: a much better fit for modern data
governance
43Copyright © 2015 Elephant Scale LLC. All rights reserved.
44
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Final Thoughts
 Already on Hadoop?
– Try Spark side-by-side
– Process some data in HDFS
– Try Spark SQL for Hive tables
 Contemplating Hadoop?
– Try Spark (standalone)
– Choose NFS or S3 file system
 Take advantage of caching
– Iterative loads
– Spark Job servers
– Tachyon
 Build new class of ‘big / medium data’ apps
45
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Thanks !
http://elephantscale.com
Expert consulting & training in Big Data
(Now offering Spark training)
46
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Caching!
 Reading data from remote FS (S3) can be slow
 For small / medium data ( 10 – 100s of GB) use caching
– Pay read penalty once
– Cache
– Then very high speed computes (in memory)
– Recommended for iterative work-loads
47
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Caching Results
Cached!
48
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Caching
 Caching is pretty effective (small / medium data sets)
 Cached data can not be shared across applications
(each application executes in its own sandbox)
49
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Sharing Cached Data
 1) ‘spark job server’
– Multiplexer
– All requests are executed through same ‘context’
– Provides web-service interface
 2) Tachyon
– Distributed In-memory file system
– Memory is the new disk!
– Out of AMP lab , Berkeley
– Early stages (very promising)
50
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Job Server
51
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Job Server
 Open sourced from Ooyala
 ‘Spark as a Service’ – simple REST interface to launch jobs
 Sub-second latency !
 Pre-load jars for even faster spinup
 Share cached RDDs across requests (NamedRDD)
App1 :
ctx.saveRDD(“my cached rdd”, rdd1)
App2:
RDD rdd2 = ctx.loadRDD (“my cached rdd”)
 https://github.com/spark-jobserver/spark-jobserver
52
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Tachyon + Spark
53
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Next : New Big Data Applications With Spark
54
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Big Data Applications : Now
 Analysis is done in batch mode (minutes / hours)
 Final results are stored in a real time data store like
Cassandra / Hbase
 These results are displayed in a dashboard / web UI
 Doing interactive analysis ????
– Need special BI tools
55
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
With Spark…
 Load data set (Giga bytes) from S3 and cache it (one time)
 Super fast (sub-seconds) queries to data
 Response time : seconds (just like a web app !)
56
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Lessons Learned
 Build sophisticated apps !
 Web-response-time (few seconds) !!
 In-depth analytics
– Leverage existing libraries in Java / Scala / Python
 ‘data analytics as a service’
57
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
•57
www.synerzip.com
Ashish Shanker
Ashish.Shanker@synerzip.com
469.374.0500
58
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Synerzip in a Nutshell
 Software product development partner for small/mid-sized technology
companies
• Exclusive focus on small/mid-sized technology companies, typically venture-
backed companies in growth phase
• By definition, all Synerzip work is the IP of its respective clients
• Deep experience in full SDLC – design, dev, QA/testing, deployment
 Dedicated team of high caliber software professionals for each client
• Seamlessly extends client’s local team offering full transparency
• Stable teams with very low turn-over
• NOT just “staff augmentation, but provide full management support
 Actually reduces risk of development/delivery
• Experienced team – uses appropriate level of engineering discipline
• Practices Agile development – responsive yet disciplined
 Reduces cost – dual-site team, 50% cost advantage
 Offers long-term flexibility – allows (facilitates) taking offshore team
captive – aka “BOT” option
58
59
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Synerzip Clients
59
60
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Join Us In Person
Agile Texas 2015 Tour
Presented by
Hemant Elhence & Vinayak Joglekar
60
61
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Next Webinar
7 Sins of Scrum and other Agile Anti-Patterns
Complimentary Webinar:
Tuesday, September 22, 2015 @ Noon CST
Presented by: Todd Little
IHM
61
62
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Ashish Shanker
Ashish.shanker@synerzip.com
469.374.0500
Connect with Synerzip
@Synerzip_Agile
linkedin.com/company/synerzip
facebook.com/Synerzip
62

More Related Content

What's hot

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache Hive
DataWorks Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 

What's hot (20)

Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache Hive
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
 
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
 
From raw data to business insights. A modern data lake
From raw data to business insights. A modern data lakeFrom raw data to business insights. A modern data lake
From raw data to business insights. A modern data lake
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Tailored for Spark
Tailored for SparkTailored for Spark
Tailored for Spark
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018
Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018
Going Deep on Amazon Aurora Serverless (DAT427-R1) - AWS re:Invent 2018
 

Similar to Insight on "From Hadoop to Spark" by Mark Kerzner

Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVMUse Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Amazon Web Services
 
Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)
ZhuanzhuanDing
 

Similar to Insight on "From Hadoop to Spark" by Mark Kerzner (20)

Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summit
 
Big Data Processing with Hadoop-MapReduce in Cloud Systems
Big Data Processing with Hadoop-MapReduce in Cloud SystemsBig Data Processing with Hadoop-MapReduce in Cloud Systems
Big Data Processing with Hadoop-MapReduce in Cloud Systems
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
Big data knolx
Big data knolxBig data knolx
Big data knolx
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data Framework
 
Building an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleBuilding an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable Scale
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVMUse Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Spark 101
Spark 101Spark 101
Spark 101
 
Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)
 

More from Synerzip

More from Synerzip (20)

HOW VOCERA LEVERAGES SYNERZIP FOR ENHANCEMENT OF VOCERA PLATFORM & ITS USER E...
HOW VOCERA LEVERAGES SYNERZIP FOR ENHANCEMENT OF VOCERA PLATFORM & ITS USER E...HOW VOCERA LEVERAGES SYNERZIP FOR ENHANCEMENT OF VOCERA PLATFORM & ITS USER E...
HOW VOCERA LEVERAGES SYNERZIP FOR ENHANCEMENT OF VOCERA PLATFORM & ITS USER E...
 
The QA/Testing Process
The QA/Testing ProcessThe QA/Testing Process
The QA/Testing Process
 
Test Driven Development – What Works And What Doesn’t
Test Driven Development – What Works And What Doesn’t Test Driven Development – What Works And What Doesn’t
Test Driven Development – What Works And What Doesn’t
 
Distributed/Dual-Shore Agile Software Development – Is It Effective?
Distributed/Dual-Shore Agile Software Development – Is It Effective?Distributed/Dual-Shore Agile Software Development – Is It Effective?
Distributed/Dual-Shore Agile Software Development – Is It Effective?
 
Using Agile Approach with Fixed Budget Projects
Using Agile Approach with Fixed Budget ProjectsUsing Agile Approach with Fixed Budget Projects
Using Agile Approach with Fixed Budget Projects
 
QA Role in Agile Teams
QA Role in Agile Teams QA Role in Agile Teams
QA Role in Agile Teams
 
Agile For Mobile App Development
Agile For Mobile App Development Agile For Mobile App Development
Agile For Mobile App Development
 
Using Agile in Non-Ideal Situations
Using Agile in Non-Ideal SituationsUsing Agile in Non-Ideal Situations
Using Agile in Non-Ideal Situations
 
Accelerating Agile Transformations - Ravi Verma
Accelerating Agile Transformations - Ravi VermaAccelerating Agile Transformations - Ravi Verma
Accelerating Agile Transformations - Ravi Verma
 
Agile Product Management Basics
Agile Product Management BasicsAgile Product Management Basics
Agile Product Management Basics
 
Product Portfolio Kanban - by Erik Huddleston
Product Portfolio Kanban - by Erik HuddlestonProduct Portfolio Kanban - by Erik Huddleston
Product Portfolio Kanban - by Erik Huddleston
 
Modern Software Practices - by Damon Poole
Modern Software Practices - by Damon PooleModern Software Practices - by Damon Poole
Modern Software Practices - by Damon Poole
 
Context Driven Agile Leadership
Context Driven Agile LeadershipContext Driven Agile Leadership
Context Driven Agile Leadership
 
Adopting TDD - by Don McGreal
Adopting TDD - by Don McGrealAdopting TDD - by Don McGreal
Adopting TDD - by Don McGreal
 
Pragmatics of Agility - by Venkat Subramaniam
Pragmatics of Agility - by Venkat SubramaniamPragmatics of Agility - by Venkat Subramaniam
Pragmatics of Agility - by Venkat Subramaniam
 
Cross Platform Mobile App Development
Cross Platform Mobile App DevelopmentCross Platform Mobile App Development
Cross Platform Mobile App Development
 
Agile2011 Conference – Key Take Aways
Agile2011 Conference – Key Take AwaysAgile2011 Conference – Key Take Aways
Agile2011 Conference – Key Take Aways
 
Performance Evaluation in Agile
Performance Evaluation in AgilePerformance Evaluation in Agile
Performance Evaluation in Agile
 
Scrum And Kanban (for better agile teams)
Scrum And Kanban (for better agile teams)Scrum And Kanban (for better agile teams)
Scrum And Kanban (for better agile teams)
 
Managing Technical Debt - by Michael Hall
Managing Technical Debt - by Michael HallManaging Technical Debt - by Michael Hall
Managing Technical Debt - by Michael Hall
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Insight on "From Hadoop to Spark" by Mark Kerzner

  • 1. Webinar: From Hadoop to Spark Introduction Hadoop and Spark Comparison From Hadoop to Spark
  • 2. 2 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Webinar Objectives  Intro: what is Hadoop and what is Spark?  Spark's capabilities and advantages vs Hadoop  From Hadoop to Spark – how to? 2
  • 3. Introduction Introduction Hadoop and Spark Comparison From Hadoop to Spark
  • 4. 4 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop in 20 Seconds  ‘The’ Big data platform  Very well field tested  Scales to peta-bytes of data  MapReduce : Batch oriented compute
  • 5. 5 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop Eco System BatchReal Time
  • 6. 6 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop Ecosystem – by function  HDFS – provides distributed storage  Map Reduce – Provides distributed computing  Pig – High level MapReduce  Hive – SQL layer over Hadoop  HBase – NoSQL storage for real-time queries
  • 7. 7 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark in 20 Seconds  Fast & Expressive Cluster computing engine  Compatible with Hadoop  Came out of Berkeley AMP Lab  Now Apache project  Version 1.3 just released (April 2015) “First Big Data platform to integrate batch, streaming and interactive computations in a unified framework” – stratio.com
  • 8. 8 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Eco-System Spark Core Spark SQL Spark Streaming ML lib Schema / sql Real Time Machine Learning Stand alone YARN MESOS Cluster managers GraphX Graph processing
  • 9. 9 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hypo-meter 
  • 10. 10 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Job Trends
  • 11. 11 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Benchmarks Source : stratio.com
  • 12. 12 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Code / Activity © Source : stratio.com
  • 13. 13 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Timeline : Hadoop & Spark
  • 14. Hadoop and Spark Comparison Introduction Hadoop and Spark Comparison Going from Hadoop to Spark Session 2: Introduction to Spark
  • 15. 15 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop Vs. Spark Hadoop Spark Source : http://www.kwigger.com/mit-skifte-til-mac/
  • 16. 16 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Comparison With Hadoop Hadoop Spark Distributed Storage + Distributed Compute Distributed Compute Only MapReduce framework Generalized computation Usually data on disk (HDFS) On disk / in memory Not ideal for iterative work Great at Iterative workloads (machine learning ..etc) Batch process - Up 10x faster for data on disk - Up to 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration
  • 17. 17 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop + Yarn : OS for Distributed Compute HDFS YARN Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Storage Cluster Management Applications (or at least, that’s the idea)
  • 18. 18 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Is Better Fit for Iterative Workloads
  • 19. 19 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Programming Model  More generic than MapReduce
  • 20. 20 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Is Spark Replacing Hadoop?  Spark runs on Hadoop / YARN – Complimentary  Spark programming model is more flexible than MapReduce  Spark is really great if data fits in memory (few hundred gigs),  Spark is ‘storage agnostic’ (see next slide)
  • 21. 21 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark & Pluggable Storage Spark (compute engine) HDFS Amazon S3 Cassandra ???
  • 22. 22 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark & Hadoop Use Case Other Spark Batch processing Hadoop’s MapReduce (Java, Pig, Hive) Spark RDDs (java / scala / python) SQL querying Hadoop : Hive Spark SQL Stream Processing / Real Time processing Storm Kafka Spark Streaming Machine Learning Mahout Spark ML Lib Real time lookups NoSQL (Hbase, Cassandra ..etc) No Spark component. But Spark can query data in NoSQL stores
  • 23. 23 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop & Spark Future ???
  • 24. Going from Hadoop to Spark Introduction Hadoop and Spark Comparison Going from Hadoop to Spark Session 2: Introduction to Spark
  • 25. 25 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Why Move From Hadoop to Spark?  Spark is ‘easier’ than Hadoop  ‘friendlier’ for data scientists / analysts – Interactive shell • fast development cycles • adhoc exploration  API supports multiple languages – Java, Scala, Python  Great for small (Gigs) to medium (100s of Gigs) data
  • 26. 26 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark : ‘Unified’ Stack  Spark supports multiple programming models – Map reduce style batch processing – Streaming / real time processing – Querying via SQL – Machine learning  All modules are tightly integrated – Facilitates rich applications  Spark can be the only stack you need ! – No need to run multiple clusters (Hadoop cluster, Storm cluster, … etc.) Image: buymeposters.com
  • 27. 27 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Migrating From Hadoop  Spark Functionality Hadoop Spark Distributed Storage HDFS Cloud storage like Amazon S3 Or NFS mounts SQL querying Hive Spark SQL ETL work flow Pig - Spork : Pig on Spark - Mix of Spark SQL Machine Learning Mahout ML Lib NoSQL DB HBase ???
  • 28. 28 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Five Steps of Moving From Hadoop to Spark 1. Data size 2. File System 3. SQL 4. ETL 5. Machine Learning
  • 29. 29 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Data Size : “You Don’t Have Big Data”
  • 30. 30 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 1) Data Size (T-shirt sizing) Image credit : blog.trumpi.co.za 10 G + 100 G + 1 TB + 100 TB + PB + < few G Hadoop Spark
  • 31. 31 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 1) Data Size  Lot of Spark adoption at SMALL – MEDIUM scale – Good fit – Data might fit in memory !! – Hadoop may be overkill  Applications – Iterative workloads (Machine learning, etc.) – Streaming  Hadoop is still preferred platform for TB + data
  • 32. 32 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 2) File System  Hadoop = Storage + Compute Spark = Compute only Spark needs a distributed FS  File system choices for Spark – HDFS - Hadoop File System • Reliable • Good performance (data locality) • Field tested for PB of data – S3 : Amazon • Reliable cloud storage • Huge scale – NFS : Network File System (‘shared FS across machines)
  • 33. 33 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark File Systems
  • 34. 34 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. File Systems For Spark HDFS NFS Amazon S3 Data locality High (best) Local enough None (ok) Throughput High (best) Medium (good) Low (ok) Latency Low (best) Low High Reliability Very High (replicated) Low Very High Cost Varies Varies $30 / TB / Month
  • 35. 35 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. File Systems Throughput Comparison  Data : 10G + (11.3 G)  Each file : ~1+ G ( x 10)  400 million records total  Partition size : 128 M  On HDFS & S3  Cluster : – 8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD ) – Hadoop cluster , Latest Horton Works HDP v2.2 – Spark : on same 8 nodes, stand-alone, v 1.2
  • 36. 36 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. HDFS Vs. S3 (lower is better) ©
  • 37. 37 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. HDFS Vs. S3 Conclusions HDFS S3 Data locality  much higher throughput Data is streamed  lower throughput Need to maintain an Hadoop cluster No Hadoop cluster to maintain  convenient Large data sets (TB + ) Good use case: - Smallish data sets (few gigs) - Load once and cache and re-use
  • 38. 38 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 3) SQL in Hadoop / Spark Hadoop Spark Engine Hive Spark SQL Language HiveQL - HiveQL - RDD programming in Java / Python / Scala Scale Petabytes Terabytes ? Inter operability Can read Hive tables or stand alone data Formats CSV, JSON, Parquet CSV, JSON, Parquet
  • 39. 39 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark SQL Vs. Hive © Fast on same HDFS data !
  • 40. 40 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 4) ETL on Hadoop / Spark Hadoop Spark ETL Tools Pig, Cascading, Oozie Native RDD programming (Scala, Java, Python) Pig High level ETL workflow Spork : Pig on Spark Cascading High level Spark-scalding
  • 41. 41 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 4) ETL On Hadoop / Spark : Conclusions  Try spork or spark-scalding – Code re-use – Not re-writing from scratch  Program RDDs directly – More flexible – Multiple language support : Scala / Java / Python – Simpler / faster in some cases  Our experience of porting a financial application – Tresata vs. RDD
  • 42. 42 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 5) Machine Learning : Hadoop / Spark Hadoop Spark Tool Mahout MLLib API Java Java / Scala / Python Iterative Algorithms Slower Very fast (in memory) In Memory processing No YES Mahout runs on Hadoop or on Spark New and young lib Latest news! Mahout only accepts new code that runs on Spark Mahout & MLLib on Spark Future? Many opinions
  • 43. 43 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Our experience, legal (eDiscovery) FreeEed (Hadoop) 3VEed (Storm, Spark) Scalable document processing All Enron docs in 1 hour (50-node Hadoop) Allows dynamically adding data sources Use case: more data discovered for the same lawsuit Allows real-time data processing User case: real-time emails Provide much improved load balancing Example: 10 GB PST mailbox Overall: a much better fit for modern data governance 43Copyright © 2015 Elephant Scale LLC. All rights reserved.
  • 44. 44 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Final Thoughts  Already on Hadoop? – Try Spark side-by-side – Process some data in HDFS – Try Spark SQL for Hive tables  Contemplating Hadoop? – Try Spark (standalone) – Choose NFS or S3 file system  Take advantage of caching – Iterative loads – Spark Job servers – Tachyon  Build new class of ‘big / medium data’ apps
  • 45. 45 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Thanks ! http://elephantscale.com Expert consulting & training in Big Data (Now offering Spark training)
  • 46. 46 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Caching!  Reading data from remote FS (S3) can be slow  For small / medium data ( 10 – 100s of GB) use caching – Pay read penalty once – Cache – Then very high speed computes (in memory) – Recommended for iterative work-loads
  • 47. 47 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Caching Results Cached!
  • 48. 48 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Caching  Caching is pretty effective (small / medium data sets)  Cached data can not be shared across applications (each application executes in its own sandbox)
  • 49. 49 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Sharing Cached Data  1) ‘spark job server’ – Multiplexer – All requests are executed through same ‘context’ – Provides web-service interface  2) Tachyon – Distributed In-memory file system – Memory is the new disk! – Out of AMP lab , Berkeley – Early stages (very promising)
  • 50. 50 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Job Server
  • 51. 51 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Job Server  Open sourced from Ooyala  ‘Spark as a Service’ – simple REST interface to launch jobs  Sub-second latency !  Pre-load jars for even faster spinup  Share cached RDDs across requests (NamedRDD) App1 : ctx.saveRDD(“my cached rdd”, rdd1) App2: RDD rdd2 = ctx.loadRDD (“my cached rdd”)  https://github.com/spark-jobserver/spark-jobserver
  • 52. 52 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Tachyon + Spark
  • 53. 53 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Next : New Big Data Applications With Spark
  • 54. 54 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Big Data Applications : Now  Analysis is done in batch mode (minutes / hours)  Final results are stored in a real time data store like Cassandra / Hbase  These results are displayed in a dashboard / web UI  Doing interactive analysis ???? – Need special BI tools
  • 55. 55 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. With Spark…  Load data set (Giga bytes) from S3 and cache it (one time)  Super fast (sub-seconds) queries to data  Response time : seconds (just like a web app !)
  • 56. 56 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Lessons Learned  Build sophisticated apps !  Web-response-time (few seconds) !!  In-depth analytics – Leverage existing libraries in Java / Scala / Python  ‘data analytics as a service’
  • 57. 57 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. •57 www.synerzip.com Ashish Shanker Ashish.Shanker@synerzip.com 469.374.0500
  • 58. 58 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Synerzip in a Nutshell  Software product development partner for small/mid-sized technology companies • Exclusive focus on small/mid-sized technology companies, typically venture- backed companies in growth phase • By definition, all Synerzip work is the IP of its respective clients • Deep experience in full SDLC – design, dev, QA/testing, deployment  Dedicated team of high caliber software professionals for each client • Seamlessly extends client’s local team offering full transparency • Stable teams with very low turn-over • NOT just “staff augmentation, but provide full management support  Actually reduces risk of development/delivery • Experienced team – uses appropriate level of engineering discipline • Practices Agile development – responsive yet disciplined  Reduces cost – dual-site team, 50% cost advantage  Offers long-term flexibility – allows (facilitates) taking offshore team captive – aka “BOT” option 58
  • 59. 59 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Synerzip Clients 59
  • 60. 60 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Join Us In Person Agile Texas 2015 Tour Presented by Hemant Elhence & Vinayak Joglekar 60
  • 61. 61 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Next Webinar 7 Sins of Scrum and other Agile Anti-Patterns Complimentary Webinar: Tuesday, September 22, 2015 @ Noon CST Presented by: Todd Little IHM 61
  • 62. 62 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Ashish Shanker Ashish.shanker@synerzip.com 469.374.0500 Connect with Synerzip @Synerzip_Agile linkedin.com/company/synerzip facebook.com/Synerzip 62

Editor's Notes

  1. 1
  2. 2
  3. 3
  4. 14
  5. Hadoop is evolving into a platform for other distributed applications
  6. In Hadoop data has to be persisted in HDFS between jobs In Spark, it can be kept in memory
  7. Spark can work with lots of storage types
  8. 24
  9. You can use python libraries for Machine learning ..etc
  10. It is possible to go from Hadoop to Spark Consider the alternatives
  11. TODO : our experience Ted Dunning: Mahout is true and verified, and focussed, MLLib is more of a loose collection Frank Dai (Spark contributor): Mahout will concentrate on machine learning and have a rich set of algorithms, while MLLib will adopt only most essential and mature algorithms
  12. 59
  13. 62