SlideShare a Scribd company logo
1 of 37
Download to read offline
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Apache Spark : Fast and Easy Data
Processing
Sujee Maniyam
Founder / Principal
Elephant Scale LLC
sujee@elephantscale.com
http://elephantscale.com
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Outline
r  Background
r  Spark Architecture
r  Demo
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark
r  Fast & Expressive Cluster computing engine
r  Compatible with Hadoop
r  Came out of Berkeley AMP Lab
r  Now Apache project
r  Version 1.2 just released (Dec 2014)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Timeline Hadoop & Spark
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Hypo-meter J
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Job Trends
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Upto 2x -10x faster for data on disk
- Upto 100x faster for data in memory
Compact code
Java, Python, Scala supported
Shell for ad-hoc exploration
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Hadoop + Yarn : Universal OS for
Cluster Computing
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Storage
Cluster
Management
Applications
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Vs Hadoop
r  Spark is ‘easier’ than Hadoop
r  ‘friendlier’ for data scientists / analysts
r  Interactive shell
r fast development cycles
r adhoc exploration
r  API supports multiple languages
r Java, Scala, Python
r  Great for small (Gigs) to medium (100s of Gigs)
data
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Vs. Hadoop
Fast!
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Is Spark Replacing Hadoop?
r  Right now, Spark runs on Hadoop / YARN
r Complimentary
r  Can be seen as generic MapReduce
r  Spark is really great if data fits in memory (few
hundred gigs),
r  Spark can be used as compute platform with
various storage types (see next slide)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark & Pluggable Storage
Spark
(compute engine)
HDFS Amazon S3 Cassandra ???
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Hadoop & Spark Future ???
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Outline
r  Background
r  à Spark Architecture
r  Demo
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Eco-System
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema / sql Real Time
Machine
Learning
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Core
r  Distributed compute engine
r  Sort / shuffle algorithms
r  Handles node failures -> re-computes missing
pieces
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark SQL
r  Adhoc / interactive analytics
r  ETL workflow
r  Reporting
r  Natively understands
r JSON : {“name”: “mark”, “age”: 40}
r Parquet : on disk columnar format, heavily
compressed
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark SQL Quick Example
{"name":"mark", "age":50, "sex":"M"},
{"name":"mary", "age":45, "sex":"F"},
{"name":"brian", "age":15, "sex":"M"},
{"name":"nancy", "age":17, "sex":"F"}
// … setup …
val teenagers = sqlContext.sql("SELECT name
FROM people WHERE age >= 13 AND age <= 19")
è  [brian, nancy]
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Streaming
r  Process data streams in real time, in micro-
batches
r  Low latency, high throughput (1000s events /
sec)
r  Stock ticks / sensor data (connected devices /
IoT – Internet of Things)
Streaming
Sources
Storage
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Machine Learning (ML Lib)
r  Machine learning at scale
r  Out of the box ML capabilities !
r  Java / Scala / Python language support
r  Lots of common algorithms are supported
r  Classification / Regressions
r  Linear models (linear R, logistic regression, SVM)
r  Decision trees
r  Collaborative filtering (recommendations)
r  K-Means clustering
r  More to come
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Cluster Deployment
1)  Standalone
2)  Yarn
3)  Mesos
Client
Node
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Cluster Architecture
r  Multiple ‘applications’ can run at the same time
r  Driver (or ‘main’) launches an application
r  Each application gets its own ‘executor’
r  Isolated (runs in different JVMs)
r  Also means data can not be shared across applications
r  Cluster Managers:
r  multiple cluster managers are supported
r  1) Standalone : simple to setup
r  2) YARN : on top of Hadoop
r  3) Mesos : General cluster manager (AMP lab)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Data Model : RDD
r  Resilient Distributed Dataset (RDD)
r  Can live in
r Memory (best case scenario)
r Or on disk (FS, HDFS, S3 …etc)
r  Each RDD is split into multiple partitions
r  Partitions may live on different nodes
r  Partitions can be computed in parallel on
different nodes
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Partitions Explained
1G file
64 M 64 M 64 M 64 M
Task Task Task Task
Result
parallelism
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD : Loading
r  Use Spark context to load RDDs from disk /
external storage
val sc = new SparkContext(…)
val f = sc.textFile(“/data/input1.txt”) // single file
sc.textFile(“/data/”) // load all files under dir
sc.textFile(“/data/*.log”) // wild card matching
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD Operations
r  Two kinds of operations on RDDs
r 1) Transformations
r Create a new RDD from existing ones (e.g. Map)
r 2) Actions
r E.g. Returns the results to clients (e.g. Reduce)
r  Transformations are lazy.. Actions force
transformations
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Transformations / Actions
RDD 1 RDD 2
Client
RDD 3
Transformation 1
(map)
Action1
(collect)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD Transformations
Transformation Description Example
filter Filters through each record
(aka grep)
f.filter( line =>
line.contains(“ERROR”))
union Merges two RDDs rdd1.union(rdd2)
…see docs …
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD Actions
Action Description Example
count() Counts all records in an rdd f.count()
first() Extract the first record f.first ()
take(n) Take first N lines f.take(10)
collect() Gathers all records for RDD.
All data has to fit in memory of
ONE machine (don’t use for big
data sets)
f.collect()
…. See
documentation
..
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD : Saving
r  saveAsTextFile () and saveAsSequenceFile()
f.saveAsTextFile(“/output/directory”) // a directory
r  Output usually is a directory
r RDDs will be saved as multiple files in the dir
r Each partition à one output file
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Caching of RDDs
r  RDDs can be loaded from disk and computed
r Hadoop mapreduce model
r  Also RDDs can be cached in memory
r  Subsequent operations are much faster
f.persist() // on disk or memory
f.cache() // memory only
r  In memory RDDs are
great for iterative workloads
r Machine learning algorithms
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Demo Time !
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Demo : Spark-shell
r  Invoke spark shell
r  Load a data set
r  Do basic operations (count / filter)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Demo: RDD Caching
r  From Spark shell
r  Load an RDD
r  Demonstrate the difference between cached and
non-cached
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Demo : Map Reduce
r  Quick Word count
val input = sc.textFile(“…”)
val counts = input.flatMap(
line => line.split(“ “)).
map(word => (word, 1)).
reduceByKey(_+_)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Thanks !
Sujee Maniyam
sujee@elephantscale.com
http://elephantscale.com
Expert consulting & training in Big Data
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Credits
r  http://spark.apache.org/
r  http://www.strategictechplanning.com
r  Tuningpp.com
r  Kidzworld.com

More Related Content

What's hot

Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 

What's hot (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 

Viewers also liked

게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 IntroAmazon Web Services Korea
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015Amazon Web Services Korea
 
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016Amazon Web Services Korea
 
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)Amazon Web Services Korea
 
Creando su primera aplicación de Big Data en AWS
Creando su primera aplicación de Big Data en AWSCreando su primera aplicación de Big Data en AWS
Creando su primera aplicación de Big Data en AWSAmazon Web Services LATAM
 

Viewers also liked (8)

게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
 
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
 
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
 
Zookeeper 소개
Zookeeper 소개Zookeeper 소개
Zookeeper 소개
 
Creando su primera aplicación de Big Data en AWS
Creando su primera aplicación de Big Data en AWSCreando su primera aplicación de Big Data en AWS
Creando su primera aplicación de Big Data en AWS
 

Similar to Spark Architecture and Demo

Insight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerInsight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerSynerzip
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724sdeeg
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...William Markito Oliveira
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataTaro L. Saito
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 

Similar to Spark Architecture and Demo (20)

Insight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerInsight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark Kerzner
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 

More from Sujee Maniyam

Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of ThingsSujee Maniyam
 
Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014Sujee Maniyam
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confSujee Maniyam
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big DataSujee Maniyam
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscapeSujee Maniyam
 
Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)Sujee Maniyam
 

More from Sujee Maniyam (6)

Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of Things
 
Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscape
 
Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Spark Architecture and Demo

  • 1. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Apache Spark : Fast and Easy Data Processing Sujee Maniyam Founder / Principal Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
  • 2. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Outline r  Background r  Spark Architecture r  Demo
  • 3. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark r  Fast & Expressive Cluster computing engine r  Compatible with Hadoop r  Came out of Berkeley AMP Lab r  Now Apache project r  Version 1.2 just released (Dec 2014)
  • 4. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Timeline Hadoop & Spark
  • 5. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Hypo-meter J
  • 6. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Job Trends
  • 7. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Comparison With Hadoop Hadoop Spark Distributed Storage + Distributed Compute Distributed Compute Only MapReduce framework Generalized computation Usually data on disk (HDFS) On disk / in memory Not ideal for iterative work Great at Iterative workloads (machine learning ..etc) Batch process - Upto 2x -10x faster for data on disk - Upto 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration
  • 8. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Hadoop + Yarn : Universal OS for Cluster Computing HDFS YARN Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Storage Cluster Management Applications
  • 9. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Vs Hadoop r  Spark is ‘easier’ than Hadoop r  ‘friendlier’ for data scientists / analysts r  Interactive shell r fast development cycles r adhoc exploration r  API supports multiple languages r Java, Scala, Python r  Great for small (Gigs) to medium (100s of Gigs) data
  • 10. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Vs. Hadoop Fast!
  • 11. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Is Spark Replacing Hadoop? r  Right now, Spark runs on Hadoop / YARN r Complimentary r  Can be seen as generic MapReduce r  Spark is really great if data fits in memory (few hundred gigs), r  Spark can be used as compute platform with various storage types (see next slide)
  • 12. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark & Pluggable Storage Spark (compute engine) HDFS Amazon S3 Cassandra ???
  • 13. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Hadoop & Spark Future ???
  • 14. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Outline r  Background r  à Spark Architecture r  Demo
  • 15. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Eco-System Spark Core Spark SQL Spark Streaming ML lib Schema / sql Real Time Machine Learning
  • 16. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Core r  Distributed compute engine r  Sort / shuffle algorithms r  Handles node failures -> re-computes missing pieces
  • 17. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark SQL r  Adhoc / interactive analytics r  ETL workflow r  Reporting r  Natively understands r JSON : {“name”: “mark”, “age”: 40} r Parquet : on disk columnar format, heavily compressed
  • 18. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark SQL Quick Example {"name":"mark", "age":50, "sex":"M"}, {"name":"mary", "age":45, "sex":"F"}, {"name":"brian", "age":15, "sex":"M"}, {"name":"nancy", "age":17, "sex":"F"} // … setup … val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") è  [brian, nancy]
  • 19. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Streaming r  Process data streams in real time, in micro- batches r  Low latency, high throughput (1000s events / sec) r  Stock ticks / sensor data (connected devices / IoT – Internet of Things) Streaming Sources Storage
  • 20. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Machine Learning (ML Lib) r  Machine learning at scale r  Out of the box ML capabilities ! r  Java / Scala / Python language support r  Lots of common algorithms are supported r  Classification / Regressions r  Linear models (linear R, logistic regression, SVM) r  Decision trees r  Collaborative filtering (recommendations) r  K-Means clustering r  More to come
  • 21. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Cluster Deployment 1)  Standalone 2)  Yarn 3)  Mesos Client Node
  • 22. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Cluster Architecture r  Multiple ‘applications’ can run at the same time r  Driver (or ‘main’) launches an application r  Each application gets its own ‘executor’ r  Isolated (runs in different JVMs) r  Also means data can not be shared across applications r  Cluster Managers: r  multiple cluster managers are supported r  1) Standalone : simple to setup r  2) YARN : on top of Hadoop r  3) Mesos : General cluster manager (AMP lab)
  • 23. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Data Model : RDD r  Resilient Distributed Dataset (RDD) r  Can live in r Memory (best case scenario) r Or on disk (FS, HDFS, S3 …etc) r  Each RDD is split into multiple partitions r  Partitions may live on different nodes r  Partitions can be computed in parallel on different nodes
  • 24. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Partitions Explained 1G file 64 M 64 M 64 M 64 M Task Task Task Task Result parallelism
  • 25. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD : Loading r  Use Spark context to load RDDs from disk / external storage val sc = new SparkContext(…) val f = sc.textFile(“/data/input1.txt”) // single file sc.textFile(“/data/”) // load all files under dir sc.textFile(“/data/*.log”) // wild card matching
  • 26. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD Operations r  Two kinds of operations on RDDs r 1) Transformations r Create a new RDD from existing ones (e.g. Map) r 2) Actions r E.g. Returns the results to clients (e.g. Reduce) r  Transformations are lazy.. Actions force transformations
  • 27. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Transformations / Actions RDD 1 RDD 2 Client RDD 3 Transformation 1 (map) Action1 (collect)
  • 28. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD Transformations Transformation Description Example filter Filters through each record (aka grep) f.filter( line => line.contains(“ERROR”)) union Merges two RDDs rdd1.union(rdd2) …see docs …
  • 29. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD Actions Action Description Example count() Counts all records in an rdd f.count() first() Extract the first record f.first () take(n) Take first N lines f.take(10) collect() Gathers all records for RDD. All data has to fit in memory of ONE machine (don’t use for big data sets) f.collect() …. See documentation ..
  • 30. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD : Saving r  saveAsTextFile () and saveAsSequenceFile() f.saveAsTextFile(“/output/directory”) // a directory r  Output usually is a directory r RDDs will be saved as multiple files in the dir r Each partition à one output file
  • 31. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Caching of RDDs r  RDDs can be loaded from disk and computed r Hadoop mapreduce model r  Also RDDs can be cached in memory r  Subsequent operations are much faster f.persist() // on disk or memory f.cache() // memory only r  In memory RDDs are great for iterative workloads r Machine learning algorithms
  • 32. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo Time !
  • 33. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo : Spark-shell r  Invoke spark shell r  Load a data set r  Do basic operations (count / filter)
  • 34. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo: RDD Caching r  From Spark shell r  Load an RDD r  Demonstrate the difference between cached and non-cached
  • 35. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo : Map Reduce r  Quick Word count val input = sc.textFile(“…”) val counts = input.flatMap( line => line.split(“ “)). map(word => (word, 1)). reduceByKey(_+_)
  • 36. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Thanks ! Sujee Maniyam sujee@elephantscale.com http://elephantscale.com Expert consulting & training in Big Data
  • 37. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Credits r  http://spark.apache.org/ r  http://www.strategictechplanning.com r  Tuningpp.com r  Kidzworld.com