SlideShare a Scribd company logo
1 of 34
A P A C H E S P A R K
S T A M P E D E C O N 2 0 1 4
S T E V E N B O R R E L L I
@stevendborrelli
A S T E R I S
A B O U T M E
F O U N D E R , A S T E R I S ( J A N 2 0 1 4 )
O R G A N I Z E R O F S T L M A C H I N E
L E A R N I N G A N D D O C K E R S T L
S Y S T E M S E N G I N E E R I N G , H P C , B I G
D A T A , & C L O U D
N E X T G E N E R A T I O N I N F R A S T R U C T U R E F O R D E V E L O P E R S
S P A R K I N F I V E S E C O N D S
is a replacement for
WHY DO WE NEED TO REPLACE
MAPREDUCE?
M A P R E D U C E I S A W E S O M E !
Allows us to process
enormous
amounts of data in parallel
M A P R E D U C E
M A P R E D U C E : S I M P L I F I E D D A T A P R O C E S S I N G O N L A R G E C L U S T E R S ( 2 0 0 4 )
J E F F R E Y D E A N A N D S A N J A Y G H E M A W A T
HITTING THE LIMITS OF HADOOP’s
MAPREDUCE
T H E P R O B L E M S W I T H
M A P R E D U C E
API: Low-Level & Complex
M A P R E D U C E I S S U E S
• Latency
• Execution time impacted by “stragglers”
• Lack of in-memory caching
• Intermediate steps persisted to disk
• No shared state
T H E P R O B L E M S W I T H M A P
R E D U C E
Not optimal for:
M A C H I N E L E A R N I N G G R A P H S
S T R E A M
P R O C E S S I N G
I M P R O V I N G M A P R E D U C E
A P A C H E T E Z
• Generalize to different workloads
• Sub-Second Latency
• Scalable and Fault Tolerant
• Easy to use API
N E X T M A P R E D U C E : G O A L S
T O P S P A R K F E A T U R E S
• Fast, fault-tolerant in-memory data structures (RDD)
• Compatibility with Hadoop ecosystem
• Rich, easy-to-use API supports Machine Learning,
Graphs and Streaming
• Interactive Shell
S P A R K S T A C K
S P A R K S T A C K
Integrated platform for disparate workloads
R E S I L I E N T D I S T R I B U T E D
D A T A S E T
• Immutable in-memory collections
• Fast recovery on failure
• Control caching and persistence to memory/disk
• Can partition to avoid shuffles
R D D L I N E A G E
lines = spark.textFile(“hdfs://errors/...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
L A N G U A G E S U P P O R T
• Spark is written in
• Uses Scala collections & Akka Actors
• Java, Python native support (Python support can lag),
lambda support in Java8/Spark 1.0
• R Bindings through SparkR
• Functional programming paradigm
R D D T R A N S F O R M A T I O N S
Transformations create a new RDD
map
filter
flatMap
sample
union
distinct
groupByKey
reduceByKey
sortByKey
join
cogroup
cartesian
Transformations are evaluated lazily.
R D D A C T I O N S
Actions Return a value
reduce
collect
count
countByKey
countByValue
countApprox
foreach
saveAsSequenceFile
saveAsTextFile
first
take(n)
takeSample
toArray
Invoking an Action will cause all previous Transformations to
be evaluated.
T A S K S C H E D U L E R
H T T P : / / A MP C A M P . B E R K E L E Y . E D U / W P - C O N T E N T / U P L O A D S / 2 0 1 2 / 0 6 / M A T E I - Z A H A R I A - P A R T - 1 - A M P - C A M P - 2 0 1 2 - S P A R K - I N T R O . P D F
• Runs general
task graphs
• Pipelines
functions
where possible
• Cache-aware
data reuse &
locality
• Partitioning-
aware to avoid
shuffles
SPARK ECOSYSTEM
S P A R K S T R E A M I N G
• Micro-Batch: Discretized Stream (DStream)
• ~1 sec latency
• Fault tolerant
• Shares Much of the same code as Batch
T O P 1 0 H A S H T A G S I N L A S T 1 0
M I N
/ Create the stream of tweets
val tweets = ssc.twitterStream(<username>, <password>)
/ Count the tags over a 10 minute window
val tagCounts = tweets.flatMap(statuts => getTags(status))
.countByValueAndWindow(Minutes(10), Second(1))
/ Sort the tags by counts
val sortedTags = tagCounts.map
{ case (tag, count) => (count, tag) }
(_.sortByKey(false))
/ Show the top 10 tags
sortedTags.foreach(showTopTags(10) _)
• 10x + speedup after data is cached
• In-memory materialized views
• Supports HiveQL, UDFs, etc.
• New Catalyst SQL engine coming in 1.0 includes
SchemaRDD to mix & match RDD/SQL in code.
• Implementation of PowerGraph, Pregel on Spark
• .5x the speed of GraphLab, but more fault-tolerant
• Machine Learning library, part of Spark core.
• Uses jblas & gfortran. Python supports NumPy.
• Growing number of algorithms:
SVM, ALS, Naive Bayes, K-Means, Linear & Logistic
Regression. (SVD/PCA, CART, L-BGFS coming in 1.x)
M L L I B
• MLI: Higher level library to support Tables
(dataframes), Linear Algebra, Optimizers.
• MLI: alpha software, limited activity
• Can use Scikit-Learn or SparkR to run models on
Spark.
M L L I B +
MOMENTUM
C O M M U N I T Y
0
50
100
150
200
250
Patches
MapReduce
Storm
Yarn
Spark
0
10000
20000
30000
40000
50000
Lines Added
MapReduce
Storm
Yarn
Spark
0
3500
7000
10500
14000
17500
Lines Removed
MapReduce
Storm
Yarn
Spark
S P A R K M O M E N T U M
• 1.0 is imminent (in 1.0 RC testing right now)
• Databricks investment $14MM Andreessen Horowitz
• Partnerships with DataStax, Cloudera, MapR,
PivotalHD
Q & A
T H A N K S !
steve@aster.is @stevendborrelli

More Related Content

What's hot

[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of BranchingPerforce
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regexSteve Mylroie
 
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...Lucidworks
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value StoreSajeev P
 
Repl internals
Repl internalsRepl internals
Repl internalsMongoDB
 
Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in WonderlandPaul Seiffert
 
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PIMOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PIMonica Li
 

What's hot (9)

[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching
 
Meeting20150109 v1
Meeting20150109 v1Meeting20150109 v1
Meeting20150109 v1
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regex
 
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value Store
 
Repl internals
Repl internalsRepl internals
Repl internals
 
Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in Wonderland
 
Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
 
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PIMOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
 

Viewers also liked

LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015LSE Enterprise
 
How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015StampedeCon
 
Economics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programmeEconomics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programmeLSE Enterprise
 
Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015Carol Varney
 
Estrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una culturaEstrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una culturaoscar daniel naranjo aristizabal
 
Curriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNGCurriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNGSamdoo Jung
 
PatrickJacks_Resume2
PatrickJacks_Resume2PatrickJacks_Resume2
PatrickJacks_Resume2Patrick Jacks
 
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013StampedeCon
 
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012StampedeCon
 
Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015StampedeCon
 
4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensationfrekhtmanassociates
 
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...StampedeCon
 
2 kien pham cv en vn with project experience
2 kien pham cv en  vn with project experience2 kien pham cv en  vn with project experience
2 kien pham cv en vn with project experienceKien Pham
 
Estrous synchronization
Estrous synchronizationEstrous synchronization
Estrous synchronizationArmia Naguib
 

Viewers also liked (16)

Cattalugue2016
Cattalugue2016Cattalugue2016
Cattalugue2016
 
LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015
 
How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015
 
Economics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programmeEconomics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programme
 
Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015
 
Estrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una culturaEstrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una cultura
 
CORRUPTION IN INDIA
CORRUPTION IN INDIACORRUPTION IN INDIA
CORRUPTION IN INDIA
 
Curriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNGCurriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNG
 
PatrickJacks_Resume2
PatrickJacks_Resume2PatrickJacks_Resume2
PatrickJacks_Resume2
 
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
 
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
 
Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015
 
4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation
 
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
 
2 kien pham cv en vn with project experience
2 kien pham cv en  vn with project experience2 kien pham cv en  vn with project experience
2 kien pham cv en vn with project experience
 
Estrous synchronization
Estrous synchronizationEstrous synchronization
Estrous synchronization
 

Similar to Apache Spark: the next big thing? - StampedeCon 2014

php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHPphp[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHPAdam Englander
 
Zend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHPZend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHPAdam Englander
 
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceAndrea Giuliano
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstarsStephan Hochhaus
 
Toying with spark
Toying with sparkToying with spark
Toying with sparkRaymond Tay
 
Microservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixMicroservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixKrzysztof Sobkowiak
 
Angular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of SpeedAngular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of SpeedIlia Idakiev
 
Web enabling your survey business
Web enabling your survey businessWeb enabling your survey business
Web enabling your survey businessRudy Stricklan
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 
Using MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryUsing MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryNuxeo
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar FunctionsArun Kejariwal
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Michał Kurzeja
 

Similar to Apache Spark: the next big thing? - StampedeCon 2014 (20)

php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHPphp[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
 
Zend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHPZend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHP
 
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstars
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
Microservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixMicroservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud Netflix
 
Angular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of SpeedAngular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of Speed
 
Web enabling your survey business
Web enabling your survey businessWeb enabling your survey business
Web enabling your survey business
 
Witchcraft
WitchcraftWitchcraft
Witchcraft
 
Apache spark
Apache sparkApache spark
Apache spark
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Using MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryUsing MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content Repository
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Everybody Lies
Everybody LiesEverybody Lies
Everybody Lies
 
GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Apache Spark: the next big thing? - StampedeCon 2014

  • 1. A P A C H E S P A R K S T A M P E D E C O N 2 0 1 4 S T E V E N B O R R E L L I @stevendborrelli A S T E R I S
  • 2. A B O U T M E F O U N D E R , A S T E R I S ( J A N 2 0 1 4 ) O R G A N I Z E R O F S T L M A C H I N E L E A R N I N G A N D D O C K E R S T L S Y S T E M S E N G I N E E R I N G , H P C , B I G D A T A , & C L O U D N E X T G E N E R A T I O N I N F R A S T R U C T U R E F O R D E V E L O P E R S
  • 3. S P A R K I N F I V E S E C O N D S is a replacement for
  • 4. WHY DO WE NEED TO REPLACE MAPREDUCE?
  • 5. M A P R E D U C E I S A W E S O M E ! Allows us to process enormous amounts of data in parallel
  • 6. M A P R E D U C E M A P R E D U C E : S I M P L I F I E D D A T A P R O C E S S I N G O N L A R G E C L U S T E R S ( 2 0 0 4 ) J E F F R E Y D E A N A N D S A N J A Y G H E M A W A T
  • 7. HITTING THE LIMITS OF HADOOP’s MAPREDUCE
  • 8. T H E P R O B L E M S W I T H M A P R E D U C E API: Low-Level & Complex
  • 9. M A P R E D U C E I S S U E S • Latency • Execution time impacted by “stragglers” • Lack of in-memory caching • Intermediate steps persisted to disk • No shared state
  • 10. T H E P R O B L E M S W I T H M A P R E D U C E Not optimal for: M A C H I N E L E A R N I N G G R A P H S S T R E A M P R O C E S S I N G
  • 11. I M P R O V I N G M A P R E D U C E A P A C H E T E Z
  • 12. • Generalize to different workloads • Sub-Second Latency • Scalable and Fault Tolerant • Easy to use API N E X T M A P R E D U C E : G O A L S
  • 13. T O P S P A R K F E A T U R E S • Fast, fault-tolerant in-memory data structures (RDD) • Compatibility with Hadoop ecosystem • Rich, easy-to-use API supports Machine Learning, Graphs and Streaming • Interactive Shell
  • 14. S P A R K S T A C K
  • 15. S P A R K S T A C K Integrated platform for disparate workloads
  • 16. R E S I L I E N T D I S T R I B U T E D D A T A S E T • Immutable in-memory collections • Fast recovery on failure • Control caching and persistence to memory/disk • Can partition to avoid shuffles
  • 17. R D D L I N E A G E lines = spark.textFile(“hdfs://errors/...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2))
  • 18. L A N G U A G E S U P P O R T • Spark is written in • Uses Scala collections & Akka Actors • Java, Python native support (Python support can lag), lambda support in Java8/Spark 1.0 • R Bindings through SparkR • Functional programming paradigm
  • 19. R D D T R A N S F O R M A T I O N S Transformations create a new RDD map filter flatMap sample union distinct groupByKey reduceByKey sortByKey join cogroup cartesian Transformations are evaluated lazily.
  • 20. R D D A C T I O N S Actions Return a value reduce collect count countByKey countByValue countApprox foreach saveAsSequenceFile saveAsTextFile first take(n) takeSample toArray Invoking an Action will cause all previous Transformations to be evaluated.
  • 21. T A S K S C H E D U L E R H T T P : / / A MP C A M P . B E R K E L E Y . E D U / W P - C O N T E N T / U P L O A D S / 2 0 1 2 / 0 6 / M A T E I - Z A H A R I A - P A R T - 1 - A M P - C A M P - 2 0 1 2 - S P A R K - I N T R O . P D F • Runs general task graphs • Pipelines functions where possible • Cache-aware data reuse & locality • Partitioning- aware to avoid shuffles
  • 23. S P A R K S T R E A M I N G • Micro-Batch: Discretized Stream (DStream) • ~1 sec latency • Fault tolerant • Shares Much of the same code as Batch
  • 24. T O P 1 0 H A S H T A G S I N L A S T 1 0 M I N / Create the stream of tweets val tweets = ssc.twitterStream(<username>, <password>) / Count the tags over a 10 minute window val tagCounts = tweets.flatMap(statuts => getTags(status)) .countByValueAndWindow(Minutes(10), Second(1)) / Sort the tags by counts val sortedTags = tagCounts.map { case (tag, count) => (count, tag) } (_.sortByKey(false)) / Show the top 10 tags sortedTags.foreach(showTopTags(10) _)
  • 25.
  • 26. • 10x + speedup after data is cached • In-memory materialized views • Supports HiveQL, UDFs, etc. • New Catalyst SQL engine coming in 1.0 includes SchemaRDD to mix & match RDD/SQL in code.
  • 27. • Implementation of PowerGraph, Pregel on Spark • .5x the speed of GraphLab, but more fault-tolerant
  • 28. • Machine Learning library, part of Spark core. • Uses jblas & gfortran. Python supports NumPy. • Growing number of algorithms: SVM, ALS, Naive Bayes, K-Means, Linear & Logistic Regression. (SVD/PCA, CART, L-BGFS coming in 1.x) M L L I B
  • 29. • MLI: Higher level library to support Tables (dataframes), Linear Algebra, Optimizers. • MLI: alpha software, limited activity • Can use Scikit-Learn or SparkR to run models on Spark. M L L I B +
  • 31. C O M M U N I T Y 0 50 100 150 200 250 Patches MapReduce Storm Yarn Spark 0 10000 20000 30000 40000 50000 Lines Added MapReduce Storm Yarn Spark 0 3500 7000 10500 14000 17500 Lines Removed MapReduce Storm Yarn Spark
  • 32. S P A R K M O M E N T U M • 1.0 is imminent (in 1.0 RC testing right now) • Databricks investment $14MM Andreessen Horowitz • Partnerships with DataStax, Cloudera, MapR, PivotalHD
  • 33. Q & A
  • 34. T H A N K S ! steve@aster.is @stevendborrelli