SlideShare a Scribd company logo
A P A C H E S P A R K
S T A M P E D E C O N 2 0 1 4
S T E V E N B O R R E L L I
@stevendborrelli
A S T E R I S
A B O U T M E
F O U N D E R , A S T E R I S ( J A N 2 0 1 4 )
O R G A N I Z E R O F S T L M A C H I N E
L E A R N I N G A N D D O C K E R S T L
S Y S T E M S E N G I N E E R I N G , H P C , B I G
D A T A , & C L O U D
N E X T G E N E R A T I O N I N F R A S T R U C T U R E F O R D E V E L O P E R S
S P A R K I N F I V E S E C O N D S
is a replacement for
WHY DO WE NEED TO REPLACE
MAPREDUCE?
M A P R E D U C E I S A W E S O M E !
Allows us to process
enormous
amounts of data in parallel
M A P R E D U C E
M A P R E D U C E : S I M P L I F I E D D A T A P R O C E S S I N G O N L A R G E C L U S T E R S ( 2 0 0 4 )
J E F F R E Y D E A N A N D S A N J A Y G H E M A W A T
HITTING THE LIMITS OF HADOOP’s
MAPREDUCE
T H E P R O B L E M S W I T H
M A P R E D U C E
API: Low-Level & Complex
M A P R E D U C E I S S U E S
• Latency
• Execution time impacted by “stragglers”
• Lack of in-memory caching
• Intermediate steps persisted to disk
• No shared state
T H E P R O B L E M S W I T H M A P
R E D U C E
Not optimal for:
M A C H I N E L E A R N I N G G R A P H S
S T R E A M
P R O C E S S I N G
I M P R O V I N G M A P R E D U C E
A P A C H E T E Z
• Generalize to different workloads
• Sub-Second Latency
• Scalable and Fault Tolerant
• Easy to use API
N E X T M A P R E D U C E : G O A L S
T O P S P A R K F E A T U R E S
• Fast, fault-tolerant in-memory data structures (RDD)
• Compatibility with Hadoop ecosystem
• Rich, easy-to-use API supports Machine Learning,
Graphs and Streaming
• Interactive Shell
S P A R K S T A C K
S P A R K S T A C K
Integrated platform for disparate workloads
R E S I L I E N T D I S T R I B U T E D
D A T A S E T
• Immutable in-memory collections
• Fast recovery on failure
• Control caching and persistence to memory/disk
• Can partition to avoid shuffles
R D D L I N E A G E
lines = spark.textFile(“hdfs://errors/...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
L A N G U A G E S U P P O R T
• Spark is written in
• Uses Scala collections & Akka Actors
• Java, Python native support (Python support can lag),
lambda support in Java8/Spark 1.0
• R Bindings through SparkR
• Functional programming paradigm
R D D T R A N S F O R M A T I O N S
Transformations create a new RDD
map
filter
flatMap
sample
union
distinct
groupByKey
reduceByKey
sortByKey
join
cogroup
cartesian
Transformations are evaluated lazily.
R D D A C T I O N S
Actions Return a value
reduce
collect
count
countByKey
countByValue
countApprox
foreach
saveAsSequenceFile
saveAsTextFile
first
take(n)
takeSample
toArray
Invoking an Action will cause all previous Transformations to
be evaluated.
T A S K S C H E D U L E R
H T T P : / / A MP C A M P . B E R K E L E Y . E D U / W P - C O N T E N T / U P L O A D S / 2 0 1 2 / 0 6 / M A T E I - Z A H A R I A - P A R T - 1 - A M P - C A M P - 2 0 1 2 - S P A R K - I N T R O . P D F
• Runs general
task graphs
• Pipelines
functions
where possible
• Cache-aware
data reuse &
locality
• Partitioning-
aware to avoid
shuffles
SPARK ECOSYSTEM
S P A R K S T R E A M I N G
• Micro-Batch: Discretized Stream (DStream)
• ~1 sec latency
• Fault tolerant
• Shares Much of the same code as Batch
T O P 1 0 H A S H T A G S I N L A S T 1 0
M I N
/ Create the stream of tweets
val tweets = ssc.twitterStream(<username>, <password>)
/ Count the tags over a 10 minute window
val tagCounts = tweets.flatMap(statuts => getTags(status))
.countByValueAndWindow(Minutes(10), Second(1))
/ Sort the tags by counts
val sortedTags = tagCounts.map
{ case (tag, count) => (count, tag) }
(_.sortByKey(false))
/ Show the top 10 tags
sortedTags.foreach(showTopTags(10) _)
• 10x + speedup after data is cached
• In-memory materialized views
• Supports HiveQL, UDFs, etc.
• New Catalyst SQL engine coming in 1.0 includes
SchemaRDD to mix & match RDD/SQL in code.
• Implementation of PowerGraph, Pregel on Spark
• .5x the speed of GraphLab, but more fault-tolerant
• Machine Learning library, part of Spark core.
• Uses jblas & gfortran. Python supports NumPy.
• Growing number of algorithms:
SVM, ALS, Naive Bayes, K-Means, Linear & Logistic
Regression. (SVD/PCA, CART, L-BGFS coming in 1.x)
M L L I B
• MLI: Higher level library to support Tables
(dataframes), Linear Algebra, Optimizers.
• MLI: alpha software, limited activity
• Can use Scikit-Learn or SparkR to run models on
Spark.
M L L I B +
MOMENTUM
C O M M U N I T Y
0
50
100
150
200
250
Patches
MapReduce
Storm
Yarn
Spark
0
10000
20000
30000
40000
50000
Lines Added
MapReduce
Storm
Yarn
Spark
0
3500
7000
10500
14000
17500
Lines Removed
MapReduce
Storm
Yarn
Spark
S P A R K M O M E N T U M
• 1.0 is imminent (in 1.0 RC testing right now)
• Databricks investment $14MM Andreessen Horowitz
• Partnerships with DataStax, Cloudera, MapR,
PivotalHD
Q & A
T H A N K S !
steve@aster.is @stevendborrelli

More Related Content

What's hot

[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching
Perforce
 
Meeting20150109 v1
Meeting20150109 v1Meeting20150109 v1
Meeting20150109 v1
Jean-Baptiste Poullet
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regex
Steve Mylroie
 
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
Lucidworks
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value Store
Sajeev P
 
Repl internals
Repl internalsRepl internals
Repl internals
MongoDB
 
Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in Wonderland
Paul Seiffert
 
Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
Alex Soto
 
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PIMOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
Monica Li
 

What's hot (9)

[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching
 
Meeting20150109 v1
Meeting20150109 v1Meeting20150109 v1
Meeting20150109 v1
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regex
 
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value Store
 
Repl internals
Repl internalsRepl internals
Repl internals
 
Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in Wonderland
 
Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
 
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PIMOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
 

Viewers also liked

Cattalugue2016
Cattalugue2016Cattalugue2016
Cattalugue2016
WsolutionSteel
 
LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015
LSE Enterprise
 
How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015
StampedeCon
 
Economics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programmeEconomics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programme
LSE Enterprise
 
Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015
Carol Varney
 
Estrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una culturaEstrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una cultura
oscar daniel naranjo aristizabal
 
CORRUPTION IN INDIA
CORRUPTION IN INDIACORRUPTION IN INDIA
CORRUPTION IN INDIA
Santhosh Kumar
 
Curriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNGCurriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNG
Samdoo Jung
 
PatrickJacks_Resume2
PatrickJacks_Resume2PatrickJacks_Resume2
PatrickJacks_Resume2
Patrick Jacks
 
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
StampedeCon
 
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
StampedeCon
 
Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015
StampedeCon
 
4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation
frekhtmanassociates
 
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
StampedeCon
 
2 kien pham cv en vn with project experience
2 kien pham cv en  vn with project experience2 kien pham cv en  vn with project experience
2 kien pham cv en vn with project experience
Kien Pham
 
Estrous synchronization
Estrous synchronizationEstrous synchronization
Estrous synchronization
Armia Naguib
 

Viewers also liked (16)

Cattalugue2016
Cattalugue2016Cattalugue2016
Cattalugue2016
 
LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015
 
How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015
 
Economics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programmeEconomics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programme
 
Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015
 
Estrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una culturaEstrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una cultura
 
CORRUPTION IN INDIA
CORRUPTION IN INDIACORRUPTION IN INDIA
CORRUPTION IN INDIA
 
Curriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNGCurriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNG
 
PatrickJacks_Resume2
PatrickJacks_Resume2PatrickJacks_Resume2
PatrickJacks_Resume2
 
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
 
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
 
Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015
 
4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation
 
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
 
2 kien pham cv en vn with project experience
2 kien pham cv en  vn with project experience2 kien pham cv en  vn with project experience
2 kien pham cv en vn with project experience
 
Estrous synchronization
Estrous synchronizationEstrous synchronization
Estrous synchronization
 

Similar to Apache Spark: the next big thing? - StampedeCon 2014

php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHPphp[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
Adam Englander
 
Zend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHPZend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHP
Adam Englander
 
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
Andrea Giuliano
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstars
Stephan Hochhaus
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
Raymond Tay
 
Microservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixMicroservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud Netflix
Krzysztof Sobkowiak
 
Angular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of SpeedAngular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of Speed
Ilia Idakiev
 
Web enabling your survey business
Web enabling your survey businessWeb enabling your survey business
Web enabling your survey business
Rudy Stricklan
 
Witchcraft
WitchcraftWitchcraft
Witchcraft
Brooklyn Zelenka
 
Apache spark
Apache sparkApache spark
Apache spark
sivachandra mandalapu
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
David Simons
 
Using MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryUsing MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content Repository
Nuxeo
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
Databricks
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
Arun Kejariwal
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
Everybody Lies
Everybody LiesEverybody Lies
Everybody Lies
Tomasz Kowalczewski
 
GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?
Francois Zaninotto
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
Holden Karau
 

Similar to Apache Spark: the next big thing? - StampedeCon 2014 (20)

php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHPphp[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
 
Zend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHPZend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHP
 
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstars
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
Microservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixMicroservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud Netflix
 
Angular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of SpeedAngular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of Speed
 
Web enabling your survey business
Web enabling your survey businessWeb enabling your survey business
Web enabling your survey business
 
Witchcraft
WitchcraftWitchcraft
Witchcraft
 
Apache spark
Apache sparkApache spark
Apache spark
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
Using MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryUsing MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content Repository
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Everybody Lies
Everybody LiesEverybody Lies
Everybody Lies
 
GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 

Recently uploaded (20)

Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 

Apache Spark: the next big thing? - StampedeCon 2014

  • 1. A P A C H E S P A R K S T A M P E D E C O N 2 0 1 4 S T E V E N B O R R E L L I @stevendborrelli A S T E R I S
  • 2. A B O U T M E F O U N D E R , A S T E R I S ( J A N 2 0 1 4 ) O R G A N I Z E R O F S T L M A C H I N E L E A R N I N G A N D D O C K E R S T L S Y S T E M S E N G I N E E R I N G , H P C , B I G D A T A , & C L O U D N E X T G E N E R A T I O N I N F R A S T R U C T U R E F O R D E V E L O P E R S
  • 3. S P A R K I N F I V E S E C O N D S is a replacement for
  • 4. WHY DO WE NEED TO REPLACE MAPREDUCE?
  • 5. M A P R E D U C E I S A W E S O M E ! Allows us to process enormous amounts of data in parallel
  • 6. M A P R E D U C E M A P R E D U C E : S I M P L I F I E D D A T A P R O C E S S I N G O N L A R G E C L U S T E R S ( 2 0 0 4 ) J E F F R E Y D E A N A N D S A N J A Y G H E M A W A T
  • 7. HITTING THE LIMITS OF HADOOP’s MAPREDUCE
  • 8. T H E P R O B L E M S W I T H M A P R E D U C E API: Low-Level & Complex
  • 9. M A P R E D U C E I S S U E S • Latency • Execution time impacted by “stragglers” • Lack of in-memory caching • Intermediate steps persisted to disk • No shared state
  • 10. T H E P R O B L E M S W I T H M A P R E D U C E Not optimal for: M A C H I N E L E A R N I N G G R A P H S S T R E A M P R O C E S S I N G
  • 11. I M P R O V I N G M A P R E D U C E A P A C H E T E Z
  • 12. • Generalize to different workloads • Sub-Second Latency • Scalable and Fault Tolerant • Easy to use API N E X T M A P R E D U C E : G O A L S
  • 13. T O P S P A R K F E A T U R E S • Fast, fault-tolerant in-memory data structures (RDD) • Compatibility with Hadoop ecosystem • Rich, easy-to-use API supports Machine Learning, Graphs and Streaming • Interactive Shell
  • 14. S P A R K S T A C K
  • 15. S P A R K S T A C K Integrated platform for disparate workloads
  • 16. R E S I L I E N T D I S T R I B U T E D D A T A S E T • Immutable in-memory collections • Fast recovery on failure • Control caching and persistence to memory/disk • Can partition to avoid shuffles
  • 17. R D D L I N E A G E lines = spark.textFile(“hdfs://errors/...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2))
  • 18. L A N G U A G E S U P P O R T • Spark is written in • Uses Scala collections & Akka Actors • Java, Python native support (Python support can lag), lambda support in Java8/Spark 1.0 • R Bindings through SparkR • Functional programming paradigm
  • 19. R D D T R A N S F O R M A T I O N S Transformations create a new RDD map filter flatMap sample union distinct groupByKey reduceByKey sortByKey join cogroup cartesian Transformations are evaluated lazily.
  • 20. R D D A C T I O N S Actions Return a value reduce collect count countByKey countByValue countApprox foreach saveAsSequenceFile saveAsTextFile first take(n) takeSample toArray Invoking an Action will cause all previous Transformations to be evaluated.
  • 21. T A S K S C H E D U L E R H T T P : / / A MP C A M P . B E R K E L E Y . E D U / W P - C O N T E N T / U P L O A D S / 2 0 1 2 / 0 6 / M A T E I - Z A H A R I A - P A R T - 1 - A M P - C A M P - 2 0 1 2 - S P A R K - I N T R O . P D F • Runs general task graphs • Pipelines functions where possible • Cache-aware data reuse & locality • Partitioning- aware to avoid shuffles
  • 23. S P A R K S T R E A M I N G • Micro-Batch: Discretized Stream (DStream) • ~1 sec latency • Fault tolerant • Shares Much of the same code as Batch
  • 24. T O P 1 0 H A S H T A G S I N L A S T 1 0 M I N / Create the stream of tweets val tweets = ssc.twitterStream(<username>, <password>) / Count the tags over a 10 minute window val tagCounts = tweets.flatMap(statuts => getTags(status)) .countByValueAndWindow(Minutes(10), Second(1)) / Sort the tags by counts val sortedTags = tagCounts.map { case (tag, count) => (count, tag) } (_.sortByKey(false)) / Show the top 10 tags sortedTags.foreach(showTopTags(10) _)
  • 25.
  • 26. • 10x + speedup after data is cached • In-memory materialized views • Supports HiveQL, UDFs, etc. • New Catalyst SQL engine coming in 1.0 includes SchemaRDD to mix & match RDD/SQL in code.
  • 27. • Implementation of PowerGraph, Pregel on Spark • .5x the speed of GraphLab, but more fault-tolerant
  • 28. • Machine Learning library, part of Spark core. • Uses jblas & gfortran. Python supports NumPy. • Growing number of algorithms: SVM, ALS, Naive Bayes, K-Means, Linear & Logistic Regression. (SVD/PCA, CART, L-BGFS coming in 1.x) M L L I B
  • 29. • MLI: Higher level library to support Tables (dataframes), Linear Algebra, Optimizers. • MLI: alpha software, limited activity • Can use Scikit-Learn or SparkR to run models on Spark. M L L I B +
  • 31. C O M M U N I T Y 0 50 100 150 200 250 Patches MapReduce Storm Yarn Spark 0 10000 20000 30000 40000 50000 Lines Added MapReduce Storm Yarn Spark 0 3500 7000 10500 14000 17500 Lines Removed MapReduce Storm Yarn Spark
  • 32. S P A R K M O M E N T U M • 1.0 is imminent (in 1.0 RC testing right now) • Databricks investment $14MM Andreessen Horowitz • Partnerships with DataStax, Cloudera, MapR, PivotalHD
  • 33. Q & A
  • 34. T H A N K S ! steve@aster.is @stevendborrelli