SlideShare a Scribd company logo
1 of 34
A P A C H E S P A R K
S T A M P E D E C O N 2 0 1 4
S T E V E N B O R R E L L I
@stevendborrelli
A S T E R I S
A B O U T M E
F O U N D E R , A S T E R I S ( J A N 2 0 1 4 )
O R G A N I Z E R O F S T L M A C H I N E
L E A R N I N G A N D D O C K E R S T L
S Y S T E M S E N G I N E E R I N G , H P C , B I G
D A T A , & C L O U D
N E X T G E N E R A T I O N I N F R A S T R U C T U R E F O R D E V E L O P E R S
S P A R K I N F I V E S E C O N D S
is a replacement for
WHY DO WE NEED TO REPLACE
MAPREDUCE?
M A P R E D U C E I S A W E S O M E !
Allows us to process
enormous
amounts of data in parallel
M A P R E D U C E
M A P R E D U C E : S I M P L I F I E D D A T A P R O C E S S I N G O N L A R G E C L U S T E R S ( 2 0 0 4 )
J E F F R E Y D E A N A N D S A N J A Y G H E M A W A T
HITTING THE LIMITS OF HADOOP’s
MAPREDUCE
T H E P R O B L E M S W I T H
M A P R E D U C E
API: Low-Level & Complex
M A P R E D U C E I S S U E S
• Latency
• Execution time impacted by “stragglers”
• Lack of in-memory caching
• Intermediate steps persisted to disk
• No shared state
T H E P R O B L E M S W I T H M A P
R E D U C E
Not optimal for:
M A C H I N E L E A R N I N G G R A P H S
S T R E A M
P R O C E S S I N G
I M P R O V I N G M A P R E D U C E
A P A C H E T E Z
• Generalize to different workloads
• Sub-Second Latency
• Scalable and Fault Tolerant
• Easy to use API
N E X T M A P R E D U C E : G O A L S
T O P S P A R K F E A T U R E S
• Fast, fault-tolerant in-memory data structures (RDD)
• Compatibility with Hadoop ecosystem
• Rich, easy-to-use API supports Machine Learning,
Graphs and Streaming
• Interactive Shell
S P A R K S T A C K
S P A R K S T A C K
Integrated platform for disparate workloads
R E S I L I E N T D I S T R I B U T E D
D A T A S E T
• Immutable in-memory collections
• Fast recovery on failure
• Control caching and persistence to memory/disk
• Can partition to avoid shuffles
R D D L I N E A G E
lines = spark.textFile(“hdfs://errors/...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
L A N G U A G E S U P P O R T
• Spark is written in
• Uses Scala collections & Akka Actors
• Java, Python native support (Python support can lag),
lambda support in Java8/Spark 1.0
• R Bindings through SparkR
• Functional programming paradigm
R D D T R A N S F O R M A T I O N S
Transformations create a new RDD
map
filter
flatMap
sample
union
distinct
groupByKey
reduceByKey
sortByKey
join
cogroup
cartesian
Transformations are evaluated lazily.
R D D A C T I O N S
Actions Return a value
reduce
collect
count
countByKey
countByValue
countApprox
foreach
saveAsSequenceFile
saveAsTextFile
first
take(n)
takeSample
toArray
Invoking an Action will cause all previous Transformations to
be evaluated.
T A S K S C H E D U L E R
H T T P : / / A MP C A M P . B E R K E L E Y . E D U / W P - C O N T E N T / U P L O A D S / 2 0 1 2 / 0 6 / M A T E I - Z A H A R I A - P A R T - 1 - A M P - C A M P - 2 0 1 2 - S P A R K - I N T R O . P D F
• Runs general
task graphs
• Pipelines
functions
where possible
• Cache-aware
data reuse &
locality
• Partitioning-
aware to avoid
shuffles
SPARK ECOSYSTEM
S P A R K S T R E A M I N G
• Micro-Batch: Discretized Stream (DStream)
• ~1 sec latency
• Fault tolerant
• Shares Much of the same code as Batch
T O P 1 0 H A S H T A G S I N L A S T 1 0
M I N
/ Create the stream of tweets
val tweets = ssc.twitterStream(<username>, <password>)
/ Count the tags over a 10 minute window
val tagCounts = tweets.flatMap(statuts => getTags(status))
.countByValueAndWindow(Minutes(10), Second(1))
/ Sort the tags by counts
val sortedTags = tagCounts.map
{ case (tag, count) => (count, tag) }
(_.sortByKey(false))
/ Show the top 10 tags
sortedTags.foreach(showTopTags(10) _)
• 10x + speedup after data is cached
• In-memory materialized views
• Supports HiveQL, UDFs, etc.
• New Catalyst SQL engine coming in 1.0 includes
SchemaRDD to mix & match RDD/SQL in code.
• Implementation of PowerGraph, Pregel on Spark
• .5x the speed of GraphLab, but more fault-tolerant
• Machine Learning library, part of Spark core.
• Uses jblas & gfortran. Python supports NumPy.
• Growing number of algorithms:
SVM, ALS, Naive Bayes, K-Means, Linear & Logistic
Regression. (SVD/PCA, CART, L-BGFS coming in 1.x)
M L L I B
• MLI: Higher level library to support Tables
(dataframes), Linear Algebra, Optimizers.
• MLI: alpha software, limited activity
• Can use Scikit-Learn or SparkR to run models on
Spark.
M L L I B +
MOMENTUM
C O M M U N I T Y
0
50
100
150
200
250
Patches
MapReduce
Storm
Yarn
Spark
0
10000
20000
30000
40000
50000
Lines Added
MapReduce
Storm
Yarn
Spark
0
3500
7000
10500
14000
17500
Lines Removed
MapReduce
Storm
Yarn
Spark
S P A R K M O M E N T U M
• 1.0 is imminent (in 1.0 RC testing right now)
• Databricks investment $14MM Andreessen Horowitz
• Partnerships with DataStax, Cloudera, MapR,
PivotalHD
Q & A
T H A N K S !
steve@aster.is @stevendborrelli

More Related Content

What's hot

[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of BranchingPerforce
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regexSteve Mylroie
 
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...Lucidworks
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value StoreSajeev P
 
Repl internals
Repl internalsRepl internals
Repl internalsMongoDB
 
Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in WonderlandPaul Seiffert
 
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PIMOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PIMonica Li
 

What's hot (9)

[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching[Perforce] Tasks - The Holy Hand Grenade of Branching
[Perforce] Tasks - The Holy Hand Grenade of Branching
 
Meeting20150109 v1
Meeting20150109 v1Meeting20150109 v1
Meeting20150109 v1
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regex
 
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value Store
 
Repl internals
Repl internalsRepl internals
Repl internals
 
Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in Wonderland
 
Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
 
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PIMOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
MOUG17: Visualizing Air Traffic with Oracle APEX and Raspberry PI
 

Viewers also liked

LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015LSE Enterprise
 
How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015StampedeCon
 
Economics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programmeEconomics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programmeLSE Enterprise
 
Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015Carol Varney
 
Estrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una culturaEstrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una culturaoscar daniel naranjo aristizabal
 
Curriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNGCurriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNGSamdoo Jung
 
PatrickJacks_Resume2
PatrickJacks_Resume2PatrickJacks_Resume2
PatrickJacks_Resume2Patrick Jacks
 
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013StampedeCon
 
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012StampedeCon
 
Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015StampedeCon
 
4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensationfrekhtmanassociates
 
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...StampedeCon
 
2 kien pham cv en vn with project experience
2 kien pham cv en  vn with project experience2 kien pham cv en  vn with project experience
2 kien pham cv en vn with project experienceKien Pham
 
Estrous synchronization
Estrous synchronizationEstrous synchronization
Estrous synchronizationArmia Naguib
 

Viewers also liked (16)

Cattalugue2016
Cattalugue2016Cattalugue2016
Cattalugue2016
 
LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015LSE Enterprise Annual Report 2015
LSE Enterprise Annual Report 2015
 
How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015How Big Data Will Save Planet Earth - StampedeCon 2015
How Big Data Will Save Planet Earth - StampedeCon 2015
 
Economics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programmeEconomics and finance of Europe: An intensive ten-week programme
Economics and finance of Europe: An intensive ten-week programme
 
Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015Carol Varney's 1 Resume 2015
Carol Varney's 1 Resume 2015
 
Estrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una culturaEstrategias pedagógicas para la enseñanza de una cultura
Estrategias pedagógicas para la enseñanza de una cultura
 
CORRUPTION IN INDIA
CORRUPTION IN INDIACORRUPTION IN INDIA
CORRUPTION IN INDIA
 
Curriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNGCurriculum Vitae_Samdoo JUNG
Curriculum Vitae_Samdoo JUNG
 
PatrickJacks_Resume2
PatrickJacks_Resume2PatrickJacks_Resume2
PatrickJacks_Resume2
 
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
 
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
Listening for Insights: The Power of Social Media Listening - StampedeCon 2012
 
Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015
 
4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation4 types of train accidents for which victims can claim compensation
4 types of train accidents for which victims can claim compensation
 
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
From Smart Buildings to Smart Cities: An Industry in the Midst of Big Data - ...
 
2 kien pham cv en vn with project experience
2 kien pham cv en  vn with project experience2 kien pham cv en  vn with project experience
2 kien pham cv en vn with project experience
 
Estrous synchronization
Estrous synchronizationEstrous synchronization
Estrous synchronization
 

Similar to Apache Spark: the next big thing? - StampedeCon 2014

php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHPphp[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHPAdam Englander
 
Zend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHPZend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHPAdam Englander
 
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceAndrea Giuliano
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstarsStephan Hochhaus
 
Toying with spark
Toying with sparkToying with spark
Toying with sparkRaymond Tay
 
Microservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixMicroservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixKrzysztof Sobkowiak
 
Angular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of SpeedAngular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of SpeedIlia Idakiev
 
Web enabling your survey business
Web enabling your survey businessWeb enabling your survey business
Web enabling your survey businessRudy Stricklan
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at ScaleDavid Simons
 
Using MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryUsing MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryNuxeo
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar FunctionsArun Kejariwal
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 

Similar to Apache Spark: the next big thing? - StampedeCon 2014 (20)

php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHPphp[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
 
Zend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHPZend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHP
 
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstars
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
Microservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixMicroservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud Netflix
 
Angular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of SpeedAngular server side rendering with NodeJS - In Pursuit Of Speed
Angular server side rendering with NodeJS - In Pursuit Of Speed
 
Web enabling your survey business
Web enabling your survey businessWeb enabling your survey business
Web enabling your survey business
 
Witchcraft
WitchcraftWitchcraft
Witchcraft
 
Apache spark
Apache sparkApache spark
Apache spark
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
Using MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryUsing MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content Repository
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Everybody Lies
Everybody LiesEverybody Lies
Everybody Lies
 
GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?GraphQL, l'avenir du REST ?
GraphQL, l'avenir du REST ?
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Apache Spark: the next big thing? - StampedeCon 2014

  • 1. A P A C H E S P A R K S T A M P E D E C O N 2 0 1 4 S T E V E N B O R R E L L I @stevendborrelli A S T E R I S
  • 2. A B O U T M E F O U N D E R , A S T E R I S ( J A N 2 0 1 4 ) O R G A N I Z E R O F S T L M A C H I N E L E A R N I N G A N D D O C K E R S T L S Y S T E M S E N G I N E E R I N G , H P C , B I G D A T A , & C L O U D N E X T G E N E R A T I O N I N F R A S T R U C T U R E F O R D E V E L O P E R S
  • 3. S P A R K I N F I V E S E C O N D S is a replacement for
  • 4. WHY DO WE NEED TO REPLACE MAPREDUCE?
  • 5. M A P R E D U C E I S A W E S O M E ! Allows us to process enormous amounts of data in parallel
  • 6. M A P R E D U C E M A P R E D U C E : S I M P L I F I E D D A T A P R O C E S S I N G O N L A R G E C L U S T E R S ( 2 0 0 4 ) J E F F R E Y D E A N A N D S A N J A Y G H E M A W A T
  • 7. HITTING THE LIMITS OF HADOOP’s MAPREDUCE
  • 8. T H E P R O B L E M S W I T H M A P R E D U C E API: Low-Level & Complex
  • 9. M A P R E D U C E I S S U E S • Latency • Execution time impacted by “stragglers” • Lack of in-memory caching • Intermediate steps persisted to disk • No shared state
  • 10. T H E P R O B L E M S W I T H M A P R E D U C E Not optimal for: M A C H I N E L E A R N I N G G R A P H S S T R E A M P R O C E S S I N G
  • 11. I M P R O V I N G M A P R E D U C E A P A C H E T E Z
  • 12. • Generalize to different workloads • Sub-Second Latency • Scalable and Fault Tolerant • Easy to use API N E X T M A P R E D U C E : G O A L S
  • 13. T O P S P A R K F E A T U R E S • Fast, fault-tolerant in-memory data structures (RDD) • Compatibility with Hadoop ecosystem • Rich, easy-to-use API supports Machine Learning, Graphs and Streaming • Interactive Shell
  • 14. S P A R K S T A C K
  • 15. S P A R K S T A C K Integrated platform for disparate workloads
  • 16. R E S I L I E N T D I S T R I B U T E D D A T A S E T • Immutable in-memory collections • Fast recovery on failure • Control caching and persistence to memory/disk • Can partition to avoid shuffles
  • 17. R D D L I N E A G E lines = spark.textFile(“hdfs://errors/...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2))
  • 18. L A N G U A G E S U P P O R T • Spark is written in • Uses Scala collections & Akka Actors • Java, Python native support (Python support can lag), lambda support in Java8/Spark 1.0 • R Bindings through SparkR • Functional programming paradigm
  • 19. R D D T R A N S F O R M A T I O N S Transformations create a new RDD map filter flatMap sample union distinct groupByKey reduceByKey sortByKey join cogroup cartesian Transformations are evaluated lazily.
  • 20. R D D A C T I O N S Actions Return a value reduce collect count countByKey countByValue countApprox foreach saveAsSequenceFile saveAsTextFile first take(n) takeSample toArray Invoking an Action will cause all previous Transformations to be evaluated.
  • 21. T A S K S C H E D U L E R H T T P : / / A MP C A M P . B E R K E L E Y . E D U / W P - C O N T E N T / U P L O A D S / 2 0 1 2 / 0 6 / M A T E I - Z A H A R I A - P A R T - 1 - A M P - C A M P - 2 0 1 2 - S P A R K - I N T R O . P D F • Runs general task graphs • Pipelines functions where possible • Cache-aware data reuse & locality • Partitioning- aware to avoid shuffles
  • 23. S P A R K S T R E A M I N G • Micro-Batch: Discretized Stream (DStream) • ~1 sec latency • Fault tolerant • Shares Much of the same code as Batch
  • 24. T O P 1 0 H A S H T A G S I N L A S T 1 0 M I N / Create the stream of tweets val tweets = ssc.twitterStream(<username>, <password>) / Count the tags over a 10 minute window val tagCounts = tweets.flatMap(statuts => getTags(status)) .countByValueAndWindow(Minutes(10), Second(1)) / Sort the tags by counts val sortedTags = tagCounts.map { case (tag, count) => (count, tag) } (_.sortByKey(false)) / Show the top 10 tags sortedTags.foreach(showTopTags(10) _)
  • 25.
  • 26. • 10x + speedup after data is cached • In-memory materialized views • Supports HiveQL, UDFs, etc. • New Catalyst SQL engine coming in 1.0 includes SchemaRDD to mix & match RDD/SQL in code.
  • 27. • Implementation of PowerGraph, Pregel on Spark • .5x the speed of GraphLab, but more fault-tolerant
  • 28. • Machine Learning library, part of Spark core. • Uses jblas & gfortran. Python supports NumPy. • Growing number of algorithms: SVM, ALS, Naive Bayes, K-Means, Linear & Logistic Regression. (SVD/PCA, CART, L-BGFS coming in 1.x) M L L I B
  • 29. • MLI: Higher level library to support Tables (dataframes), Linear Algebra, Optimizers. • MLI: alpha software, limited activity • Can use Scikit-Learn or SparkR to run models on Spark. M L L I B +
  • 31. C O M M U N I T Y 0 50 100 150 200 250 Patches MapReduce Storm Yarn Spark 0 10000 20000 30000 40000 50000 Lines Added MapReduce Storm Yarn Spark 0 3500 7000 10500 14000 17500 Lines Removed MapReduce Storm Yarn Spark
  • 32. S P A R K M O M E N T U M • 1.0 is imminent (in 1.0 RC testing right now) • Databricks investment $14MM Andreessen Horowitz • Partnerships with DataStax, Cloudera, MapR, PivotalHD
  • 33. Q & A
  • 34. T H A N K S ! steve@aster.is @stevendborrelli