SlideShare a Scribd company logo
Lighting Fast Big Data Analytics with 
Apache . 
Andy Petrella (@noootsab), Gerard Maas (@maasg) 
Big Data Hacker Data Processing Team Lead 
#devoxx #sparkvoxx @noootsab @maasg
What is Spark? 
Spark Foundation: The RDD 
#devoxx #sparkvoxx @noootsab @maasg
Memory Network 
(and don’t forget to throw some disks in the mix) 
#devoxx #sparkvoxx @noootsab @maasg
What is Spark? 
Spark is a fast and general engine for large-scale distributed data processing. 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line. 
split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
Fast Functional 
#devoxx #sparkvoxx @noootsab @maasg
Spark: A Strong Open Source Project 
27/02 Apache top-level proj 
30/05 Spark 1.0.0 REL 
11/09 Spark 1.1.0 REL 
42 contibutors 118 contibutors 
#Commits. src: 
176 contibutors 
#devoxx #sparkvoxx @noootsab @maasg
Compared to Map-Reduce 
public class WordCount { 
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 
private final static IntWritable one = new IntWritable( 1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
context.write(word, one); 
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
throws IOException, InterruptedException { 
int sum = 0; 
for (IntWritable val : values) { 
sum += val.get(); 
context.write(key, new IntWritable(sum)); 
public static void main(String[] args) throws Exception { 
Configuration conf = new Configuration(); 
Job job = new Job(conf, "wordcount" ); 
FileInputFormat.addInputPath(job, new Path(args[ 0])); 
FileOutputFormat.setOutputPath(job, new Path(args[ 1])); 
job.waitForCompletion( true); 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line. 
split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
#devoxx #sparkvoxx @noootsab @maasg
The Big Idea... 
Express computations in terms of operations on a data set. 
Spark Core Concept: RDD => Resilient Distributed Dataset 
Think of an RDD as an immutable, distributed collection of objects 
• Resilient => Can be reconstructed in case of failure 
• Distributed => Transformations are parallelizable operations 
• Dataset => Data loaded and partitioned across cluster nodes (executors) 
RDDs are memory-intensive. Caching behavior is controllable. 
#devoxx #sparkvoxx @noootsab @maasg
Spark Cluster 
#devoxx #sparkvoxx @noootsab @maasg
.textFile("...") RDD 
#devoxx #sparkvoxx @noootsab @maasg
.textFile("...").flatMap(l => l.split(" ")) 
#devoxx #sparkvoxx @noootsab @maasg
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
#devoxx #sparkvoxx @noootsab @maasg
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
.reduceByKey(_ + _) 
#devoxx #sparkvoxx @noootsab @maasg
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
.reduceByKey(_ + _) 
#devoxx #sparkvoxx @noootsab @maasg
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
.reduceByKey(_ + _) 
#devoxx #sparkvoxx @noootsab @maasg
The Spark Lingo 
.textFile("...").flatMap(l => l.split(" ")).map(w => (w,1)) 
.reduceByKey(_ + _) 
#devoxx #sparkvoxx @noootsab @maasg
Spark: RDD Operations 
#devoxx #sparkvoxx @noootsab @maasg
Inner Manipulations 
> map, flatMap, filter, distinct 
Cross RDD 
> union, subtract, intersection, join, cartesian 
Structural reorganization (Expensive) 
> groupBy, aggregate, sort 
> coalesce, repartition 
#devoxx #sparkvoxx @noootsab @maasg
Fetch Data 
> collect, take, first, takeSample 
Aggregate Results 
> reduce, count, countByKey 
> foreach, foreachPartition, save* 
#devoxx #sparkvoxx @noootsab @maasg
RDD Lineage 
Each RDDs keeps track of its parent. 
This is the basis for DAG scheduling 
and fault recovery 
val file = spark.textFile("hdfs://...") 
val wordsRDD = file.flatMap(line => line.split 
(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
val scoreRdd ={case (k,v) => (v,k)} 
wordsRDD MapPartitionsRDD 
scoreRDD MappedRDD 
rdd.toDebugString is your friend 
#devoxx #sparkvoxx @noootsab @maasg
Spark has Support for... 
Scala Notebook 
> Shell Notebook 
R API Shell 
The Spark Shell is the best way to start exploring Spark 
#devoxx #sparkvoxx @noootsab @maasg
Exploring and 
transforming data with 
the Spark Shell 
Book data provided by Project Gutenberg ( 
Cluster computing resources provided by 
#devoxx #sparkvoxx @noootsab @maasg
#devoxx #sparkvoxx @noootsab @maasg
What is Spark? 
Spark Foundation: The RDD 
#devoxx #sparkvoxx @noootsab @maasg
Now, we know what is Spark! 
At least, we know its Core, let’s say SDK. 
Thanks to its great and enthusiastic community 
Spark Core have been used in an ever growing number of fields 
Hence the ecosystem is evolving fast 
#devoxx #sparkvoxx @noootsab @maasg
Higher level primitives ... 
… or APIs 
… or the rise of the popolo 
If Spark Core is the fold of distributed computing 
Then we’re going to look at the map, filter, countBy, groupBy, ... 
#devoxx #sparkvoxx @noootsab @maasg
Spark Streaming 
When you have big fat streams behaving as one single collection 
#devoxx #sparkvoxx @noootsab @maasg
Spark Streaming 
#devoxx #sparkvoxx @noootsab @maasg
Spark SQL 
From SQL to noSQL to SQL … to noSQL 
Structured Query Language 
We’re not really querying but we’re processing 
SQL provides the mathematical (abstraction) structures to manipulate data 
We can optimize, Spark has Catalyst 
#devoxx #sparkvoxx @noootsab @maasg
Spark SQL 
#devoxx #sparkvoxx @noootsab @maasg
“The library to teach them all” 
SciPy, SciKitLearn, R, MatLab and c° → learn on one machine 
(sadly often, one core) 
SVM lm 
K-Means ALS 
#devoxx #sparkvoxx @noootsab @maasg
Connecting the dots 
Graph processing at scale. 
> Takes edges 
> Add some nodes 
> Combine = Send messages (Pregel) 
#devoxx #sparkvoxx @noootsab @maasg
Connecting the dots 
Graph processing at scale. 
> Take edges 
> Link nodes 
> Combine/Send messages 
#devoxx #sparkvoxx @noootsab @maasg
The new kid on the block in the Spark community (with the uncovered Thunder) 
Game changing library for processing DNA, Genotypes, Variant and co. 
Comes with the right stack for processing … 
… legacy huge bunch of vital data 
#devoxx #sparkvoxx @noootsab @maasg
Tooling (NoIDE) 
Besides the classical Eclipse, IntellijIDEA, Netbeans, Sublime Text and family! 
An IDE is not enough because not only softwares or services are crafted. 
Spark is for data analysis, and data scientist need 
> interactivity (exploration) 
> reproducibility (environment, data and logic) 
> shareability (results) 
#devoxx #sparkvoxx @noootsab @maasg
Spark-Shell backend for IPython (Worksheet for data analysts) 
#devoxx #sparkvoxx @noootsab @maasg
Well shaped Notebook based on Kibana, offering Spark dedicated features 
> Multi languages (Scala, sql, markdown, shell) 
> Dynamic forms (generating inputs) 
> Data visualization (and export) 
Check the website! 
#devoxx #sparkvoxx @noootsab @maasg
Spark Notebook 
Scala-Notebook fork, enhanced for Spark peculiarities. 
Full Scala, Akka and RxScala. 
Features including: 
> Multi languages (Scala, sql, markdown, javascript) 
> Data visualization 
> Spark work tracking 
Try it: 
curl | bash -s dev 
#devoxx #sparkvoxx @noootsab @maasg
Databricks Cloud 
The amazing product crafted by the company behind Spark! 
Cannot say more than this product will be amazing. 
Fully collaborative, dashboard creation and publication. 
Register for a beta account (Still eagerly waiting for mine ) 
Go there 
#devoxx #sparkvoxx @noootsab @maasg
#devoxx #sparkvoxx @noootsab @maasg
Mining DNA 
#devoxx #sparkvoxx @noootsab @maasg
#devoxx #sparkvoxx @noootsab @maasg
Mining Geodata 
#devoxx #sparkvoxx @noootsab @maasg
Dallas  Seattle 
divergence of 18.4 
#devoxx #sparkvoxx @noootsab @maasg
Mining Texts 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
Process Wikipedia XML dump put in HDFS 
Convert XML (multi-lined ) to CSV 
Push to S3 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
Compute some stats: TF-IDF 
Train a NaiveBayes classifier 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
See what the machine can say 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
But… quite some data 
#devoxx #sparkvoxx @noootsab @maasg
A Word of Advice 
Spark beautiful simplicity is often overshadowed by the complexity of building 
and maintaining a working distributed system. 
Sharpen up your Ops skills… 
… or ooops 
#devoxx #sparkvoxx @noootsab @maasg
Project website: 
Spark presentations: 
Starting Questions: 
More Advanced Questions: 
Source Code: 
Getting involved: 
#devoxx #sparkvoxx @noootsab @maasg
Devoxx ! 
Virdata → Shell Demo cluster 
NextLab → Wikipedia ML Cluster 
Rand Hindi (Snips) → Geodata example 
Xavier Tordoir (SilicoCloud) → DNA example 
#devoxx #sparkvoxx @noootsab @maasg
#devoxx #sparkvoxx @noootsab @maasg

More Related Content

What's hot

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
Holden Karau
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
Holden Karau
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Holden Karau
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
Spark Summit
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone

What's hot (20)

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

Similar to Spark devoxx2014

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
Yasoda Jayaweera
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys

Similar to Spark devoxx2014 (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Spark core
Spark coreSpark core
Spark core
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

More from Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
Andy Petrella
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
Andy Petrella
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
Andy Petrella
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
Andy Petrella
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
Andy Petrella
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
Andy Petrella
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
Andy Petrella
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
Andy Petrella
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Andy Petrella
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
Andy Petrella
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Andy Petrella
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
Andy Petrella
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Andy Petrella
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
Andy Petrella
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
Andy Petrella
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
Andy Petrella
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella

More from Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark

Spark devoxx2014

  • 1. Lighting Fast Big Data Analytics with Apache . Andy Petrella (@noootsab), Gerard Maas (@maasg) Big Data Hacker Data Processing Team Lead #devoxx #sparkvoxx @noootsab @maasg
  • 2. Agenda What is Spark? Spark Foundation: The RDD Demo Ecosystem Examples Resources #devoxx #sparkvoxx @noootsab @maasg
  • 3. Memory Network CPU’s (and don’t forget to throw some disks in the mix) #devoxx #sparkvoxx @noootsab @maasg
  • 4. What is Spark? Spark is a fast and general engine for large-scale distributed data processing. val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line. split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Fast Functional Growing Ecosystem #devoxx #sparkvoxx @noootsab @maasg
  • 5. Spark: A Strong Open Source Project 27/02 Apache top-level proj 30/05 Spark 1.0.0 REL 11/09 Spark 1.1.0 REL 42 contibutors 118 contibutors #Commits. src: 176 contibutors #devoxx #sparkvoxx @noootsab @maasg
  • 6. Compared to Map-Reduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable( 1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount" ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[ 0])); FileOutputFormat.setOutputPath(job, new Path(args[ 1])); job.waitForCompletion( true); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line. split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark #devoxx #sparkvoxx @noootsab @maasg
  • 7. The Big Idea... Express computations in terms of operations on a data set. Spark Core Concept: RDD => Resilient Distributed Dataset Think of an RDD as an immutable, distributed collection of objects • Resilient => Can be reconstructed in case of failure • Distributed => Transformations are parallelizable operations • Dataset => Data loaded and partitioned across cluster nodes (executors) RDDs are memory-intensive. Caching behavior is controllable. #devoxx #sparkvoxx @noootsab @maasg
  • 8. RDDs Executors Spark Cluster HDFS #devoxx #sparkvoxx @noootsab @maasg
  • 9. RDDs .textFile("...") RDD Partitions #devoxx #sparkvoxx @noootsab @maasg
  • 10. RDDs .textFile("...").flatMap(l => l.split(" ")) #devoxx #sparkvoxx @noootsab @maasg
  • 11. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 #devoxx #sparkvoxx @noootsab @maasg
  • 12. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 #devoxx #sparkvoxx @noootsab @maasg
  • 13. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7 3 #devoxx #sparkvoxx @noootsab @maasg
  • 14. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7753 7 3 #devoxx #sparkvoxx @noootsab @maasg
  • 15. The Spark Lingo .textFile("...").flatMap(l => l.split(" ")).map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7753 7 3 Job Cluster Executor RDD Partition Stage Task #devoxx #sparkvoxx @noootsab @maasg
  • 16. Spark: RDD Operations INPUT DATA HDFS TEXT/ Sequence File RDD SparkContext RDD OUTPUT Data HDFS TEXT/ Sequence File Cassandra #devoxx #sparkvoxx @noootsab @maasg
  • 17. Transformations Inner Manipulations > map, flatMap, filter, distinct Cross RDD > union, subtract, intersection, join, cartesian Structural reorganization (Expensive) > groupBy, aggregate, sort Tuning > coalesce, repartition #devoxx #sparkvoxx @noootsab @maasg
  • 18. Actions Fetch Data > collect, take, first, takeSample Aggregate Results > reduce, count, countByKey Output > foreach, foreachPartition, save* #devoxx #sparkvoxx @noootsab @maasg
  • 19. RDD Lineage Each RDDs keeps track of its parent. This is the basis for DAG scheduling and fault recovery val file = spark.textFile("hdfs://...") val wordsRDD = file.flatMap(line => line.split (" ")) .map(word => (word, 1)) .reduceByKey(_ + _) val scoreRdd ={case (k,v) => (v,k)} HadoopRDD MappedRDD FlatMappedRDD MappedRDD MapPartitionsRDD ShuffleRDD wordsRDD MapPartitionsRDD scoreRDD MappedRDD rdd.toDebugString is your friend #devoxx #sparkvoxx @noootsab @maasg
  • 20. Spark has Support for... Java Scala Notebook Python API Shell > A A API A API > Shell Notebook R API Shell The Spark Shell is the best way to start exploring Spark #devoxx #sparkvoxx @noootsab @maasg
  • 21. Demo Exploring and transforming data with the Spark Shell Acknowlegments: Book data provided by Project Gutenberg ( through Cluster computing resources provided by #devoxx #sparkvoxx @noootsab @maasg
  • 23. Agenda What is Spark? Spark Foundation: The RDD Demo Ecosystem Examples Resources #devoxx #sparkvoxx @noootsab @maasg
  • 24. Ecosystem Now, we know what is Spark! At least, we know its Core, let’s say SDK. Thanks to its great and enthusiastic community Spark Core have been used in an ever growing number of fields Hence the ecosystem is evolving fast #devoxx #sparkvoxx @noootsab @maasg
  • 25. Higher level primitives ... … or APIs … or the rise of the popolo If Spark Core is the fold of distributed computing Then we’re going to look at the map, filter, countBy, groupBy, ... #devoxx #sparkvoxx @noootsab @maasg
  • 26. Spark Streaming When you have big fat streams behaving as one single collection t DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] #devoxx #sparkvoxx @noootsab @maasg
  • 27. Spark Streaming #devoxx #sparkvoxx @noootsab @maasg
  • 28. Spark SQL From SQL to noSQL to SQL … to noSQL Structured Query Language We’re not really querying but we’re processing SQL provides the mathematical (abstraction) structures to manipulate data We can optimize, Spark has Catalyst #devoxx #sparkvoxx @noootsab @maasg
  • 29. Spark SQL #devoxx #sparkvoxx @noootsab @maasg
  • 30. MLLib “The library to teach them all” SciPy, SciKitLearn, R, MatLab and c° → learn on one machine (sadly often, one core) SVM lm NaiveBayes PCA K-Means ALS SVD #devoxx #sparkvoxx @noootsab @maasg
  • 31. GraphX Connecting the dots Graph processing at scale. > Takes edges > Add some nodes > Combine = Send messages (Pregel) #devoxx #sparkvoxx @noootsab @maasg
  • 32. GraphX Connecting the dots Graph processing at scale. > Take edges > Link nodes > Combine/Send messages #devoxx #sparkvoxx @noootsab @maasg
  • 33. ADAM The new kid on the block in the Spark community (with the uncovered Thunder) Game changing library for processing DNA, Genotypes, Variant and co. Comes with the right stack for processing … … legacy huge bunch of vital data #devoxx #sparkvoxx @noootsab @maasg
  • 34. Tooling (NoIDE) Besides the classical Eclipse, IntellijIDEA, Netbeans, Sublime Text and family! An IDE is not enough because not only softwares or services are crafted. Spark is for data analysis, and data scientist need > interactivity (exploration) > reproducibility (environment, data and logic) > shareability (results) #devoxx #sparkvoxx @noootsab @maasg
  • 35. ISpark Spark-Shell backend for IPython (Worksheet for data analysts) #devoxx #sparkvoxx @noootsab @maasg
  • 36. Zeppelin Well shaped Notebook based on Kibana, offering Spark dedicated features > Multi languages (Scala, sql, markdown, shell) > Dynamic forms (generating inputs) > Data visualization (and export) Check the website! #devoxx #sparkvoxx @noootsab @maasg
  • 37. Spark Notebook Scala-Notebook fork, enhanced for Spark peculiarities. Full Scala, Akka and RxScala. Features including: > Multi languages (Scala, sql, markdown, javascript) > Data visualization > Spark work tracking Try it: curl | bash -s dev #devoxx #sparkvoxx @noootsab @maasg
  • 38. Databricks Cloud The amazing product crafted by the company behind Spark! Cannot say more than this product will be amazing. Fully collaborative, dashboard creation and publication. Register for a beta account (Still eagerly waiting for mine ) Go there #devoxx #sparkvoxx @noootsab @maasg
  • 39. Examples #devoxx #sparkvoxx @noootsab @maasg
  • 40. Mining DNA #devoxx #sparkvoxx @noootsab @maasg
  • 42. Mining Geodata #devoxx #sparkvoxx @noootsab @maasg
  • 43. Dallas Seattle divergence of 18.4 #devoxx #sparkvoxx @noootsab @maasg
  • 44. Mining Texts #devoxx #sparkvoxx @noootsab @maasg
  • 45. A small project just for the fun Process Wikipedia XML dump put in HDFS Convert XML (multi-lined ) to CSV Push to S3 Sampling #devoxx #sparkvoxx @noootsab @maasg
  • 46. A small project just for the fun Compute some stats: TF-IDF Train a NaiveBayes classifier #devoxx #sparkvoxx @noootsab @maasg
  • 47. A small project just for the fun See what the machine can say #devoxx #sparkvoxx @noootsab @maasg
  • 48. A small project just for the fun But… quite some data #devoxx #sparkvoxx @noootsab @maasg
  • 49. A Word of Advice Spark beautiful simplicity is often overshadowed by the complexity of building and maintaining a working distributed system. Sharpen up your Ops skills… … or ooops #devoxx #sparkvoxx @noootsab @maasg
  • 50. Resources Project website: Spark presentations: Starting Questions: More Advanced Questions: Source Code: Getting involved: #devoxx #sparkvoxx @noootsab @maasg
  • 51. Acknowledgments Devoxx ! Virdata → Shell Demo cluster NextLab → Wikipedia ML Cluster Rand Hindi (Snips) → Geodata example Xavier Tordoir (SilicoCloud) → DNA example #devoxx #sparkvoxx @noootsab @maasg
  • 52. Answers! #devoxx #sparkvoxx @noootsab @maasg