SlideShare a Scribd company logo
CONFIDENTIAL - RESTRICTED 
Introduction to Spark 
LV Big Data Monthly Meetup #2 
December 3rd 2014 
Maxime Dumas 
Systems Engineer, Cloudera
Thirty Seconds About Max 
• Systems Engineer 
• aka Sales Engineer 
• SoCal, AZ, NV 
• former coder of PHP 
• teaches meditation + yoga 
• from Montreal, Canada 
2
What Does Cloudera Do? 
• product 
• distribution of Hadoop components, Apache licensed 
• enterprise tooling 
• support 
• training 
• services (aka consulting) 
• community 
3
4 
But first… how did we get here?
What does Hadoop look like? 
5 
HDFS HDFS 
… 
master 
worker 
(“NN”) 
(“DN”) 
MR 
worker 
(“TT”) 
HDFS 
worker 
(“DN”) 
MR 
worker 
(“TT”) 
HDFS 
worker 
(“DN”) 
MR 
worker 
(“TT”) 
HDFS 
worker 
(“DN”) 
MR 
worker 
(“TT”) 
HDFS 
worker 
(“DN”) 
MR 
worker 
(“TT”) 
MR 
master 
(“JT”) 
Standby 
master
But I want MORE! 
6 
HDFS 
worker 
HDFS 
worker 
HDFS 
worker 
HDFS 
worker 
HDFS 
worker 
… 
MapReduce 
HDFS 
master 
(“NN”) 
MR 
master 
(“JT”) 
Standby 
master
Hadoop as an Architecture 
The Old Way 
Network 
Expensive, Special purpose, “Reliable” Servers 
Expensive Licensed Software 
• Hard to scale 
• Network is a bottleneck 
• Only handles relational data 
• Difficult to add new fields & data types 
Expensive & Unattainable 
$30,000+ per TB 
Data Storage 
(SAN, NAS) 
Compute 
(RDBMS, EDW) 
The Hadoop Way 
Commodity “Unreliable” Servers 
Hybrid Open Source Software 
• Scales out forever 
• No bottlenecks 
• Easy to ingest any data 
• Agile data access 
Affordable & Attainable 
$300-$1,000 per TB 
Compute 
(CPU) 
Memory Storage 
(Disk) 
z 
z
CDH: the App Store for Hadoop 
8 
Resource Management 
Storage 
Integration 
Metadata 
NoSQL 
DBMS 
… 
Analytic 
MPP 
DBMS 
Search 
Engine 
In- 
Memory 
Batch 
Processing 
Support 
Data 
Management 
System 
Management 
Security 
Machine 
Learning 
MapReduce
9 
Introduction to Apache Spark 
Credits: 
• Ben White 
• Todd Lipcon 
• Ted Malaska 
• Jairam Ranganathan 
• Jayant Shekhar 
• Sandy Ryza
Can we improve on MR? 
• Problems with MR: 
• Very low-level: requires a lot of code to do simple 
things 
• Very constrained: everything must be described as 
“map” and “reduce”. Powerful but sometimes 
difficult to think in these terms. 
10
Can we improve on MR? 
• Two approaches to improve on MapReduce: 
1. Special purpose systems to solve one problem domain 
well. 
• Giraph / Graphlab (graph processing) 
• Storm (stream processing) 
• Impala (real-time SQL) 
2. Generalize the capabilities of MapReduce to 
provide a richer foundation to solve problems. 
• Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs) 
Both are viable strategies depending on the problem! 
11
What is Apache Spark? 
Spark is a general purpose computational framework 
Retains the advantages of MapReduce: 
• Linear scalability 
• Fault-tolerance 
• Data Locality based computations 
…but offers so much more: 
• Leverages distributed memory for better performance 
• Supports iterative algorithms that are not feasible in MR 
• Improved developer experience 
• Full Directed Graph expressions for data parallel computations 
• Comes with libraries for machine learning, graph analysis, etc. 
12
What is Apache Spark? 
Run programs up to 100x faster than Hadoop 
MapReduce in memory, or 10x faster on disk. 
One of the largest open source projects in big data: 
• 170+ developers contributing 
• 30+ companies contributing 
• 400+ discussions per month on the mailing list 
13
Popular project 
14
Getting started with Spark 
• Java API 
• Interactive shells: 
• Scala (spark-shell) 
• Python (pyspark) 
15
Execution modes 
16
Execution modes 
• Standalone Mode 
• Dedicated master and worker daemons 
• YARN Client Mode 
• Launches a YARN application with the 
driver program running locally 
• YARN Cluster Mode 
• Launches a YARN application with the 
driver program running in the YARN 
ApplicationMaster 
17 
Dedicated Spark 
runtime with static 
resource limits 
Dynamic resource 
management 
between Spark, 
MR, Impala…
Spark Concepts 
18
RDD – Resilient Distributed Dataset 
• Collections of objects partitioned across a cluster 
• Stored in RAM or on Disk 
• You can control persistence and partitioning 
• Created by: 
• Distributing local collection objects 
• Transformation of data in storage 
• Transformation of RDDs 
• Automatically rebuilt on failure (resilient) 
• Contains lineage to compute from storage 
• Lazy materialization 
20
RDD transformations 
21
Operations on RDDs 
Transformations lazily transform a RDD 
to a new RDD 
• map 
• flatMap 
• filter 
• sample 
• join 
• sort 
• reduceByKey 
• … 
Actions run computation to return a 
value 
• collect 
• reduce(func) 
• foreach(func) 
• count 
• first, take(n) 
• saveAs 
• … 
22
Fault Tolerance 
• RDDs contain lineage. 
• Lineage – source location and list of transformations 
• Lost partitions can be re-computed from source data 
23 
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) 
.map(lambda s: s.split(“t”)[2]) 
HDFS File Filtered RDD Mapped RDD 
filter 
(func = startsWith(…)) 
map 
(func = split(...))
24 
Examples
Word Count in MapReduce 
package org.myorg; 
import java.io.IOException; 
import java.util.*; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.conf.*; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 
public class WordCount { 
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, Context context) throws IOException, 
InterruptedException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken()); 
context.write(word, one); 
} 
} 
} 
25 
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
throws IOException, InterruptedException { 
int sum = 0; 
for (IntWritable val : values) { 
sum += val.get(); 
} 
context.write(key, new IntWritable(sum)); 
} 
} 
public static void main(String[] args) throws Exception { 
Configuration conf = new Configuration(); 
Job job = new Job(conf, "wordcount"); 
job.setOutputKeyClass(Text.class); 
job.setOutputValueClass(IntWritable.class); 
job.setMapperClass(Map.class); 
job.setReducerClass(Reduce.class); 
job.setInputFormatClass(TextInputFormat.class); 
job.setOutputFormatClass(TextOutputFormat.class); 
FileInputFormat.addInputPath(job, new Path(args[0])); 
FileOutputFormat.setOutputPath(job, new Path(args[1])); 
job.waitForCompletion(true); 
} 
}
Word Count in Spark 
sc.textFile(“words”) 
.flatMap(line => line.split(" ")) 
.map(word=>(word,1)) 
.reduceByKey(_+_).collect() 
26
Logistic Regression 
• Read two sets of points 
• Looks for a plane W that separates them 
• Perform gradient descent: 
• Start with random W 
• On each iteration, sum a function of W over the data 
• Move W in a direction that improves it 
27
Intuition 
28
Logistic Regression 
29
Logistic Regression Performance 
30
31 
Spark and Hadoop: 
a Framework within a Framework
32
33 
Map 
Reduce 
Resource Management 
Storage 
Integration 
Metadata 
HBase Impala Solr Spark … 
Support 
Data 
Management 
System 
Management 
Security
Spark Streaming 
• Takes the concept of RDDs and extends it to DStreams 
• Fault-tolerant like RDDs 
• Transformable like RDDs 
• Adds new “rolling window” operations 
• Rolling averages, etc. 
• But keeps everything else! 
• Regular Spark code works in Spark Streaming 
• Can still access HDFS data, etc. 
• Example use cases: 
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS. 
• Detecting anomalous behavior and triggering alerts. 
• Continuous reporting of summary metrics for incoming data. 
34
Micro-batching for on the fly ETL 
35
What about SQL? 
36 
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html 
http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
Fault Recovery Recap 
• RDDs store dependency graph 
• Because RDDs are deterministic: 
Missing RDDs are rebuilt in parallel on other nodes 
• Stateful RDDs can have infinite lineage 
• Periodic checkpoints to disk clears lineage 
• Faster recovery times 
• Better handling of stragglers vs row-by-row streaming 
37
Why Spark? 
• Flexible like MapReduce 
• High performance 
• Machine learning, 
iterative algorithms 
• Interactive data 
explorations 
• Concise, easy API for 
developer productivity 
38
39 
Demo Time! 
• Log file Analysis 
• Machine Learning 
• Spark Streaming
What’s Next? 
• Download Hadoop! 
• CDH available at www.cloudera.com 
• Try it online: Cloudera Live 
• Cloudera provides pre-loaded VMs 
• http://tiny.cloudera.com/quickstartvm 
40
41 
Questions? 
Preferably related to the talk… or not.
42 
Thank You! 
Maxime Dumas 
mdumas@cloudera.com 
We’re hiring.
43

More Related Content

What's hot

Apache Spark
Apache SparkApache Spark
Apache Spark
Uwe Printz
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
DataStax Academy
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
MapR Technologies
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
Ryan Bosshart
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Hubert Fan Chiang
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
sarith divakar
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
Spark Summit
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 

What's hot (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 

Similar to Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
Data Con LA
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Mark Rittman
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
Steve Min
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
Jagadeesan A S
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 

Similar to Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014 (20)

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014

  • 1. CONFIDENTIAL - RESTRICTED Introduction to Spark LV Big Data Monthly Meetup #2 December 3rd 2014 Maxime Dumas Systems Engineer, Cloudera
  • 2. Thirty Seconds About Max • Systems Engineer • aka Sales Engineer • SoCal, AZ, NV • former coder of PHP • teaches meditation + yoga • from Montreal, Canada 2
  • 3. What Does Cloudera Do? • product • distribution of Hadoop components, Apache licensed • enterprise tooling • support • training • services (aka consulting) • community 3
  • 4. 4 But first… how did we get here?
  • 5. What does Hadoop look like? 5 HDFS HDFS … master worker (“NN”) (“DN”) MR worker (“TT”) HDFS worker (“DN”) MR worker (“TT”) HDFS worker (“DN”) MR worker (“TT”) HDFS worker (“DN”) MR worker (“TT”) HDFS worker (“DN”) MR worker (“TT”) MR master (“JT”) Standby master
  • 6. But I want MORE! 6 HDFS worker HDFS worker HDFS worker HDFS worker HDFS worker … MapReduce HDFS master (“NN”) MR master (“JT”) Standby master
  • 7. Hadoop as an Architecture The Old Way Network Expensive, Special purpose, “Reliable” Servers Expensive Licensed Software • Hard to scale • Network is a bottleneck • Only handles relational data • Difficult to add new fields & data types Expensive & Unattainable $30,000+ per TB Data Storage (SAN, NAS) Compute (RDBMS, EDW) The Hadoop Way Commodity “Unreliable” Servers Hybrid Open Source Software • Scales out forever • No bottlenecks • Easy to ingest any data • Agile data access Affordable & Attainable $300-$1,000 per TB Compute (CPU) Memory Storage (Disk) z z
  • 8. CDH: the App Store for Hadoop 8 Resource Management Storage Integration Metadata NoSQL DBMS … Analytic MPP DBMS Search Engine In- Memory Batch Processing Support Data Management System Management Security Machine Learning MapReduce
  • 9. 9 Introduction to Apache Spark Credits: • Ben White • Todd Lipcon • Ted Malaska • Jairam Ranganathan • Jayant Shekhar • Sandy Ryza
  • 10. Can we improve on MR? • Problems with MR: • Very low-level: requires a lot of code to do simple things • Very constrained: everything must be described as “map” and “reduce”. Powerful but sometimes difficult to think in these terms. 10
  • 11. Can we improve on MR? • Two approaches to improve on MapReduce: 1. Special purpose systems to solve one problem domain well. • Giraph / Graphlab (graph processing) • Storm (stream processing) • Impala (real-time SQL) 2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems. • Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs) Both are viable strategies depending on the problem! 11
  • 12. What is Apache Spark? Spark is a general purpose computational framework Retains the advantages of MapReduce: • Linear scalability • Fault-tolerance • Data Locality based computations …but offers so much more: • Leverages distributed memory for better performance • Supports iterative algorithms that are not feasible in MR • Improved developer experience • Full Directed Graph expressions for data parallel computations • Comes with libraries for machine learning, graph analysis, etc. 12
  • 13. What is Apache Spark? Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. One of the largest open source projects in big data: • 170+ developers contributing • 30+ companies contributing • 400+ discussions per month on the mailing list 13
  • 15. Getting started with Spark • Java API • Interactive shells: • Scala (spark-shell) • Python (pyspark) 15
  • 17. Execution modes • Standalone Mode • Dedicated master and worker daemons • YARN Client Mode • Launches a YARN application with the driver program running locally • YARN Cluster Mode • Launches a YARN application with the driver program running in the YARN ApplicationMaster 17 Dedicated Spark runtime with static resource limits Dynamic resource management between Spark, MR, Impala…
  • 19. RDD – Resilient Distributed Dataset • Collections of objects partitioned across a cluster • Stored in RAM or on Disk • You can control persistence and partitioning • Created by: • Distributing local collection objects • Transformation of data in storage • Transformation of RDDs • Automatically rebuilt on failure (resilient) • Contains lineage to compute from storage • Lazy materialization 20
  • 21. Operations on RDDs Transformations lazily transform a RDD to a new RDD • map • flatMap • filter • sample • join • sort • reduceByKey • … Actions run computation to return a value • collect • reduce(func) • foreach(func) • count • first, take(n) • saveAs • … 22
  • 22. Fault Tolerance • RDDs contain lineage. • Lineage – source location and list of transformations • Lost partitions can be re-computed from source data 23 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 24. Word Count in MapReduce package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } 25 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 25. Word Count in Spark sc.textFile(“words”) .flatMap(line => line.split(" ")) .map(word=>(word,1)) .reduceByKey(_+_).collect() 26
  • 26. Logistic Regression • Read two sets of points • Looks for a plane W that separates them • Perform gradient descent: • Start with random W • On each iteration, sum a function of W over the data • Move W in a direction that improves it 27
  • 30. 31 Spark and Hadoop: a Framework within a Framework
  • 31. 32
  • 32. 33 Map Reduce Resource Management Storage Integration Metadata HBase Impala Solr Spark … Support Data Management System Management Security
  • 33. Spark Streaming • Takes the concept of RDDs and extends it to DStreams • Fault-tolerant like RDDs • Transformable like RDDs • Adds new “rolling window” operations • Rolling averages, etc. • But keeps everything else! • Regular Spark code works in Spark Streaming • Can still access HDFS data, etc. • Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS. • Detecting anomalous behavior and triggering alerts. • Continuous reporting of summary metrics for incoming data. 34
  • 34. Micro-batching for on the fly ETL 35
  • 35. What about SQL? 36 http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
  • 36. Fault Recovery Recap • RDDs store dependency graph • Because RDDs are deterministic: Missing RDDs are rebuilt in parallel on other nodes • Stateful RDDs can have infinite lineage • Periodic checkpoints to disk clears lineage • Faster recovery times • Better handling of stragglers vs row-by-row streaming 37
  • 37. Why Spark? • Flexible like MapReduce • High performance • Machine learning, iterative algorithms • Interactive data explorations • Concise, easy API for developer productivity 38
  • 38. 39 Demo Time! • Log file Analysis • Machine Learning • Spark Streaming
  • 39. What’s Next? • Download Hadoop! • CDH available at www.cloudera.com • Try it online: Cloudera Live • Cloudera provides pre-loaded VMs • http://tiny.cloudera.com/quickstartvm 40
  • 40. 41 Questions? Preferably related to the talk… or not.
  • 41. 42 Thank You! Maxime Dumas mdumas@cloudera.com We’re hiring.
  • 42. 43

Editor's Notes

  1. Similar to the Red Hat model. Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
  2. What’s the best bang-per-buck you can get for computational capacity? Intel boxes running Linux! Sharded databases (and shared-nothing architectures like Teradata) look like this Let’s just get a ton of them, and figure out the rest with software (fault tolerance, distributed process management, job control, etc). And the rest is history - this is how Google, Facebook, Yahoo – this is how these guys build out their data centers.
  3. So does this approach work for more than just MapReduce? Sure! You’ve got all this CPU and memory sitting here – let’s spin up more than just MapReduce. Let’s add an agent for a distributed SQL query engine, a distributed NoSQL database, distributed search…
  4. Pricing Data: Cloudera: HW + SW per-year list prices for Basic thru EDH at various configs Old Way: Various sources. One of note: - Cowen / Goldmacher coverage initiation of Teradata, June 17, 2013 - List price of high-end appliance (which he thinks is more comparable to our solution) is $57K/TB + maintenance for an annual cost of $39K/TB - Prices have likely decreased, but we estimate they are still in excess of $30K/TB/year - List price of their low-end appliance is $12K/TB + maint or $8K per year