Introduction to Apache Spark Developer Training

Introduction to Spark Developer Training
Diana Carroll | Senior Curriculum Developer

Agenda
 Cloudera's Learning Path for Developers
 Target Audience and Prerequisites
 Course Outline
 Short Presentation Based on Actual Course Material
 Question and Answer Session

Learning Path: Developers
Create Powerful New Data Processing Tools
Learn to code and write MapReduce programs for production
Master advanced API topics required for real-world data analysis
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of operations per second
Implement recommenders and data experiments
Draw actionable insights from analysis of disparate data
Build converged applications using multiple processing engines
Develop enterprise solutions using components across the EDH
Combine batch and stream processing with interactive analytics
Optimize applications for speed, ease of use, and sophistication
Spark
Training
Big Data
Applications
HBase
Training
Intro to
Data Science
Developer
Training
Aaron T. Myers
Software Engineer

1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
More than 20,000 students trained since 2009
6 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
7 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
8 Depth of Training Material
Hands-on labs and VMs support live instruction
Leader in Certification
Over 8,000 accredited Cloudera professionals
4 Trusted Source for Training
100,000+ people have attended online courses 9 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
Aligned to Best Practices and the Pace of Change
5 State of the Art Curriculum
Courses updated as Hadoop evolves 10Commitment to Big Data Education
University partnerships to teach Hadoop in the classroom

Cloudera Developer Training for Apache Spark
About the Course

 Intended for people who write code, such as
–Software Engineers
–Data Engineers
–ETL Developers
Target Audience

 No prior knowledge of Spark, Hadoop or distributed programming
concepts is required
Course Prerequisites

 Requirements
–Basic familiarity with Linux or Unix
$ mkdir /data
$ cd /data
$ rm /home/johndoe/salesreport.txt

 Requirements
–Basic familiarity with Linux or Unix
–Intermediate-level programming skills in either Scala or Python
$ mkdir /data
$ cd /data
$ rm /home/johndoe/salesreport.txt

Example of Required Scala Skill Level
 Do you understand the following code? Could you write something
similar?
object Maps {
val colors = Map("red" -> 0xFF0000,
"turquoise" -> 0x00FFFF,
"black" -> 0x000000,
"orange" -> 0xFF8040,
"brown" -> 0x804000)
def main(args: Array[String]) {
for (name <- args) println(
colors.get(name) match {
case Some(code) =>
name + " has code: " + code
case None =>
"Unknown color: " + name
}
)
}
}

Example of Required Python Skill Level
 Do you understand the following code? Could you write something
similar?
import sys
def parsePurchases(s):
return s.split(',')
if __name__ == "__main__":
if len(sys.argv) < 2:
print "Usage: SumPrices <products>"
exit(-1)
prices = {'apple': 0.40, 'banana': 0.50, 'orange': 0.10}
total = sum(prices[fruit]
for fruit in parsePurchases(sys.argv[1]))
print 'Total: $%.2f' % total

 Getting started with Scala
–www.scala-lang.org
Practicing Scala or Python

 Getting started with Scala
–www.scala-lang.org
 Getting started with Python
–python.org
–developers.google.com/edu/python
–and many more
Practicing Scala or Python

1. Introduction
Course Outline

1. Introduction
2. What is Spark?
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
4. Working with RDDs
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
5. The Hadoop Distributed File
System
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
System
6. Running Spark on a Cluster
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
System
7. Parallel Programming with Spark
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
System
8. Caching and Persistence
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
System
9. Writing Spark Applications
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
System
10. Spark Streaming
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
System
10. Spark Streaming
11. Common Patterns in Spark
Programming
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
System
10. Spark Streaming
Programming
12. Improving Spark Performance
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
System
10. Spark Streaming
Programming
13. Spark, Hadoop and the Enterprise
Data Center
Course Outline

1. Introduction
2. What is Spark?
3. Spark Basics
System
10. Spark Streaming
Programming
13. Spark, Hadoop and the Enterprise
Data Center
14. Conclusion
Course Outline

 Based on
–Chapter 3: Spark Basics
–Chapter 4: Working with RDDs
Course Excerpt

 Based on
–Chapter 3: Spark Basics
–Chapter 4: Working with RDDs
 Topics
–What is Spark?
–The components of a distributed data processing system
–Intro to the Spark Shell
–Resilient Distributed Datasets
–RDD operations
–Example: WordCount
Course Excerpt

 Apache Spark is a fast, general engine for large-scale data
processing and analysis
–Open source, developed at UC Berkeley
 Written in Scala
–Functional programming language that runs in a JVM
What is Apache Spark?

 Apache Spark is a fast, general engine for large-scale data
processing and analysis
–Open source, developed at UC Berkeley
 Written in Scala
–Functional programming language that runs in a JVM
 Key Concepts
–Avoid the data bottleneck by distributing data when it is
stored
–Bring the processing to the data
–Data stored in memory

Distributed Processing with the Spark Framework
API
Spark

API
Cluster Computing
Spark
• Spark Standalone
• YARN
• Mesos

API
Cluster Computing Storage
Spark
• Spark Standalone
• YARN
• Mesos
HDFS
(Hadoop Distributed File
System)

 Spark Shell
–Interactive REPL – for learning or data exploration
–Python or Scala
 Spark Applications
–For large scale data processing
–Python, Java or Scala
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 0.9.1
/_/
Using Python version 2.6.6 (r266:84292, Jan
22 2014 09:42:36)
Spark context available as sc.
>>>
$ spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 0.9.1
/_/
Using Scala version 2.10.3 (Java HotSpot(TM)
64-Bit Server VM, Java 1.7.0_51)
Created spark context..
Spark context available as sc.
scala>
Scala Shell
Python Shell

 Every Spark application requires a Spark Context
–The main entry point to the Spark API
 Spark Shell provides a preconfigured Spark Context called sc
Spark Context
>>> sc.appName
u'PySparkShell'
scala> sc.appName
res0: String = Spark shell

 RDD (Resilient Distributed Dataset)
–Resilient – if data in memory is lost, it can be
recreated
–Distributed – stored in memory across the cluster
–Dataset – initial data can come from a file or created
programmatically
 RDDs are the fundamental unit of data in Spark
 Most of Spark programming is performing operations on
RDDs
RDD (Resilient Distributed Dataset)
data
data
data
data…
RDD

I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Example: A File-based RDD
I've never seen a purple
cow.
File: purplecow.txt
RDD: mydata
> mydata = sc.textFile("purplecow.txt")

Example: A File-based RDD
I've never seen a purple
cow.
File: purplecow.txt
RDD: mydata
> mydata = sc.textFile("purplecow.txt")
> mydata.count()
4

 Two types of RDD operations
–Actions – return values
–count
–take(n)
RDD Operations
value
RDD

 Two types of RDD operations
–Actions – return values
–count
–take(n)
–Transformations – define new RDDs
based on the current one
–filter
–map
–reduce
RDD Operations
value
RDD
New RDDBase RDD

Example: map and filter Transformations

I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
map(lambda line: line.upper()) map(line => line.toUpperCase())

BUT I CAN TELL YOU, ANYHOW,
filter(lambda line: line.startswith('I'))
map(lambda line: line.upper()) map(line => line.toUpperCase())
filter(line => line.startsWith('I'))

 RDDs can hold any type of element
–Primitive types: integers, characters, booleans, strings, etc.
–Sequence types: lists, arrays, tuples, dicts, etc. (including nested)
–Scala/Java Objects (if serializable)
–Mixed types
RDDs

 RDDs can hold any type of element
–Primitive types: integers, characters, booleans, strings, etc.
–Sequence types: lists, arrays, tuples, dicts, etc. (including nested)
–Scala/Java Objects (if serializable)
–Mixed types
 Some types of RDDs have additional functionality
–Double RDDs – RDDs consisting of numeric data
–Pair RDDs – RDDs consisting of Key-Value pairs
RDDs

 Pair RDDs are a special form of RDD
–Each element must be a key-value pair (a two-
element tuple)
–Keys and values can be any type
Pair RDDs
(key1,value1)
(key2,value2)
(key3,value3)
…
Pair RDD

 Pair RDDs are a special form of RDD
–Each element must be a key-value pair (a two-
element tuple)
–Keys and values can be any type
 Why?
–Use with Map-Reduce algorithms
–Many additional functions are available for
common data processing needs
–E.g. sorting, joining, grouping, counting, etc.
Pair RDDs
(key1,value1)
(key2,value2)
(key3,value3)
…
Pair RDD

 MapReduce is a common programming model
–Two phases
–Map – process each element in a data set
–Reduce – aggregate or consolidate the data
–Easily applicable to distributed processing of large data sets
MapReduce

–Two phases
 Hadoop MapReduce is the major implementation
–Limited
–Each job has one Map phase, one Reduce phase in each
–Job output saved to files
MapReduce

–Two phases
 Hadoop MapReduce is the major implementation
–Limited
–Each job has one Map phase, one Reduce phase in each
–Job output saved to files
 Spark implements MapReduce with much greater flexibility
–Map and Reduce functions can be interspersed
–Results stored in memory
–Operations can be chained easily
MapReduce

MapReduce Example: Word Count
the cat sat on the mat
the aardvark sat on the sofa
Input Data
Result
aardvark 1
cat 1
mat 1
on 2
sat 2
sofa 1
the 4
?

Example: Word Count
> counts = sc.textFile(file)
the cat sat on the
mat
the aardvark sat on
the sofa

Example: Word Count
.flatMap(lambda line: line.split())
the cat sat on the
mat
the aardvark sat on
the sofa
the
cat
sat
on
the
mat
the
aardvark
sat
…

Example: Word Count
.map(lambda word: (word,1))
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)
…
the
cat
sat
on
the
mat
the
aardvark
sat
…
Key-
Value
Pairs

Example: Word Count
.reduceByKey(lambda v1,v2: v1+v2)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
the cat sat on the
mat
the aardvark sat on
the sofa
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)
…
the
cat
sat
on
the
mat
the
aardvark
sat
…

 ReduceByKey functions must be
–Binary – combines values
from two keys
–Commutative – x+y = y+x
–Associative – (x+y)+z = x+(y+z)
ReduceByKey
(the,1)
(cat,1)
(sat,1)
(on,1)
(the,1)
(mat,1)
(the,1)
(aardvark,1)
(sat,1)
(on,1)
(the,1)
(the,2)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)

from two keys
ReduceByKey
(the,1)
(cat,1)
(sat,1)
(on,1)
(the,1)
(mat,1)
(the,1)
(aardvark,1)
(sat,1)
(on,1)
(the,1)
(the,2)
(the,3)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)

from two keys
ReduceByKey
(the,1)
(cat,1)
(sat,1)
(on,1)
(the,1)
(mat,1)
(the,1)
(aardvark,1)
(sat,1)
(on,1)
(the,1)
(the,2)
(the,3)
(the,4)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)

Example: Word Count
> counts.saveAsTextFile(output)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
(aardvark,1)
(cat,1)
(mat,1)
(on,2)
(sat,2)
(sofa,1)
(the,4)

 Spark takes the concepts of
MapReduce to the next level
–Higher level API = faster, easier
development
Spark v. Hadoop MapReduce

development
public class WordCount {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
public class WordMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("W+")) {
if (word.length() > 0)
context.write(new Text(word), new IntWritable(1));
}
}
}
}
public class SumReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
> counts.saveAsTextFile(output)

development
–Low latency = near real-time
processing

development
–Low latency = near real-time
processing
–In-memory data storage = up to
100x performance improvement
Logistic Regression

Thank you for attending!
• Submit questions in the Q&A panel
• Follow Cloudera University @ClouderaU
• Follow Diana on GitHub:
https://github.com/dianacarroll
• Follow the Developer learning path:
http://university.cloudera.com/develop
ers
• Learn about the enterprise data hub:
http://tinyurl.com/edh-webinar
• Join the Cloudera user community:
http://community.cloudera.com/
Register now for Cloudera training at
http://university.cloudera.com
Use discount code Spark_10 to save 10%
on new enrollments in Spark Developer
Training classes delivered by Cloudera
until October 3, 2014*
Use discount code 15off2 to save 15% on
enrollments in two or more training
classes delivered by Cloudera until
October 3, 2014*
* Excludes classes sold or delivered by Cloudera partners

Introduction to Apache Spark Developer Training

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Apache Spark Developer Training

Similar to Introduction to Apache Spark Developer Training (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Introduction to Apache Spark Developer Training

Editor's Notes