2. Two meetup groups
Close, but different
topics
Ran by me
I don’t have to be a
presenter all the time
Propose your agenda
3. Not a Big Data introduction
Visit our next Big Data Essentials meetup instead
http://www.meetup.com/Big-Data-Developers-in-
Slovenia/events/223871144/
Not for people without technical background
(sorry)
Not a thorough use case discussion
Just a technical overview of technology for
beginners
4. General purpose distributed computing engine
suitable for large scale machine learning and
data processing tasks
5. NOT SO GOOD GOOD
Not the first computing
engine
MapReduce
MPI
Not one of a kind
Flink
Not so old (mature)
Developing very fast
Rapidly growing
community
Backed by major
vendors
Innovation
Designed for iterative
data analysis on large
scale (supersedes MR)
6. In-Memory Performance
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
7. A Big Data/DWH
developer
A Data Scientist
An Analytics
Architect
A CxO of IT company
Statistici
an
Business
Analyst
Software
Engineer
8. IT WORDS BUSINESS WORDS
Data
processing/Transformat
ion
Machine Learning
Social Network
Analysis
Streaming/Microbatchi
ng
Segmentation
Campaign response
prediction
Churn avoidance
CTR prediction
Behavioral analysis
Genomics
….
9.
10.
11. Open Source SystemML
Educate One Million Data Professionals
Establish Spark Technology Center
Founding Member of AMPLab
Contributing to the Core
Port many existing applications onto Spark
Develop applications using Spark
12.
13. Distributed platform for thousands of nodes
Data storage and computation framework
Open source
Runs on commodity hardware
Flexible – everything is loosely coupled
14. Driving principals
Files are stored across the entire cluster
Programs are brought to the data, not the data to the program
Distributed file system (DFS) stores blocks across the whole cluster
Blocks of a single file are distributed across the cluster
A given block is typically replicated as well for resiliency
Just like a regular file system, the contents of a file is up to the application
Unlike a regular file system, you can ask it “where does each block of my file
live?”
FILE
BLOCK
S
15. map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Hello World Bye World
Hello IBM
Content of Input Documents
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1 emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Map 2 emits:
< Hello, 1>
< IBM, 1>
16. Spark brings two significant value-adds:
Bring to Map Reduce the same added value that databases (and
parallel databases) brought to query processing:
Let the app developer focus on the WHAT (they need to ask) and let the
system figure out HOW (it should be done).
Enable faster higher level application development through higher level
constructs and concepts: (RDD concept)
Let the system deal with performance (as part of the HOW)
Leveraging memory (Bufferpools, Caching RDDs in memory)
Maintaining sets of dedicated worker processes ready to go (subagents in
DBMS, Executors in Spark)
Enabling interactive processing (CLP, SQL*Plus, spark-shell, etc….)
Be one general purpose engine for multiples types of
workloads (SQL, Streaming, Machine Learning, etc…)
17. Apache Spark is a fast, general
purpose, easy-to-use cluster
computing system for large-scale data
processing
Fast
Leverages aggressively cached in-memory
distributed computing and dedicated
App Executor processes even when no jobs
are running
Faster than MapReduce
General purpose
Covers a wide range of workloads
Provides SQL, streaming and complex
analytics
Flexible and easier to use than Map
Reduce
Spark is written in Scala, an object oriented,
functional programming language
Scala, Python and Java APIs
Scala and Python interactive shells
Runs on Hadoop, Mesos, standalone or
cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount
18. Spark is versatile and
flexible:
Can run on YARN /
HDFS but also
standalone or on
MESOS
Spark engine can be
exploited from multiple
“entry points”: SQL,
Streaming, Machine
Learning, Graph
Processing
19. Normally you code stuff up in one of the
languages
Scala
Python
Java
I like Python, but in some cases it is slower
With DataFrames, no difference (more later)
20. One of the shells
Scala shell (spark-shell)
Python shell
Code it in the editor and submit with spark-
submit
Use “notebook” (Jupyter, Zeppelin)
My preferred method. More later
Enable your IDE to run spark
PyCharm
IntelliJ IDEA
21.
22. Jupytiter
Zeppelin
Scala
Incubated
Many others
Spark Notebook
Ispark
DataBricks Cloud
IBM Spark aaS
IBM DataScientist
Workbench
23. Initialize context
Read data
Run stuff
Transformations
Actions
Caching
More later
24. GOOD STUFF NOT SO GOOD STUFF
Full API exposed
Concise language
Documentation is way
better
Faster if you use plain
RDDs
Build tools and
dependency tracking
Not so many additional
libraries compared to
Python
Pandas
Matplotlib
Harder to run in a
“notebook”*
*At the moment
Harder to learn
25. Scala Crash Course
Holden Karau, DataBricks
http://lintool.github.io/SparkTutorial/slides/day
1_Scala_crash_course.pdf
Martin Odersky’s “Functional Programming in
Scala” course
Books
Scala for Impatient
Scala by Example
26. GOOD STUFF NOT SO GOOD STUFF
Clean & clear language
Easy to learn
Lot of libraries
Pandas
Scikit
matplotlib
Easy to run in a
“notebook”
Slower
Interpreted language
Not all API functions
exposed
Streaming
Some times behaves
differently
27. I think coding in Java
for Spark is terrible
But if you like it
messy, there is
nobody to stop you
28.
29. A way to connect to spark engine
Initialized with all runtime parameters
For example, memory parameters
30. Resilient Distributed Dataset
An abstraction over a generic data collection
Integers
Strings
PairRDD : <key, value> pairs (support additional
operations)
Single logical entity but under the hood is a
distributed collection
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names
31. You have to pay attention what kind of
operation you are running
Transformation
Does not do anything until the action is called
Actions
Kick off computation
Results can be persisted to memory (cache) or to disk
(more later)
32. Three methods for creation
Distributing a collection of objects from the driver program
(using the parallelize method of the spark context)
val rddNumbers = sc.parallelize(1 to 10)
val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))
Loading an external dataset (file)
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
Transformation from another existing RDD
val rddNumbers2 = rddNumbers.map(x=> x+1)
33. Transformations are
lazy evaluations
Returns a pointer to
the transformed RDD
Pair RDD (K,V)
functions for
MapReduce style
transformations
Map
Filter
flatMap
reduceByKey
sortByKey
Join
See the doc for full list
34. Kick off the
computation
Transformations are
lazily evaluated
Collect()
Count()
Take()
Reduce()
First()
saveAsTextFile()
35. Each node stores any partitions of the cache that it
computes in memory
Reuses them in other actions on that dataset (or
datasets derived from it)
Future actions are much faster (often by more
than 10x)
Two methods for RDD persistence: persist() and
cache()
36. rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
DataBricks
41. Read the Fine Manual
https://spark.apache.org/docs/latest/index.html
Take the course
BigData University
https://bigdatauniversity.com/bdu-wp/bdu-
course/spark-fundamentals/
edX – edx.org search for Spark
If you’re stuck
Try the user lists :
https://spark.apache.org/community.html
42. Questions?
Topic for the next meetup?
Your experiences?
Want to be a presenter?
43. Some slide and text graphics were borrowed
from the following sources
Vincent Poncet, IBM France
Jacques Roy, IBM US
Daniel Kikuchi , IBM US
Mokhtar Kandil , IBM US
DataBricks
Spark Docs
I completely lost track what slides I copied
from which source. I apologize.