SlideShare a Scribd company logo
1 of 63
© 2015 IBM Corporation
Introduction to Apache Spark
Vincent Poncet
IBM Software Big Data Technical Sale
02/07/2015
2 © 2015 IBM Corporation
Credits
 This presentations draws upon previous work / slides by IBM
colleagues from WW Software Big Data Organization : Daniel
Kikuchi, Jacques Roy and Mokhtar Kandil
 I used several materials from DataBricks and Apache Spark
documentation
3 © 2015 IBM Corporation
Introduction and background
Spark Core API
Spark Execution Model
Spark Shell & Application Deployment
Spark Extensions (SparkSQL, MLlib, Spark Streaming)
Spark Future
Agenda
4 © 2015 IBM Corporation
Introduction and background
5 © 2015 IBM Corporation
 Apache Spark is a fast, general purpose,
easy-to-use cluster computing system for
large-scale data processing
– Fast
• Leverages aggressively cached in-memory
distributed computing and dedicated Executor
processes even when no jobs are running
• Faster than MapReduce
– General purpose
• Covers a wide range of workloads
• Provides SQL, streaming and complex
analytics
– Flexible and easier to use than Map Reduce
• Spark is written in Scala, an object oriented,
functional programming language
• Scala, Python and Java APIs
• Scala and Python interactive shells
• Runs on Hadoop, Mesos, standalone or cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount
6 © 2015 IBM Corporation
Brief History of Spark
 2002 – MapReduce @ Google
 2004 – MapReduce paper
 2006 – Hadoop @ Yahoo
 2008 – Hadoop Summit
 2010 – Spark paper
 2013 – Spark 0.7 Apache Incubator
 2014 – Apache Spark top-level
 2014 – 1.2.0 release in December
 2015 – 1.3.0 release in March
 2015 – 1.4.0 release in June
 Spark is HOT!!!
 Most active project in Hadoop
ecosystem
 One of top 3 most active Apache
projects
 Databricks founded by the creators
of Spark from UC Berkeley’s
AMPLab
Activity for 6 months in 2014
(from Matei Zaharia – 2014 Spark Summit)
DataBricks
In June 2015, code base was about 400K lines
7 © 2015 IBM Corporation
DataBricks / Spark Summit 2015
8 © 2015 IBM Corporation
Large Scale Usage
DataBricks / Spark Summit 2015
9 © 2015 IBM Corporation
Spark ecosystem
 Spark is quite versatile and flexible:
– Can run on YARN / HDFS but also standalone or on MESOS
– The general processing capabilities of the Spark engine can be exploited from
multiple “entry points”: SQL, Streaming, Machine Learning, Graph Processing
10 © 2015 IBM Corporation
Spark in the Hadoop ecosystem
 Currently, Spark is a general purpose parallel processing engine
which integrates with YARN along the rest of the Hadoop frameworks
YARN
HDFS
Map/
Reduce 2
HivePig
Spark
HBase BigSQL Impala
11 © 2015 IBM Corporation
Future of Spark’s role in Hadoop ?
 The Spark Core engine is a good performant replacement for Map
Reduce:
YARN
HDFS
Spark Core
BigSQL
Spark
SQL
Spark
MLlib
Spark
Streaming
Hive
Custom
code
HBase
12 © 2015 IBM Corporation
Spark Core API
13 © 2015 IBM Corporation
 An RDD is a distributed collection of Scala/Python/Java objects of
the same type:
– RDD of strings
– RDD of integers
– RDD of (key, value) pairs
– RDD of class Java/Python/Scala objects
 An RDD is physically distributed across the cluster, but manipulated
as one logical entity:
– Spark will “distribute” any required processing to all partitions where the RDD
exists and perform necessary redistributions and aggregations as well.
– Example: Consider a distributed RDD “Names” made of names
Resilient Distributed Dataset (RDD): definition
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names
14 © 2015 IBM Corporation
 Suppose we want to know the number of names in the RDD “Names”
 User simply requests: Names.count()
– Spark will “distribute” count processing to all partitions so as to obtain:
• Partition 1: Mokhtar(1), Jacques (1), Dirk (1)  3
• Partition 2: Cindy (1), Dan (1), Susan (1)  3
• Partition 3: Dirk (1), Frank (1), Jacques (1)  3
– Local counts are subsequently aggregated: 3+3+3=9
 To lookup the first element in the RDD: Names.first()
 To display all elements of the RDD: Names.collect() (careful with this)
Resilient Distributed Dataset: definition
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names
15 © 2015 IBM Corporation
Resilient Distributed Datasets: Creation and Manipulation
 Three methods for creation
– Distributing a collection of objects from the driver program (using the
parallelize method of the spark context)
val rddNumbers = sc.parallelize(1 to 10)
val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))
– Loading an external dataset (file)
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
– Transformation from another existing RDD
val rddNumbers2 = rddNumbers.map(x=> x+1)
 Dataset from any storage supported by Hadoop
– HDFS, Cassandra, HBase, Amazon S3
– Others
 File types supported
– Text files, SequenceFiles, Parquet, JSON
– Hadoop InputFormat
16 © 2015 IBM Corporation
Resilient Distributed Datasets: Properties
 Immutable
 Two types of operations
– Transformations ~ DDL (Create View V2 as…)
• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10
• val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11
• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded
• It’s a Directed Acyclic Graph (DAG)
• No actual data processing does take place  Lazy evaluations
– Actions ~ DML (Select * From V2…)
• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
• Performs transformations and action
• Returns a value (or write to a file)
 Fault tolerance
– If data in memory is lost it will be recreated from lineage
 Caching, persistence (memory, spilling, disk) and check-pointing
17 © 2015 IBM Corporation
RDD Transformations
 Transformations are lazy evaluations
 Returns a pointer to the transformed RDD
 Pair RDD (K,V) functions for MapReduce style transformations
Transformation Meaning
map(func) Return a new dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns
true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func should
return a Seq rather than a single item
Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
join(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all
pairs of elements for each key.
reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the values for
each key are aggregated using the given reduce function func
sortByKey([ascendin
g],[numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,V)
pairs sorted by keys in ascending or descending order.
combineByKey[C}(cr
eateCombiner,
mergeValue,
mergeCombiners))
Generic function to combine the elements for each key using a custom set of aggregation
functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C.
createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)
18 © 2015 IBM Corporation
RDD Actions
 Actions returns values or save a RDD to disk
Action Meaning
collect() Return all the elements of the dataset as an array of the driver program. This
is usually useful after a filter or another operation that returns a sufficiently
small subset of data.
count() Return the number of elements in a dataset.
first() Return the first element of the dataset
take(n) Return an array with the first n elements of the dataset.
foreach(func) Run a function func on each element of the dataset.
saveAsTextFile Save the RDD into a TextFile
Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
19 © 2015 IBM Corporation
RDD Persistence
 Each node stores any partitions of the cache that it computes in memory
 Reuses them in other actions on that dataset (or datasets derived from it)
– Future actions are much faster (often by more than 10x)
 Two methods for RDD persistence: persist() and cache()
Storage Level Meaning
MEMORY_ONLY Store as deserialized Java objects in the JVM. If the RDD does not fit in memory, part of
it will be cached. The other will be recomputed as needed. This is the default. The
cache() method uses this.
MEMORY_AND_DISK Same except also store on disk if it doesn’t fit in memory. Read from memory and disk
when needed.
MEMORY_ONLY_SER Store as serialized Java objects (one bye array per partition). Space efficient, but more
CPU intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_AND_DISK but stored as serialized objects.
DISK_ONLY Store only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc.
Same as above, but replicate each partition on two cluster nodes
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon.
20 © 2015 IBM Corporation
Scala
 Scala Crash Course
 Holden Karau, DataBricks
http://lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course
.pdf
21 © 2015 IBM Corporation
Code Execution (1)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt
 ‘spark-shell’ provides Spark context as ‘sc’
22 © 2015 IBM Corporation
Code Execution (2)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
23 © 2015 IBM Corporation
Code Execution (3)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes RDD: danQuotes
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
DAN Spark is cool
DAN Scala is awesome
24 © 2015 IBM Corporation
Code Execution (4)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes RDD: danQuotes RDD: danSpark
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
DAN Spark is cool
DAN Scala is awesome
Spark
Scala
25 © 2015 IBM Corporation
Code Execution (5)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt
HadoopRDD
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
RDD: quotes
DAN Spark is cool
DAN Scala is awesome
RDD: danQuotes
Spark
Scala
RDD: danSpark
1
26 © 2015 IBM Corporation
DataFrames
 A DataFrame is a distributed collection of data organized into named columns. It is
conceptually equivalent to a table in a relational database, an R dataframe or Python Pandas,
but in a distributed manner and with query optimizations and predicate pushdown to the
underlying storage.
 DataFrames can be constructed from a wide array of sources such as: structured data files,
tables in Hive, external databases, or existing RDDs.
 Released in Spark 1.3
DataBricks / Spark Summit 2015
27 © 2015 IBM Corporation
DataFrames Examples
// Create the DataFrame
val df = sqlContext.read.parquet("examples/src/main/resources/people.parquet")
// Show the content of the DataFrame
df.show()
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the "name" column
df.select("name").show()
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// Select people older than 21
df.filter(df("age") > 21).show()
// Count people by age
df.groupBy("age").count().show()
28 © 2015 IBM Corporation
Spark
Execution Model
29 © 2015 IBM Corporation
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark client
(app master) Spark worker
HDFS, HBase, …
Block
manager
Task
threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster
manager
Components
DataBricks
30 © 2015 IBM Corporation
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
Scheduling Process
DataBricks
31 © 2015 IBM Corporation
Pipelines narrow ops.
within a stage
Picks join algorithms
based on partitioning
(minimize shuffles)
Reuses previously
cached data
Scheduler Optimizations
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
DataBricks
32 © 2015 IBM Corporation
Direct Acyclic Graph (DAG)
 View the lineage
 Could be issued in a continuous line
scala> danSpark.toDebugString
res1: String =
(2) MappedRDD[4] at map at <console>:16
| MappedRDD[3] at map at <console>:16
| FilteredRDD[2] at filter at <console>:14
| hdfs:/sparkdata/sparkQuotes.txt MappedRDD[1] at textFile at <console>:12
| hdfs:/sparkdata/sparkQuotes.txt HadoopRDD[0] at textFile at <console>:12
val danSpark = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt").
filter(_.startsWith("DAN")).
map(_.split(" ")).
map(x => x(1)).
.filter(_.contains("Spark"))
danSpark.count()
33 © 2015 IBM Corporation
Showing Multiple Apps
SparkContext
Driver Program
Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
App
 Each Spark application runs as a set of processes coordinated by the
Spark context object (driver program)
– Spark context connects to Cluster Manager (standalone, Mesos/Yarn)
– Spark context acquires executors (JVM instance)
on worker nodes
– Spark context sends tasks to the executors
DataBricks
34 © 2015 IBM Corporation
Spark Terminology
 Context (Connection):
– Represents a connection to the Spark cluster. The Application which initiated
the context can submit one or several jobs, sequentially or in parallel, batch or
interactively, or long running server continuously serving requests.
 Driver (Coordinator agent)
– The program or process running the Spark context. Responsible for running
jobs over the cluster and converting the App into a set of tasks
 Job (Query / Query plan):
– A piece of logic (code) which will take some input from HDFS (or the local
filesystem), perform some computations (transformations and actions) and
write some output back.
 Stage (Subplan)
– Jobs are divided into stages
 Tasks (Sub section)
– Each stage is made up of tasks. One task per partition. One task is executed
on one partition (of data) by one executor
 Executor (Sub agent)
– The process responsible for executing a task on a worker node
 Resilient Distributed Dataset
35 © 2015 IBM Corporation
Spark
Shell & Application Deployment
36 © 2015 IBM Corporation
Spark’s Scala and Python Shell
 Spark comes with two shells
– Scala
– Python
 APIs available for Scala, Python and Java
 Appropriate versions for each Spark release
 Spark’s native language is Scala, more natural to write Spark
applications using Scala.
 This presentation will focus on code examples in Scala
37 © 2015 IBM Corporation
Spark’s Scala and Python Shell
 Powerful tool to analyze data interactively
 The Scala shell runs on the Java VM
– Can leverage existing Java libraries
 Scala:
– To launch the Scala shell (from Spark home directory):
./bin/spark-shell
– To read in a text file:
scala> val textFile = sc.textFile("README.txt")
 Python:
– To launch the Python shell (from Spark home directory):
./bin/pyspark
– To read in a text file:
>>> textFile = sc.textFile("README.txt")
38 © 2015 IBM Corporation
SparkContext in Applications
 The main entry point for Spark functionality
 Represents the connection to a Spark cluster
 Create RDDs, accumulators, and broadcast variables on that
cluster
 In the Spark shell, the SparkContext, sc, is automatically initialized
for you to use
 In a Spark program, import some classes and implicit conversions
into your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
39 © 2015 IBM Corporation
A Spark Standalone Application in Scala
Import statements
SparkConf and
SparkContext
Transformations
and Actions
40 © 2015 IBM Corporation
Running Standalone Applications
 Define the dependencies
– Scala  simple.sbt
 Create the typical directory structure with the files
 Create a JAR package containing the application’s code.
– Scala: sbt package
 Use spark-submit to run the program
Scala:
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
41 © 2015 IBM Corporation
Spark Properties
 Set application properties via the SparkConf object
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
 Dynamically setting Spark properties
– SparkContext with an empty conf
val sc = new SparkContext(new SparkConf())
– Supply the configuration values during runtime
./bin/spark-submit --name "My app" --master local[4] --conf
spark.shuffle.spill=false --conf "spark.executor.extraJavaOptions=-
XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
– conf/spark-defaults.conf
 Application web UI
http://<driver>:4040
42 © 2015 IBM Corporation
Spark Configuration
 Three locations for configuration:
– Spark properties
– Environment variables
conf/spark-env.sh
– Logging
log4j.properties
 Override default configuration directory (SPARK_HOME/conf)
– SPARK_CONF_DIR
• spark-defaults.conf
• spark-env.sh
• log4j.properties
• etc.
43 © 2015 IBM Corporation
Spark Monitoring
 Three ways to monitor Spark applications
1. Web UI
• Default port 4040
• Available for the duration of the application
2. Metrics
• Based on the Coda Hale Metrics Library
• Report to a variety of sinks (HTTP, JMX, and CSV)
• /conf/metrics.properties
3. External instrumentations
• Ganglia
• OS profiling tools (dstat, iostat, iotop)
• JVM utilities (jstack, jmap, jstat, jconsole)
44 © 2015 IBM Corporation
Running Spark Examples
 Spark samples available in the examples directory
 Run the examples (from Spark home directory):
./bin/run-example SparkPi
where SparkPi is the name of the sample application
45 © 2015 IBM Corporation
Spark Extensions
46 © 2015 IBM Corporation
Spark Extensions
 Extensions to the core Spark API
 Improvements made to the core are passed to these libraries
 Little overhead to use with the Spark core
from http://spark.apache.org
47 © 2015 IBM Corporation
Spark SQL
 Process relational queries expressed in SQL (HiveQL)
 Seamlessly mix SQL queries with Spark programs
 In Spark since 1.0, refactored on top of DataFrames since 1.3
 Provide a single interface for efficiently working with structured
data including Apache Hive, Parquet and JSON files
 Leverages Hive frontend and metastore
– Compatibility with Hive data, queries
and UDFs
– HiveQL limitations may apply
– Not ANSI SQL compliant
– Little to no query rewrite optimization,
automatic memory management or
sophisticated workload management
 Graduated from alpha status with Spark 1.3
 Standard connectivity through JDBC/ODBC
48 © 2015 IBM Corporation
Spark SQL - Getting Started
 SQLContext created from SparkContext
// An existing SparkContext, sc
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 HiveContext created from SparkContext
// An existing SparkContext, sc
val sqlContext = new
org.apache.spark.sql.hive.HiveContext(sc)
 Import a library to convert an RDD to a DataFrame
– Scala:
import sqlContext.implicits._
 DataFrame data sources
– Inferring the schema using reflection
– Programmatic interface
49 © 2015 IBM Corporation
Spark SQL - Inferring the Schema Using Reflection
 The case class in Scala defines the schema of the table
case class Person(name: String, age: Int)
 The arguments of the case class becomes the names of the columns
 Create the RDD of the Person object and create a DataFrame
val people = sc.textFile("examples/src/main/resources/people.txt").
map(_.split(",")).
map(p => Person(p(0), p(1).trim.toInt)).toDF()
 Register the DataFrame as a table
people.registerTempTable("people")
 Run SQL statements using the sql method provided by the
SQLContext
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND
age <= 19")
 The results of the queries are DataFrames and support all the normal
RDD operations
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
50 © 2015 IBM Corporation
Spark SQL - Programmatic Interface
 Use when you cannot define the case classes ahead of time
 Three steps to create the Dataframe
1. Schema encoded as a String, import SparkSQL Struct types
val schemaString = “name age”
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
2. Create the schema represented by a StructType matching the structure of
the Rows in the RDD from step 1.
val schema = StructType( schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))
3. Apply the schema to the RDD of Rows using the createDataFrame method.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
 Then register the peopleSchemaRDD as a table
peopleDataFrame.registerTempTable("people")
 Run the sql statements using the sql method:
val results = sqlContext.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
51 © 2015 IBM Corporation
SparkSQL - DataSources
Before : Spark 1.2.x
 ParquetFile
– val parquetFile = sqlContext.parquetFile("people.parquet")
 JSON :
– val df =
sqlContext.jsonFile("examples/src/main/resources/people.json")
Spark 1.3.x
 Generic Load/Save
– val df = sqlContext.load(“<filename>", “<datasource
type>")
– df.save (“<filename>", “<datasource type>")
 ParquetFile
– val df = sqlContext.load("people.parquet") //
(parquet unless otherwise configured
by spark.sql.sources.default)
– df.select("name",
"age").save("namesAndAges.parquet")
 JSON
– val df = sqlContext.load("people.json", "json")
– df.select("name", "age").save("namesAndAges.json",
“json")
 CSV (external package)
– val df = sqlContext.load("com.databricks.spark.csv",
Map("path" -> "cars.csv", "header" -> "true"))
– df.select("year", "model").save("newcars.csv",
"com.databricks.spark.csv")
Spark 1.4.x
 Generic Load/Save
– val df = sqlContext.read.load(“<filename>", “<datasource type>")
– df.write.save (“<filename>", “<datasource type>")
 ParquetFile
– val df = sqlContext.read.load("people.parquet") // (parquet unless
otherwise configured by spark.sql.sources.default)
– df.select("name", "age").write.save("namesAndAges.parquet")
 JSON
– val df = sqlContext.read.load("people.json", "json")
– df.select("name", "age").write.save("namesAndAges.json", “json")
 CSV (external package)
– val df =
sqlContext.read.format("com.databricks.spark.csv").option("heade
r", "true").load("cars.csv")
– df.select("year",
"model").write.format("com.databricks.spark.csv").save("newcars.
csv")
DataSource APIs provides generic methods to
manage connectors to any datasource (file, jdbc,
cassandra, mongodb, etc…). From Spark 1.3
DataSource APIs provides predicate pushdown
capabilities to leverage the performance of the
backend. Most connectors are available at
http://spark-packages.org/
52 © 2015 IBM Corporation
Spark Streaming
 Scalable, high-throughput, fault-tolerant stream processing of live
data streams
 Write Spark streaming applications like Spark applications
 Recovers lost work and operator state (sliding windows) out-of-the-
box
 Uses HDFS and Zookeeper for high availability
 Data sources also include TCP sockets, ZeroMQ or other customized
data sources
53 © 2015 IBM Corporation
Spark Streaming - Internals
 The input stream goes into Spark Steaming
 Breaks up into batches of input data
 Feeds it into the Spark engine for processing
 Generate the final results in streams of batches
 DStream - Discretized Stream
– Represents a continuous stream of data created from the input streams
– Internally, represented as a sequence of RDDs
54 © 2015 IBM Corporation
Spark Streaming - Getting Started
 Count the number of words coming in from the TCP socket
 Import the Spark Streaming classes
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
 Create the StreamingContext object
val conf =
new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
 Create a DStream
val lines = ssc.socketTextStream("localhost", 9999)
 Split the lines into words
val words = lines.flatMap(_.split(" "))
 Count the words
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
 Print to the console
wordCounts.print()
55 © 2015 IBM Corporation
Spark Streaming - Continued
 No real processing happens until you tell it
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
 Code and application can be found in the NetworkWordCount
example
 To run the example:
– Invoke netcat to start the data stream
– In a different terminal, run the application
./bin/run-example streaming.NetworkWordCount localhost 9999
56 © 2015 IBM Corporation
Spark MLlib
 Spark MLlib for machine learning
library
 Since Spark 0.8
 Provides common algorithms and
utilities
• Classification
• Regression
• Clustering
• Collaborative filtering
• Dimensionality reduction
 Leverages in-memory cache of Spark
to speed up iteration processing
57 © 2015 IBM Corporation
Spark MLlib - Getting Started
 Use k-means clustering for set of latitudes and longitudes
 Import the Spark MLlib classes
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
 Create the SparkContext object
val conf = new SparkConf().setAppName("KMeans")
val sc = new SparkContext(conf)
 Create a data RDD
val taxifile = sc.textFile("user/spark/sparkdata/nyctaxisub/*")
 Create Vectors for input to algorithm
val taxi =
taxifile.map{line=>Vectors.dense(line.split(",").slice(3,5).map(_.toDouble))}
 Run the k-means algorithm with 3 clusters and 10 iterations
val model = Kmeans.train(taxi,3,10)
val clusterCenters = model.clusterCenters.map(_.toArray)
 Print to the console
clusterCenters.foreach(lines=>println(lines(0),lines(1)))
58 © 2015 IBM Corporation
SparkML
 SparkML provides an API to build ML pipeline (since Spark 1.3)
 Similar to Python scikit-learn
 SparkML provides abstraction for all steps of an ML workflow
Generic ML Workflow Real Life ML Workflow
 Transformer: A Transformer is an algorithm which can transform
one DataFrame into another DataFrame. E.g., an ML model is a
Transformer which transforms an RDD with features into an
RDD with predictions.
 Estimator: An Estimator is an algorithm which can be fit on a
DataFrame to produce a Transformer. E.g., a learning algorithm
is an Estimator which trains on a dataset and produces a model.
 Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow.
 Param: All Transformers and Estimators now share a common
API for specifying parameters. Xebia HUG France 06/2015
59 © 2015 IBM Corporation
Spark GraphX
 Flexible Graphing
–GraphX unifies ETL, exploratory analysis, and iterative graph
computation
–You can view the same data as both graphs and collections,
transform and join graphs with RDDs efficiently, and write custom
iterative graph algorithms with the API
 Speed
–Comparable performance to the fastest specialized graph
processing systems.
 Algorithms
–Choose from a growing library of graph algorithms
–In addition to a highly flexible API, GraphX comes
with a variety of graph algorithms
60 © 2015 IBM Corporation
Spark R
 Spark R is an R package that provides a light-weight front-end to use
Apache Spark from R
 Spark R exposes the Spark API through the RDD class and allows
users to interactively run jobs from the R shell on a cluster.
 Goal
– Make Spark R production ready
– Integration with MLlib
– Consolidations to the DataFrames and RDD concepts
 First release in Spark 1.4.0 :
– Support of DataFrames
 Spark 1.5
– Support of MLlib
61 © 2015 IBM Corporation
Spark internals refactoring : Project Tungsten
 Memory Management and Binary Processing:
leverage application semantics to manage memory
explicitly and eliminate the overhead of JVM object
model and garbage collection
 Cache-aware computation: algorithms and data
structures to exploit memory hierarchy
 Code generation: exploit modern compilers and
CPUs: allow efficient operation directly on binary data
DataBricks / Spark Summit 2015
62 © 2015 IBM Corporation
Spark: Final Thoughts
 Spark is a good replacement for MapReduce
– Higher performance
– Framework is easier to use than MapReduce (M/R)
– Powerful RDD & DataFrames concepts
– Big higher level libraries : SparkSQL, MLlib/ML, Streaming, GraphX
– Big ecosystem adoption
 This is a very fast paced environment, so keep up !
– Lot of new features at each new release (major release each 3 months)
– Spark has the latest / best offer but things may change again
63 © 2015 IBM Corporation
Resources
 The Learning Spark O’Reilly book
 Lab(s) this afternoon
 The following course on big data university

More Related Content

What's hot

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache SparkEdureka!
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 

What's hot (20)

Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Apache spark
Apache sparkApache spark
Apache spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark
SparkSpark
Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 

Viewers also liked

API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013
API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013
API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013Luis Borges Quina
 
How to Talk about APIs (APIDays Paris 2016)
How to Talk about APIs (APIDays Paris 2016)How to Talk about APIs (APIDays Paris 2016)
How to Talk about APIs (APIDays Paris 2016)Andrew Seward
 
APIdaze_Meetup require ('lx') _ TADHack 23 May 2015
APIdaze_Meetup require ('lx') _ TADHack 23 May 2015APIdaze_Meetup require ('lx') _ TADHack 23 May 2015
APIdaze_Meetup require ('lx') _ TADHack 23 May 2015Luis Borges Quina
 
Translation is UX manifesto
Translation is UX manifestoTranslation is UX manifesto
Translation is UX manifestoAntoine Lefeuvre
 
Networks, cloud & operator innovation- Mats Alendal
Networks, cloud & operator innovation- Mats AlendalNetworks, cloud & operator innovation- Mats Alendal
Networks, cloud & operator innovation- Mats AlendalEricsson
 
Incubateur HEC Presentation programme Oct 2016
Incubateur HEC Presentation programme Oct 2016Incubateur HEC Presentation programme Oct 2016
Incubateur HEC Presentation programme Oct 2016Remi Rivas
 
Ottspott by Apidaze @API Days Paris 2015
Ottspott by Apidaze @API Days Paris 2015Ottspott by Apidaze @API Days Paris 2015
Ottspott by Apidaze @API Days Paris 2015Luis Borges Quina
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
figo at API Days 2016 in Paris
figo at API Days 2016 in Parisfigo at API Days 2016 in Paris
figo at API Days 2016 in ParisLars Markull
 
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_
Api days 2014  from theatrophone to ap is_the 2020 telco challenge_Api days 2014  from theatrophone to ap is_the 2020 telco challenge_
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_Luis Borges Quina
 
WebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze Presentation
WebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze PresentationWebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze Presentation
WebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze PresentationLuis Borges Quina
 
Manifeste 'Translation is UX'
Manifeste 'Translation is UX'Manifeste 'Translation is UX'
Manifeste 'Translation is UX'Antoine Lefeuvre
 

Viewers also liked (13)

API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013
API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013
API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013
 
Value Creation Strategies for APIs
Value Creation Strategies for APIsValue Creation Strategies for APIs
Value Creation Strategies for APIs
 
How to Talk about APIs (APIDays Paris 2016)
How to Talk about APIs (APIDays Paris 2016)How to Talk about APIs (APIDays Paris 2016)
How to Talk about APIs (APIDays Paris 2016)
 
APIdaze_Meetup require ('lx') _ TADHack 23 May 2015
APIdaze_Meetup require ('lx') _ TADHack 23 May 2015APIdaze_Meetup require ('lx') _ TADHack 23 May 2015
APIdaze_Meetup require ('lx') _ TADHack 23 May 2015
 
Translation is UX manifesto
Translation is UX manifestoTranslation is UX manifesto
Translation is UX manifesto
 
Networks, cloud & operator innovation- Mats Alendal
Networks, cloud & operator innovation- Mats AlendalNetworks, cloud & operator innovation- Mats Alendal
Networks, cloud & operator innovation- Mats Alendal
 
Incubateur HEC Presentation programme Oct 2016
Incubateur HEC Presentation programme Oct 2016Incubateur HEC Presentation programme Oct 2016
Incubateur HEC Presentation programme Oct 2016
 
Ottspott by Apidaze @API Days Paris 2015
Ottspott by Apidaze @API Days Paris 2015Ottspott by Apidaze @API Days Paris 2015
Ottspott by Apidaze @API Days Paris 2015
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
figo at API Days 2016 in Paris
figo at API Days 2016 in Parisfigo at API Days 2016 in Paris
figo at API Days 2016 in Paris
 
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_
Api days 2014  from theatrophone to ap is_the 2020 telco challenge_Api days 2014  from theatrophone to ap is_the 2020 telco challenge_
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_
 
WebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze Presentation
WebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze PresentationWebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze Presentation
WebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze Presentation
 
Manifeste 'Translation is UX'
Manifeste 'Translation is UX'Manifeste 'Translation is UX'
Manifeste 'Translation is UX'
 

Similar to Introduction to Apache Spark

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 

Similar to Introduction to Apache Spark (20)

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Scala+data
Scala+dataScala+data
Scala+data
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 

Recently uploaded

How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreelreely ones
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 

Recently uploaded (20)

How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 

Introduction to Apache Spark

  • 1. © 2015 IBM Corporation Introduction to Apache Spark Vincent Poncet IBM Software Big Data Technical Sale 02/07/2015
  • 2. 2 © 2015 IBM Corporation Credits  This presentations draws upon previous work / slides by IBM colleagues from WW Software Big Data Organization : Daniel Kikuchi, Jacques Roy and Mokhtar Kandil  I used several materials from DataBricks and Apache Spark documentation
  • 3. 3 © 2015 IBM Corporation Introduction and background Spark Core API Spark Execution Model Spark Shell & Application Deployment Spark Extensions (SparkSQL, MLlib, Spark Streaming) Spark Future Agenda
  • 4. 4 © 2015 IBM Corporation Introduction and background
  • 5. 5 © 2015 IBM Corporation  Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-scale data processing – Fast • Leverages aggressively cached in-memory distributed computing and dedicated Executor processes even when no jobs are running • Faster than MapReduce – General purpose • Covers a wide range of workloads • Provides SQL, streaming and complex analytics – Flexible and easier to use than Map Reduce • Spark is written in Scala, an object oriented, functional programming language • Scala, Python and Java APIs • Scala and Python interactive shells • Runs on Hadoop, Mesos, standalone or cloud Logistic regression in Hadoop and Spark Spark Stack val wordCounts = sc.textFile("README.md").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) WordCount
  • 6. 6 © 2015 IBM Corporation Brief History of Spark  2002 – MapReduce @ Google  2004 – MapReduce paper  2006 – Hadoop @ Yahoo  2008 – Hadoop Summit  2010 – Spark paper  2013 – Spark 0.7 Apache Incubator  2014 – Apache Spark top-level  2014 – 1.2.0 release in December  2015 – 1.3.0 release in March  2015 – 1.4.0 release in June  Spark is HOT!!!  Most active project in Hadoop ecosystem  One of top 3 most active Apache projects  Databricks founded by the creators of Spark from UC Berkeley’s AMPLab Activity for 6 months in 2014 (from Matei Zaharia – 2014 Spark Summit) DataBricks In June 2015, code base was about 400K lines
  • 7. 7 © 2015 IBM Corporation DataBricks / Spark Summit 2015
  • 8. 8 © 2015 IBM Corporation Large Scale Usage DataBricks / Spark Summit 2015
  • 9. 9 © 2015 IBM Corporation Spark ecosystem  Spark is quite versatile and flexible: – Can run on YARN / HDFS but also standalone or on MESOS – The general processing capabilities of the Spark engine can be exploited from multiple “entry points”: SQL, Streaming, Machine Learning, Graph Processing
  • 10. 10 © 2015 IBM Corporation Spark in the Hadoop ecosystem  Currently, Spark is a general purpose parallel processing engine which integrates with YARN along the rest of the Hadoop frameworks YARN HDFS Map/ Reduce 2 HivePig Spark HBase BigSQL Impala
  • 11. 11 © 2015 IBM Corporation Future of Spark’s role in Hadoop ?  The Spark Core engine is a good performant replacement for Map Reduce: YARN HDFS Spark Core BigSQL Spark SQL Spark MLlib Spark Streaming Hive Custom code HBase
  • 12. 12 © 2015 IBM Corporation Spark Core API
  • 13. 13 © 2015 IBM Corporation  An RDD is a distributed collection of Scala/Python/Java objects of the same type: – RDD of strings – RDD of integers – RDD of (key, value) pairs – RDD of class Java/Python/Scala objects  An RDD is physically distributed across the cluster, but manipulated as one logical entity: – Spark will “distribute” any required processing to all partitions where the RDD exists and perform necessary redistributions and aggregations as well. – Example: Consider a distributed RDD “Names” made of names Resilient Distributed Dataset (RDD): definition Mokhtar Jacques Dirk Cindy Dan Susan Dirk Frank Jacques Partition 1 Partition 2 Partition 3 Names
  • 14. 14 © 2015 IBM Corporation  Suppose we want to know the number of names in the RDD “Names”  User simply requests: Names.count() – Spark will “distribute” count processing to all partitions so as to obtain: • Partition 1: Mokhtar(1), Jacques (1), Dirk (1)  3 • Partition 2: Cindy (1), Dan (1), Susan (1)  3 • Partition 3: Dirk (1), Frank (1), Jacques (1)  3 – Local counts are subsequently aggregated: 3+3+3=9  To lookup the first element in the RDD: Names.first()  To display all elements of the RDD: Names.collect() (careful with this) Resilient Distributed Dataset: definition Mokhtar Jacques Dirk Cindy Dan Susan Dirk Frank Jacques Partition 1 Partition 2 Partition 3 Names
  • 15. 15 © 2015 IBM Corporation Resilient Distributed Datasets: Creation and Manipulation  Three methods for creation – Distributing a collection of objects from the driver program (using the parallelize method of the spark context) val rddNumbers = sc.parallelize(1 to 10) val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”)) – Loading an external dataset (file) val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") – Transformation from another existing RDD val rddNumbers2 = rddNumbers.map(x=> x+1)  Dataset from any storage supported by Hadoop – HDFS, Cassandra, HBase, Amazon S3 – Others  File types supported – Text files, SequenceFiles, Parquet, JSON – Hadoop InputFormat
  • 16. 16 © 2015 IBM Corporation Resilient Distributed Datasets: Properties  Immutable  Two types of operations – Transformations ~ DDL (Create View V2 as…) • val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10 • val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11 • The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded • It’s a Directed Acyclic Graph (DAG) • No actual data processing does take place  Lazy evaluations – Actions ~ DML (Select * From V2…) • rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] • Performs transformations and action • Returns a value (or write to a file)  Fault tolerance – If data in memory is lost it will be recreated from lineage  Caching, persistence (memory, spilling, disk) and check-pointing
  • 17. 17 © 2015 IBM Corporation RDD Transformations  Transformations are lazy evaluations  Returns a pointer to the transformed RDD  Pair RDD (K,V) functions for MapReduce style transformations Transformation Meaning map(func) Return a new dataset formed by passing each element of the source through a function func. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func should return a Seq rather than a single item Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the values for each key are aggregated using the given reduce function func sortByKey([ascendin g],[numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,V) pairs sorted by keys in ascending or descending order. combineByKey[C}(cr eateCombiner, mergeValue, mergeCombiners)) Generic function to combine the elements for each key using a custom set of aggregation functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C. createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)
  • 18. 18 © 2015 IBM Corporation RDD Actions  Actions returns values or save a RDD to disk Action Meaning collect() Return all the elements of the dataset as an array of the driver program. This is usually useful after a filter or another operation that returns a sufficiently small subset of data. count() Return the number of elements in a dataset. first() Return the first element of the dataset take(n) Return an array with the first n elements of the dataset. foreach(func) Run a function func on each element of the dataset. saveAsTextFile Save the RDD into a TextFile Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
  • 19. 19 © 2015 IBM Corporation RDD Persistence  Each node stores any partitions of the cache that it computes in memory  Reuses them in other actions on that dataset (or datasets derived from it) – Future actions are much faster (often by more than 10x)  Two methods for RDD persistence: persist() and cache() Storage Level Meaning MEMORY_ONLY Store as deserialized Java objects in the JVM. If the RDD does not fit in memory, part of it will be cached. The other will be recomputed as needed. This is the default. The cache() method uses this. MEMORY_AND_DISK Same except also store on disk if it doesn’t fit in memory. Read from memory and disk when needed. MEMORY_ONLY_SER Store as serialized Java objects (one bye array per partition). Space efficient, but more CPU intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_AND_DISK but stored as serialized objects. DISK_ONLY Store only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as above, but replicate each partition on two cluster nodes OFF_HEAP (experimental) Store RDD in serialized format in Tachyon.
  • 20. 20 © 2015 IBM Corporation Scala  Scala Crash Course  Holden Karau, DataBricks http://lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course .pdf
  • 21. 21 © 2015 IBM Corporation Code Execution (1) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt  ‘spark-shell’ provides Spark context as ‘sc’
  • 22. 22 © 2015 IBM Corporation Code Execution (2) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt RDD: quotes DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible
  • 23. 23 © 2015 IBM Corporation Code Execution (3) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt RDD: quotes RDD: danQuotes DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible DAN Spark is cool DAN Scala is awesome
  • 24. 24 © 2015 IBM Corporation Code Execution (4) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt RDD: quotes RDD: danQuotes RDD: danSpark DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible DAN Spark is cool DAN Scala is awesome Spark Scala
  • 25. 25 © 2015 IBM Corporation Code Execution (5) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt HadoopRDD DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible RDD: quotes DAN Spark is cool DAN Scala is awesome RDD: danQuotes Spark Scala RDD: danSpark 1
  • 26. 26 © 2015 IBM Corporation DataFrames  A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database, an R dataframe or Python Pandas, but in a distributed manner and with query optimizations and predicate pushdown to the underlying storage.  DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.  Released in Spark 1.3 DataBricks / Spark Summit 2015
  • 27. 27 © 2015 IBM Corporation DataFrames Examples // Create the DataFrame val df = sqlContext.read.parquet("examples/src/main/resources/people.parquet") // Show the content of the DataFrame df.show() // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Select only the "name" column df.select("name").show() // Select everybody, but increment the age by 1 df.select(df("name"), df("age") + 1).show() // Select people older than 21 df.filter(df("age") > 21).show() // Count people by age df.groupBy("age").count().show()
  • 28. 28 © 2015 IBM Corporation Spark Execution Model
  • 29. 29 © 2015 IBM Corporation sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark client (app master) Spark worker HDFS, HBase, … Block manager Task threads RDD graph Scheduler Block tracker Shuffle tracker Cluster manager Components DataBricks
  • 30. 30 © 2015 IBM Corporation rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed Scheduling Process DataBricks
  • 31. 31 © 2015 IBM Corporation Pipelines narrow ops. within a stage Picks join algorithms based on partitioning (minimize shuffles) Reuses previously cached data Scheduler Optimizations join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task DataBricks
  • 32. 32 © 2015 IBM Corporation Direct Acyclic Graph (DAG)  View the lineage  Could be issued in a continuous line scala> danSpark.toDebugString res1: String = (2) MappedRDD[4] at map at <console>:16 | MappedRDD[3] at map at <console>:16 | FilteredRDD[2] at filter at <console>:14 | hdfs:/sparkdata/sparkQuotes.txt MappedRDD[1] at textFile at <console>:12 | hdfs:/sparkdata/sparkQuotes.txt HadoopRDD[0] at textFile at <console>:12 val danSpark = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt"). filter(_.startsWith("DAN")). map(_.split(" ")). map(x => x(1)). .filter(_.contains("Spark")) danSpark.count()
  • 33. 33 © 2015 IBM Corporation Showing Multiple Apps SparkContext Driver Program Cluster Manager Worker Node Executor Task Task Cache Worker Node Executor Task Task Cache App  Each Spark application runs as a set of processes coordinated by the Spark context object (driver program) – Spark context connects to Cluster Manager (standalone, Mesos/Yarn) – Spark context acquires executors (JVM instance) on worker nodes – Spark context sends tasks to the executors DataBricks
  • 34. 34 © 2015 IBM Corporation Spark Terminology  Context (Connection): – Represents a connection to the Spark cluster. The Application which initiated the context can submit one or several jobs, sequentially or in parallel, batch or interactively, or long running server continuously serving requests.  Driver (Coordinator agent) – The program or process running the Spark context. Responsible for running jobs over the cluster and converting the App into a set of tasks  Job (Query / Query plan): – A piece of logic (code) which will take some input from HDFS (or the local filesystem), perform some computations (transformations and actions) and write some output back.  Stage (Subplan) – Jobs are divided into stages  Tasks (Sub section) – Each stage is made up of tasks. One task per partition. One task is executed on one partition (of data) by one executor  Executor (Sub agent) – The process responsible for executing a task on a worker node  Resilient Distributed Dataset
  • 35. 35 © 2015 IBM Corporation Spark Shell & Application Deployment
  • 36. 36 © 2015 IBM Corporation Spark’s Scala and Python Shell  Spark comes with two shells – Scala – Python  APIs available for Scala, Python and Java  Appropriate versions for each Spark release  Spark’s native language is Scala, more natural to write Spark applications using Scala.  This presentation will focus on code examples in Scala
  • 37. 37 © 2015 IBM Corporation Spark’s Scala and Python Shell  Powerful tool to analyze data interactively  The Scala shell runs on the Java VM – Can leverage existing Java libraries  Scala: – To launch the Scala shell (from Spark home directory): ./bin/spark-shell – To read in a text file: scala> val textFile = sc.textFile("README.txt")  Python: – To launch the Python shell (from Spark home directory): ./bin/pyspark – To read in a text file: >>> textFile = sc.textFile("README.txt")
  • 38. 38 © 2015 IBM Corporation SparkContext in Applications  The main entry point for Spark functionality  Represents the connection to a Spark cluster  Create RDDs, accumulators, and broadcast variables on that cluster  In the Spark shell, the SparkContext, sc, is automatically initialized for you to use  In a Spark program, import some classes and implicit conversions into your program: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf
  • 39. 39 © 2015 IBM Corporation A Spark Standalone Application in Scala Import statements SparkConf and SparkContext Transformations and Actions
  • 40. 40 © 2015 IBM Corporation Running Standalone Applications  Define the dependencies – Scala  simple.sbt  Create the typical directory structure with the files  Create a JAR package containing the application’s code. – Scala: sbt package  Use spark-submit to run the program Scala: ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/SimpleApp.scala
  • 41. 41 © 2015 IBM Corporation Spark Properties  Set application properties via the SparkConf object val conf = new SparkConf() .setMaster("local") .setAppName("CountingSheep") .set("spark.executor.memory", "1g") val sc = new SparkContext(conf)  Dynamically setting Spark properties – SparkContext with an empty conf val sc = new SparkContext(new SparkConf()) – Supply the configuration values during runtime ./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false --conf "spark.executor.extraJavaOptions=- XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar – conf/spark-defaults.conf  Application web UI http://<driver>:4040
  • 42. 42 © 2015 IBM Corporation Spark Configuration  Three locations for configuration: – Spark properties – Environment variables conf/spark-env.sh – Logging log4j.properties  Override default configuration directory (SPARK_HOME/conf) – SPARK_CONF_DIR • spark-defaults.conf • spark-env.sh • log4j.properties • etc.
  • 43. 43 © 2015 IBM Corporation Spark Monitoring  Three ways to monitor Spark applications 1. Web UI • Default port 4040 • Available for the duration of the application 2. Metrics • Based on the Coda Hale Metrics Library • Report to a variety of sinks (HTTP, JMX, and CSV) • /conf/metrics.properties 3. External instrumentations • Ganglia • OS profiling tools (dstat, iostat, iotop) • JVM utilities (jstack, jmap, jstat, jconsole)
  • 44. 44 © 2015 IBM Corporation Running Spark Examples  Spark samples available in the examples directory  Run the examples (from Spark home directory): ./bin/run-example SparkPi where SparkPi is the name of the sample application
  • 45. 45 © 2015 IBM Corporation Spark Extensions
  • 46. 46 © 2015 IBM Corporation Spark Extensions  Extensions to the core Spark API  Improvements made to the core are passed to these libraries  Little overhead to use with the Spark core from http://spark.apache.org
  • 47. 47 © 2015 IBM Corporation Spark SQL  Process relational queries expressed in SQL (HiveQL)  Seamlessly mix SQL queries with Spark programs  In Spark since 1.0, refactored on top of DataFrames since 1.3  Provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files  Leverages Hive frontend and metastore – Compatibility with Hive data, queries and UDFs – HiveQL limitations may apply – Not ANSI SQL compliant – Little to no query rewrite optimization, automatic memory management or sophisticated workload management  Graduated from alpha status with Spark 1.3  Standard connectivity through JDBC/ODBC
  • 48. 48 © 2015 IBM Corporation Spark SQL - Getting Started  SQLContext created from SparkContext // An existing SparkContext, sc val sqlContext = new org.apache.spark.sql.SQLContext(sc)  HiveContext created from SparkContext // An existing SparkContext, sc val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)  Import a library to convert an RDD to a DataFrame – Scala: import sqlContext.implicits._  DataFrame data sources – Inferring the schema using reflection – Programmatic interface
  • 49. 49 © 2015 IBM Corporation Spark SQL - Inferring the Schema Using Reflection  The case class in Scala defines the schema of the table case class Person(name: String, age: Int)  The arguments of the case class becomes the names of the columns  Create the RDD of the Person object and create a DataFrame val people = sc.textFile("examples/src/main/resources/people.txt"). map(_.split(",")). map(p => Person(p(0), p(1).trim.toInt)).toDF()  Register the DataFrame as a table people.registerTempTable("people")  Run SQL statements using the sql method provided by the SQLContext val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")  The results of the queries are DataFrames and support all the normal RDD operations teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
  • 50. 50 © 2015 IBM Corporation Spark SQL - Programmatic Interface  Use when you cannot define the case classes ahead of time  Three steps to create the Dataframe 1. Schema encoded as a String, import SparkSQL Struct types val schemaString = “name age” import org.apache.spark.sql.Row; import org.apache.spark.sql.types.{StructType,StructField,StringType}; 2. Create the schema represented by a StructType matching the structure of the Rows in the RDD from step 1. val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) 3. Apply the schema to the RDD of Rows using the createDataFrame method. val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)  Then register the peopleSchemaRDD as a table peopleDataFrame.registerTempTable("people")  Run the sql statements using the sql method: val results = sqlContext.sql("SELECT name FROM people") results.map(t => "Name: " + t(0)).collect().foreach(println)
  • 51. 51 © 2015 IBM Corporation SparkSQL - DataSources Before : Spark 1.2.x  ParquetFile – val parquetFile = sqlContext.parquetFile("people.parquet")  JSON : – val df = sqlContext.jsonFile("examples/src/main/resources/people.json") Spark 1.3.x  Generic Load/Save – val df = sqlContext.load(“<filename>", “<datasource type>") – df.save (“<filename>", “<datasource type>")  ParquetFile – val df = sqlContext.load("people.parquet") // (parquet unless otherwise configured by spark.sql.sources.default) – df.select("name", "age").save("namesAndAges.parquet")  JSON – val df = sqlContext.load("people.json", "json") – df.select("name", "age").save("namesAndAges.json", “json")  CSV (external package) – val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true")) – df.select("year", "model").save("newcars.csv", "com.databricks.spark.csv") Spark 1.4.x  Generic Load/Save – val df = sqlContext.read.load(“<filename>", “<datasource type>") – df.write.save (“<filename>", “<datasource type>")  ParquetFile – val df = sqlContext.read.load("people.parquet") // (parquet unless otherwise configured by spark.sql.sources.default) – df.select("name", "age").write.save("namesAndAges.parquet")  JSON – val df = sqlContext.read.load("people.json", "json") – df.select("name", "age").write.save("namesAndAges.json", “json")  CSV (external package) – val df = sqlContext.read.format("com.databricks.spark.csv").option("heade r", "true").load("cars.csv") – df.select("year", "model").write.format("com.databricks.spark.csv").save("newcars. csv") DataSource APIs provides generic methods to manage connectors to any datasource (file, jdbc, cassandra, mongodb, etc…). From Spark 1.3 DataSource APIs provides predicate pushdown capabilities to leverage the performance of the backend. Most connectors are available at http://spark-packages.org/
  • 52. 52 © 2015 IBM Corporation Spark Streaming  Scalable, high-throughput, fault-tolerant stream processing of live data streams  Write Spark streaming applications like Spark applications  Recovers lost work and operator state (sliding windows) out-of-the- box  Uses HDFS and Zookeeper for high availability  Data sources also include TCP sockets, ZeroMQ or other customized data sources
  • 53. 53 © 2015 IBM Corporation Spark Streaming - Internals  The input stream goes into Spark Steaming  Breaks up into batches of input data  Feeds it into the Spark engine for processing  Generate the final results in streams of batches  DStream - Discretized Stream – Represents a continuous stream of data created from the input streams – Internally, represented as a sequence of RDDs
  • 54. 54 © 2015 IBM Corporation Spark Streaming - Getting Started  Count the number of words coming in from the TCP socket  Import the Spark Streaming classes import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._  Create the StreamingContext object val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1))  Create a DStream val lines = ssc.socketTextStream("localhost", 9999)  Split the lines into words val words = lines.flatMap(_.split(" "))  Count the words val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _)  Print to the console wordCounts.print()
  • 55. 55 © 2015 IBM Corporation Spark Streaming - Continued  No real processing happens until you tell it ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate  Code and application can be found in the NetworkWordCount example  To run the example: – Invoke netcat to start the data stream – In a different terminal, run the application ./bin/run-example streaming.NetworkWordCount localhost 9999
  • 56. 56 © 2015 IBM Corporation Spark MLlib  Spark MLlib for machine learning library  Since Spark 0.8  Provides common algorithms and utilities • Classification • Regression • Clustering • Collaborative filtering • Dimensionality reduction  Leverages in-memory cache of Spark to speed up iteration processing
  • 57. 57 © 2015 IBM Corporation Spark MLlib - Getting Started  Use k-means clustering for set of latitudes and longitudes  Import the Spark MLlib classes import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors  Create the SparkContext object val conf = new SparkConf().setAppName("KMeans") val sc = new SparkContext(conf)  Create a data RDD val taxifile = sc.textFile("user/spark/sparkdata/nyctaxisub/*")  Create Vectors for input to algorithm val taxi = taxifile.map{line=>Vectors.dense(line.split(",").slice(3,5).map(_.toDouble))}  Run the k-means algorithm with 3 clusters and 10 iterations val model = Kmeans.train(taxi,3,10) val clusterCenters = model.clusterCenters.map(_.toArray)  Print to the console clusterCenters.foreach(lines=>println(lines(0),lines(1)))
  • 58. 58 © 2015 IBM Corporation SparkML  SparkML provides an API to build ML pipeline (since Spark 1.3)  Similar to Python scikit-learn  SparkML provides abstraction for all steps of an ML workflow Generic ML Workflow Real Life ML Workflow  Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms an RDD with features into an RDD with predictions.  Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a dataset and produces a model.  Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.  Param: All Transformers and Estimators now share a common API for specifying parameters. Xebia HUG France 06/2015
  • 59. 59 © 2015 IBM Corporation Spark GraphX  Flexible Graphing –GraphX unifies ETL, exploratory analysis, and iterative graph computation –You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms with the API  Speed –Comparable performance to the fastest specialized graph processing systems.  Algorithms –Choose from a growing library of graph algorithms –In addition to a highly flexible API, GraphX comes with a variety of graph algorithms
  • 60. 60 © 2015 IBM Corporation Spark R  Spark R is an R package that provides a light-weight front-end to use Apache Spark from R  Spark R exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.  Goal – Make Spark R production ready – Integration with MLlib – Consolidations to the DataFrames and RDD concepts  First release in Spark 1.4.0 : – Support of DataFrames  Spark 1.5 – Support of MLlib
  • 61. 61 © 2015 IBM Corporation Spark internals refactoring : Project Tungsten  Memory Management and Binary Processing: leverage application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection  Cache-aware computation: algorithms and data structures to exploit memory hierarchy  Code generation: exploit modern compilers and CPUs: allow efficient operation directly on binary data DataBricks / Spark Summit 2015
  • 62. 62 © 2015 IBM Corporation Spark: Final Thoughts  Spark is a good replacement for MapReduce – Higher performance – Framework is easier to use than MapReduce (M/R) – Powerful RDD & DataFrames concepts – Big higher level libraries : SparkSQL, MLlib/ML, Streaming, GraphX – Big ecosystem adoption  This is a very fast paced environment, so keep up ! – Lot of new features at each new release (major release each 3 months) – Spark has the latest / best offer but things may change again
  • 63. 63 © 2015 IBM Corporation Resources  The Learning Spark O’Reilly book  Lab(s) this afternoon  The following course on big data university