SlideShare a Scribd company logo
1 of 71
Apache Spark :
What? Why? When?
Massimo Schenone
Sr Consultant
Big Data Scenario
Data is growing faster than
computation speeds
⇾ Web apps, mobile, social media,
scientific, …
Requires large clusters to analyze
Programming clusters is hard
⇾ Failures, placement, load balancing
Challenges of Data Science
«The vast majority of work that goes into conducting successful
analyses lies in preprocessing data. Data is messy, and cleansing,
munging, fusing, mushing, and many other verbs are prerequisites
to doing anything useful with it.»
«Iteration is a fundamental part of data science. Modeling and
analysis typically require multiple passes over the same data.»
Advanced Analytics with Spark
Sandy Riza, Uri Laserson, Sean Owen & Josh Wills
Overview
The Story of Today
Motivations Internals Deploy SQL Streaming
If you are immune to boredom, there is literally nothing you cannot accomplish.
—David Foster Wallace
What is Apache Spark?
Apache Spark is a cluster computing platform designed to
be fast, easy to use and general-purpose.
Run workloads 100x faster
df = spark.read.json("logs.json")
df.where("age > 21")
.select("name.first").show()
Write applications quickly in Java, Scala,
Python, R, and SQL.
Spark's Python DataFrame API
Combine SQL, streaming,
and complex analytics.
Project Goals
Extend the MapReduce model to better support two
common classes of analytics apps:
●
Iterative algorithms (machine learning, graphs)
●
Interactive data mining
Enhance programmability:
●
Integrate into Scala programming language
●
Allow interactive use from Scala interpreter
Matei Zaharia, Spark project creator
Performance and Productivity
MapReduce Data Flow
● map function: processes data and generates a set of intermediate
key/value pairs.
● reduce function: merges all intermediate values associated with the
same intermediate key.
MapReduce: Word Count
MapReduce Execution Model
Motivations to move forward
● MapReduce greatly simplified big data analysis on large, unreliable clusters
● It provides fault-tolerance, but also has drawbacks:
– M/R programming model has not been designed for complex operations
– iterative computation: hard to reuse intermediate results across multiple
computations
– efficiency: the only way to share data across jobs is stable storage, which
is slow
Solution
● Extends MapReduce with more operators
● Support for advanced data flow graphs.
● In-memory and out-of-core processing.
The Scala Programming Language
Scala combines object-oriented and functional programming in one concise, high-level language.
Scala Crash Course
// Simple function
def sum(a:Int, b:Int) = a+b
// High Order Function
def calc(a:Int, b:Int, op: (Int, Int) => Int) = op(a,b)
// Passing function sum as argument
calc(3, 5, sum) // res: Int = 8
// Passing an inlined function
calc(3, 5, _ + _) // res: Int = 8
Scala Crash Course (cont.)
// Tuples
// Immutable lists
val captainStuff = ("Picard", "Enterprise­D", "NCC­1701­D")
//> captainStuff : (String, String, String) = ...
// Lists
// Like a tuple with more functionality, but it cannot hold items of different types.
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine") //> shipList : List[String]
// Access individual members using () with ZERO­BASED index
println(shipList(1)) //> Defiant
// Let's apply a function literal to a list. map() can be used to apply any function to every item in a collection.
val backwardShips = shipList.map( (ship: String) => {ship.reverse} )
//> backwardShips : List[String] = ...
//| pS peeD)
for (ship <­backwardShips) { println(ship) } //> esirpretnE
//| tnaifeD
//| regayoV
//| eniN ecapS peeD
Scala Crash Course (cont.)
// reduce() can be used to combine together all the items in a collection using some function.
val numberList = List(1, 2, 3, 4, 5) //> numberList : List[Int] = List(1, 2, 3, 4, 5)
val sum = numberList.reduce( (x: Int, y: Int) => x + y )
//> sum : Int = 15
println(sum) //> 15
// filter() can remove stuff you don't want. Here we'll introduce wildcard syntax while we're at it.
val iHateFives = numberList.filter( (x: Int) => x != 5 )
//> iHateFives : List[Int] = List(1, 2, 3, 4)
val iHateThrees = numberList.filter(_ != 3) //> iHateThrees : List[Int] = List(1, 2, 4, 5)
And in the End it was ...
Spark Core
The Spark Core itself has two parts:
● A Computation engine which provides some basic functionalities
like memory management, task scheduling, fault recovery and most
importantly interacting with the cluster manager and storage system
(HDFS, Amazon S3, Google Cloud storage, Cassandra, Hive, etc.)
● Spark Core APIs (available in Scala, Python, Java, and R):
– Unstructured APIs : RDDs, Accumulators and Broadcast variables
– Structured APIs : DataFrames and DataSets
Resilient Distributed Datasets (RDDs)
A distributed memory abstraction
● Immutable collections of objects spread across a cluster
● An RDD is divided into a number of partitions,
which are atomic pieces of information
● Built through parallel transformations from:
– data in stable storage (fs, HDFS, S3, via JDBC, etc.)
– existing RDDs
RDD Operators
High-order functions:
● Transformations: lazy operators that create new RDDs ( map, filter,
groupBy, join, etc.). Their result RDD is not immediately computed.
● Actions: launch a computation and return a value (non-RDD) to
the program or write data to the external storage ( count, take,
collect, save, etc.). Data is sent from executors to the driver.
Creating RDDs
The SparkContext is our handle to the Spark cluster. It defines a handful
of methods which can be used to create and populate a RDD:
● Turn a collection into an RDD
● Load text file from local FS, HDFS, or S3
val rdd = sc.parallelize(Array(1, 2, 3))
val a = sc.textFile("file.txt")
val b = sc.textFile("directory/*.txt")
val c = sc.textFile("hdfs://namenode:9000/path/file")
RDD Transformations - map
● Passing each element through a function
● All items are independently processed.
val nums = sc.parallelize(Array(1, 2, 3))
val squares = nums.map(x => x * x)
// {1, 4, 9}
RDD Transformations - groupBy
● Pairs with identical key are grouped.
● Groups are independently processed.
val schools = sc.parallelize(Seq(("sics", 1), ("kth", 1), ("sics", 2)))
schools.groupByKey()
// {("sics", (1, 2)), ("kth", (1))}
schools.reduceByKey((x, y) => x + y)
// {("sics", 3), ("kth", 1)}
Basic RDD Actions
● Return all the elements of the RDD as an array.
● Return an array with the first n elements of the RDD.
● Return the number of elements in the RDD.
val nums = sc.parallelize(Array(1, 2, 3))
nums.collect() // Array(1, 2, 3)
nums.take(2) // Array(1, 2)
nums.count() // 3
Fault Tolerance
Transformations on RDDs are represented as a lineage graph, a DAG
representing the computations done on the RDD.
RDD itself contains all the dependency informa‐
tion needed to recreate each of its partitions.
val rdd = sc.textFile(...)
val filtered = rdd.map(...).filter(...)
val count = filtered.count()
val reduced = filtered.reduce()
Checkpointing
● It prevents RDD graph from growing too large.
● RDD is saved to a file inside the checkpointing directory.
● All references to its parent RDDs are removed.
● Done lazily, saved to disk the first time it is computed.
● You can force it: rdd.checkpoint()
Caching
There are many ways to configure how the data is persisted:
● MEMORY_ONLY (default): in memory as regular java objects (just like a regular Java
program - least used elements are evacuated by JVM)
● DISK_ONLY: on disk as regular java objects
● MEMORY_ONLY_SER: in memory as serialize Java objects (more compact since uses byte arrays)
● MEMORY_AND_DISK: both in memory and on disk (spill over to disk to avoid re-computation)
● MEMORY_AND_DISK_SER: on disk as serialize Java objects (more compact since uses byte arrays)
RDD Transformations (revisited)
● Narrow Dependencies: an output RDD has partitions that originate from a
single partition in the parent RDD (e.g. map, filter)
● Wide Dependencies: the data required to compute the records in a single
partition may reside in many partitions on the parent RDD (e.g. groupByKey,
reduceByKey)
Spark Application Tree
Spark groups narrow transformations as a stage which is called pipelining.
At a high level, one stage can be
thought of as the set of computations
(tasks) that can each be computed on
one executor without communication
with other executors or with the driver.
Stage boundaries
//stage 0
counts = sc.textFile("/path/to/input/")
.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
//stage 1
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("/path/to/output/")
In general, a new stage begins whenever network
communication between workers is required
(for instance, in a shuffle).
Spark Programming Model
● Spark expresses computation by defining RDDs
● Based on parallelizable operators: higher-order functions
that execute user defined functions in parallel
● A data flow is composed of any number of data sources
and operators:
How do RDDs evolve into tasks?
Spark Execution Model
An application maps to a single driver process and a set of executor
processes distributed across the hosts in a cluster.
The executors are responsible for
performing work, in the form of tasks,
as well as for storing any data.
Invoking an action triggers the
launch of a job to fulfill it.
A stage is a collection of tasks
that run the same code, each on
a different subset of the data.
task
result
Parallelism
more partitions = more parallelism
How to run Spark
●
Interactive Mode: spark-shell or Spark Notebook
●
Batch Mode: spark-submit
Deployment Modes :
●
Local
●
Standalone
●
YARN Cluster
●
Mesos Cluster
●
Kubernetes
Runs Everywhere
Interactive Spark application
$ bin/spark­shell
Using Spark's default log4j profile: org/apache/spark/log4j­defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.137:4041
Spark context available as 'sc' (master = local[*], app id = local­1534089873554).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.1.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64­Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@281963c
scala>
Standalone Application
You need to import the Spark packages in your program and create a
SparkContext (driver program):
● Initializing Spark in Python
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster(“local”).setAppName(“My App”)
sc = SparkContext(conf = conf)
● Initializing Spark in Scala
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster(“local”).setAppName(“My App”)
val sc = new SparkContext(conf)
Batch Mode
spark­submit
­­master MASTER_URL spark://host:port, mesos://host:port, yarn, or local
­­deploy­mode DEPLOY_MODE (client or cluster)
­­name NAME (name of the application)
­­jars JARS (list of jars to be added to classpath of the driver and
executors)
­­conf PROP=VALUE (spark configurations)
­­driver­memory MEM (Memory for the driver program. Format 300M or 1G)
­­executor­memory MEM (Memory for executor. Format 500M or 2G)
­­driver­cores NUM (Cores for drivers – applicable for YARN and
standalone)
­­executor­cores NUM (Cores for executors – applicable for YARN and
standalone)
­­num­executors NUM (Number of executors to launch (Default: 2))
Launch the application via spark-submit command:
Spark job running on HDFS
Spark running on Hadoop
Advantages of running on Hadoop
● YARN resource manager, which takes responsibility for
scheduling tasks across available nodes in the cluster.
● Hadoop Distributed File System, which stores data when
the cluster runs out of free memory, and which persistently
stores historical data when Spark is not running.
● Disaster Recovery capabilities, inherent to Hadoop, which
enable recovery of data when individual nodes fail.
● Data Security, which becomes increasingly important as
Spark tackles production workloads in regulated industries
such as healthcare and financial services. Projects like
Apache Knox and Apache Ranger offer data security
capabilities that augment Hadoop.
SCALING
Thanks to Holden Karau and her lesson in unintended consequences.
Scaling tips
● Spark only “understands” the program to the point we have an
action (not like a compiler)
● If we are going to re-use an RDD better caching it in memory or
persisting at an another level (MEMORY_AND_DISK, ..)
● In a shared environment checkpointing can help
● Persist before checkpointing (gets rid of the lineage)
Key-Value Data Scaling
● What does the distribution of keys look like?
● What type of aggregations do we need to do?
● What’s partition structure
● ...
Key Skew
Keys not evenly distributed (e.g. zip code, null values)
● Straggler: a task which takes much longer to complete than the other ones
● The function groupByKey groups all of the records with the same key into a
single record. When called all the key-value pairs are shuffled around (data is
transferred over the network). By default, Spark uses hash partitioning to
determine which key-value pair should be sent to which machine.
● Spark flushes out the data to disk one key at a time, so key-value pairs can be
too big to fit in memory (OOM)
● If we have enough key skew, sortByKey will explode too. Sorting the key can
put all records in the same partition.
groupByKey vs reduceByKey
val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
.collect()
val wordCountsWithGroup = wordPairsRDD
.groupByKey()
.map(t => (t._1, t._2.sum))
.collect()
groupByKey vs reduceByKey
By reducing the dataset first, the amount of
data sent over the network during the shuffle
is greatly reduced
Shuffle explosion (sortByKey)
All the examples by lessons of Holden Karau in unintended consequences
Shuffle “caching”
● Spark keeps shuffle files so that they can be re-used
● Shuffle files live in the driver program memory until GC is triggered
● You need to trigger a GC event on the worker to free the memory or call the API function
to cleanup shuffle memory
● Enable off-heap memory: shuffle data structures are allocated out of JVM memory
Summary
Spark offers an innovative, efficient model of parallel computing that centers on
lazily evaluated, immutable, distributed datasets, known as RDDs.
RDD methods can be used without any knowledge of their implementation - but
having an understanding of the details will help you write more performant code.
Structured vs Unstructured Data
Spark and RDDs don't know anything about the schema of the data
it's dealing with.
Spark vs Databases
In Spark:
● we do functional transformations
on data
● we pass user defined function
literal to higher order functions
like map, flatMap, filter
In Database/Hive:
● we do declarative transformations
on data
● Specialized and structured, pre-
defined operations
Eg. SELECT * FROM * WHERE *
Spark SQL: DataFrames, Datasets
Like RDDs, DataFrames and Datasets represent distributed collections,
with additional schema information not found in RDDs.
This additional schema informationis used to provide a more efficient
storage layer (Tungsten), and in the optimizer (Catalyst) to perform
additional optimizations.
● DataFrames and Datasets have a specialized representation and
columnar cache format.
● Instead of specifying arbitrary functions, which the optimizer is unable to
introspect, you use a restricted expression syntax so the optimizer can
have more information.
DataFrames
A DataFrame is a distributed collection of data organized into named columns.
DataFrames can be created from different data sources such as:
• existing RDDs
• structured data files
• JSON datasets
• Hive tables
• external databases (via JDBC)
DataFrame API example
val employeeDF = sc.textFile(...).toDF
employeeDF.show()
// employeeDF:
// +---+-----+-------+---+--------+
// | id|fname| lname |age| city |
// +---+-----+-------+---+--------+
// | 12| Joe| Smith| 38|New York|
// |563|Sally| Owens| 48|New York|
// |645|Slate|Markham| 28| Sydney|
// |221|David| Walker| 21| Sydney|
// +---+-----+-------+---+--------+
val sydneyEmployeesDF =
sparkSession.select("id", "lname")
.where("city = sydney")
.orderBy("id")
// sydneyEmployeesDF:
// +---+-------+
// | id| lname|
// +---+-------+
// |221| Walker|
// |645|Markham|
// +---+-------+
RDD versus DataFrame storage size
DataFrames vs DataSets
DataFrames
● Relational flavour
● Lack of compile-time type
checking
● DataFrames are a specialized
version of Datasets that operate
on generic Row objects
DataSets
● Mix of relational and functional
transformations
● Compile-time type checking
● Can be used when you know the
type information at compile time
● Datasets can be easily converted
to/from DataFrames and RDDs
DataFrames/DataSets vs RDDs
DataFrames/DataSets
● Catalyst Optimizer
● Efficient storage format
● Restrict subset of data types
● DataFrames are not strongly
typed
● Dataset API is continuing to
evolve
RDDs
● Unstructured data
● Wider variety of data types
● Not primarly relational
transformations
● Number of partitions needed for
different parts of your pipeline
changes
User-Defined Functions and Aggregate
Functions (UDFs, UDAFs)
User-defined functions and user-defined aggregate functions provide you with
ways to extend the DataFrame and SQL APIs with your own custom code while
keeping the Catalyst optimizer.
If most of your work is in Python but you want to access some UDFs without
the performance penalty, you can write your UDFs in Scala and register them
for use in Python.
Physical Execution Comparison
Spark Streaming
Spark Streaming is an extension of the core Spark API that makes it easy
to build fault-tolerant processing of real-time data streams.
It works by dividing the live stream of data into batches (called micro-
batches) of a pre-defined interval (N seconds) and then treating each
batch of data as a RDD.
Each RDD contains only a little chunk of incoming data.
Spark Streaming
With Spark Streaming’s micro-batch approach, we can use other
Spark libraries (core, ML, SQL) with the Spark Streaming API in the
same application.
DStream
DStream (short for “discretized stream”) is the basic abstraction in
Spark Streaming and represents a continuous stream of data.
Internally, a DStream is represented as a sequence of RDD objects:
Similar to the transformation and action operations on RDDs,
Dstreams support the following operations: map, flatMap, filter,
count, reduce, countByValue, reduceByKey, join, updateStateByKey
Netcat Streaming Example
import org.apache.spark.streaming.{StreamingContext, Seconds}
val ssc = new StreamingContext(sc, Seconds(10))
// This listens to log data sent into port 9999, one second at a time
val lines = ssc.socketTextStream("localhost", 9999)
// Wordcount
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
// You need to kick off the job explicitly
ssc.start()
ssc.awaitTermination()
Netcat Streaming Example
...
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Time: 1535630570000 ms
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
(how,1)
(into,1)
(go,1)
(what,1)
(program,,1)
(want,1)
(looks,1)
(program,1)
(Spark,2)
(a,4)
...
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Time: 1535630580000 ms
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
[sparkuser@horst ~]$ nc ­lk 9999
...
Spark Streaming is a special SparkContext
that you can use for processing data quickly
in near­time. It’s similar to the standard
SparkContext, which is geared toward batch
operations. Spark Streaming uses a little
trick to create small batch windows (micro
batches) that offer all of the advantages of
Spark: safe, fast data handling and lazy
evaluation combined with real­time
processing. It’s a combination of both batch
and interactive processing.
...
Twitter Example
val ssc = new StreamingContext(conf, Seconds(1))
// Get a Twitter stream and extract just the messages themselves
val tweets = TwitterUtils.createStream(ssc, None)
val statuses = tweets.map(_.getText())
// Create a new Dstream that has every individual word as its own entry
val tweetwords = statuses.flatMap(_.split(“ “))
// Eliminate anything that’s not a hashtag
val hashtags = tweetwords.filter(_.startsWith(“#“))
// Convert RRD to key/value pairs
val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1))
// Count up the results over a sliding window
val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow(_+_,_­_, Seconds(300), Seconds(1))
// Sort and output the results
val sortedResults = hashtagCounts.transform(_.sortBy(x._2), false)
SortedResults.print
Real Use Cases
• Uber, the ride-sharing service, uses Spark Streaming in their continuous-
streaming ETL pipeline to collect terabytes of event data every day from their
mobile users for real-time telemetry analysis.
• Pinterest uses Spark Streaming, MemSQL, and Apache Kafka technologies
to provide real-time insight into how their users are engaging with pins across
the globe.
• Netflix uses Kafka and Spark Streaming to build a real-time online movie
recommendation and data-monitoring solution that processes billions of
events received per day from different data sources.
Conclusions
A lightning fast cluster
computing framework
Apache Spark can help you to address the challenges of Data Science….
A unified engine supporting diverse workloads &
environments. Fault-tolerant and Scalable.
From simple ETL to complex
Machine Learning jobs
You won’t be a Spark superhero, but...
Thanks!
mschenone@sorint.it

More Related Content

What's hot

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 

What's hot (20)

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 

Similar to Apache Spark: What? Why? When?

Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 

Similar to Apache Spark: What? Why? When? (20)

Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark core
Spark coreSpark core
Spark core
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 

Apache Spark: What? Why? When?

  • 1. Apache Spark : What? Why? When? Massimo Schenone Sr Consultant
  • 2. Big Data Scenario Data is growing faster than computation speeds ⇾ Web apps, mobile, social media, scientific, … Requires large clusters to analyze Programming clusters is hard ⇾ Failures, placement, load balancing
  • 3. Challenges of Data Science «The vast majority of work that goes into conducting successful analyses lies in preprocessing data. Data is messy, and cleansing, munging, fusing, mushing, and many other verbs are prerequisites to doing anything useful with it.» «Iteration is a fundamental part of data science. Modeling and analysis typically require multiple passes over the same data.» Advanced Analytics with Spark Sandy Riza, Uri Laserson, Sean Owen & Josh Wills
  • 4. Overview The Story of Today Motivations Internals Deploy SQL Streaming If you are immune to boredom, there is literally nothing you cannot accomplish. —David Foster Wallace
  • 5. What is Apache Spark? Apache Spark is a cluster computing platform designed to be fast, easy to use and general-purpose. Run workloads 100x faster df = spark.read.json("logs.json") df.where("age > 21") .select("name.first").show() Write applications quickly in Java, Scala, Python, R, and SQL. Spark's Python DataFrame API Combine SQL, streaming, and complex analytics.
  • 6. Project Goals Extend the MapReduce model to better support two common classes of analytics apps: ● Iterative algorithms (machine learning, graphs) ● Interactive data mining Enhance programmability: ● Integrate into Scala programming language ● Allow interactive use from Scala interpreter Matei Zaharia, Spark project creator Performance and Productivity
  • 7. MapReduce Data Flow ● map function: processes data and generates a set of intermediate key/value pairs. ● reduce function: merges all intermediate values associated with the same intermediate key.
  • 10. Motivations to move forward ● MapReduce greatly simplified big data analysis on large, unreliable clusters ● It provides fault-tolerance, but also has drawbacks: – M/R programming model has not been designed for complex operations – iterative computation: hard to reuse intermediate results across multiple computations – efficiency: the only way to share data across jobs is stable storage, which is slow
  • 11. Solution ● Extends MapReduce with more operators ● Support for advanced data flow graphs. ● In-memory and out-of-core processing.
  • 12. The Scala Programming Language Scala combines object-oriented and functional programming in one concise, high-level language.
  • 13. Scala Crash Course // Simple function def sum(a:Int, b:Int) = a+b // High Order Function def calc(a:Int, b:Int, op: (Int, Int) => Int) = op(a,b) // Passing function sum as argument calc(3, 5, sum) // res: Int = 8 // Passing an inlined function calc(3, 5, _ + _) // res: Int = 8
  • 14. Scala Crash Course (cont.) // Tuples // Immutable lists val captainStuff = ("Picard", "Enterprise­D", "NCC­1701­D") //> captainStuff : (String, String, String) = ... // Lists // Like a tuple with more functionality, but it cannot hold items of different types. val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine") //> shipList : List[String] // Access individual members using () with ZERO­BASED index println(shipList(1)) //> Defiant // Let's apply a function literal to a list. map() can be used to apply any function to every item in a collection. val backwardShips = shipList.map( (ship: String) => {ship.reverse} ) //> backwardShips : List[String] = ... //| pS peeD) for (ship <­backwardShips) { println(ship) } //> esirpretnE //| tnaifeD //| regayoV //| eniN ecapS peeD
  • 15. Scala Crash Course (cont.) // reduce() can be used to combine together all the items in a collection using some function. val numberList = List(1, 2, 3, 4, 5) //> numberList : List[Int] = List(1, 2, 3, 4, 5) val sum = numberList.reduce( (x: Int, y: Int) => x + y ) //> sum : Int = 15 println(sum) //> 15 // filter() can remove stuff you don't want. Here we'll introduce wildcard syntax while we're at it. val iHateFives = numberList.filter( (x: Int) => x != 5 ) //> iHateFives : List[Int] = List(1, 2, 3, 4) val iHateThrees = numberList.filter(_ != 3) //> iHateThrees : List[Int] = List(1, 2, 4, 5)
  • 16. And in the End it was ...
  • 17. Spark Core The Spark Core itself has two parts: ● A Computation engine which provides some basic functionalities like memory management, task scheduling, fault recovery and most importantly interacting with the cluster manager and storage system (HDFS, Amazon S3, Google Cloud storage, Cassandra, Hive, etc.) ● Spark Core APIs (available in Scala, Python, Java, and R): – Unstructured APIs : RDDs, Accumulators and Broadcast variables – Structured APIs : DataFrames and DataSets
  • 18. Resilient Distributed Datasets (RDDs) A distributed memory abstraction ● Immutable collections of objects spread across a cluster ● An RDD is divided into a number of partitions, which are atomic pieces of information ● Built through parallel transformations from: – data in stable storage (fs, HDFS, S3, via JDBC, etc.) – existing RDDs
  • 19. RDD Operators High-order functions: ● Transformations: lazy operators that create new RDDs ( map, filter, groupBy, join, etc.). Their result RDD is not immediately computed. ● Actions: launch a computation and return a value (non-RDD) to the program or write data to the external storage ( count, take, collect, save, etc.). Data is sent from executors to the driver.
  • 20. Creating RDDs The SparkContext is our handle to the Spark cluster. It defines a handful of methods which can be used to create and populate a RDD: ● Turn a collection into an RDD ● Load text file from local FS, HDFS, or S3 val rdd = sc.parallelize(Array(1, 2, 3)) val a = sc.textFile("file.txt") val b = sc.textFile("directory/*.txt") val c = sc.textFile("hdfs://namenode:9000/path/file")
  • 21. RDD Transformations - map ● Passing each element through a function ● All items are independently processed. val nums = sc.parallelize(Array(1, 2, 3)) val squares = nums.map(x => x * x) // {1, 4, 9}
  • 22. RDD Transformations - groupBy ● Pairs with identical key are grouped. ● Groups are independently processed. val schools = sc.parallelize(Seq(("sics", 1), ("kth", 1), ("sics", 2))) schools.groupByKey() // {("sics", (1, 2)), ("kth", (1))} schools.reduceByKey((x, y) => x + y) // {("sics", 3), ("kth", 1)}
  • 23. Basic RDD Actions ● Return all the elements of the RDD as an array. ● Return an array with the first n elements of the RDD. ● Return the number of elements in the RDD. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) nums.take(2) // Array(1, 2) nums.count() // 3
  • 24. Fault Tolerance Transformations on RDDs are represented as a lineage graph, a DAG representing the computations done on the RDD. RDD itself contains all the dependency informa‐ tion needed to recreate each of its partitions. val rdd = sc.textFile(...) val filtered = rdd.map(...).filter(...) val count = filtered.count() val reduced = filtered.reduce()
  • 25. Checkpointing ● It prevents RDD graph from growing too large. ● RDD is saved to a file inside the checkpointing directory. ● All references to its parent RDDs are removed. ● Done lazily, saved to disk the first time it is computed. ● You can force it: rdd.checkpoint()
  • 26. Caching There are many ways to configure how the data is persisted: ● MEMORY_ONLY (default): in memory as regular java objects (just like a regular Java program - least used elements are evacuated by JVM) ● DISK_ONLY: on disk as regular java objects ● MEMORY_ONLY_SER: in memory as serialize Java objects (more compact since uses byte arrays) ● MEMORY_AND_DISK: both in memory and on disk (spill over to disk to avoid re-computation) ● MEMORY_AND_DISK_SER: on disk as serialize Java objects (more compact since uses byte arrays)
  • 27. RDD Transformations (revisited) ● Narrow Dependencies: an output RDD has partitions that originate from a single partition in the parent RDD (e.g. map, filter) ● Wide Dependencies: the data required to compute the records in a single partition may reside in many partitions on the parent RDD (e.g. groupByKey, reduceByKey)
  • 28. Spark Application Tree Spark groups narrow transformations as a stage which is called pipelining. At a high level, one stage can be thought of as the set of computations (tasks) that can each be computed on one executor without communication with other executors or with the driver.
  • 29. Stage boundaries //stage 0 counts = sc.textFile("/path/to/input/") .flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) //stage 1 .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("/path/to/output/") In general, a new stage begins whenever network communication between workers is required (for instance, in a shuffle).
  • 30. Spark Programming Model ● Spark expresses computation by defining RDDs ● Based on parallelizable operators: higher-order functions that execute user defined functions in parallel ● A data flow is composed of any number of data sources and operators:
  • 31. How do RDDs evolve into tasks?
  • 32. Spark Execution Model An application maps to a single driver process and a set of executor processes distributed across the hosts in a cluster. The executors are responsible for performing work, in the form of tasks, as well as for storing any data. Invoking an action triggers the launch of a job to fulfill it. A stage is a collection of tasks that run the same code, each on a different subset of the data. task result
  • 33. Parallelism more partitions = more parallelism
  • 34. How to run Spark ● Interactive Mode: spark-shell or Spark Notebook ● Batch Mode: spark-submit Deployment Modes : ● Local ● Standalone ● YARN Cluster ● Mesos Cluster ● Kubernetes Runs Everywhere
  • 35. Interactive Spark application $ bin/spark­shell Using Spark's default log4j profile: org/apache/spark/log4j­defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://192.168.1.137:4041 Spark context available as 'sc' (master = local[*], app id = local­1534089873554). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 2.1.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64­Bit Server VM, Java 1.8.0_181) Type in expressions to have them evaluated. Type :help for more information. scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@281963c scala>
  • 36. Standalone Application You need to import the Spark packages in your program and create a SparkContext (driver program): ● Initializing Spark in Python from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster(“local”).setAppName(“My App”) sc = SparkContext(conf = conf) ● Initializing Spark in Scala import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val conf = new SparkConf().setMaster(“local”).setAppName(“My App”) val sc = new SparkContext(conf)
  • 37. Batch Mode spark­submit ­­master MASTER_URL spark://host:port, mesos://host:port, yarn, or local ­­deploy­mode DEPLOY_MODE (client or cluster) ­­name NAME (name of the application) ­­jars JARS (list of jars to be added to classpath of the driver and executors) ­­conf PROP=VALUE (spark configurations) ­­driver­memory MEM (Memory for the driver program. Format 300M or 1G) ­­executor­memory MEM (Memory for executor. Format 500M or 2G) ­­driver­cores NUM (Cores for drivers – applicable for YARN and standalone) ­­executor­cores NUM (Cores for executors – applicable for YARN and standalone) ­­num­executors NUM (Number of executors to launch (Default: 2)) Launch the application via spark-submit command:
  • 38. Spark job running on HDFS
  • 40. Advantages of running on Hadoop ● YARN resource manager, which takes responsibility for scheduling tasks across available nodes in the cluster. ● Hadoop Distributed File System, which stores data when the cluster runs out of free memory, and which persistently stores historical data when Spark is not running. ● Disaster Recovery capabilities, inherent to Hadoop, which enable recovery of data when individual nodes fail. ● Data Security, which becomes increasingly important as Spark tackles production workloads in regulated industries such as healthcare and financial services. Projects like Apache Knox and Apache Ranger offer data security capabilities that augment Hadoop.
  • 41. SCALING Thanks to Holden Karau and her lesson in unintended consequences.
  • 42. Scaling tips ● Spark only “understands” the program to the point we have an action (not like a compiler) ● If we are going to re-use an RDD better caching it in memory or persisting at an another level (MEMORY_AND_DISK, ..) ● In a shared environment checkpointing can help ● Persist before checkpointing (gets rid of the lineage)
  • 43. Key-Value Data Scaling ● What does the distribution of keys look like? ● What type of aggregations do we need to do? ● What’s partition structure ● ...
  • 44. Key Skew Keys not evenly distributed (e.g. zip code, null values) ● Straggler: a task which takes much longer to complete than the other ones ● The function groupByKey groups all of the records with the same key into a single record. When called all the key-value pairs are shuffled around (data is transferred over the network). By default, Spark uses hash partitioning to determine which key-value pair should be sent to which machine. ● Spark flushes out the data to disk one key at a time, so key-value pairs can be too big to fit in memory (OOM) ● If we have enough key skew, sortByKey will explode too. Sorting the key can put all records in the same partition.
  • 45. groupByKey vs reduceByKey val words = Array("one", "two", "two", "three", "three", "three") val wordPairsRDD = sc.parallelize(words).map(word => (word, 1)) val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect() val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()
  • 46. groupByKey vs reduceByKey By reducing the dataset first, the amount of data sent over the network during the shuffle is greatly reduced
  • 47. Shuffle explosion (sortByKey) All the examples by lessons of Holden Karau in unintended consequences
  • 48. Shuffle “caching” ● Spark keeps shuffle files so that they can be re-used ● Shuffle files live in the driver program memory until GC is triggered ● You need to trigger a GC event on the worker to free the memory or call the API function to cleanup shuffle memory ● Enable off-heap memory: shuffle data structures are allocated out of JVM memory
  • 49. Summary Spark offers an innovative, efficient model of parallel computing that centers on lazily evaluated, immutable, distributed datasets, known as RDDs. RDD methods can be used without any knowledge of their implementation - but having an understanding of the details will help you write more performant code.
  • 50.
  • 51. Structured vs Unstructured Data Spark and RDDs don't know anything about the schema of the data it's dealing with.
  • 52. Spark vs Databases In Spark: ● we do functional transformations on data ● we pass user defined function literal to higher order functions like map, flatMap, filter In Database/Hive: ● we do declarative transformations on data ● Specialized and structured, pre- defined operations Eg. SELECT * FROM * WHERE *
  • 53. Spark SQL: DataFrames, Datasets Like RDDs, DataFrames and Datasets represent distributed collections, with additional schema information not found in RDDs. This additional schema informationis used to provide a more efficient storage layer (Tungsten), and in the optimizer (Catalyst) to perform additional optimizations. ● DataFrames and Datasets have a specialized representation and columnar cache format. ● Instead of specifying arbitrary functions, which the optimizer is unable to introspect, you use a restricted expression syntax so the optimizer can have more information.
  • 54. DataFrames A DataFrame is a distributed collection of data organized into named columns. DataFrames can be created from different data sources such as: • existing RDDs • structured data files • JSON datasets • Hive tables • external databases (via JDBC)
  • 55. DataFrame API example val employeeDF = sc.textFile(...).toDF employeeDF.show() // employeeDF: // +---+-----+-------+---+--------+ // | id|fname| lname |age| city | // +---+-----+-------+---+--------+ // | 12| Joe| Smith| 38|New York| // |563|Sally| Owens| 48|New York| // |645|Slate|Markham| 28| Sydney| // |221|David| Walker| 21| Sydney| // +---+-----+-------+---+--------+ val sydneyEmployeesDF = sparkSession.select("id", "lname") .where("city = sydney") .orderBy("id") // sydneyEmployeesDF: // +---+-------+ // | id| lname| // +---+-------+ // |221| Walker| // |645|Markham| // +---+-------+
  • 56. RDD versus DataFrame storage size
  • 57. DataFrames vs DataSets DataFrames ● Relational flavour ● Lack of compile-time type checking ● DataFrames are a specialized version of Datasets that operate on generic Row objects DataSets ● Mix of relational and functional transformations ● Compile-time type checking ● Can be used when you know the type information at compile time ● Datasets can be easily converted to/from DataFrames and RDDs
  • 58. DataFrames/DataSets vs RDDs DataFrames/DataSets ● Catalyst Optimizer ● Efficient storage format ● Restrict subset of data types ● DataFrames are not strongly typed ● Dataset API is continuing to evolve RDDs ● Unstructured data ● Wider variety of data types ● Not primarly relational transformations ● Number of partitions needed for different parts of your pipeline changes
  • 59. User-Defined Functions and Aggregate Functions (UDFs, UDAFs) User-defined functions and user-defined aggregate functions provide you with ways to extend the DataFrame and SQL APIs with your own custom code while keeping the Catalyst optimizer. If most of your work is in Python but you want to access some UDFs without the performance penalty, you can write your UDFs in Scala and register them for use in Python.
  • 61.
  • 62. Spark Streaming Spark Streaming is an extension of the core Spark API that makes it easy to build fault-tolerant processing of real-time data streams. It works by dividing the live stream of data into batches (called micro- batches) of a pre-defined interval (N seconds) and then treating each batch of data as a RDD. Each RDD contains only a little chunk of incoming data.
  • 63. Spark Streaming With Spark Streaming’s micro-batch approach, we can use other Spark libraries (core, ML, SQL) with the Spark Streaming API in the same application.
  • 64. DStream DStream (short for “discretized stream”) is the basic abstraction in Spark Streaming and represents a continuous stream of data. Internally, a DStream is represented as a sequence of RDD objects: Similar to the transformation and action operations on RDDs, Dstreams support the following operations: map, flatMap, filter, count, reduce, countByValue, reduceByKey, join, updateStateByKey
  • 65. Netcat Streaming Example import org.apache.spark.streaming.{StreamingContext, Seconds} val ssc = new StreamingContext(sc, Seconds(10)) // This listens to log data sent into port 9999, one second at a time val lines = ssc.socketTextStream("localhost", 9999) // Wordcount val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() // You need to kick off the job explicitly ssc.start() ssc.awaitTermination()
  • 66. Netcat Streaming Example ... ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ Time: 1535630570000 ms ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ (how,1) (into,1) (go,1) (what,1) (program,,1) (want,1) (looks,1) (program,1) (Spark,2) (a,4) ... ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ Time: 1535630580000 ms ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ [sparkuser@horst ~]$ nc ­lk 9999 ... Spark Streaming is a special SparkContext that you can use for processing data quickly in near­time. It’s similar to the standard SparkContext, which is geared toward batch operations. Spark Streaming uses a little trick to create small batch windows (micro batches) that offer all of the advantages of Spark: safe, fast data handling and lazy evaluation combined with real­time processing. It’s a combination of both batch and interactive processing. ...
  • 67. Twitter Example val ssc = new StreamingContext(conf, Seconds(1)) // Get a Twitter stream and extract just the messages themselves val tweets = TwitterUtils.createStream(ssc, None) val statuses = tweets.map(_.getText()) // Create a new Dstream that has every individual word as its own entry val tweetwords = statuses.flatMap(_.split(“ “)) // Eliminate anything that’s not a hashtag val hashtags = tweetwords.filter(_.startsWith(“#“)) // Convert RRD to key/value pairs val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1)) // Count up the results over a sliding window val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow(_+_,_­_, Seconds(300), Seconds(1)) // Sort and output the results val sortedResults = hashtagCounts.transform(_.sortBy(x._2), false) SortedResults.print
  • 68. Real Use Cases • Uber, the ride-sharing service, uses Spark Streaming in their continuous- streaming ETL pipeline to collect terabytes of event data every day from their mobile users for real-time telemetry analysis. • Pinterest uses Spark Streaming, MemSQL, and Apache Kafka technologies to provide real-time insight into how their users are engaging with pins across the globe. • Netflix uses Kafka and Spark Streaming to build a real-time online movie recommendation and data-monitoring solution that processes billions of events received per day from different data sources.
  • 69. Conclusions A lightning fast cluster computing framework Apache Spark can help you to address the challenges of Data Science…. A unified engine supporting diverse workloads & environments. Fault-tolerant and Scalable. From simple ETL to complex Machine Learning jobs
  • 70. You won’t be a Spark superhero, but...