SlideShare a Scribd company logo
An Introduction to
Apache Spark
Anastasios Skarlatidis
@anskarl
Software Engineer/Researcher
IIT, NCSR "Demokritos"
• Part I: Getting to know Spark
• Part II: Basic programming
• Part III: Spark under the hood
• Part IV: Advanced features
Outline
Part I:
Getting to know Spark
Spark in a Nutshell
• General cluster computing platform:
• Distributed in-memory computational framework.
• SQL, Machine Learning, Stream Processing, etc.
• Easy to use, powerful, high-level API:
• Scala, Java, Python and R.
Unified Stack
Spark SQL
Spark
Streaming
(real-time
processing)
MLlib
(Machine
Learning)
GraphX
(graph
processing)
Spark Core
Standalone Scheduler YARN Mesos
High Performance
• In-memory cluster computing.
• Ideal for iterative algorithms.
• Faster than Hadoop:
• 10x on disk.
• 100x in memory.
Brief History
• Originally developed in 2009, UC Berkeley AMP Lab.
• Open-sourced in 2010.
• As of 2014, Spark is a top-level Apache project.
• Fastest open-source engine for sorting 100 ΤΒ:
• Won the 2014 Daytona GraySort contest.
• Throughput: 4.27 TB/min
Who uses Spark,
and for what?
A. Data Scientists:
• Analyze and model data.
• Data transformations and prototyping.
• Statistics and Machine Learning.
B. Software Engineers:
• Implement production data processing systems.
• Require a reasonable API for distributed processing.
• Reliable, high performance, easy to monitor platform.
Resilient Distributed Dataset
RDD is an immutable and partitioned collection:
• Resilient: it can be recreated, when data in memory is lost.
• Distributed: stored in memory across the cluster.
• Dataset: data that comes from file or created
programmatically.
RDD
partitions
Resilient Distributed Datasets
• Feels like coding using typical Scala collections.
• RDD can be build:
1. Directly from a datasource (e.g., text file, HDFS, etc.),
2. or by applying a transformation to another RDD(s).
• Main features:
• RDDs are computed lazily.
• Automatically rebuild on failure.
• Persistence for reuse (RAM and/or disk).
Part II:
Basic programming
Spark Shell
$	
  cd	
  spark	
  
$	
  ./bin/spark-­‐shell	
  
Spark	
  assembly	
  has	
  been	
  built	
  with	
  Hive,	
  including	
  Datanucleus	
  jars	
  on	
  classpath	
  
Welcome	
  to	
  
	
  	
  	
  	
  	
  	
  ____	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  __	
  
	
  	
  	
  	
  	
  /	
  __/__	
  	
  ___	
  _____/	
  /__	
  
	
  	
  	
  	
  _	
  /	
  _	
  /	
  _	
  `/	
  __/	
  	
  '_/	
  
	
  	
  	
  /___/	
  .__/_,_/_/	
  /_/_	
  	
  	
  version	
  1.2.1	
  
	
  	
  	
  	
  	
  	
  /_/	
  
Using	
  Scala	
  version	
  2.10.4	
  (Java	
  HotSpot(TM)	
  64-­‐Bit	
  Server	
  VM,	
  Java	
  1.7.0_71)	
  
Type	
  in	
  expressions	
  to	
  have	
  them	
  evaluated.	
  
Type	
  :help	
  for	
  more	
  information.	
  
Spark	
  context	
  available	
  as	
  sc.	
  
scala>
Standalone Applications
Sbt:	
  
	
  	
  	
  	
  "org.apache.spark"	
  %%	
  "spark-­‐core"	
  %	
  "1.2.1"
Maven:	
  
	
  	
  	
  	
  groupId:	
  org.apache.spark	
  
	
  	
  	
  	
  artifactId:	
  spark-­‐core_2.10	
  
	
  	
  	
  	
  version:	
  1.2.1
Initiate Spark Context
import	
  org.apache.spark.SparkContext	
  
import	
  org.apache.spark.SparkContext._	
  
import	
  org.apache.spark.SparkConf	
  
object	
  SimpleApp	
  extends	
  App	
  {	
  
	
  	
  val	
  conf	
  =	
  new	
  SparkConf().setAppName("Hello	
  Spark")	
  
	
  	
  val	
  sc	
  =	
  new	
  SparkContext(conf)	
  
}
Rich, High-level API
map	
  
filter	
  
sort	
  
groupBy	
  
union	
  
join	
  
…	
  
reduce	
  
count	
  
fold	
  
reduceByKey	
  
groupByKey	
  
cogroup	
  
zip	
  
…	
  
sample	
  
take	
  
first	
  
partitionBy	
  
mapWith	
  
pipe	
  
save	
  
…	
  
Rich, High-level API
map	
  
filter	
  
sort	
  
groupBy	
  
union	
  
join	
  
…	
  
reduce	
  
count	
  
fold	
  
reduceByKey	
  
groupByKey	
  
cogroup	
  
zip	
  
…	
  
sample	
  
take	
  
first	
  
partitionBy	
  
mapWith	
  
pipe	
  
save	
  
…	
  
Loading and Saving
• File Systems: Local FS, Amazon S3 and HDFS.
• Supported formats: Text files, JSON, Hadoop sequence files,
parquet files, protocol buffers and object files.
• Structured data with Spark SQL: Hive, JSON, JDBC,
Cassandra, HBase and ElasticSearch.
Create RDDs
//	
  sc:	
  SparkContext	
  instance	
  
//	
  Scala	
  List	
  to	
  RDD	
  
val	
  rdd0	
  =	
  sc.parallelize(List(1,	
  2,	
  3,	
  4))	
  
//	
  Load	
  lines	
  of	
  a	
  text	
  file	
  
val	
  rdd1	
  =	
  sc.textFile("path/to/filename.txt")	
  
//	
  Load	
  a	
  file	
  from	
  HDFS	
  
val	
  rdd2	
  =	
  sc.hadoopFile("hdfs://master:port/path")	
  
//	
  Load	
  lines	
  of	
  a	
  compressed	
  text	
  file	
  	
  
val	
  rdd3	
  =	
  sc.textFile("file:///path/to/compressedText.gz")	
  
//	
  Load	
  lines	
  of	
  multiple	
  files	
  
val	
  rdd4	
  =	
  sc.textFile("s3n://log-­‐files/2014/*.log")
RDD Operations
1. Transformations: define new RDDs based on current one,
e.g., filter, map, reduce, groupBy, etc.
RDD New RDD
2. Actions: return values, e.g., count, sum, collect, etc.
value
RDD
Transformations (I): basics
val	
  nums	
  =	
  sc.parallelize(List(1,	
  2,	
  3))	
  
//	
  Pass	
  each	
  element	
  through	
  a	
  function	
  
val	
  squares	
  =	
  nums.map(x	
  =>	
  x	
  *	
  x)	
  //{1,	
  4,	
  9}	
  
//	
  Keep	
  elements	
  passing	
  a	
  predicate	
  
val	
  even	
  =	
  squares.filter(_	
  %	
  2	
  ==	
  0)	
  //{4}	
  
//	
  Map	
  each	
  element	
  to	
  zero	
  or	
  more	
  others	
  
val	
  mn	
  =	
  nums.flatMap(x	
  =>	
  1	
  to	
  x)	
  //{1,	
  1,	
  2,	
  1,	
  2,	
  3}	
  
Transformations (I): illustrated
nums
squares
ParallelCollectionRDD
even
mn
nums.flatMap(x	
  =>	
  1	
  to	
  x)
squares.filter(_	
  %	
  2	
  ==	
  0)
MappedRDD
FilteredRDD
nums.map(x	
  =>	
  x	
  *	
  x)
FlatMappedRDD
Transformations (II): key - value
val	
  pets	
  =	
  sc.parallelize(List(("cat",	
  1),	
  ("dog",	
  1),	
  
("cat",	
  2)))	
  
ValueKey
pets.filter{case	
  (k,	
  v)	
  =>	
  k	
  ==	
  "cat"}	
  
//	
  {(cat,1),	
  (cat,2)}	
  
pets.map{case	
  (k,	
  v)	
  =>	
  (k,	
  v	
  +	
  1)}	
  	
  
//	
  {(cat,2),	
  (dog,2),	
  (cat,3)}	
  
pets.mapValues(v	
  =>	
  v	
  +	
  1)	
  	
  
//	
  {(cat,2),	
  (dog,2),	
  (cat,3)}	
  
Transformations (II): key - value
//	
  Aggregation	
  
pets.reduceByKey((l,	
  r)	
  =>	
  l	
  +	
  r)	
  //{(cat,3),	
  (dog,1)}	
  
//	
  Grouping	
  
pets.groupByKey()	
  //{(cat,	
  Seq(1,	
  2)),	
  (dog,	
  Seq(1)}	
  
//	
  Sorting	
  
pets.sortByKey()	
  	
  //{(cat,	
  1),	
  (cat,	
  2),	
  (dog,	
  1)}	
  
val	
  pets	
  =	
  sc.parallelize(List(("cat",	
  1),	
  ("dog",	
  1),	
  
("cat",	
  2)))	
  
ValueKey
Transformations (III): key - value
//RDD[(URL,	
  page_name)]	
  tuples	
  	
  	
  
val	
  names	
  =	
  sc.textFile("names.txt").map(…)…	
  
//RDD[(URL,	
  visit_counts)]	
  tuples	
  
val	
  visits	
  =	
  sc.textFile("counts.txt").map(…)…	
  
//RDD[(URL,	
  (visit	
  counts,	
  page	
  name))]	
  
val	
  joined	
  =	
  visits.join(names)
Basics: Actions
val	
  nums	
  =	
  sc.parallelize(List(1,	
  2,	
  3))	
  
//	
  Count	
  number	
  of	
  elements	
  
nums.count()	
  	
  	
  //	
  =	
  3	
  
//	
  Merge	
  with	
  an	
  associative	
  function	
  
nums.reduce((l,	
  r)	
  =>	
  l	
  +	
  r)	
  	
  //	
  =	
  6	
  
//	
  Write	
  elements	
  to	
  a	
  text	
  file	
  
nums.saveAsTextFile("path/to/filename.txt")
Workflow
data transformation action
result
Part III:
Spark under the hood
1. Job: work required to compute an RDD.
2. Each job is divided to stages.
3. Task:
• Unit of work within a stage
• Corresponds to one RDD partition.
Units of Execution Model
Task 0
Job
Stage 0
…Task 1 Task 0
Stage 1
…Task 1 …
Execution Model
Driver Program
SparkContext
Worker Node
Task Task
Executor
Worker Node
Task Task
Executorval	
  lines	
  =	
  sc.textFile("README.md")	
  
val	
  countedLines	
  =	
  lines.count()	
  
Example: word count
val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  
val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  	
  	
  //	
  (a)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  	
  //	
  (b)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  	
  //	
  (c)
"to	
  be	
  or"
"not	
  to	
  be"
(a) "to"	
  
"be"	
  
"or"
"not"	
  
"to"	
  
"be"
("to",	
  1)	
  
("be",	
  1)	
  
("or",	
  1)
("not",	
  1)	
  
("to",	
  1)	
  
("be",	
  1)
(b)
("be",	
  2)	
  
("not",	
  1)
("or",	
  1)	
  
("to",	
  2)
(c)
to be or
not to be
12:	
  val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  //	
  HadoopRDD[0],	
  MappedRDD[1]	
  
13:	
  
14:	
  val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  //	
  FlatMappedRDD[2]	
  
15:	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  	
  	
  	
  //	
  MappedRDD[3]	
  
16:	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  	
  	
  	
  //	
  ShuffledRDD[4]	
  
17:	
  
18:	
  counts.toDebugString	
  
res0:	
  String	
  =	
  	
  
(2)	
  ShuffledRDD[4]	
  at	
  reduceByKey	
  at	
  <console>:16	
  []	
  
	
  +-­‐(2)	
  MappedRDD[3]	
  at	
  map	
  at	
  <console>:15	
  []	
  
	
  	
  	
  	
  |	
  	
  FlatMappedRDD[2]	
  at	
  flatMap	
  at	
  <console>:14	
  []	
  
	
  	
  	
  	
  |	
  	
  hamlet.txt	
  MappedRDD[1]	
  at	
  textFile	
  at	
  <console>:12	
  []	
  
	
  	
  	
  	
  |	
  	
  hamlet.txt	
  HadoopRDD[0]	
  at	
  textFile	
  at	
  <console>:12	
  []	
  
Visualize an RDD
Lineage Graph
val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  //	
  MappedRDD[1],	
  HadoopRDD[0]	
  
val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  	
  //	
  FlatMappedRDD[2]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  //	
  MappedRDD[3]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  //	
  ShuffledRDD[4]
[0] [1] [2] [3] [4]
HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
Lineage Graph
val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  //	
  MappedRDD[1],	
  HadoopRDD[0]	
  
val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  	
  //	
  FlatMappedRDD[2]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  //	
  MappedRDD[3]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  //	
  ShuffledRDD[4]
HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
Execution Plan
val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  //	
  MappedRDD[1],	
  HadoopRDD[0]	
  
val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  	
  //	
  FlatMappedRDD[2]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  //	
  MappedRDD[3]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  //	
  ShuffledRDD[4]
Stage 1 Stage 2
pipelining
Part IV:
Advanced Features
Persistence
• When we use the same RDD multiple times:
• Spark will recompute the RDD.
• Expensive to iterative algorithms.
• Spark can persist RDDs, avoiding recomputations.
Levels of persistence
val	
  result	
  =	
  input.map(expensiveComputation)	
  
result.persist(LEVEL)
LEVEL
Space
Consumption
CPU time In memory On disk
MEMORY_ONLY (default) High Low Y N
MEMORY_ONLY_SER Low High Y N
MEMORY_AND_DISK High Medium Some Some
MEMORY_AND_DISK_SER Low High Some Some
DISK_ONLY Low High N Y
Persistence Behaviour
• Each node will store its computed partition.
• In case of a failure, Spark recomputes the
missing partitions.
• Least Recently Used cache policy:
• Memory-only: recompute partitions.
• Memory-and-disk: recompute and write to disk.
• Manually remove from cache: unpersist()
Shared Variables
1. Accumulators: aggregate values from worker
nodes back to the driver program.
2. Broadcast variables: distribute values to all
worker nodes.
Accumulator Example
val	
  input	
  =	
  sc.textFile("input.txt")	
  
val	
  sum	
  =	
  sc.accumulator(0)	
  
val	
  count	
  =	
  sc.accumulator(0)	
  
	
   	
  
input	
  
.filter(line	
  =>	
  line.size	
  >	
  0)	
  
.flatMap(line	
  =>	
  line.split("	
  "))	
  
.map(word	
  =>	
  word.size)	
  
.foreach{	
  
	
   	
   size	
  =>	
  
	
   	
   	
   sum	
  +=	
  size	
  //	
  increment	
  accumulator	
  
	
   	
   	
   count	
  +=	
  1	
  	
  //	
  increment	
  accumulator	
  
	
   }	
   	
  
val	
  average	
  =	
  sum.value.toDouble	
  /	
  count.value
driver only
initialize the
accumulators
• Safe: Updates inside actions will only applied once.
• Unsafe: Updates inside transformation may applied
more than once!!!
Accumulators and Fault
Tolerance
Broadcast Variables
• Closures and the variables they use are send
separately to each task.
• We may want to share some variable (e.g., a Map)
across tasks/operations.
• This can efficiently done with broadcast variables.
Example without
broadcast variables
//	
  RDD[(String,	
  String)]	
  	
  	
  	
  
val	
  names	
  =	
  …	
  //load	
  (URL,	
  page	
  name)	
  tuples	
  
//	
  RDD[(String,	
  Int)]	
  
val	
  visits	
  =	
  …	
  //load	
  (URL,	
  visit	
  counts)	
  tuples	
  
//	
  Map[String,	
  String]	
  
val	
  pageMap	
  =	
  names.collect.toMap	
  
val	
  joined	
  =	
  visits.map{	
  	
  
	
   case	
  (url,	
  counts)	
  =>	
  	
  
	
   	
   (url,	
  (pageMap(url),	
  counts))	
  
}
pageMap	
  is	
  sent	
  along	
  
with	
  every	
  task
Example with
broadcast variables
//	
  RDD[(String,	
  String)]	
  	
  	
  	
  
val	
  names	
  =	
  …	
  //load	
  (URL,	
  page	
  name)	
  tuples	
  
//	
  RDD[(String,	
  Int)]	
  
val	
  visits	
  =	
  …	
  //load	
  (URL,	
  visit	
  counts)	
  tuples	
  
//	
  Map[String,	
  String]	
  
val	
  pageMap	
  =	
  names.collect.toMap	
  
val	
  bcMap	
  =	
  sc.broadcast(pageMap)	
  
val	
  joined	
  =	
  visits.map{	
  	
  
	
   case	
  (url,	
  counts)	
  =>	
  	
  
	
   	
   (url,	
  (bcMap.value(url),	
  counts))	
  
}
Broadcast variable
pageMap	
  is	
  sent	
  only	
  
to	
  each	
  node	
  once
Appendix
Staging
groupBy
map filter
join
Staging
groupBy
map filter
join
Caching
Staging
groupBy
map filter
join
Stage 1
Stage 2
Stage 3

More Related Content

What's hot

Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark
SparkSpark
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 

What's hot (20)

Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark
SparkSpark
Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 

Viewers also liked

Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
About the LoMRF project
About the LoMRF projectAbout the LoMRF project
About the LoMRF project
Anastasios Skarlatidis
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
thelabdude
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
Daniel Leon
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Mahdi Esmailoghli
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
Yukti Kaura
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Chris Fregly
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Spark Summit
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 

Viewers also liked (20)

Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
About the LoMRF project
About the LoMRF projectAbout the LoMRF project
About the LoMRF project
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 

Similar to Introduction to Apache Spark

Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Radu Chilom
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
Thu Hiền
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
Marilyn Waldman
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 

Similar to Introduction to Apache Spark (20)

Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Scala+data
Scala+dataScala+data
Scala+data
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Spark core
Spark coreSpark core
Spark core
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 

Recently uploaded

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 

Recently uploaded (20)

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 

Introduction to Apache Spark

  • 1. An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos"
  • 2. • Part I: Getting to know Spark • Part II: Basic programming • Part III: Spark under the hood • Part IV: Advanced features Outline
  • 3. Part I: Getting to know Spark
  • 4. Spark in a Nutshell • General cluster computing platform: • Distributed in-memory computational framework. • SQL, Machine Learning, Stream Processing, etc. • Easy to use, powerful, high-level API: • Scala, Java, Python and R.
  • 6. High Performance • In-memory cluster computing. • Ideal for iterative algorithms. • Faster than Hadoop: • 10x on disk. • 100x in memory.
  • 7. Brief History • Originally developed in 2009, UC Berkeley AMP Lab. • Open-sourced in 2010. • As of 2014, Spark is a top-level Apache project. • Fastest open-source engine for sorting 100 ΤΒ: • Won the 2014 Daytona GraySort contest. • Throughput: 4.27 TB/min
  • 8. Who uses Spark, and for what? A. Data Scientists: • Analyze and model data. • Data transformations and prototyping. • Statistics and Machine Learning. B. Software Engineers: • Implement production data processing systems. • Require a reasonable API for distributed processing. • Reliable, high performance, easy to monitor platform.
  • 9. Resilient Distributed Dataset RDD is an immutable and partitioned collection: • Resilient: it can be recreated, when data in memory is lost. • Distributed: stored in memory across the cluster. • Dataset: data that comes from file or created programmatically. RDD partitions
  • 10. Resilient Distributed Datasets • Feels like coding using typical Scala collections. • RDD can be build: 1. Directly from a datasource (e.g., text file, HDFS, etc.), 2. or by applying a transformation to another RDD(s). • Main features: • RDDs are computed lazily. • Automatically rebuild on failure. • Persistence for reuse (RAM and/or disk).
  • 12. Spark Shell $  cd  spark   $  ./bin/spark-­‐shell   Spark  assembly  has  been  built  with  Hive,  including  Datanucleus  jars  on  classpath   Welcome  to              ____                            __            /  __/__    ___  _____/  /__          _  /  _  /  _  `/  __/    '_/        /___/  .__/_,_/_/  /_/_      version  1.2.1              /_/   Using  Scala  version  2.10.4  (Java  HotSpot(TM)  64-­‐Bit  Server  VM,  Java  1.7.0_71)   Type  in  expressions  to  have  them  evaluated.   Type  :help  for  more  information.   Spark  context  available  as  sc.   scala>
  • 13. Standalone Applications Sbt:          "org.apache.spark"  %%  "spark-­‐core"  %  "1.2.1" Maven:          groupId:  org.apache.spark          artifactId:  spark-­‐core_2.10          version:  1.2.1
  • 14. Initiate Spark Context import  org.apache.spark.SparkContext   import  org.apache.spark.SparkContext._   import  org.apache.spark.SparkConf   object  SimpleApp  extends  App  {      val  conf  =  new  SparkConf().setAppName("Hello  Spark")      val  sc  =  new  SparkContext(conf)   }
  • 15. Rich, High-level API map   filter   sort   groupBy   union   join   …   reduce   count   fold   reduceByKey   groupByKey   cogroup   zip   …   sample   take   first   partitionBy   mapWith   pipe   save   …  
  • 16. Rich, High-level API map   filter   sort   groupBy   union   join   …   reduce   count   fold   reduceByKey   groupByKey   cogroup   zip   …   sample   take   first   partitionBy   mapWith   pipe   save   …  
  • 17. Loading and Saving • File Systems: Local FS, Amazon S3 and HDFS. • Supported formats: Text files, JSON, Hadoop sequence files, parquet files, protocol buffers and object files. • Structured data with Spark SQL: Hive, JSON, JDBC, Cassandra, HBase and ElasticSearch.
  • 18. Create RDDs //  sc:  SparkContext  instance   //  Scala  List  to  RDD   val  rdd0  =  sc.parallelize(List(1,  2,  3,  4))   //  Load  lines  of  a  text  file   val  rdd1  =  sc.textFile("path/to/filename.txt")   //  Load  a  file  from  HDFS   val  rdd2  =  sc.hadoopFile("hdfs://master:port/path")   //  Load  lines  of  a  compressed  text  file     val  rdd3  =  sc.textFile("file:///path/to/compressedText.gz")   //  Load  lines  of  multiple  files   val  rdd4  =  sc.textFile("s3n://log-­‐files/2014/*.log")
  • 19. RDD Operations 1. Transformations: define new RDDs based on current one, e.g., filter, map, reduce, groupBy, etc. RDD New RDD 2. Actions: return values, e.g., count, sum, collect, etc. value RDD
  • 20. Transformations (I): basics val  nums  =  sc.parallelize(List(1,  2,  3))   //  Pass  each  element  through  a  function   val  squares  =  nums.map(x  =>  x  *  x)  //{1,  4,  9}   //  Keep  elements  passing  a  predicate   val  even  =  squares.filter(_  %  2  ==  0)  //{4}   //  Map  each  element  to  zero  or  more  others   val  mn  =  nums.flatMap(x  =>  1  to  x)  //{1,  1,  2,  1,  2,  3}  
  • 21. Transformations (I): illustrated nums squares ParallelCollectionRDD even mn nums.flatMap(x  =>  1  to  x) squares.filter(_  %  2  ==  0) MappedRDD FilteredRDD nums.map(x  =>  x  *  x) FlatMappedRDD
  • 22. Transformations (II): key - value val  pets  =  sc.parallelize(List(("cat",  1),  ("dog",  1),   ("cat",  2)))   ValueKey pets.filter{case  (k,  v)  =>  k  ==  "cat"}   //  {(cat,1),  (cat,2)}   pets.map{case  (k,  v)  =>  (k,  v  +  1)}     //  {(cat,2),  (dog,2),  (cat,3)}   pets.mapValues(v  =>  v  +  1)     //  {(cat,2),  (dog,2),  (cat,3)}  
  • 23. Transformations (II): key - value //  Aggregation   pets.reduceByKey((l,  r)  =>  l  +  r)  //{(cat,3),  (dog,1)}   //  Grouping   pets.groupByKey()  //{(cat,  Seq(1,  2)),  (dog,  Seq(1)}   //  Sorting   pets.sortByKey()    //{(cat,  1),  (cat,  2),  (dog,  1)}   val  pets  =  sc.parallelize(List(("cat",  1),  ("dog",  1),   ("cat",  2)))   ValueKey
  • 24. Transformations (III): key - value //RDD[(URL,  page_name)]  tuples       val  names  =  sc.textFile("names.txt").map(…)…   //RDD[(URL,  visit_counts)]  tuples   val  visits  =  sc.textFile("counts.txt").map(…)…   //RDD[(URL,  (visit  counts,  page  name))]   val  joined  =  visits.join(names)
  • 25. Basics: Actions val  nums  =  sc.parallelize(List(1,  2,  3))   //  Count  number  of  elements   nums.count()      //  =  3   //  Merge  with  an  associative  function   nums.reduce((l,  r)  =>  l  +  r)    //  =  6   //  Write  elements  to  a  text  file   nums.saveAsTextFile("path/to/filename.txt")
  • 28. 1. Job: work required to compute an RDD. 2. Each job is divided to stages. 3. Task: • Unit of work within a stage • Corresponds to one RDD partition. Units of Execution Model Task 0 Job Stage 0 …Task 1 Task 0 Stage 1 …Task 1 …
  • 29. Execution Model Driver Program SparkContext Worker Node Task Task Executor Worker Node Task Task Executorval  lines  =  sc.textFile("README.md")   val  countedLines  =  lines.count()  
  • 30. Example: word count val  lines  =  sc.textFile("hamlet.txt")   val  counts  =  lines.flatMap(_.split("  "))      //  (a)                                      .map(word  =>  (word,  1))    //  (b)                                      .reduceByKey(_  +  _)            //  (c) "to  be  or" "not  to  be" (a) "to"   "be"   "or" "not"   "to"   "be" ("to",  1)   ("be",  1)   ("or",  1) ("not",  1)   ("to",  1)   ("be",  1) (b) ("be",  2)   ("not",  1) ("or",  1)   ("to",  2) (c) to be or not to be
  • 31. 12:  val  lines  =  sc.textFile("hamlet.txt")  //  HadoopRDD[0],  MappedRDD[1]   13:   14:  val  counts  =  lines.flatMap(_.split("  "))  //  FlatMappedRDD[2]   15:                              .map(word  =>  (word,  1))        //  MappedRDD[3]   16:                              .reduceByKey(_  +  _)                //  ShuffledRDD[4]   17:   18:  counts.toDebugString   res0:  String  =     (2)  ShuffledRDD[4]  at  reduceByKey  at  <console>:16  []    +-­‐(2)  MappedRDD[3]  at  map  at  <console>:15  []          |    FlatMappedRDD[2]  at  flatMap  at  <console>:14  []          |    hamlet.txt  MappedRDD[1]  at  textFile  at  <console>:12  []          |    hamlet.txt  HadoopRDD[0]  at  textFile  at  <console>:12  []   Visualize an RDD
  • 32. Lineage Graph val  lines  =  sc.textFile("hamlet.txt")  //  MappedRDD[1],  HadoopRDD[0]   val  counts  =  lines.flatMap(_.split("  "))    //  FlatMappedRDD[2]                                      .map(word  =>  (word,  1))  //  MappedRDD[3]                                      .reduceByKey(_  +  _)          //  ShuffledRDD[4] [0] [1] [2] [3] [4] HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
  • 33. Lineage Graph val  lines  =  sc.textFile("hamlet.txt")  //  MappedRDD[1],  HadoopRDD[0]   val  counts  =  lines.flatMap(_.split("  "))    //  FlatMappedRDD[2]                                      .map(word  =>  (word,  1))  //  MappedRDD[3]                                      .reduceByKey(_  +  _)          //  ShuffledRDD[4] HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
  • 34. HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD Execution Plan val  lines  =  sc.textFile("hamlet.txt")  //  MappedRDD[1],  HadoopRDD[0]   val  counts  =  lines.flatMap(_.split("  "))    //  FlatMappedRDD[2]                                      .map(word  =>  (word,  1))  //  MappedRDD[3]                                      .reduceByKey(_  +  _)          //  ShuffledRDD[4] Stage 1 Stage 2 pipelining
  • 36. Persistence • When we use the same RDD multiple times: • Spark will recompute the RDD. • Expensive to iterative algorithms. • Spark can persist RDDs, avoiding recomputations.
  • 37. Levels of persistence val  result  =  input.map(expensiveComputation)   result.persist(LEVEL) LEVEL Space Consumption CPU time In memory On disk MEMORY_ONLY (default) High Low Y N MEMORY_ONLY_SER Low High Y N MEMORY_AND_DISK High Medium Some Some MEMORY_AND_DISK_SER Low High Some Some DISK_ONLY Low High N Y
  • 38. Persistence Behaviour • Each node will store its computed partition. • In case of a failure, Spark recomputes the missing partitions. • Least Recently Used cache policy: • Memory-only: recompute partitions. • Memory-and-disk: recompute and write to disk. • Manually remove from cache: unpersist()
  • 39. Shared Variables 1. Accumulators: aggregate values from worker nodes back to the driver program. 2. Broadcast variables: distribute values to all worker nodes.
  • 40. Accumulator Example val  input  =  sc.textFile("input.txt")   val  sum  =  sc.accumulator(0)   val  count  =  sc.accumulator(0)       input   .filter(line  =>  line.size  >  0)   .flatMap(line  =>  line.split("  "))   .map(word  =>  word.size)   .foreach{       size  =>         sum  +=  size  //  increment  accumulator         count  +=  1    //  increment  accumulator     }     val  average  =  sum.value.toDouble  /  count.value driver only initialize the accumulators
  • 41. • Safe: Updates inside actions will only applied once. • Unsafe: Updates inside transformation may applied more than once!!! Accumulators and Fault Tolerance
  • 42. Broadcast Variables • Closures and the variables they use are send separately to each task. • We may want to share some variable (e.g., a Map) across tasks/operations. • This can efficiently done with broadcast variables.
  • 43. Example without broadcast variables //  RDD[(String,  String)]         val  names  =  …  //load  (URL,  page  name)  tuples   //  RDD[(String,  Int)]   val  visits  =  …  //load  (URL,  visit  counts)  tuples   //  Map[String,  String]   val  pageMap  =  names.collect.toMap   val  joined  =  visits.map{       case  (url,  counts)  =>         (url,  (pageMap(url),  counts))   } pageMap  is  sent  along   with  every  task
  • 44. Example with broadcast variables //  RDD[(String,  String)]         val  names  =  …  //load  (URL,  page  name)  tuples   //  RDD[(String,  Int)]   val  visits  =  …  //load  (URL,  visit  counts)  tuples   //  Map[String,  String]   val  pageMap  =  names.collect.toMap   val  bcMap  =  sc.broadcast(pageMap)   val  joined  =  visits.map{       case  (url,  counts)  =>         (url,  (bcMap.value(url),  counts))   } Broadcast variable pageMap  is  sent  only   to  each  node  once
  • 45.