Lightning-fast cluster computing
Apache Spark
What is Apache Spark?
Cluster computing platform designed to be fast and general-purpose.
Fast
Universal
Highly Accessible
Unified stack
Comparison with MR
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line =>
line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Examples:Word Count
val sc = new SparkContext(...)
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
val sc = new SparkContext(...)
val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line => line.contains("error"))
val warningsRDD = inputRDD.filter(line => line.contains("warning"))
val badLinesRDD = errorsRDD.union(warningsRDD)
badLinesRDD.persist()
badLinesRDD.count()
badLinesRDD.collect()
Examples:Log Mining
How it works?
RDD
Resilient
Distributed
Dataset
Example Hadoop RDD
partitions = One per HDFS block
dependencies = none
compute = read corresponding block
preferredLocations = HDFS block locations
partitioner = none
Advanced: RDD as interface
Direct Acyclic Graph (DAG)
hadoopRDD
errorsRDD warningsRDD
badLinseRDD
filterfilter
union
Function Name Purpose Example
map() Apply a function to each element in the RDD and return an RDD of the result. rdd.map(x => x + 1)
flatMap() Apply a function to each element in the RDD and return an RDD of the contents of
the iterators returned. Often used to extract words.
rdd.flatMap(x => x.to(3))
filter() Return an RDD consisting of only elements that pass the condition passed to
filter().
rdd.filter(x => x != 1)
distinct() Remove duplicates. rdd.distinct()
union() Produce an RDD containing elements from both RDDs. rdd.union(other)
intersection() RDD containing only elements found in both RDDs. rdd.intersection(other)
join() Perform an inner join between two RDDs. rdd.join(other)
groupByKey() Group values with same key rdd.groupByKey(other)
RDD Transformations
RDD actions
Function Name Purpose Example
count() Number of elements in RDD rdd.count()
collect() Return all elements from the RDD rdd.collect()
saveAsTextFile() Saves RDD elements to an external
storage system
rdd.saveAsTextFile(“hdfs://...”)
take(num) Return num elements from RDD rdd.take(10)
reduce(func) Combine the elements of the RDD
together in parallel (e.g., sum)
rdd.reduce((x, y) => x + y)
takeOrdered(num)(ordering) Return num elements regarding provided
ordering
rdd.takeOrdered(2)(myOrdering)
RDD Caching
Level Space Used CPU Time In Memory On disk Comments
MEMORY_ONLY High Low Y N
MEMORY_ONLY_SER Low High Y N
MEMORY_AND_DISK High Medium Some Some Spills to disk if there is too
much data to fit in memory.
MEMORY_AND_DISK_SER Low High Some Some Spills to disk if there is too
much data to fit in memory.
Stores serialized
representation in memory.
DISK_ONLY Low High N Y
How it works?
Main program which controls the flow
Driver Executors
Nodes that execute actions
How it works?
DAG
Scheduler
Coordination between
RDDs, driver and
nodes
What is Spark Application
Advanced Topics: Stages
Advanced Topics: Shuffling
Spark Stack
SQL
Streaming
Machine Learning
GraphX
if not… DEMO everyone?
?

Apache Spark - Aram Mkrtchyan

  • 1.
  • 2.
    What is ApacheSpark? Cluster computing platform designed to be fast and general-purpose. Fast Universal Highly Accessible
  • 3.
  • 4.
    Comparison with MR valtextFile = spark.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 5.
    Examples:Word Count val sc= new SparkContext(...) val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 6.
    val sc =new SparkContext(...) val inputRDD = sc.textFile("log.txt") val errorsRDD = inputRDD.filter(line => line.contains("error")) val warningsRDD = inputRDD.filter(line => line.contains("warning")) val badLinesRDD = errorsRDD.union(warningsRDD) badLinesRDD.persist() badLinesRDD.count() badLinesRDD.collect() Examples:Log Mining
  • 7.
  • 8.
    Example Hadoop RDD partitions= One per HDFS block dependencies = none compute = read corresponding block preferredLocations = HDFS block locations partitioner = none Advanced: RDD as interface
  • 9.
    Direct Acyclic Graph(DAG) hadoopRDD errorsRDD warningsRDD badLinseRDD filterfilter union
  • 10.
    Function Name PurposeExample map() Apply a function to each element in the RDD and return an RDD of the result. rdd.map(x => x + 1) flatMap() Apply a function to each element in the RDD and return an RDD of the contents of the iterators returned. Often used to extract words. rdd.flatMap(x => x.to(3)) filter() Return an RDD consisting of only elements that pass the condition passed to filter(). rdd.filter(x => x != 1) distinct() Remove duplicates. rdd.distinct() union() Produce an RDD containing elements from both RDDs. rdd.union(other) intersection() RDD containing only elements found in both RDDs. rdd.intersection(other) join() Perform an inner join between two RDDs. rdd.join(other) groupByKey() Group values with same key rdd.groupByKey(other) RDD Transformations
  • 11.
    RDD actions Function NamePurpose Example count() Number of elements in RDD rdd.count() collect() Return all elements from the RDD rdd.collect() saveAsTextFile() Saves RDD elements to an external storage system rdd.saveAsTextFile(“hdfs://...”) take(num) Return num elements from RDD rdd.take(10) reduce(func) Combine the elements of the RDD together in parallel (e.g., sum) rdd.reduce((x, y) => x + y) takeOrdered(num)(ordering) Return num elements regarding provided ordering rdd.takeOrdered(2)(myOrdering)
  • 12.
    RDD Caching Level SpaceUsed CPU Time In Memory On disk Comments MEMORY_ONLY High Low Y N MEMORY_ONLY_SER Low High Y N MEMORY_AND_DISK High Medium Some Some Spills to disk if there is too much data to fit in memory. MEMORY_AND_DISK_SER Low High Some Some Spills to disk if there is too much data to fit in memory. Stores serialized representation in memory. DISK_ONLY Low High N Y
  • 13.
    How it works? Mainprogram which controls the flow Driver Executors Nodes that execute actions
  • 14.
    How it works? DAG Scheduler Coordinationbetween RDDs, driver and nodes
  • 15.
    What is SparkApplication
  • 16.
  • 17.
  • 18.
  • 19.
    if not… DEMOeveryone? ?