Apache Spark Overview part1 (20161107)

아파치 스파크 소개
Part1
2016.11.07
민형기

Contents
• MapReduce
• Apache Spark
• Spark SQL

MapReduce History
• 1979 – Stanford, MIT, CMU, etc
• set/list operations in LISP, Prolog, etc. for parallel processing
• 2004 – Google
• MapReduce(2004): Simplified Data Processing on Large Clusters
• http://research.google.com/archive/mapreduce.html
• 2006 – Apache Hadoop: http://hadoop.apache.org/
• Hadoop, originating from the Nutch Project, Doug Cutting
• 2008 – Yahoo
• Web scale search indexing
• Hadoop Summit, HUG, etc
• 2009 – Amazon AWS
• Elastic MapReduce
• Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc
• 2012.01 – Apache Hadoop 1.0
• MapReduce 1.0: cluster resource management & data processing
• 2013.10 – Apache Hadoop 2.2
• MapReduce 2.0: data processing
• YARN: cluster resource management
Jeff Dean
Doug Cutting
제프 딘의 29가지 진실: http://ppss.kr/archives/16672

MapReduce Motivation
• 구글에서 사용중인 데이터를 가공하기 위해서는 많은 머신이 필요함.
• 특히, 입력 데이터가 크고, 적절한 시간 내에 완료되려면 컴퓨테이션이 많은 장비에 분산되어
야 한다.
• 웹 페이지의 인덱스를 생성하는 과정에서 방대한 양의 웹 페이지를 처리해야 할 때도 분산처
리가 필요함.
• 데이터 가공의 종류는 지속적으로 증가함
• 검색 색인(역 인덱스) 계산, 웹 문서의 그래프 구조의 다양한 표현, Host별로 크롤된 페이지의
수의 Summary, 해당 일자의 가장 많이 요청된 쿼리 셋 등
• 대부분은 개념적으로 어렵지 않으나, 분산처리 고려(작업 병렬화, 데이터
분산, 실패 처리 등) 로 인하여 코드가 복잡해 짐
• 분산 데이터 처리 Framework 필요
http://research.google.com/archive/mapreduce.html

MapReduce Programming Model
• Map과 Reduce는 Lisp과 같은 함수형 언어에서 유래한 용어
• Map: 데이터의 집합에 함수를 적용하여 새로운 집합을 만드는 것
• Reduce: 데이터의 집합에 함수를 적용하여 하나의 결과로 모으는 것
• Map: <키, 값>  <키`, 값`>*
• Reduce: <키`, 값` *>  값``*

MapReduce Design Pattern
• Basic MapReduce Patterns
• Counting, Summing
• Collating
• Filtering(“Grepping”), Parsing, and Validation
• Distributed Task Execution
• Sorting
• Not-So-Basic MapReduce Patterns
• Iterative Message Passing(Graph Processing)
• Distinct Values(Unique Items Counting)
• Cross-Correlation
• Relational MapReduce Patterns
• Selection
• Projection
• Union
• Intersection
• Difference
• GroupBy and Aggregation
• Joining
• Machine Learning and Math MapReduce Algorithms
https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

MapReduce Limitations
• MR로 직접 프로그래밍 하는 것은 어렵다.
• MR은 어렵고, 개발 노력이 많이 들고, 성능 보장이 어렵다.
• 개발자의 수준에 따른 성능 차 발생
• 기존 SQL 구현에 비해 생산성이 많이 떨어짐
• MapReduce는 one-pass 연산에는 우수한 성능을 보이나, multi-
pass 알고리즘에는 효율적이지 못하다.
• Disk IO에 최적화됨 / 메모리를 잘 사용하지 못함
• 반복적인 알고리즘의 경우 디스크 IO를 계속 발생시키기 때문에 효율적이지
못함
MR은 다양한 종류의 연산에 최적화 되어있지 않다.
• 특화된 시스템이 필요함
https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf

MapReduce Limitations
MapReduce
Storm
Giraph
Drill
Tez
Impala
…
Specialized systems
(iterative, interactive and
streaming apps)
General batch
processing
Tajo
Druid
Presto

아파치 스파크란?
• Fast and general engine for large-scale data processing.
• 특징
• 스피드: Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• 쉬운 사용: Java, Scala, Python, R로 쉽게 작성 가능
• 일반성: Batch, Streaming, iterative, interactive
• Runs Everywhere: Hadoop, Mesos, Standalone
• ‘09 UC Berkeley AMPLab, open sourced in ‘10

Spark Stack(Unified Platform)
Spark Core / RDD
Spark Streaming
(Streaming)
GraphX
(graph)
Spark SQL
MLlib
(Machine Learning)
Standalone YARN Mesos
Scala Java Python R

Separate engine:
Benefit for Users
동일한 엔진으로 데이터 추출, 모델 학습, interactive 쿼리를 수
행할 수 있다.
…
DFS
read
DFS
write
parse
DFS
read
DFS
write
train
DFS
read
DFS
write
query
HDFS
DFS
read
parse
train
query
Spark: Interactive
analysis
https://spark-summit.org/2013/zaharia-the-state-of-spark-and-where-were-going/

스파크 히스토리
• 2009년 – UC Berkeley RAD Lab(AMP Lab)에서 개발시작
• 2010년 – Open Source화
• 2010년 - Spark: Cluster Computing with Working Sets
• 2012년 – Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing
• 2013년 – 아파치 프로젝트로 전환
• 2014년 – 아파치 최상위 프로젝트(Top-Level Project)
• 2014년 – 스파크로 Large scale Sorting 세계기록(Databricks)
• 2014년 5월 – 1.0 release
• 2016년 7월 – 2.0 release

스파크 - Motivation
• MapReduce는 빅데이터 분석을 쉽게 만들어 줌
• 그러나 이것은 방향성을 갖는 데이터 플로우 모델에만 적합
• MapReduce가 부족한 것
• Iterative Job: 기계학습, 그래프 처리
• Interactive analytics: Ad-hoc 쿼리 (Hive, Pig)
 Data Sharing is Slow
• 어떻게 개선할 수 있을까?
• Fast data sharing
• General DAGs
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
http://www.slideshare.net/yongho/rdd-paper-review

Operations in MapReduce
• MR에서 데이터공유는 replication, serialization, and disk IO로 느림
• 대부분의 MR의 90%시간은 HDFS read-write에서 사용됨
• Iterative Operations • Interactive Operations
https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm

Operations in Spark RDD
• Iterative Operations • Interactive Operations
https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
• RDD: 데이터공유를 메모리에서 함
• 메모리를 이용한 데이터 공유는 네트워크나 디스크보다 10~100배 빠름

아파치 스파크 – Time to Sort 100TB
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east

Scala, Java, Python, R
// Scala:
val lines = sc.textFile(…)
val pairs = lines.map( s => (s, 1) )
val counts = pairs.reduceByKey( (a,b) => a + b)
// Java:
JavaRDD<String> lines = sc.textFile("data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
// Python:
lines = sc.textFile(…)
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a+b)

Spark Context
• 모든 Spark 응용프로그램은 Spark Context가 필요함
• Spark API를 위한 Main entry point
• Spark cluster와의 connection을 대표함
• Spark Shell은 미리 설정된 Spark Context인 sc를 제공함
• Scala (spark-shell):
• Python (pyspark):

Master
• SparkContext의 master파라메터는 어떤 클러스터를 사용할지
결정함
master description
local run Spark locally with one worker thread
(no parallelism)
local[K] run Spark locally with K worker threads
(ideally set to # cores)
spark://host:port connect to a Spark standalone cluster;
PORT depends on config (7077 by default)
mesos://host:port connect to a Mesos cluster;
PORT depends on config (5050 by default)
yarn Connect to yarn cluster in client or cluster mode depending on the value of –deploy-mode.
The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

Master
http://spark.apache.org/docs/latest/cluster-overview.html
1. Application 리소스 할당을 위해 Cluster Manager에 접속
2. 클러스터의 task를 수행할 executors를 획득
3. Applicaion code를 executor에 전달
4. Task를 executor에 전달하고 실행

Master – YARN vs. Standalone
• Master 종류에 따른 비교(YARN vs. Standalone)
YARN Cluster YARN Client Spark Standalone
Driver runs in: Application Master Client Client
Who requests resources? Application Master Application Master Client
Who starts executor processes? YARN NM YARN NM Spark Workers
Persistent services YARN RM / NM YARN RM / NM Spark Master / Workers
Supports Spark Shell? No Yes Yes
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

Resilient Distributed Datasets(RDD)
• Primary abstraction in Spark
• An Immutable collections of objects that can be operated on in parallel
• RDD
• Resilient: 메모리에 저장된 데이터가 유실 되도, 다시 만들어짐
• Distributed: 메모리가 클러스터를 통해 저장됨
• Main idea: Resilient Distributed Datasets
• Immutable collections of objects, spreads across cluster
• 유저는 컬렉션의 파티셔닝과 퍼시스턴스(메모리, 디스크 등)를 관리할 수 있
음
• RDD 생성: 스토리지  RDD, RDD  RDD만 가능
• Statically typed: RDD[T] has objects of type T
• Fault-tolerant: 어떤 데이터의 계보(lineage)만 기록
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
https://gist.github.com/hellerbarde/2843375
http://www.slideshare.net/yongho/rdd-paper-review

• Two types: transformations and actions
• Transformation Operation
• 변환을 통해 새로운 RDD를 생성, e.g, rdd.map(…)
• lazy operation
• 계보(lineage)에 기록
• Action operation
• 모든 계산된 결과를 제공하거나 저장, e.g. rdd.count()
• 즉시 수행
• 계보에 있는 정보(transformation operations)를 이용하여, Execution Plan을 계산
• 최적의 코스로 수행됨
RDD - Operations
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

RDD – Transformations
Transformation Meaning
map(f: TU) RDD[T]  RDD[U]
filter(f: TBool) RDD[T]  RDD[T]
flatMap(f: T  Seq[U]) RDD[T]  RDD[U]
mapPartitions(f: Iterator[T]  Iterator[U]) RDD[T]  RDD[U], 각 파티션 블록에서 개별적으로 수행됨
mapPartitionsWithIndex(f: (Int, Iterator[T])  Iterator[U]) RDD[T]  RDD[U], integer value는 파티션 index임
sample(withReplacement, fraction, seed) RDD[T]  RDD[T], fraction 비율 만큼 sampling
union(otherDataset) (RDD[T], RDD[T])  RDD[T], A ∪ B
intersection(otherDataset) (RDD[T], RDD[T])  RDD[T], A ∩ B
distinct([numTasks]) RDD[T]  RDD[T], source dataset에서 distinct element를 제공함
groupByKey([numTasks]) RDD[(K,V)]  RDD[(K, Iterable[V])]
reduceByKey(f: (V,V)  V, [numTasks]) RDD[(K,V)]  RDD[(K,V)], 각 Key별로 value를 aggregated value함
sortByKey([ascending], [numTasks]) RDD[(K,V)]  RDD[(K,V)], Key를 기준으로 정렬
http://spark.apache.org/docs/1.6.2/programming-guide.html

RDD – Transformations
Transformation Meaning
join(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)])  RDD[(K,(V,W))], 각 k에 대한 모든 (v,w)
leftOuterJoin, rightOuterJoin, fullOuterJoin
cogroup(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)])  RDD[(K, (Iterable[V], Iterable[W]))]
alias: groupWith
cartesian(otherDataset) (RDD[T], RDD[U])  RDD[(T,U)], RDD간의 cartesian product
모든 (a,b) element, (a in RDD[T], b in RDD[U])
pipe(command, [envVars]) String  RDD[String], shell command 실행후 결과를 RDD로 변환함
stdin, lines-> process -> stdout 한 stdout결과를 RDD[string]으로 제공함
http://blog.madhukaraphatak.com/pipe-in-spark/
coalesce(numPartitions) RDD[T]  RDD[T], RDD의 파티션 개수를 지정된 파티션 수로 줄임
repartition(numPartitions) RDD[T]  RDD[T], RDD에 있는 데이터를 지정된 파티션 수로 줄이고, 리셔플됨.
항상 전 데이터가 네트웍을 통해서 셔플됨
repartitionAndSortWithinPartitions(partitioner) RDD[(K,V)]  RDD[(K,V)],
주어진 파티셔너를 통해서 repartition됨, 그리고 각 파티션 결과 내에서 정렬함
repartition보다 효과적

RDD - Transformations
Scala:
Python:
val distFile = sc.textFile(“README.md”)
distFile.map(l => l.split(“ “)).collect()
distFile.flatMap(l => l.split(“ “)).collect()
distFile = sc.textFile(“README.md”)
distFile.map(lambda x: x.split(’ ‘)).collect()
distFile.flatMap(lambda x: x.split(’ ‘)).collect()

RDD – Actions
Action Meaning
reduce(f: (T,T)  T) RDD[T]  T, dataset내의 모든 element를 f를 사용해서 aggregate한 결과를 반환
collect() RDD[T]  Array[T], dataset내의 모든 element를 array로 반환
count() RDD[T]  Long, dataset내의 element의 개수를 반환
first() RDD[T]  T, 첫 번째 element를 반환
take(n) RDD[T]  Array[T], n 번째 까지의 element들을 반환
taskSample(withReplacement, num, [seed]) RDD[T]  Array[T], 랜덤으로 num만큼 element들의 결과를 반환
takeOrdered(n, [ordering]) RDD[T]  Array[T], 정렬된 n번째까지의 element를 반환
saveAsTextFile(path) RDD[T]  Unit, 모든 element를 text파일로 저장, local filesystem, HDFS등에 저장
saveAsSequenceFile(path) RDD[T]  Unit, 모든 element를 Hadoop SequenceFile로 저장
saveAsObjectFile(path) RDD[T]  Unit, 모든 element를 java serialization을 이용한 simple format으로 저장
countByKey() RDD[(K,V)]  Map[K, Long], 각 key에 대한 count를 반환
foreach(f: Iterator[T]  Unit) RDD[T]  Unit, 각 element에 대한 함수 f를 수행
saveAsNewAPIHadoopDataset RDD[T]  Unit, Hadoop API의 ‘OutputFormat’(mapreduce.OutputFormat)을 이용하
여 임의의 HDFS에 저장(MR Job), HBase BulkLoad에서 사용

RDD - Actions
Scala:
Python:
val f = sc.textFile(“README.md”)
val words = f.flatMap(l => l.split(“ “)).map(word => (word, 1))
words.reduceByKey(_ + _).collect
from operator import add
f = sc.textFile(“README.md”)
words = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1))
words.reduceByKey(add).collect()

RDD - Persistence
• MapReduce와 다르게 Spark은 dataset을 persist(or cache)할 수
있다.
• 다른 RDD operation(trans/action)에서 재사용하기 위해 각 노
드는 임의의 파티션을 메모리나 스토리지에 저장
• 10배의 스피드 증가
• 가장 중요한 스파크 피처 중의 하나임
>>> val wordCounts = rdd.flatMap(x => x.split(“ “))
.map(s => (s, 1))
.reduceByKey((a,b) => a + b)
.cache()

RDD - Persistence
Storage Level Meaning
MEMORY_ONLY RDD를 deserialized Java objects로 jvm heap에 저장. Default Level임
RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 저장되지 않고, 필요 시 재계산.
MEMORY_AND_DISK RDD를 deserialized Java objects로 jvm heap에 저장.
RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음.
MEMORY_ONLY_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장.
메모리 공간에 효과적, 읽는 시점에 cpu-intensive
MEMORY_AND_DISK_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장.
RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음.
DISK_ONLY 오직 디스크에만 저장함.
MEMORY_ONLY_2 MEMORY_ONLY와 동일하나, 각 파티션 마다 2개의 node에 저장.
MEMORY_AND_DISK_2 MEMORY_AND_DISK와 동일하나, 각 파티션 마다 2개의 node에 저장.
OFF_HEAP RDD를 Tachyon에 맞는 serialized format으로 저장.
MEMORY_ONLY_SER와 비교해서, gc overhead가 감소.
Large heap과 다수의 concurrent application을 사용하는 경우에 효과적.
Executor에 crash가 발생하더라도 cache된 데이터가 유실되지 않음.

RDD - Persistence
Scala:
Python:
val f = sc.textFile(“README.md”)
val w = f.flatMap(l => l.split(“ “)).map(word => (word, 1)).cache()
w.reduceByKey(_ + _).collect.foreach(println)
from operator import add
f = sc.textFile(“README.md”)
w = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1)).cache()
w.reduceByKey(add).collect()

RDD - Fault Tolerance
The State of Spark, and Where We're Going Next
- Matei Zaharia, Spark Summit (2013)
youtu.be/nU6vO2EJAb4
• RDD는 각 변환에 대해서 계보(lineage)를 기록해서 유실된 데이
터를 복구할 수 있다.

• Narrow Dependencies
• 한 노드
• 메모리만 이용
• 빠름
• 일부 파티션 복구도 빠름
• Wide Dependencies
• 여러 노드
• 셔플 발생
• 네트웍을 이용함
• 복구에 많은 시간 소요
• Checkpoint 권장
RDD – Narrow vs. Wide Dependencies

RDD – Job Scheduling
• DAG 방향 따라 계산
• Stage는 가능하면 로컬에서 실
행할 수 있도록 구성(Narrow
Dependency를 갖도록)
• 셔플이 필요한 경우 Stage 구분
• 파티션이 수행될 노드는 데이터
로컬리티를 고려함(HDFS)

Examples – Word Count
aardvark 1
cat 1
mat 1
on 2
sat 2
sofa 1
the 4
Input Data
the cat sat on the mat
the aardvark sat on the sofa
Result
http://www.slideshare.net/cloudera/spark-devwebinarslides-final

Examples – Word Count
the cat sat on the
mat
the aardvark sat on
the sofa
the
cat
sat
on
the
mat
the
aardvark
sat
…
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)
…
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
val f = sc.textFile(file)
val w = f.flatMap(l => l.split(“ “))
val counts = w.reduceByKey(_ + _)
counts.saveAsTextFile(output)
.map(word => (word, 1))
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
ShuffledRDD
Array

Examples - Estimate Pi
• Monte Carlo method를 이용한 Pi 값 계산
• ./bin/run-example SparkPi 2 local
• 알고리즘
1. Draw a square, then inscribe a circle within it.
2. Uniformly scatter objects of uniform size over the square.
3. Count the number of objects inside the circle and the
total number of objects.
4. The ratio of the two counts is an estimate of the ratio of
the two areas, which is π/4. Multiply the result by 4 to
estimate π.
https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
http://demonstrations.wolfram.com/MonteCarloEstimateForPi/
https://en.wikipedia.org/wiki/Monte_Carlo_method

Examples – Estimate Pi
Base RDD
transformed RDD
https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
action

Spark SQL
• Spark module for structured data processing (e.g. DB tables, JSON
files)
• Adding Schema to RDDs
• Three ways to manipulate data:
• SQL (2014.05, Spark 1.0)
• DataFrame (2015.03, Spark 1.3)
• Datasets (2016.01, Spark 1.6)
• Same execution engine for all three
• Spark SQL interfaces provide more information about both
structure and computation being performed than basic Spark RDD
API

Spark SQL Motivation
• Create and Run Spark Programs Faster
• Write less code
• Read less data
• Let the optimizer do the hard work
• Shark의 한계
• Limited integration with Spark programs
• Hive optimizer not designed for Spark
Spark SQL reuses the best parts of Shark
http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014

SQL
• Execute SQL queries written using either a basic SQL syntax
or HiveQL
• When running SQL from within another programming
language the results will be returned as a DataFrame.
• Interact with the SQL interface using the CLI or JDBC/ODBC

DataFrames
• Distributed collection of data organized into named columns.
• Conceptually equivalent to a table in relational DB or data
frame in R/Python
• API available in Scala, Java, Python, and R
• Richer optimizations(significantly faster than RDDs)
• Can be constructed from a wide array of sources
• data files, tables in Hive, external databases, exisiting RDD

DataFrames
http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
• Constructed from a wide array of sources

Datasets
• New experimental interface added in Spark 1.6
• Tries to provide the benefits of RDDs(strong typing, ability to
use powerful lambda functions) with the benefits of Spark
SQL’s optimized execution engine.
• Unified Dataset API can be used both in Scala and Java.

SQL Context and Hive Context
• SQLContext
• Entry point into all functionality in Spark SQL
• Wraps / extends existing spark context
• HiveContext
• Superset of functionality provided by basic SQLContext
• Read data from Hive tables
• Access to Hive Functions -> UDFs
val sqlContext = SQLContext(sc)
val hc = HiveContext(sc)

DataFrame Example
• Reading Data From Table
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 8|
| IAD| TPA| 19|
| IND| BWI| 8|
| IND| BWI| -4|
| IND| BWI| 34|
+------+----+--------+

DataFrame Example
• Using DataFrame API to Filter Data(show delays more than
15min)
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
+------+----+--------+
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

SQL Example
• Using SQL to Query and Filter Data(again, show delays more
than 15 min)
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5").show
+------+----+--------+
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

RDD vs. DataFrame
• RDD
• Lower-level API (more control)
• Lots of existing code & users
• Compile-time type-safety
• DataFrame
• Higher-level API(faster development)
• Faster sorting, hashing, and serialization
• More opportunities for automatic optimization
• Lower memory pressure

DataFrame은 직관적
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by department?
RDD Example
Data Frame Example
SQL Example
sc.sql (“SELECT avg(age) FROM data GROUP BY dept”)
http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin

Spark SQL Optimizations
• Spark SQL uses an underlying optimization engine(Catalyst)
• Catalyst can perform intelligent optimization since it understands the schema
• Spark SQL does not materialize all the columns(as with RDD) only what’s needed
http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin

Plan Optimization & Execution
• Spark SQL uses an underlying optimization engine(Catalyst)

An example query
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
Logical Plan
Project
name
Filter
id = 1
Project
id,name
People

Optimizing with Rules
Original
Plan
Project
name
Filter
id = 1
Project
id,name
People
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Combine
Projection
Project
name
Filter
id = 1
People
IndexLookup
id = 1
return: name
Physical
Plan

References
Papers
• Spark: Cluster Computing with Working Sets: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
• Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing : http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
• Spark SQL: Relational Data Processing in Spark: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
RDD
• Spark 의 핵심은 무엇인가? RDD! (RDD paper review): http://www.slideshare.net/yongho/rdd-paper-review
• Apache Spark RDDs: http://www.slideshare.net/deanchen11/scala-bay-spark-talk
Stanford 자료
• Intro to Apache Spark: https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
• Distributed Computing with Spark: https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf

References
• Apache Apache Spark Overview - MapR: http://www.slideshare.net/caroljmcdonald/apache-spark-overview-52602792
• Introduction to Apache Spark Developer Training - Cloudera: http://www.slideshare.net/cloudera/spark-devwebinarslides-final
• Apache Spark Overview: http://www.slideshare.net/VadimYBichutskiy/apache-spark-overview
• Simplifying Big Data Analytics with Apache Spark: http://www.slideshare.net/databricks/bdtc2
• Intro to Spark with Zeppelin: http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin
• Introduction to Big Data Analytics using Apache Spark and Zeppelin: http://www.slideshare.net/alexzeltov/introduction-to-big-data-analytics-using-apache-spark-and-zeppelin-on-hdinsights-on-
azure-saas-andor-hdp-on-azurepaas
• Spark overview: http://www.slideshare.net/LisaHua/spark-overview-37479609
• Introduction to real time big data with Apache Spark: http://www.slideshare.net/tmatyashovsky/introduction-to-realtime-big-data-with-apache-spark
• Spark은 왜 이렇게 유명해 지고 있을까?: http://www.slideshare.net/KSLUG/ss-47355270
• Apache Spark Briefing: http://www.slideshare.net/ThomasWDinsmore/apache-spark-briefing-12062013
• Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106: http://www.slideshare.net/sanghoonlee982/spark-overview-20141106
• Zeppelin(Spark)으로 데이터 분석하기: http://www.slideshare.net/sangwookimme/zeppelinspark-41329473
• Lightening Fast Big Data Analytics using Apache Spark: http://www.slideshare.net/manishgforce/lightening-fast-big-data-analytics-using-apache-spark
• Apache Hive on Apache Spark: http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
• Apache Spark: The Next Gen toolset for Big Data Processing: http://www.slideshare.net/prajods/apache-spark-the-next-gen-toolset-for-big-data-processing
• Intro to Spark and Spark SQL: http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014
• Spark SQL Deep Dive @ Melbourne Spark Meetup: http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
• A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
• Apache Spark (big Data) DataFrame - Things to know: https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary

References
• Spark SQL - Quick Guide: https://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm
• Big Data Processing with Apache Spark - Part 2: Spark SQL: https://www.infoq.com/articles/apache-spark-sql
• Analytics with Apache Spark Tutorial Part 2: Spark SQL: https://dzone.com/articles/analytics-with-apache-spark-tutorial-part-2-spark
• Apache Spark Resource Management and YARN App Models: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-
and-yarn-app-models/
• Why Spark Is the Next Top (Compute) Model: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model-39976454
• Introduction to Apache Spark:http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010
• Apache Spark & Hadoop:http://www.slideshare.net/MapRTechnologies/spark-overviewjune2014
• Spark와 Hadoop, 완벽한 조합 (한국어): http://www.slideshare.net/pudidic/spark-hadoop
• Big Data visualization with Apache Spark and Zeppelin: http://www.slideshare.net/prajods/big-data-visualization-with-apache-spark-and-zeppelin
• latency: https://gist.github.com/hellerbarde/2843375
• Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San Jose 2015: http://www.slideshare.net/databricks/strata-sj-everyday-
im-shuffling-tips-for-writing-better-spark-programs
• Apache Spark Architecture: http://www.slideshare.net/AGrishchenko/apache-spark-architecture

구글 빅데이터 관련 기술
기술 연도 내용
GFS 2003 Google File System: A Distributed Storage
MapReduce 2004 Simplified Data Processing on Large Clusters
Sawzall 2005 Interpreting the Data: Parallel Analysis with Sawzall
Chubby 2006 The Chubby Lock Service for Loosely-Coupled Distributed Systems
BigTable 2006 A Distributed Storage System for Structured Data
Paxos 2007 Paxos Made Live - An Engineering Perspective
Colossus 2009 GFS II
Percolator 2010 Large-scale Incremental Processing Using Distributed Transactions and Notifications
Pregel 2010 A System for Large-Scale Graph Processing
Dremel 2010 Interactive Analysis of Web-Scale Datasets
Tenzing 2011 A SQL Implementation On The MapReduce Framework
Megastore 2011 Providing Scalable, Highly Available Storage for Interactive Services
Spanner 2012 Google's Globally-Distributed Database
F1 2012 The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business

GFS Motivation
• More than 15,000 commodity-class PC's.
• Fault-tolerance provided in software
• More cost-effective solution
• Multiple clusters distributed worldwide
• One query reads 100’s of MB of data
• One query consumes 10’s of billions of CPU cycles
• Thousands of queries served per second.
• Google stores dozens of copies of the entire Web!
• Conclusion: Need Large, distributed, highly fault-
tolerant file system
http://www.cs.brandeis.edu/~dilant/WebPage_TA160/The%20Google%20File%20System.pdf

GFS Assumptions
• 높은 컴포넌트 장애율
• 장애에 대한 모니터링/감시, 장애 내성, 장애 복구 등의 준비가 필요하다.
• “적당한” 규모의 큰(HUGE) 파일들
• Just a few million
• Each is 100MB or larger; multi-GB files typical
• 파일은 한번 쓰고, 대부분은 추가된다.
• Perhaps concurrently
• 큰 순차 읽기(Large Streaming Reads)
• 높은 지속적인 처리량(throughput)이 저 지연(low latency)보다 중
요
http://research.google.com/archive/gfs.html
https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf

GFS Architecture
http://research.google.com/archive/gfs.html
https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf
<GFS Architecture>
<GFS 파일 저장 구조>

SQL JOINS
http://amirulkamil.com/best-describe-join/

YARN Architecture
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN Features
Feature Description
Multi-tenancy YARN allows multiple access engines (either open-source or proprietar
y) to use Hadoop as the common standard for batch, interactive and r
eal-time engines that can simultaneously access the same data set.
Multi-tenant data processing improves an enterprise’s return on its
Hadoop investments.
Cluster utilization YARN’s dynamic allocation of cluster resources improves utilization ove
r more static MapReduce rules used in early versions of Hadoop
Scalability Data center processing power continues to rapidly expand. YARN’s Res
ourceManager focuses exclusively on scheduling and keeps pace as cl
usters expand to thousands of nodes managing petabytes of data.
Compatibility Existing MapReduce applications developed for Hadoop 1 can run YAR
N without any disruption to existing processes that already work

Hive Overview
• Invented at Facebook. Open sourced to Apache in 2008
• A database/data warehouse on top of Hadoop
• Structured data similar to relational schema
• Tables, columns, rows and partitions
• SQL like query language (HiveQL)
• A subset of SQL with many traditional features
• It is possible to embedded MR script in HiveQL
• Queries are compiled into MR jobs that are executed on Hadoop.
출처: http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

Hive Motivation(Facebook)
• Problem: Data growth was exponential
• 200GB per day in March 2008
• 2+TB(compressed) raw data / day in April 2009
• 4+TB(compressed) raw data / day in Nov. 2009
• 12+TB(compressed) raw data / day today(2010)
• The Hadoop Experiment
• Much superior to availability and scalability of commercial DBs
• Efficiency not that great, but throw more hardware
• Partial Availability/resilience/scale more important than ACID
• Problem: Programmability and Metadata
• MapReduce hard to program (users know sql/bash/python)
• Need to publish data in well known schemas
• Solution: SQL + MapReduce = HIVE (2007)
출처: 1) http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation
2) http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Hive – Data Flow of Facebook
출처: http://borthakur.com/ftp/hadoopmicrosoft.pdf, http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

Hive - Architecture
https://cwiki.apache.org/confluence/display/Hive/Design

Hive - Query Execution and MR Jobs
출처: Ysmart(Yet Another SQL-to-MapReduce Translator), http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

Hive - Limitations
• Performance (주로 ~0.12)
• For simple queries, HIVE performance is comparable with hand-
coded MR jobs
• The execution time is much longer for complex queries
• 연산단계마다 MR잡이 실행되기 때문에 많은 IO로 인한 성능 병목 발생
• 각 단계마다, 비효율적인 데이터 스캔 및 전송이 발생함
• 약한 Optimizer로 인한 비효율적인 실행계획
• 스팅어 계획(Stinger Initiative)로 성능은 이전에 비해 비약적으
로 향상됨(Tez, Orc 도입, Optimizer개선)
• DW용으로만 한정적으로 사용될 수 있음.
• Streaming, Graph, ML등의 작업에는 제한이 있음

RDD Operations in paper

Spark Master - YARN
• YARN allows you to dynamically share and centrally configure the same pool of cluster
resources between all frameworks that run on YARN. You can throw your entire cluster
at a MapReduce job, then use some of it on an Impala query and the rest on Spark
application, without any changes in configuration.
• You can take advantage of all the features of YARN schedulers for categorizing,
isolating, and prioritizing workloads.
• Spark standalone mode requires each application to run an executor on every node in
the cluster, whereas with YARN, you choose the number of executors to use.
• Finally, YARN is the only cluster manager for Spark that supports security. With YARN,
Spark can run against Kerberized Hadoop clusters and uses secure authentication
between its processes.

Spark Master - YARN
yarn-client modeyarn-cluster mode

Spark 2.0 DataSets
Language Main Abstraction
Scala Dataset[T] & DataFrame (alias for
Dataset[Row])
Java Dataset[T]
Python* DataFrame
R* DataFrame
Note: Since Python and R have no compile-time type-safety, we
only have untyped APIs, namely DataFrames.
Typed and Un-typed APIs
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Static-typing and runtime type-safety

High-level abstraction and custom view
1. Spark reads the JSON, infers the schema, and creates a collection of DataFrames
2. At this point, Spark converts your data into DataFrame=Dataset[Row], a collection of
generic Row object, since it does not know the exact type.
3. Now, Spark converts the Dataset[Row]-> Dataset[DeviceIoTData] type-specific Scala
JVM Object, as dictated by the class DeviceIoTData
case class DeviceIoTData (battery_level: Long, c02_level: Long, cca2: String, cca3: String, cn: String, device_id: Long,
device_name: String, humidity: Long, ip: String, latitude: Double, lcd: String, longitude: Double, scale:String, temp: Long, timestamp: Long)
{"device_id": 198164, "device_name": "sensor-pad-198164owomcJZ", "ip": "80.55.20.25", "cca2": "PL", "cca3": "POL", "cn": "Poland",
"latitude": 53.080000, "longitude": 18.620000, "scale": "Celsius", "temp": 21, "humidity": 65, "battery_level": 8, "c02_level": 1408, "lcd":
"red", "timestamp" :1458081226051}
// read the json file and create the dataset from the
// case class DeviceIoTData
// ds is now a collection of JVM Scala objects DeviceIoTData
val ds = spark.read.json("/databricks-public-datasets/data/iot/iot_devices.json").as[DeviceIoTData]

Ease-of-use of APIs with structure
• Although structure may limit control in what your Spark program can
do with data
• Most computations can be accomplished with Dataset’s high-level APIs.
For example, it’s much simpler to perform agg, select, sum, avg, map,
filter, or groupBy
// Use filter(), map(), groupBy() country, and compute avg()
// for temperatures and humidity. This operation results in
// another immutable Dataset. The query is simpler to read,
// and expressive
val dsAvgTmp = ds.filter(d => {d.temp > 25}).map(d => (d.temp, d.humidity, d.cca3)).groupBy($"_3").avg()
//display the resulting dataset
display(dsAvgTmp)

Performance and Optimization
• DataFrame and Dataset APIs are built on top of the Spark
SQL engine.
• it uses Catalyst to generate an optimized logical and physical query
plan.
• Spark은 Dataset의 Tungsten’s Encoder를 이용하면, serialize /
deserialize시에 bytecode를 compact시켜줘서 Speed에 이점이
있다.

Use DataFrames or Datasets?
• If you want rich semantics, high-level abstractions, and domain specific APIs  use
DataFrame or Dataset.
• If your processing demands high-level expressions, filters, maps, aggregation, averages,
sum, SQL queries, columnar access and use of lambda functions on semi-structured
data  DataFrame or Dataset.
• If you want higher degree of type-safety at compile time, want typed JVM objects,
take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code
generation  Dataset.
• If you want unification and simplification of APIs across Spark Libraries  DataFrame
or Dataset.
• If you are a R user  DataFrames.
• If you are a Python user  DataFrames and resort back to RDDs if you need more
control.

Reliability Models
Core Storm Storm Trident Spark Streaming
At Most Once Yes Yes No
At Least Once Yes Yes No*
Once and Only Once
(Exactly Once)
No Yes Yes*
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

Programing Model
Core Storm Storm Trident Spark Streaming
Stream Primitive Tuple
Tuple, Tuple Batch,
Partition
Dstream
Stream Sources Spouts Spouts, Trident Spouts HDFS, Network
Computation/
Transformation
Bolts
Filters,
Functions,
Aggregations,
Joins
Transformation,
Window Operations
Stateful Operation
No
(roll your own)
Yes Yes
Output/
Persistence
Bolts State, MapState foreachRDD
2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

Performance
• Storm capped at 10k msgs/sec/node?
• Spark Streaming 40x faster than Storm?
System Performance
Storm(Twitter) 10,000 records/s/node
Spark Streaming 400,000 records/s/node
Apache S4 7,000 records/s/node
Other Commercial Systems 100,000 records/s/node
2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

Apache Spark Overview part1 (20161107)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark Overview part1 (20161107)

Similar to Apache Spark Overview part1 (20161107) (20)

More from Steve Min

More from Steve Min (14)

Recently uploaded

Recently uploaded (20)

Apache Spark Overview part1 (20161107)