SlideShare a Scribd company logo
1 of 93
Download to read offline
아파치 스파크 소개
Part1
2016.11.07
민형기
Contents
• MapReduce
• Apache Spark
• Spark SQL
Brief History
MapReduce
MapReduce History
• 1979 – Stanford, MIT, CMU, etc
• set/list operations in LISP, Prolog, etc. for parallel processing
• 2004 – Google
• MapReduce(2004): Simplified Data Processing on Large Clusters
• http://research.google.com/archive/mapreduce.html
• 2006 – Apache Hadoop: http://hadoop.apache.org/
• Hadoop, originating from the Nutch Project, Doug Cutting
• 2008 – Yahoo
• Web scale search indexing
• Hadoop Summit, HUG, etc
• 2009 – Amazon AWS
• Elastic MapReduce
• Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc
• 2012.01 – Apache Hadoop 1.0
• MapReduce 1.0: cluster resource management & data processing
• 2013.10 – Apache Hadoop 2.2
• MapReduce 2.0: data processing
• YARN: cluster resource management
Jeff Dean
Doug Cutting
제프 딘의 29가지 진실: http://ppss.kr/archives/16672
MapReduce Motivation
• 구글에서 사용중인 데이터를 가공하기 위해서는 많은 머신이 필요함.
• 특히, 입력 데이터가 크고, 적절한 시간 내에 완료되려면 컴퓨테이션이 많은 장비에 분산되어
야 한다.
• 웹 페이지의 인덱스를 생성하는 과정에서 방대한 양의 웹 페이지를 처리해야 할 때도 분산처
리가 필요함.
• 데이터 가공의 종류는 지속적으로 증가함
• 검색 색인(역 인덱스) 계산, 웹 문서의 그래프 구조의 다양한 표현, Host별로 크롤된 페이지의
수의 Summary, 해당 일자의 가장 많이 요청된 쿼리 셋 등
• 대부분은 개념적으로 어렵지 않으나, 분산처리 고려(작업 병렬화, 데이터
분산, 실패 처리 등) 로 인하여 코드가 복잡해 짐
• 분산 데이터 처리 Framework 필요
http://research.google.com/archive/mapreduce.html
MapReduce Programming Model
• Map과 Reduce는 Lisp과 같은 함수형 언어에서 유래한 용어
• Map: 데이터의 집합에 함수를 적용하여 새로운 집합을 만드는 것
• Reduce: 데이터의 집합에 함수를 적용하여 하나의 결과로 모으는 것
• Map: <키, 값>  <키`, 값`>*
• Reduce: <키`, 값` *>  값``*
MapReduce Process
MapReduce Design Pattern
• Basic MapReduce Patterns
• Counting, Summing
• Collating
• Filtering(“Grepping”), Parsing, and Validation
• Distributed Task Execution
• Sorting
• Not-So-Basic MapReduce Patterns
• Iterative Message Passing(Graph Processing)
• Distinct Values(Unique Items Counting)
• Cross-Correlation
• Relational MapReduce Patterns
• Selection
• Projection
• Union
• Intersection
• Difference
• GroupBy and Aggregation
• Joining
• Machine Learning and Math MapReduce Algorithms
https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
MapReduce Limitations
• MR로 직접 프로그래밍 하는 것은 어렵다.
• MR은 어렵고, 개발 노력이 많이 들고, 성능 보장이 어렵다.
• 개발자의 수준에 따른 성능 차 발생
• 기존 SQL 구현에 비해 생산성이 많이 떨어짐
• MapReduce는 one-pass 연산에는 우수한 성능을 보이나, multi-
pass 알고리즘에는 효율적이지 못하다.
• Disk IO에 최적화됨 / 메모리를 잘 사용하지 못함
• 반복적인 알고리즘의 경우 디스크 IO를 계속 발생시키기 때문에 효율적이지
못함
MR은 다양한 종류의 연산에 최적화 되어있지 않다.
• 특화된 시스템이 필요함
https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf
MapReduce Limitations
MapReduce
Storm
Giraph
Drill
Tez
Impala
…
Specialized systems
(iterative, interactive and
streaming apps)
General batch
processing
Tajo
Druid
Presto
아파치 스파크
아파치 스파크란?
• Fast and general engine for large-scale data processing.
• 특징
• 스피드: Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• 쉬운 사용: Java, Scala, Python, R로 쉽게 작성 가능
• 일반성: Batch, Streaming, iterative, interactive
• Runs Everywhere: Hadoop, Mesos, Standalone
• ‘09 UC Berkeley AMPLab, open sourced in ‘10
Spark Stack(Unified Platform)
Spark Core / RDD
Spark Streaming
(Streaming)
GraphX
(graph)
Spark SQL
MLlib
(Machine Learning)
Standalone YARN Mesos
Scala Java Python R
Separate engine:
Benefit for Users
동일한 엔진으로 데이터 추출, 모델 학습, interactive 쿼리를 수
행할 수 있다.
…
DFS
read
DFS
write
parse
DFS
read
DFS
write
train
DFS
read
DFS
write
query
HDFS
DFS
read
parse
train
query
Spark: Interactive
analysis
https://spark-summit.org/2013/zaharia-the-state-of-spark-and-where-were-going/
스파크 히스토리
• 2009년 – UC Berkeley RAD Lab(AMP Lab)에서 개발시작
• 2010년 – Open Source화
• 2010년 - Spark: Cluster Computing with Working Sets
• 2012년 – Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing
• 2013년 – 아파치 프로젝트로 전환
• 2014년 – 아파치 최상위 프로젝트(Top-Level Project)
• 2014년 – 스파크로 Large scale Sorting 세계기록(Databricks)
• 2014년 5월 – 1.0 release
• 2016년 7월 – 2.0 release
스파크 - Motivation
• MapReduce는 빅데이터 분석을 쉽게 만들어 줌
• 그러나 이것은 방향성을 갖는 데이터 플로우 모델에만 적합
• MapReduce가 부족한 것
• Iterative Job: 기계학습, 그래프 처리
• Interactive analytics: Ad-hoc 쿼리 (Hive, Pig)
 Data Sharing is Slow
• 어떻게 개선할 수 있을까?
• Fast data sharing
• General DAGs
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
http://www.slideshare.net/yongho/rdd-paper-review
Operations in MapReduce
• MR에서 데이터공유는 replication, serialization, and disk IO로 느림
• 대부분의 MR의 90%시간은 HDFS read-write에서 사용됨
• Iterative Operations • Interactive Operations
https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
Operations in Spark RDD
• Iterative Operations • Interactive Operations
https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
• RDD: 데이터공유를 메모리에서 함
• 메모리를 이용한 데이터 공유는 네트워크나 디스크보다 10~100배 빠름
아파치 스파크 – Time to Sort 100TB
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Scala, Java, Python, R
// Scala:
val lines = sc.textFile(…)
val pairs = lines.map( s => (s, 1) )
val counts = pairs.reduceByKey( (a,b) => a + b)
// Java:
JavaRDD<String> lines = sc.textFile("data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
// Python:
lines = sc.textFile(…)
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a+b)
Spark Context
• 모든 Spark 응용프로그램은 Spark Context가 필요함
• Spark API를 위한 Main entry point
• Spark cluster와의 connection을 대표함
• Spark Shell은 미리 설정된 Spark Context인 sc를 제공함
• Scala (spark-shell):
• Python (pyspark):
Master
• SparkContext의 master파라메터는 어떤 클러스터를 사용할지
결정함
master description
local run Spark locally with one worker thread
(no parallelism)
local[K] run Spark locally with K worker threads
(ideally set to # cores)
spark://host:port connect to a Spark standalone cluster;
PORT depends on config (7077 by default)
mesos://host:port connect to a Mesos cluster;
PORT depends on config (5050 by default)
yarn Connect to yarn cluster in client or cluster mode depending on the value of –deploy-mode.
The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
Master
http://spark.apache.org/docs/latest/cluster-overview.html
1. Application 리소스 할당을 위해 Cluster Manager에 접속
2. 클러스터의 task를 수행할 executors를 획득
3. Applicaion code를 executor에 전달
4. Task를 executor에 전달하고 실행
Master – YARN vs. Standalone
• Master 종류에 따른 비교(YARN vs. Standalone)
YARN Cluster YARN Client Spark Standalone
Driver runs in: Application Master Client Client
Who requests resources? Application Master Application Master Client
Who starts executor processes? YARN NM YARN NM Spark Workers
Persistent services YARN RM / NM YARN RM / NM Spark Master / Workers
Supports Spark Shell? No Yes Yes
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
Resilient Distributed Datasets(RDD)
• Primary abstraction in Spark
• An Immutable collections of objects that can be operated on in parallel
• RDD
• Resilient: 메모리에 저장된 데이터가 유실 되도, 다시 만들어짐
• Distributed: 메모리가 클러스터를 통해 저장됨
• Main idea: Resilient Distributed Datasets
• Immutable collections of objects, spreads across cluster
• 유저는 컬렉션의 파티셔닝과 퍼시스턴스(메모리, 디스크 등)를 관리할 수 있
음
• RDD 생성: 스토리지  RDD, RDD  RDD만 가능
• Statically typed: RDD[T] has objects of type T
• Fault-tolerant: 어떤 데이터의 계보(lineage)만 기록
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
https://gist.github.com/hellerbarde/2843375
http://www.slideshare.net/yongho/rdd-paper-review
• Two types: transformations and actions
• Transformation Operation
• 변환을 통해 새로운 RDD를 생성, e.g, rdd.map(…)
• lazy operation
• 계보(lineage)에 기록
• Action operation
• 모든 계산된 결과를 제공하거나 저장, e.g. rdd.count()
• 즉시 수행
• 계보에 있는 정보(transformation operations)를 이용하여, Execution Plan을 계산
• 최적의 코스로 수행됨
RDD - Operations
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
RDD – Transformations
Transformation Meaning
map(f: TU) RDD[T]  RDD[U]
filter(f: TBool) RDD[T]  RDD[T]
flatMap(f: T  Seq[U]) RDD[T]  RDD[U]
mapPartitions(f: Iterator[T]  Iterator[U]) RDD[T]  RDD[U], 각 파티션 블록에서 개별적으로 수행됨
mapPartitionsWithIndex(f: (Int, Iterator[T])  Iterator[U]) RDD[T]  RDD[U], integer value는 파티션 index임
sample(withReplacement, fraction, seed) RDD[T]  RDD[T], fraction 비율 만큼 sampling
union(otherDataset) (RDD[T], RDD[T])  RDD[T], A ∪ B
intersection(otherDataset) (RDD[T], RDD[T])  RDD[T], A ∩ B
distinct([numTasks]) RDD[T]  RDD[T], source dataset에서 distinct element를 제공함
groupByKey([numTasks]) RDD[(K,V)]  RDD[(K, Iterable[V])]
reduceByKey(f: (V,V)  V, [numTasks]) RDD[(K,V)]  RDD[(K,V)], 각 Key별로 value를 aggregated value함
sortByKey([ascending], [numTasks]) RDD[(K,V)]  RDD[(K,V)], Key를 기준으로 정렬
http://spark.apache.org/docs/1.6.2/programming-guide.html
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
RDD – Transformations
Transformation Meaning
join(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)])  RDD[(K,(V,W))], 각 k에 대한 모든 (v,w)
leftOuterJoin, rightOuterJoin, fullOuterJoin
cogroup(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)])  RDD[(K, (Iterable[V], Iterable[W]))]
alias: groupWith
cartesian(otherDataset) (RDD[T], RDD[U])  RDD[(T,U)], RDD간의 cartesian product
모든 (a,b) element, (a in RDD[T], b in RDD[U])
pipe(command, [envVars]) String  RDD[String], shell command 실행후 결과를 RDD로 변환함
stdin, lines-> process -> stdout 한 stdout결과를 RDD[string]으로 제공함
http://blog.madhukaraphatak.com/pipe-in-spark/
coalesce(numPartitions) RDD[T]  RDD[T], RDD의 파티션 개수를 지정된 파티션 수로 줄임
repartition(numPartitions) RDD[T]  RDD[T], RDD에 있는 데이터를 지정된 파티션 수로 줄이고, 리셔플됨.
항상 전 데이터가 네트웍을 통해서 셔플됨
repartitionAndSortWithinPartitions(partitioner) RDD[(K,V)]  RDD[(K,V)],
주어진 파티셔너를 통해서 repartition됨, 그리고 각 파티션 결과 내에서 정렬함
repartition보다 효과적
http://spark.apache.org/docs/1.6.2/programming-guide.html
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
RDD - Transformations
Scala:
Python:
val distFile = sc.textFile(“README.md”)
distFile.map(l => l.split(“ “)).collect()
distFile.flatMap(l => l.split(“ “)).collect()
distFile = sc.textFile(“README.md”)
distFile.map(lambda x: x.split(’ ‘)).collect()
distFile.flatMap(lambda x: x.split(’ ‘)).collect()
RDD – Actions
Action Meaning
reduce(f: (T,T)  T) RDD[T]  T, dataset내의 모든 element를 f를 사용해서 aggregate한 결과를 반환
collect() RDD[T]  Array[T], dataset내의 모든 element를 array로 반환
count() RDD[T]  Long, dataset내의 element의 개수를 반환
first() RDD[T]  T, 첫 번째 element를 반환
take(n) RDD[T]  Array[T], n 번째 까지의 element들을 반환
taskSample(withReplacement, num, [seed]) RDD[T]  Array[T], 랜덤으로 num만큼 element들의 결과를 반환
takeOrdered(n, [ordering]) RDD[T]  Array[T], 정렬된 n번째까지의 element를 반환
saveAsTextFile(path) RDD[T]  Unit, 모든 element를 text파일로 저장, local filesystem, HDFS등에 저장
saveAsSequenceFile(path) RDD[T]  Unit, 모든 element를 Hadoop SequenceFile로 저장
saveAsObjectFile(path) RDD[T]  Unit, 모든 element를 java serialization을 이용한 simple format으로 저장
countByKey() RDD[(K,V)]  Map[K, Long], 각 key에 대한 count를 반환
foreach(f: Iterator[T]  Unit) RDD[T]  Unit, 각 element에 대한 함수 f를 수행
saveAsNewAPIHadoopDataset RDD[T]  Unit, Hadoop API의 ‘OutputFormat’(mapreduce.OutputFormat)을 이용하
여 임의의 HDFS에 저장(MR Job), HBase BulkLoad에서 사용
http://spark.apache.org/docs/1.6.2/programming-guide.html
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
RDD - Actions
Scala:
Python:
val f = sc.textFile(“README.md”)
val words = f.flatMap(l => l.split(“ “)).map(word => (word, 1))
words.reduceByKey(_ + _).collect
from operator import add
f = sc.textFile(“README.md”)
words = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1))
words.reduceByKey(add).collect()
RDD - Persistence
• MapReduce와 다르게 Spark은 dataset을 persist(or cache)할 수
있다.
• 다른 RDD operation(trans/action)에서 재사용하기 위해 각 노
드는 임의의 파티션을 메모리나 스토리지에 저장
• 10배의 스피드 증가
• 가장 중요한 스파크 피처 중의 하나임
>>> val wordCounts = rdd.flatMap(x => x.split(“ “))
.map(s => (s, 1))
.reduceByKey((a,b) => a + b)
.cache()
RDD - Persistence
Storage Level Meaning
MEMORY_ONLY RDD를 deserialized Java objects로 jvm heap에 저장. Default Level임
RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 저장되지 않고, 필요 시 재계산.
MEMORY_AND_DISK RDD를 deserialized Java objects로 jvm heap에 저장.
RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음.
MEMORY_ONLY_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장.
메모리 공간에 효과적, 읽는 시점에 cpu-intensive
MEMORY_AND_DISK_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장.
RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음.
DISK_ONLY 오직 디스크에만 저장함.
MEMORY_ONLY_2 MEMORY_ONLY와 동일하나, 각 파티션 마다 2개의 node에 저장.
MEMORY_AND_DISK_2 MEMORY_AND_DISK와 동일하나, 각 파티션 마다 2개의 node에 저장.
OFF_HEAP RDD를 Tachyon에 맞는 serialized format으로 저장.
MEMORY_ONLY_SER와 비교해서, gc overhead가 감소.
Large heap과 다수의 concurrent application을 사용하는 경우에 효과적.
Executor에 crash가 발생하더라도 cache된 데이터가 유실되지 않음.
RDD - Persistence
Scala:
Python:
val f = sc.textFile(“README.md”)
val w = f.flatMap(l => l.split(“ “)).map(word => (word, 1)).cache()
w.reduceByKey(_ + _).collect.foreach(println)
from operator import add
f = sc.textFile(“README.md”)
w = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1)).cache()
w.reduceByKey(add).collect()
RDD - Fault Tolerance
The State of Spark, and Where We're Going Next
- Matei Zaharia, Spark Summit (2013)
youtu.be/nU6vO2EJAb4
• RDD는 각 변환에 대해서 계보(lineage)를 기록해서 유실된 데이
터를 복구할 수 있다.
• Narrow Dependencies
• 한 노드
• 메모리만 이용
• 빠름
• 일부 파티션 복구도 빠름
• Wide Dependencies
• 여러 노드
• 셔플 발생
• 네트웍을 이용함
• 복구에 많은 시간 소요
• Checkpoint 권장
RDD – Narrow vs. Wide Dependencies
RDD – Job Scheduling
• DAG 방향 따라 계산
• Stage는 가능하면 로컬에서 실
행할 수 있도록 구성(Narrow
Dependency를 갖도록)
• 셔플이 필요한 경우 Stage 구분
• 파티션이 수행될 노드는 데이터
로컬리티를 고려함(HDFS)
Examples – Word Count
aardvark 1
cat 1
mat 1
on 2
sat 2
sofa 1
the 4
Input Data
the cat sat on the mat
the aardvark sat on the sofa
Result
http://www.slideshare.net/cloudera/spark-devwebinarslides-final
Examples – Word Count
the cat sat on the
mat
the aardvark sat on
the sofa
the
cat
sat
on
the
mat
the
aardvark
sat
…
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
(the, 1)
(aardvark, 1)
(sat, 1)
…
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
val f = sc.textFile(file)
val w = f.flatMap(l => l.split(“ “))
val counts = w.reduceByKey(_ + _)
counts.saveAsTextFile(output)
.map(word => (word, 1))
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
ShuffledRDD
Array
Examples - Estimate Pi
• Monte Carlo method를 이용한 Pi 값 계산
• ./bin/run-example SparkPi 2 local
• 알고리즘
1. Draw a square, then inscribe a circle within it.
2. Uniformly scatter objects of uniform size over the square.
3. Count the number of objects inside the circle and the
total number of objects.
4. The ratio of the two counts is an estimate of the ratio of
the two areas, which is π/4. Multiply the result by 4 to
estimate π.
https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
http://demonstrations.wolfram.com/MonteCarloEstimateForPi/
https://en.wikipedia.org/wiki/Monte_Carlo_method
Examples – Estimate Pi
Base RDD
transformed RDD
https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
action
Spark SQL
Spark SQL
• Spark module for structured data processing (e.g. DB tables, JSON
files)
• Adding Schema to RDDs
• Three ways to manipulate data:
• SQL (2014.05, Spark 1.0)
• DataFrame (2015.03, Spark 1.3)
• Datasets (2016.01, Spark 1.6)
• Same execution engine for all three
• Spark SQL interfaces provide more information about both
structure and computation being performed than basic Spark RDD
API
Spark SQL Motivation
• Create and Run Spark Programs Faster
• Write less code
• Read less data
• Let the optimizer do the hard work
• Shark의 한계
• Limited integration with Spark programs
• Hive optimizer not designed for Spark
Spark SQL reuses the best parts of Shark
http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014
SQL
• Execute SQL queries written using either a basic SQL syntax
or HiveQL
• When running SQL from within another programming
language the results will be returned as a DataFrame.
• Interact with the SQL interface using the CLI or JDBC/ODBC
DataFrames
• Distributed collection of data organized into named columns.
• Conceptually equivalent to a table in relational DB or data
frame in R/Python
• API available in Scala, Java, Python, and R
• Richer optimizations(significantly faster than RDDs)
• Can be constructed from a wide array of sources
• data files, tables in Hive, external databases, exisiting RDD
DataFrames
http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
• Constructed from a wide array of sources
Datasets
• New experimental interface added in Spark 1.6
• Tries to provide the benefits of RDDs(strong typing, ability to
use powerful lambda functions) with the benefits of Spark
SQL’s optimized execution engine.
• Unified Dataset API can be used both in Scala and Java.
SQL Context and Hive Context
• SQLContext
• Entry point into all functionality in Spark SQL
• Wraps / extends existing spark context
• HiveContext
• Superset of functionality provided by basic SQLContext
• Read data from Hive tables
• Access to Hive Functions -> UDFs
val sqlContext = SQLContext(sc)
val hc = HiveContext(sc)
DataFrame Example
• Reading Data From Table
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 8|
| IAD| TPA| 19|
| IND| BWI| 8|
| IND| BWI| -4|
| IND| BWI| 34|
+------+----+--------+
DataFrame Example
• Using DataFrame API to Filter Data(show delays more than
15min)
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SQL Example
• Using SQL to Query and Filter Data(again, show delays more
than 15 min)
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5").show
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
RDD vs. DataFrame
• RDD
• Lower-level API (more control)
• Lots of existing code & users
• Compile-time type-safety
• DataFrame
• Higher-level API(faster development)
• Faster sorting, hashing, and serialization
• More opportunities for automatic optimization
• Lower memory pressure
DataFrame은 직관적
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by department?
RDD Example
Data Frame Example
SQL Example
sc.sql (“SELECT avg(age) FROM data GROUP BY dept”)
http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin
Spark SQL Optimizations
• Spark SQL uses an underlying optimization engine(Catalyst)
• Catalyst can perform intelligent optimization since it understands the schema
• Spark SQL does not materialize all the columns(as with RDD) only what’s needed
http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin
Plan Optimization & Execution
• Spark SQL uses an underlying optimization engine(Catalyst)
http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
An example query
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
Logical Plan
Project
name
Filter
id = 1
Project
id,name
People
http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
Optimizing with Rules
Original
Plan
Project
name
Filter
id = 1
Project
id,name
People
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Combine
Projection
Project
name
Filter
id = 1
People
IndexLookup
id = 1
return: name
Physical
Plan
http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
References
Papers
• Spark: Cluster Computing with Working Sets: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
• Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing : http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
• Spark SQL: Relational Data Processing in Spark: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
RDD
• Spark 의 핵심은 무엇인가? RDD! (RDD paper review): http://www.slideshare.net/yongho/rdd-paper-review
• Apache Spark RDDs: http://www.slideshare.net/deanchen11/scala-bay-spark-talk
Stanford 자료
• Intro to Apache Spark: https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
• Distributed Computing with Spark: https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf
References
• Apache Apache Spark Overview - MapR: http://www.slideshare.net/caroljmcdonald/apache-spark-overview-52602792
• Introduction to Apache Spark Developer Training - Cloudera: http://www.slideshare.net/cloudera/spark-devwebinarslides-final
• Apache Spark Overview: http://www.slideshare.net/VadimYBichutskiy/apache-spark-overview
• Simplifying Big Data Analytics with Apache Spark: http://www.slideshare.net/databricks/bdtc2
• Intro to Spark with Zeppelin: http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin
• Introduction to Big Data Analytics using Apache Spark and Zeppelin: http://www.slideshare.net/alexzeltov/introduction-to-big-data-analytics-using-apache-spark-and-zeppelin-on-hdinsights-on-
azure-saas-andor-hdp-on-azurepaas
• Spark overview: http://www.slideshare.net/LisaHua/spark-overview-37479609
• Introduction to real time big data with Apache Spark: http://www.slideshare.net/tmatyashovsky/introduction-to-realtime-big-data-with-apache-spark
• Spark은 왜 이렇게 유명해 지고 있을까?: http://www.slideshare.net/KSLUG/ss-47355270
• Apache Spark Briefing: http://www.slideshare.net/ThomasWDinsmore/apache-spark-briefing-12062013
• Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106: http://www.slideshare.net/sanghoonlee982/spark-overview-20141106
• Zeppelin(Spark)으로 데이터 분석하기: http://www.slideshare.net/sangwookimme/zeppelinspark-41329473
• Lightening Fast Big Data Analytics using Apache Spark: http://www.slideshare.net/manishgforce/lightening-fast-big-data-analytics-using-apache-spark
• Apache Hive on Apache Spark: http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
• Apache Spark: The Next Gen toolset for Big Data Processing: http://www.slideshare.net/prajods/apache-spark-the-next-gen-toolset-for-big-data-processing
• Intro to Spark and Spark SQL: http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014
• Spark SQL Deep Dive @ Melbourne Spark Meetup: http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
• A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
• Apache Spark (big Data) DataFrame - Things to know: https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary
References
• Spark SQL - Quick Guide: https://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm
• Big Data Processing with Apache Spark - Part 2: Spark SQL: https://www.infoq.com/articles/apache-spark-sql
• Analytics with Apache Spark Tutorial Part 2: Spark SQL: https://dzone.com/articles/analytics-with-apache-spark-tutorial-part-2-spark
• Apache Spark Resource Management and YARN App Models: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-
and-yarn-app-models/
• Why Spark Is the Next Top (Compute) Model: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model-39976454
• Introduction to Apache Spark:http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010
• Apache Spark & Hadoop:http://www.slideshare.net/MapRTechnologies/spark-overviewjune2014
• Spark와 Hadoop, 완벽한 조합 (한국어): http://www.slideshare.net/pudidic/spark-hadoop
• Big Data visualization with Apache Spark and Zeppelin: http://www.slideshare.net/prajods/big-data-visualization-with-apache-spark-and-zeppelin
• latency: https://gist.github.com/hellerbarde/2843375
• Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San Jose 2015: http://www.slideshare.net/databricks/strata-sj-everyday-
im-shuffling-tips-for-writing-better-spark-programs
• Apache Spark Architecture: http://www.slideshare.net/AGrishchenko/apache-spark-architecture
Appendix
Latency numbers
구글 빅데이터 관련 기술
기술 연도 내용
GFS 2003 Google File System: A Distributed Storage
MapReduce 2004 Simplified Data Processing on Large Clusters
Sawzall 2005 Interpreting the Data: Parallel Analysis with Sawzall
Chubby 2006 The Chubby Lock Service for Loosely-Coupled Distributed Systems
BigTable 2006 A Distributed Storage System for Structured Data
Paxos 2007 Paxos Made Live - An Engineering Perspective
Colossus 2009 GFS II
Percolator 2010 Large-scale Incremental Processing Using Distributed Transactions and Notifications
Pregel 2010 A System for Large-Scale Graph Processing
Dremel 2010 Interactive Analysis of Web-Scale Datasets
Tenzing 2011 A SQL Implementation On The MapReduce Framework
Megastore 2011 Providing Scalable, Highly Available Storage for Interactive Services
Spanner 2012 Google's Globally-Distributed Database
F1 2012 The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business
GFS Motivation
• More than 15,000 commodity-class PC's.
• Fault-tolerance provided in software
• More cost-effective solution
• Multiple clusters distributed worldwide
• One query reads 100’s of MB of data
• One query consumes 10’s of billions of CPU cycles
• Thousands of queries served per second.
• Google stores dozens of copies of the entire Web!
• Conclusion: Need Large, distributed, highly fault-
tolerant file system
http://www.cs.brandeis.edu/~dilant/WebPage_TA160/The%20Google%20File%20System.pdf
GFS Assumptions
• 높은 컴포넌트 장애율
• 장애에 대한 모니터링/감시, 장애 내성, 장애 복구 등의 준비가 필요하다.
• “적당한” 규모의 큰(HUGE) 파일들
• Just a few million
• Each is 100MB or larger; multi-GB files typical
• 파일은 한번 쓰고, 대부분은 추가된다.
• Perhaps concurrently
• 큰 순차 읽기(Large Streaming Reads)
• 높은 지속적인 처리량(throughput)이 저 지연(low latency)보다 중
요
http://research.google.com/archive/gfs.html
https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf
GFS Architecture
http://research.google.com/archive/gfs.html
https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf
<GFS Architecture>
<GFS 파일 저장 구조>
SQL JOINS
http://amirulkamil.com/best-describe-join/
YARN Architecture
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN Features
Feature Description
Multi-tenancy YARN allows multiple access engines (either open-source or proprietar
y) to use Hadoop as the common standard for batch, interactive and r
eal-time engines that can simultaneously access the same data set.
Multi-tenant data processing improves an enterprise’s return on its
Hadoop investments.
Cluster utilization YARN’s dynamic allocation of cluster resources improves utilization ove
r more static MapReduce rules used in early versions of Hadoop
Scalability Data center processing power continues to rapidly expand. YARN’s Res
ourceManager focuses exclusively on scheduling and keeps pace as cl
usters expand to thousands of nodes managing petabytes of data.
Compatibility Existing MapReduce applications developed for Hadoop 1 can run YAR
N without any disruption to existing processes that already work
Hive
Hive Overview
• Invented at Facebook. Open sourced to Apache in 2008
• A database/data warehouse on top of Hadoop
• Structured data similar to relational schema
• Tables, columns, rows and partitions
• SQL like query language (HiveQL)
• A subset of SQL with many traditional features
• It is possible to embedded MR script in HiveQL
• Queries are compiled into MR jobs that are executed on Hadoop.
출처: http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf
Hive Motivation(Facebook)
• Problem: Data growth was exponential
• 200GB per day in March 2008
• 2+TB(compressed) raw data / day in April 2009
• 4+TB(compressed) raw data / day in Nov. 2009
• 12+TB(compressed) raw data / day today(2010)
• The Hadoop Experiment
• Much superior to availability and scalability of commercial DBs
• Efficiency not that great, but throw more hardware
• Partial Availability/resilience/scale more important than ACID
• Problem: Programmability and Metadata
• MapReduce hard to program (users know sql/bash/python)
• Need to publish data in well known schemas
• Solution: SQL + MapReduce = HIVE (2007)
출처: 1) http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation
2) http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop
Hive – Data Flow of Facebook
출처: http://borthakur.com/ftp/hadoopmicrosoft.pdf, http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf
Hive - Architecture
https://cwiki.apache.org/confluence/display/Hive/Design
Hive - Query Execution and MR Jobs
출처: Ysmart(Yet Another SQL-to-MapReduce Translator), http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf
Hive - Limitations
• Performance (주로 ~0.12)
• For simple queries, HIVE performance is comparable with hand-
coded MR jobs
• The execution time is much longer for complex queries
• 연산단계마다 MR잡이 실행되기 때문에 많은 IO로 인한 성능 병목 발생
• 각 단계마다, 비효율적인 데이터 스캔 및 전송이 발생함
• 약한 Optimizer로 인한 비효율적인 실행계획
• 스팅어 계획(Stinger Initiative)로 성능은 이전에 비해 비약적으
로 향상됨(Tez, Orc 도입, Optimizer개선)
• DW용으로만 한정적으로 사용될 수 있음.
• Streaming, Graph, ML등의 작업에는 제한이 있음
Spark
RDD Operations in paper
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
Spark Master - YARN
• YARN allows you to dynamically share and centrally configure the same pool of cluster
resources between all frameworks that run on YARN. You can throw your entire cluster
at a MapReduce job, then use some of it on an Impala query and the rest on Spark
application, without any changes in configuration.
• You can take advantage of all the features of YARN schedulers for categorizing,
isolating, and prioritizing workloads.
• Spark standalone mode requires each application to run an executor on every node in
the cluster, whereas with YARN, you choose the number of executors to use.
• Finally, YARN is the only cluster manager for Spark that supports security. With YARN,
Spark can run against Kerberized Hadoop clusters and uses secure authentication
between its processes.
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
Spark Master - YARN
yarn-client modeyarn-cluster mode
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
Spark 2.0 Datasets
Spark 2.0 DataSets
Language Main Abstraction
Scala Dataset[T] & DataFrame (alias for
Dataset[Row])
Java Dataset[T]
Python* DataFrame
R* DataFrame
Note: Since Python and R have no compile-time type-safety, we
only have untyped APIs, namely DataFrames.
Typed and Un-typed APIs
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Static-typing and runtime type-safety
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
High-level abstraction and custom view
1. Spark reads the JSON, infers the schema, and creates a collection of DataFrames
2. At this point, Spark converts your data into DataFrame=Dataset[Row], a collection of
generic Row object, since it does not know the exact type.
3. Now, Spark converts the Dataset[Row]-> Dataset[DeviceIoTData] type-specific Scala
JVM Object, as dictated by the class DeviceIoTData
case class DeviceIoTData (battery_level: Long, c02_level: Long, cca2: String, cca3: String, cn: String, device_id: Long,
device_name: String, humidity: Long, ip: String, latitude: Double, lcd: String, longitude: Double, scale:String, temp: Long, timestamp: Long)
{"device_id": 198164, "device_name": "sensor-pad-198164owomcJZ", "ip": "80.55.20.25", "cca2": "PL", "cca3": "POL", "cn": "Poland",
"latitude": 53.080000, "longitude": 18.620000, "scale": "Celsius", "temp": 21, "humidity": 65, "battery_level": 8, "c02_level": 1408, "lcd":
"red", "timestamp" :1458081226051}
// read the json file and create the dataset from the
// case class DeviceIoTData
// ds is now a collection of JVM Scala objects DeviceIoTData
val ds = spark.read.json("/databricks-public-datasets/data/iot/iot_devices.json").as[DeviceIoTData]
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Ease-of-use of APIs with structure
• Although structure may limit control in what your Spark program can
do with data
• Most computations can be accomplished with Dataset’s high-level APIs.
For example, it’s much simpler to perform agg, select, sum, avg, map,
filter, or groupBy
// Use filter(), map(), groupBy() country, and compute avg()
// for temperatures and humidity. This operation results in
// another immutable Dataset. The query is simpler to read,
// and expressive
val dsAvgTmp = ds.filter(d => {d.temp > 25}).map(d => (d.temp, d.humidity, d.cca3)).groupBy($"_3").avg()
//display the resulting dataset
display(dsAvgTmp)
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Performance and Optimization
• DataFrame and Dataset APIs are built on top of the Spark
SQL engine.
• it uses Catalyst to generate an optimized logical and physical query
plan.
• Spark은 Dataset의 Tungsten’s Encoder를 이용하면, serialize /
deserialize시에 bytecode를 compact시켜줘서 Speed에 이점이
있다.
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Use DataFrames or Datasets?
• If you want rich semantics, high-level abstractions, and domain specific APIs  use
DataFrame or Dataset.
• If your processing demands high-level expressions, filters, maps, aggregation, averages,
sum, SQL queries, columnar access and use of lambda functions on semi-structured
data  DataFrame or Dataset.
• If you want higher degree of type-safety at compile time, want typed JVM objects,
take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code
generation  Dataset.
• If you want unification and simplification of APIs across Spark Libraries  DataFrame
or Dataset.
• If you are a R user  DataFrames.
• If you are a Python user  DataFrames and resort back to RDDs if you need more
control.
Spark Streaming vs. Storm
Reliability Models
Core Storm Storm Trident Spark Streaming
At Most Once Yes Yes No
At Least Once Yes Yes No*
Once and Only Once
(Exactly Once)
No Yes Yes*
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Programing Model
Core Storm Storm Trident Spark Streaming
Stream Primitive Tuple
Tuple, Tuple Batch,
Partition
Dstream
Stream Sources Spouts Spouts, Trident Spouts HDFS, Network
Computation/
Transformation
Bolts
Filters,
Functions,
Aggregations,
Joins
Transformation,
Window Operations
Stateful Operation
No
(roll your own)
Yes Yes
Output/
Persistence
Bolts State, MapState foreachRDD
2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Performance
• Storm capped at 10k msgs/sec/node?
• Spark Streaming 40x faster than Storm?
System Performance
Storm(Twitter) 10,000 records/s/node
Spark Streaming 400,000 records/s/node
Apache S4 7,000 records/s/node
Other Commercial Systems 100,000 records/s/node
2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

More Related Content

What's hot

Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Stream all the things
Stream all the thingsStream all the things
Stream all the thingsDean Wampler
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡youngick
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 

What's hot (20)

Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Stream all the things
Stream all the thingsStream all the things
Stream all the things
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Similar to Apache Spark Overview part1 (20161107)

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsNamuk Park
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersAmjith Singh
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 

Similar to Apache Spark Overview part1 (20161107) (20)

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 

More from Steve Min

Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Steve Min
 
Apache Htrace overview (20160520)
Apache Htrace overview (20160520)Apache Htrace overview (20160520)
Apache Htrace overview (20160520)Steve Min
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)Steve Min
 
[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)Steve Min
 
[SSA] 01.bigdata database technology (2014.02.05)
[SSA] 01.bigdata database technology (2014.02.05)[SSA] 01.bigdata database technology (2014.02.05)
[SSA] 01.bigdata database technology (2014.02.05)Steve Min
 
Google guava overview
Google guava overviewGoogle guava overview
Google guava overviewSteve Min
 
Scala overview
Scala overviewScala overview
Scala overviewSteve Min
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database OverviewSteve Min
 
NoSQL Database
NoSQL DatabaseNoSQL Database
NoSQL DatabaseSteve Min
 
Scalable web architecture
Scalable web architectureScalable web architecture
Scalable web architectureSteve Min
 
Scalable system design patterns
Scalable system design patternsScalable system design patterns
Scalable system design patternsSteve Min
 
Cloud Computing v1.0
Cloud Computing v1.0Cloud Computing v1.0
Cloud Computing v1.0Steve Min
 
Cloud Music v1.0
Cloud Music v1.0Cloud Music v1.0
Cloud Music v1.0Steve Min
 

More from Steve Min (14)

Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Apache Htrace overview (20160520)
Apache Htrace overview (20160520)Apache Htrace overview (20160520)
Apache Htrace overview (20160520)
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)
 
[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)
 
[SSA] 01.bigdata database technology (2014.02.05)
[SSA] 01.bigdata database technology (2014.02.05)[SSA] 01.bigdata database technology (2014.02.05)
[SSA] 01.bigdata database technology (2014.02.05)
 
Google guava overview
Google guava overviewGoogle guava overview
Google guava overview
 
Scala overview
Scala overviewScala overview
Scala overview
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 
NoSQL Database
NoSQL DatabaseNoSQL Database
NoSQL Database
 
Scalable web architecture
Scalable web architectureScalable web architecture
Scalable web architecture
 
Scalable system design patterns
Scalable system design patternsScalable system design patterns
Scalable system design patterns
 
Cloud Computing v1.0
Cloud Computing v1.0Cloud Computing v1.0
Cloud Computing v1.0
 
Cloud Music v1.0
Cloud Music v1.0Cloud Music v1.0
Cloud Music v1.0
 
Html5 video
Html5 videoHtml5 video
Html5 video
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 

Apache Spark Overview part1 (20161107)

  • 2. Contents • MapReduce • Apache Spark • Spark SQL
  • 5. MapReduce History • 1979 – Stanford, MIT, CMU, etc • set/list operations in LISP, Prolog, etc. for parallel processing • 2004 – Google • MapReduce(2004): Simplified Data Processing on Large Clusters • http://research.google.com/archive/mapreduce.html • 2006 – Apache Hadoop: http://hadoop.apache.org/ • Hadoop, originating from the Nutch Project, Doug Cutting • 2008 – Yahoo • Web scale search indexing • Hadoop Summit, HUG, etc • 2009 – Amazon AWS • Elastic MapReduce • Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc • 2012.01 – Apache Hadoop 1.0 • MapReduce 1.0: cluster resource management & data processing • 2013.10 – Apache Hadoop 2.2 • MapReduce 2.0: data processing • YARN: cluster resource management Jeff Dean Doug Cutting 제프 딘의 29가지 진실: http://ppss.kr/archives/16672
  • 6. MapReduce Motivation • 구글에서 사용중인 데이터를 가공하기 위해서는 많은 머신이 필요함. • 특히, 입력 데이터가 크고, 적절한 시간 내에 완료되려면 컴퓨테이션이 많은 장비에 분산되어 야 한다. • 웹 페이지의 인덱스를 생성하는 과정에서 방대한 양의 웹 페이지를 처리해야 할 때도 분산처 리가 필요함. • 데이터 가공의 종류는 지속적으로 증가함 • 검색 색인(역 인덱스) 계산, 웹 문서의 그래프 구조의 다양한 표현, Host별로 크롤된 페이지의 수의 Summary, 해당 일자의 가장 많이 요청된 쿼리 셋 등 • 대부분은 개념적으로 어렵지 않으나, 분산처리 고려(작업 병렬화, 데이터 분산, 실패 처리 등) 로 인하여 코드가 복잡해 짐 • 분산 데이터 처리 Framework 필요 http://research.google.com/archive/mapreduce.html
  • 7. MapReduce Programming Model • Map과 Reduce는 Lisp과 같은 함수형 언어에서 유래한 용어 • Map: 데이터의 집합에 함수를 적용하여 새로운 집합을 만드는 것 • Reduce: 데이터의 집합에 함수를 적용하여 하나의 결과로 모으는 것 • Map: <키, 값>  <키`, 값`>* • Reduce: <키`, 값` *>  값``*
  • 9. MapReduce Design Pattern • Basic MapReduce Patterns • Counting, Summing • Collating • Filtering(“Grepping”), Parsing, and Validation • Distributed Task Execution • Sorting • Not-So-Basic MapReduce Patterns • Iterative Message Passing(Graph Processing) • Distinct Values(Unique Items Counting) • Cross-Correlation • Relational MapReduce Patterns • Selection • Projection • Union • Intersection • Difference • GroupBy and Aggregation • Joining • Machine Learning and Math MapReduce Algorithms https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 10. MapReduce Limitations • MR로 직접 프로그래밍 하는 것은 어렵다. • MR은 어렵고, 개발 노력이 많이 들고, 성능 보장이 어렵다. • 개발자의 수준에 따른 성능 차 발생 • 기존 SQL 구현에 비해 생산성이 많이 떨어짐 • MapReduce는 one-pass 연산에는 우수한 성능을 보이나, multi- pass 알고리즘에는 효율적이지 못하다. • Disk IO에 최적화됨 / 메모리를 잘 사용하지 못함 • 반복적인 알고리즘의 경우 디스크 IO를 계속 발생시키기 때문에 효율적이지 못함 MR은 다양한 종류의 연산에 최적화 되어있지 않다. • 특화된 시스템이 필요함 https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf
  • 11. MapReduce Limitations MapReduce Storm Giraph Drill Tez Impala … Specialized systems (iterative, interactive and streaming apps) General batch processing Tajo Druid Presto
  • 13. 아파치 스파크란? • Fast and general engine for large-scale data processing. • 특징 • 스피드: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • 쉬운 사용: Java, Scala, Python, R로 쉽게 작성 가능 • 일반성: Batch, Streaming, iterative, interactive • Runs Everywhere: Hadoop, Mesos, Standalone • ‘09 UC Berkeley AMPLab, open sourced in ‘10
  • 14. Spark Stack(Unified Platform) Spark Core / RDD Spark Streaming (Streaming) GraphX (graph) Spark SQL MLlib (Machine Learning) Standalone YARN Mesos Scala Java Python R
  • 15. Separate engine: Benefit for Users 동일한 엔진으로 데이터 추출, 모델 학습, interactive 쿼리를 수 행할 수 있다. … DFS read DFS write parse DFS read DFS write train DFS read DFS write query HDFS DFS read parse train query Spark: Interactive analysis https://spark-summit.org/2013/zaharia-the-state-of-spark-and-where-were-going/
  • 16. 스파크 히스토리 • 2009년 – UC Berkeley RAD Lab(AMP Lab)에서 개발시작 • 2010년 – Open Source화 • 2010년 - Spark: Cluster Computing with Working Sets • 2012년 – Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing • 2013년 – 아파치 프로젝트로 전환 • 2014년 – 아파치 최상위 프로젝트(Top-Level Project) • 2014년 – 스파크로 Large scale Sorting 세계기록(Databricks) • 2014년 5월 – 1.0 release • 2016년 7월 – 2.0 release
  • 17. 스파크 - Motivation • MapReduce는 빅데이터 분석을 쉽게 만들어 줌 • 그러나 이것은 방향성을 갖는 데이터 플로우 모델에만 적합 • MapReduce가 부족한 것 • Iterative Job: 기계학습, 그래프 처리 • Interactive analytics: Ad-hoc 쿼리 (Hive, Pig)  Data Sharing is Slow • 어떻게 개선할 수 있을까? • Fast data sharing • General DAGs https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf http://www.slideshare.net/yongho/rdd-paper-review
  • 18. Operations in MapReduce • MR에서 데이터공유는 replication, serialization, and disk IO로 느림 • 대부분의 MR의 90%시간은 HDFS read-write에서 사용됨 • Iterative Operations • Interactive Operations https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
  • 19. Operations in Spark RDD • Iterative Operations • Interactive Operations https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm • RDD: 데이터공유를 메모리에서 함 • 메모리를 이용한 데이터 공유는 네트워크나 디스크보다 10~100배 빠름
  • 20. 아파치 스파크 – Time to Sort 100TB http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
  • 21. Scala, Java, Python, R // Scala: val lines = sc.textFile(…) val pairs = lines.map( s => (s, 1) ) val counts = pairs.reduceByKey( (a,b) => a + b) // Java: JavaRDD<String> lines = sc.textFile("data.txt"); JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1)); JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b); // Python: lines = sc.textFile(…) pairs = lines.map(lambda s: (s, 1)) counts = pairs.reduceByKey(lambda a, b: a+b)
  • 22. Spark Context • 모든 Spark 응용프로그램은 Spark Context가 필요함 • Spark API를 위한 Main entry point • Spark cluster와의 connection을 대표함 • Spark Shell은 미리 설정된 Spark Context인 sc를 제공함 • Scala (spark-shell): • Python (pyspark):
  • 23. Master • SparkContext의 master파라메터는 어떤 클러스터를 사용할지 결정함 master description local run Spark locally with one worker thread (no parallelism) local[K] run Spark locally with K worker threads (ideally set to # cores) spark://host:port connect to a Spark standalone cluster; PORT depends on config (7077 by default) mesos://host:port connect to a Mesos cluster; PORT depends on config (5050 by default) yarn Connect to yarn cluster in client or cluster mode depending on the value of –deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
  • 24. Master http://spark.apache.org/docs/latest/cluster-overview.html 1. Application 리소스 할당을 위해 Cluster Manager에 접속 2. 클러스터의 task를 수행할 executors를 획득 3. Applicaion code를 executor에 전달 4. Task를 executor에 전달하고 실행
  • 25. Master – YARN vs. Standalone • Master 종류에 따른 비교(YARN vs. Standalone) YARN Cluster YARN Client Spark Standalone Driver runs in: Application Master Client Client Who requests resources? Application Master Application Master Client Who starts executor processes? YARN NM YARN NM Spark Workers Persistent services YARN RM / NM YARN RM / NM Spark Master / Workers Supports Spark Shell? No Yes Yes http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
  • 26. Resilient Distributed Datasets(RDD) • Primary abstraction in Spark • An Immutable collections of objects that can be operated on in parallel • RDD • Resilient: 메모리에 저장된 데이터가 유실 되도, 다시 만들어짐 • Distributed: 메모리가 클러스터를 통해 저장됨 • Main idea: Resilient Distributed Datasets • Immutable collections of objects, spreads across cluster • 유저는 컬렉션의 파티셔닝과 퍼시스턴스(메모리, 디스크 등)를 관리할 수 있 음 • RDD 생성: 스토리지  RDD, RDD  RDD만 가능 • Statically typed: RDD[T] has objects of type T • Fault-tolerant: 어떤 데이터의 계보(lineage)만 기록 https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf https://gist.github.com/hellerbarde/2843375 http://www.slideshare.net/yongho/rdd-paper-review
  • 27. • Two types: transformations and actions • Transformation Operation • 변환을 통해 새로운 RDD를 생성, e.g, rdd.map(…) • lazy operation • 계보(lineage)에 기록 • Action operation • 모든 계산된 결과를 제공하거나 저장, e.g. rdd.count() • 즉시 수행 • 계보에 있는 정보(transformation operations)를 이용하여, Execution Plan을 계산 • 최적의 코스로 수행됨 RDD - Operations http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 28. RDD – Transformations Transformation Meaning map(f: TU) RDD[T]  RDD[U] filter(f: TBool) RDD[T]  RDD[T] flatMap(f: T  Seq[U]) RDD[T]  RDD[U] mapPartitions(f: Iterator[T]  Iterator[U]) RDD[T]  RDD[U], 각 파티션 블록에서 개별적으로 수행됨 mapPartitionsWithIndex(f: (Int, Iterator[T])  Iterator[U]) RDD[T]  RDD[U], integer value는 파티션 index임 sample(withReplacement, fraction, seed) RDD[T]  RDD[T], fraction 비율 만큼 sampling union(otherDataset) (RDD[T], RDD[T])  RDD[T], A ∪ B intersection(otherDataset) (RDD[T], RDD[T])  RDD[T], A ∩ B distinct([numTasks]) RDD[T]  RDD[T], source dataset에서 distinct element를 제공함 groupByKey([numTasks]) RDD[(K,V)]  RDD[(K, Iterable[V])] reduceByKey(f: (V,V)  V, [numTasks]) RDD[(K,V)]  RDD[(K,V)], 각 Key별로 value를 aggregated value함 sortByKey([ascending], [numTasks]) RDD[(K,V)]  RDD[(K,V)], Key를 기준으로 정렬 http://spark.apache.org/docs/1.6.2/programming-guide.html http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 29. RDD – Transformations Transformation Meaning join(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)])  RDD[(K,(V,W))], 각 k에 대한 모든 (v,w) leftOuterJoin, rightOuterJoin, fullOuterJoin cogroup(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)])  RDD[(K, (Iterable[V], Iterable[W]))] alias: groupWith cartesian(otherDataset) (RDD[T], RDD[U])  RDD[(T,U)], RDD간의 cartesian product 모든 (a,b) element, (a in RDD[T], b in RDD[U]) pipe(command, [envVars]) String  RDD[String], shell command 실행후 결과를 RDD로 변환함 stdin, lines-> process -> stdout 한 stdout결과를 RDD[string]으로 제공함 http://blog.madhukaraphatak.com/pipe-in-spark/ coalesce(numPartitions) RDD[T]  RDD[T], RDD의 파티션 개수를 지정된 파티션 수로 줄임 repartition(numPartitions) RDD[T]  RDD[T], RDD에 있는 데이터를 지정된 파티션 수로 줄이고, 리셔플됨. 항상 전 데이터가 네트웍을 통해서 셔플됨 repartitionAndSortWithinPartitions(partitioner) RDD[(K,V)]  RDD[(K,V)], 주어진 파티셔너를 통해서 repartition됨, 그리고 각 파티션 결과 내에서 정렬함 repartition보다 효과적 http://spark.apache.org/docs/1.6.2/programming-guide.html http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 30. RDD - Transformations Scala: Python: val distFile = sc.textFile(“README.md”) distFile.map(l => l.split(“ “)).collect() distFile.flatMap(l => l.split(“ “)).collect() distFile = sc.textFile(“README.md”) distFile.map(lambda x: x.split(’ ‘)).collect() distFile.flatMap(lambda x: x.split(’ ‘)).collect()
  • 31. RDD – Actions Action Meaning reduce(f: (T,T)  T) RDD[T]  T, dataset내의 모든 element를 f를 사용해서 aggregate한 결과를 반환 collect() RDD[T]  Array[T], dataset내의 모든 element를 array로 반환 count() RDD[T]  Long, dataset내의 element의 개수를 반환 first() RDD[T]  T, 첫 번째 element를 반환 take(n) RDD[T]  Array[T], n 번째 까지의 element들을 반환 taskSample(withReplacement, num, [seed]) RDD[T]  Array[T], 랜덤으로 num만큼 element들의 결과를 반환 takeOrdered(n, [ordering]) RDD[T]  Array[T], 정렬된 n번째까지의 element를 반환 saveAsTextFile(path) RDD[T]  Unit, 모든 element를 text파일로 저장, local filesystem, HDFS등에 저장 saveAsSequenceFile(path) RDD[T]  Unit, 모든 element를 Hadoop SequenceFile로 저장 saveAsObjectFile(path) RDD[T]  Unit, 모든 element를 java serialization을 이용한 simple format으로 저장 countByKey() RDD[(K,V)]  Map[K, Long], 각 key에 대한 count를 반환 foreach(f: Iterator[T]  Unit) RDD[T]  Unit, 각 element에 대한 함수 f를 수행 saveAsNewAPIHadoopDataset RDD[T]  Unit, Hadoop API의 ‘OutputFormat’(mapreduce.OutputFormat)을 이용하 여 임의의 HDFS에 저장(MR Job), HBase BulkLoad에서 사용 http://spark.apache.org/docs/1.6.2/programming-guide.html http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 32. RDD - Actions Scala: Python: val f = sc.textFile(“README.md”) val words = f.flatMap(l => l.split(“ “)).map(word => (word, 1)) words.reduceByKey(_ + _).collect from operator import add f = sc.textFile(“README.md”) words = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1)) words.reduceByKey(add).collect()
  • 33. RDD - Persistence • MapReduce와 다르게 Spark은 dataset을 persist(or cache)할 수 있다. • 다른 RDD operation(trans/action)에서 재사용하기 위해 각 노 드는 임의의 파티션을 메모리나 스토리지에 저장 • 10배의 스피드 증가 • 가장 중요한 스파크 피처 중의 하나임 >>> val wordCounts = rdd.flatMap(x => x.split(“ “)) .map(s => (s, 1)) .reduceByKey((a,b) => a + b) .cache()
  • 34. RDD - Persistence Storage Level Meaning MEMORY_ONLY RDD를 deserialized Java objects로 jvm heap에 저장. Default Level임 RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 저장되지 않고, 필요 시 재계산. MEMORY_AND_DISK RDD를 deserialized Java objects로 jvm heap에 저장. RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음. MEMORY_ONLY_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장. 메모리 공간에 효과적, 읽는 시점에 cpu-intensive MEMORY_AND_DISK_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장. RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음. DISK_ONLY 오직 디스크에만 저장함. MEMORY_ONLY_2 MEMORY_ONLY와 동일하나, 각 파티션 마다 2개의 node에 저장. MEMORY_AND_DISK_2 MEMORY_AND_DISK와 동일하나, 각 파티션 마다 2개의 node에 저장. OFF_HEAP RDD를 Tachyon에 맞는 serialized format으로 저장. MEMORY_ONLY_SER와 비교해서, gc overhead가 감소. Large heap과 다수의 concurrent application을 사용하는 경우에 효과적. Executor에 crash가 발생하더라도 cache된 데이터가 유실되지 않음.
  • 35. RDD - Persistence Scala: Python: val f = sc.textFile(“README.md”) val w = f.flatMap(l => l.split(“ “)).map(word => (word, 1)).cache() w.reduceByKey(_ + _).collect.foreach(println) from operator import add f = sc.textFile(“README.md”) w = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1)).cache() w.reduceByKey(add).collect()
  • 36. RDD - Fault Tolerance The State of Spark, and Where We're Going Next - Matei Zaharia, Spark Summit (2013) youtu.be/nU6vO2EJAb4 • RDD는 각 변환에 대해서 계보(lineage)를 기록해서 유실된 데이 터를 복구할 수 있다.
  • 37. • Narrow Dependencies • 한 노드 • 메모리만 이용 • 빠름 • 일부 파티션 복구도 빠름 • Wide Dependencies • 여러 노드 • 셔플 발생 • 네트웍을 이용함 • 복구에 많은 시간 소요 • Checkpoint 권장 RDD – Narrow vs. Wide Dependencies
  • 38. RDD – Job Scheduling • DAG 방향 따라 계산 • Stage는 가능하면 로컬에서 실 행할 수 있도록 구성(Narrow Dependency를 갖도록) • 셔플이 필요한 경우 Stage 구분 • 파티션이 수행될 노드는 데이터 로컬리티를 고려함(HDFS)
  • 39. Examples – Word Count aardvark 1 cat 1 mat 1 on 2 sat 2 sofa 1 the 4 Input Data the cat sat on the mat the aardvark sat on the sofa Result http://www.slideshare.net/cloudera/spark-devwebinarslides-final
  • 40. Examples – Word Count the cat sat on the mat the aardvark sat on the sofa the cat sat on the mat the aardvark sat … (the, 1) (cat, 1) (sat, 1) (on, 1) (the, 1) (mat, 1) (the, 1) (aardvark, 1) (sat, 1) … (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4) (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4) val f = sc.textFile(file) val w = f.flatMap(l => l.split(“ “)) val counts = w.reduceByKey(_ + _) counts.saveAsTextFile(output) .map(word => (word, 1)) HadoopRDD MapPartitionsRDD MapPartitionsRDD ShuffledRDD Array
  • 41. Examples - Estimate Pi • Monte Carlo method를 이용한 Pi 값 계산 • ./bin/run-example SparkPi 2 local • 알고리즘 1. Draw a square, then inscribe a circle within it. 2. Uniformly scatter objects of uniform size over the square. 3. Count the number of objects inside the circle and the total number of objects. 4. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf http://demonstrations.wolfram.com/MonteCarloEstimateForPi/ https://en.wikipedia.org/wiki/Monte_Carlo_method
  • 42. Examples – Estimate Pi Base RDD transformed RDD https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf action
  • 44. Spark SQL • Spark module for structured data processing (e.g. DB tables, JSON files) • Adding Schema to RDDs • Three ways to manipulate data: • SQL (2014.05, Spark 1.0) • DataFrame (2015.03, Spark 1.3) • Datasets (2016.01, Spark 1.6) • Same execution engine for all three • Spark SQL interfaces provide more information about both structure and computation being performed than basic Spark RDD API
  • 45. Spark SQL Motivation • Create and Run Spark Programs Faster • Write less code • Read less data • Let the optimizer do the hard work • Shark의 한계 • Limited integration with Spark programs • Hive optimizer not designed for Spark Spark SQL reuses the best parts of Shark http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014
  • 46. SQL • Execute SQL queries written using either a basic SQL syntax or HiveQL • When running SQL from within another programming language the results will be returned as a DataFrame. • Interact with the SQL interface using the CLI or JDBC/ODBC
  • 47. DataFrames • Distributed collection of data organized into named columns. • Conceptually equivalent to a table in relational DB or data frame in R/Python • API available in Scala, Java, Python, and R • Richer optimizations(significantly faster than RDDs) • Can be constructed from a wide array of sources • data files, tables in Hive, external databases, exisiting RDD
  • 49. Datasets • New experimental interface added in Spark 1.6 • Tries to provide the benefits of RDDs(strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. • Unified Dataset API can be used both in Scala and Java.
  • 50. SQL Context and Hive Context • SQLContext • Entry point into all functionality in Spark SQL • Wraps / extends existing spark context • HiveContext • Superset of functionality provided by basic SQLContext • Read data from Hive tables • Access to Hive Functions -> UDFs val sqlContext = SQLContext(sc) val hc = HiveContext(sc)
  • 51. DataFrame Example • Reading Data From Table val df = sqlContext.table("flightsTbl") df.select("Origin", "Dest", "DepDelay").show(5) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 8| | IAD| TPA| 19| | IND| BWI| 8| | IND| BWI| -4| | IND| BWI| 34| +------+----+--------+
  • 52. DataFrame Example • Using DataFrame API to Filter Data(show delays more than 15min) df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  • 53. SQL Example • Using SQL to Query and Filter Data(again, show delays more than 15 min) // Register Temporary Table df.registerTempTable("flights") // Use SQL to Query Dataset sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT 5").show +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  • 54. RDD vs. DataFrame • RDD • Lower-level API (more control) • Lots of existing code & users • Compile-time type-safety • DataFrame • Higher-level API(faster development) • Faster sorting, hashing, and serialization • More opportunities for automatic optimization • Lower memory pressure
  • 55. DataFrame은 직관적 dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department? RDD Example Data Frame Example SQL Example sc.sql (“SELECT avg(age) FROM data GROUP BY dept”) http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin
  • 56. Spark SQL Optimizations • Spark SQL uses an underlying optimization engine(Catalyst) • Catalyst can perform intelligent optimization since it understands the schema • Spark SQL does not materialize all the columns(as with RDD) only what’s needed http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin
  • 57. Plan Optimization & Execution • Spark SQL uses an underlying optimization engine(Catalyst) http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
  • 58. An example query SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1 Logical Plan Project name Filter id = 1 Project id,name People http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
  • 59. Optimizing with Rules Original Plan Project name Filter id = 1 Project id,name People Project name Project id,name Filter id = 1 People Filter Push-Down Combine Projection Project name Filter id = 1 People IndexLookup id = 1 return: name Physical Plan http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
  • 60. References Papers • Spark: Cluster Computing with Working Sets: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing : http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf • Spark SQL: Relational Data Processing in Spark: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf RDD • Spark 의 핵심은 무엇인가? RDD! (RDD paper review): http://www.slideshare.net/yongho/rdd-paper-review • Apache Spark RDDs: http://www.slideshare.net/deanchen11/scala-bay-spark-talk Stanford 자료 • Intro to Apache Spark: https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf • Distributed Computing with Spark: https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf
  • 61. References • Apache Apache Spark Overview - MapR: http://www.slideshare.net/caroljmcdonald/apache-spark-overview-52602792 • Introduction to Apache Spark Developer Training - Cloudera: http://www.slideshare.net/cloudera/spark-devwebinarslides-final • Apache Spark Overview: http://www.slideshare.net/VadimYBichutskiy/apache-spark-overview • Simplifying Big Data Analytics with Apache Spark: http://www.slideshare.net/databricks/bdtc2 • Intro to Spark with Zeppelin: http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin • Introduction to Big Data Analytics using Apache Spark and Zeppelin: http://www.slideshare.net/alexzeltov/introduction-to-big-data-analytics-using-apache-spark-and-zeppelin-on-hdinsights-on- azure-saas-andor-hdp-on-azurepaas • Spark overview: http://www.slideshare.net/LisaHua/spark-overview-37479609 • Introduction to real time big data with Apache Spark: http://www.slideshare.net/tmatyashovsky/introduction-to-realtime-big-data-with-apache-spark • Spark은 왜 이렇게 유명해 지고 있을까?: http://www.slideshare.net/KSLUG/ss-47355270 • Apache Spark Briefing: http://www.slideshare.net/ThomasWDinsmore/apache-spark-briefing-12062013 • Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106: http://www.slideshare.net/sanghoonlee982/spark-overview-20141106 • Zeppelin(Spark)으로 데이터 분석하기: http://www.slideshare.net/sangwookimme/zeppelinspark-41329473 • Lightening Fast Big Data Analytics using Apache Spark: http://www.slideshare.net/manishgforce/lightening-fast-big-data-analytics-using-apache-spark • Apache Hive on Apache Spark: http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ • Apache Spark: The Next Gen toolset for Big Data Processing: http://www.slideshare.net/prajods/apache-spark-the-next-gen-toolset-for-big-data-processing • Intro to Spark and Spark SQL: http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014 • Spark SQL Deep Dive @ Melbourne Spark Meetup: http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune • A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html • Apache Spark (big Data) DataFrame - Things to know: https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary
  • 62. References • Spark SQL - Quick Guide: https://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm • Big Data Processing with Apache Spark - Part 2: Spark SQL: https://www.infoq.com/articles/apache-spark-sql • Analytics with Apache Spark Tutorial Part 2: Spark SQL: https://dzone.com/articles/analytics-with-apache-spark-tutorial-part-2-spark • Apache Spark Resource Management and YARN App Models: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management- and-yarn-app-models/ • Why Spark Is the Next Top (Compute) Model: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model-39976454 • Introduction to Apache Spark:http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010 • Apache Spark & Hadoop:http://www.slideshare.net/MapRTechnologies/spark-overviewjune2014 • Spark와 Hadoop, 완벽한 조합 (한국어): http://www.slideshare.net/pudidic/spark-hadoop • Big Data visualization with Apache Spark and Zeppelin: http://www.slideshare.net/prajods/big-data-visualization-with-apache-spark-and-zeppelin • latency: https://gist.github.com/hellerbarde/2843375 • Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San Jose 2015: http://www.slideshare.net/databricks/strata-sj-everyday- im-shuffling-tips-for-writing-better-spark-programs • Apache Spark Architecture: http://www.slideshare.net/AGrishchenko/apache-spark-architecture
  • 65. 구글 빅데이터 관련 기술 기술 연도 내용 GFS 2003 Google File System: A Distributed Storage MapReduce 2004 Simplified Data Processing on Large Clusters Sawzall 2005 Interpreting the Data: Parallel Analysis with Sawzall Chubby 2006 The Chubby Lock Service for Loosely-Coupled Distributed Systems BigTable 2006 A Distributed Storage System for Structured Data Paxos 2007 Paxos Made Live - An Engineering Perspective Colossus 2009 GFS II Percolator 2010 Large-scale Incremental Processing Using Distributed Transactions and Notifications Pregel 2010 A System for Large-Scale Graph Processing Dremel 2010 Interactive Analysis of Web-Scale Datasets Tenzing 2011 A SQL Implementation On The MapReduce Framework Megastore 2011 Providing Scalable, Highly Available Storage for Interactive Services Spanner 2012 Google's Globally-Distributed Database F1 2012 The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business
  • 66. GFS Motivation • More than 15,000 commodity-class PC's. • Fault-tolerance provided in software • More cost-effective solution • Multiple clusters distributed worldwide • One query reads 100’s of MB of data • One query consumes 10’s of billions of CPU cycles • Thousands of queries served per second. • Google stores dozens of copies of the entire Web! • Conclusion: Need Large, distributed, highly fault- tolerant file system http://www.cs.brandeis.edu/~dilant/WebPage_TA160/The%20Google%20File%20System.pdf
  • 67. GFS Assumptions • 높은 컴포넌트 장애율 • 장애에 대한 모니터링/감시, 장애 내성, 장애 복구 등의 준비가 필요하다. • “적당한” 규모의 큰(HUGE) 파일들 • Just a few million • Each is 100MB or larger; multi-GB files typical • 파일은 한번 쓰고, 대부분은 추가된다. • Perhaps concurrently • 큰 순차 읽기(Large Streaming Reads) • 높은 지속적인 처리량(throughput)이 저 지연(low latency)보다 중 요 http://research.google.com/archive/gfs.html https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf
  • 71. YARN Features Feature Description Multi-tenancy YARN allows multiple access engines (either open-source or proprietar y) to use Hadoop as the common standard for batch, interactive and r eal-time engines that can simultaneously access the same data set. Multi-tenant data processing improves an enterprise’s return on its Hadoop investments. Cluster utilization YARN’s dynamic allocation of cluster resources improves utilization ove r more static MapReduce rules used in early versions of Hadoop Scalability Data center processing power continues to rapidly expand. YARN’s Res ourceManager focuses exclusively on scheduling and keeps pace as cl usters expand to thousands of nodes managing petabytes of data. Compatibility Existing MapReduce applications developed for Hadoop 1 can run YAR N without any disruption to existing processes that already work
  • 72. Hive
  • 73. Hive Overview • Invented at Facebook. Open sourced to Apache in 2008 • A database/data warehouse on top of Hadoop • Structured data similar to relational schema • Tables, columns, rows and partitions • SQL like query language (HiveQL) • A subset of SQL with many traditional features • It is possible to embedded MR script in HiveQL • Queries are compiled into MR jobs that are executed on Hadoop. 출처: http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf
  • 74. Hive Motivation(Facebook) • Problem: Data growth was exponential • 200GB per day in March 2008 • 2+TB(compressed) raw data / day in April 2009 • 4+TB(compressed) raw data / day in Nov. 2009 • 12+TB(compressed) raw data / day today(2010) • The Hadoop Experiment • Much superior to availability and scalability of commercial DBs • Efficiency not that great, but throw more hardware • Partial Availability/resilience/scale more important than ACID • Problem: Programmability and Metadata • MapReduce hard to program (users know sql/bash/python) • Need to publish data in well known schemas • Solution: SQL + MapReduce = HIVE (2007) 출처: 1) http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation 2) http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop
  • 75. Hive – Data Flow of Facebook 출처: http://borthakur.com/ftp/hadoopmicrosoft.pdf, http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf
  • 77. Hive - Query Execution and MR Jobs 출처: Ysmart(Yet Another SQL-to-MapReduce Translator), http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf
  • 78. Hive - Limitations • Performance (주로 ~0.12) • For simple queries, HIVE performance is comparable with hand- coded MR jobs • The execution time is much longer for complex queries • 연산단계마다 MR잡이 실행되기 때문에 많은 IO로 인한 성능 병목 발생 • 각 단계마다, 비효율적인 데이터 스캔 및 전송이 발생함 • 약한 Optimizer로 인한 비효율적인 실행계획 • 스팅어 계획(Stinger Initiative)로 성능은 이전에 비해 비약적으 로 향상됨(Tez, Orc 도입, Optimizer개선) • DW용으로만 한정적으로 사용될 수 있음. • Streaming, Graph, ML등의 작업에는 제한이 있음
  • 79. Spark
  • 80. RDD Operations in paper http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 81. Spark Master - YARN • YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration. • You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads. • Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use. • Finally, YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes. http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
  • 82. Spark Master - YARN yarn-client modeyarn-cluster mode http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
  • 84. Spark 2.0 DataSets Language Main Abstraction Scala Dataset[T] & DataFrame (alias for Dataset[Row]) Java Dataset[T] Python* DataFrame R* DataFrame Note: Since Python and R have no compile-time type-safety, we only have untyped APIs, namely DataFrames. Typed and Un-typed APIs https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  • 85. Static-typing and runtime type-safety https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  • 86. High-level abstraction and custom view 1. Spark reads the JSON, infers the schema, and creates a collection of DataFrames 2. At this point, Spark converts your data into DataFrame=Dataset[Row], a collection of generic Row object, since it does not know the exact type. 3. Now, Spark converts the Dataset[Row]-> Dataset[DeviceIoTData] type-specific Scala JVM Object, as dictated by the class DeviceIoTData case class DeviceIoTData (battery_level: Long, c02_level: Long, cca2: String, cca3: String, cn: String, device_id: Long, device_name: String, humidity: Long, ip: String, latitude: Double, lcd: String, longitude: Double, scale:String, temp: Long, timestamp: Long) {"device_id": 198164, "device_name": "sensor-pad-198164owomcJZ", "ip": "80.55.20.25", "cca2": "PL", "cca3": "POL", "cn": "Poland", "latitude": 53.080000, "longitude": 18.620000, "scale": "Celsius", "temp": 21, "humidity": 65, "battery_level": 8, "c02_level": 1408, "lcd": "red", "timestamp" :1458081226051} // read the json file and create the dataset from the // case class DeviceIoTData // ds is now a collection of JVM Scala objects DeviceIoTData val ds = spark.read.json("/databricks-public-datasets/data/iot/iot_devices.json").as[DeviceIoTData] https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  • 87. Ease-of-use of APIs with structure • Although structure may limit control in what your Spark program can do with data • Most computations can be accomplished with Dataset’s high-level APIs. For example, it’s much simpler to perform agg, select, sum, avg, map, filter, or groupBy // Use filter(), map(), groupBy() country, and compute avg() // for temperatures and humidity. This operation results in // another immutable Dataset. The query is simpler to read, // and expressive val dsAvgTmp = ds.filter(d => {d.temp > 25}).map(d => (d.temp, d.humidity, d.cca3)).groupBy($"_3").avg() //display the resulting dataset display(dsAvgTmp) https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  • 88. Performance and Optimization • DataFrame and Dataset APIs are built on top of the Spark SQL engine. • it uses Catalyst to generate an optimized logical and physical query plan. • Spark은 Dataset의 Tungsten’s Encoder를 이용하면, serialize / deserialize시에 bytecode를 compact시켜줘서 Speed에 이점이 있다. https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  • 89. Use DataFrames or Datasets? • If you want rich semantics, high-level abstractions, and domain specific APIs  use DataFrame or Dataset. • If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data  DataFrame or Dataset. • If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation  Dataset. • If you want unification and simplification of APIs across Spark Libraries  DataFrame or Dataset. • If you are a R user  DataFrames. • If you are a Python user  DataFrames and resort back to RDDs if you need more control.
  • 91. Reliability Models Core Storm Storm Trident Spark Streaming At Most Once Yes Yes No At Least Once Yes Yes No* Once and Only Once (Exactly Once) No Yes Yes* http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
  • 92. Programing Model Core Storm Storm Trident Spark Streaming Stream Primitive Tuple Tuple, Tuple Batch, Partition Dstream Stream Sources Spouts Spouts, Trident Spouts HDFS, Network Computation/ Transformation Bolts Filters, Functions, Aggregations, Joins Transformation, Window Operations Stateful Operation No (roll your own) Yes Yes Output/ Persistence Bolts State, MapState foreachRDD 2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
  • 93. Performance • Storm capped at 10k msgs/sec/node? • Spark Streaming 40x faster than Storm? System Performance Storm(Twitter) 10,000 records/s/node Spark Streaming 400,000 records/s/node Apache S4 7,000 records/s/node Other Commercial Systems 100,000 records/s/node 2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming