Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
아파치 스파크 소개
Part1
2016.11.07
민형기
Contents
• MapReduce
• Apache Spark
• Spark SQL
Brief History
MapReduce
MapReduce History
• 1979 – Stanford, MIT, CMU, etc
• set/list operations in LISP, Prolog, etc. for parallel processing
• 2...
MapReduce Motivation
• 구글에서 사용중인 데이터를 가공하기 위해서는 많은 머신이 필요함.
• 특히, 입력 데이터가 크고, 적절한 시간 내에 완료되려면 컴퓨테이션이 많은 장비에 분산되어
야 한다.
• 웹...
MapReduce Programming Model
• Map과 Reduce는 Lisp과 같은 함수형 언어에서 유래한 용어
• Map: 데이터의 집합에 함수를 적용하여 새로운 집합을 만드는 것
• Reduce: 데이터의 ...
MapReduce Process
MapReduce Design Pattern
• Basic MapReduce Patterns
• Counting, Summing
• Collating
• Filtering(“Grepping”), Parsing, and ...
MapReduce Limitations
• MR로 직접 프로그래밍 하는 것은 어렵다.
• MR은 어렵고, 개발 노력이 많이 들고, 성능 보장이 어렵다.
• 개발자의 수준에 따른 성능 차 발생
• 기존 SQL 구현에 비해...
MapReduce Limitations
MapReduce
Storm
Giraph
Drill
Tez
Impala
…
Specialized systems
(iterative, interactive and
streaming ...
아파치 스파크
아파치 스파크란?
• Fast and general engine for large-scale data processing.
• 특징
• 스피드: Run programs up to 100x faster than Hadoo...
Spark Stack(Unified Platform)
Spark Core / RDD
Spark Streaming
(Streaming)
GraphX
(graph)
Spark SQL
MLlib
(Machine Learnin...
Separate engine:
Benefit for Users
동일한 엔진으로 데이터 추출, 모델 학습, interactive 쿼리를 수
행할 수 있다.
…
DFS
read
DFS
write
parse
DFS
read
...
스파크 히스토리
• 2009년 – UC Berkeley RAD Lab(AMP Lab)에서 개발시작
• 2010년 – Open Source화
• 2010년 - Spark: Cluster Computing with Work...
스파크 - Motivation
• MapReduce는 빅데이터 분석을 쉽게 만들어 줌
• 그러나 이것은 방향성을 갖는 데이터 플로우 모델에만 적합
• MapReduce가 부족한 것
• Iterative Job: 기계학습...
Operations in MapReduce
• MR에서 데이터공유는 replication, serialization, and disk IO로 느림
• 대부분의 MR의 90%시간은 HDFS read-write에서 사용됨
...
Operations in Spark RDD
• Iterative Operations • Interactive Operations
https://www.tutorialspoint.com/apache_spark/apache...
아파치 스파크 – Time to Sort 100TB
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Scala, Java, Python, R
// Scala:
val lines = sc.textFile(…)
val pairs = lines.map( s => (s, 1) )
val counts = pairs.reduce...
Spark Context
• 모든 Spark 응용프로그램은 Spark Context가 필요함
• Spark API를 위한 Main entry point
• Spark cluster와의 connection을 대표함
• S...
Master
• SparkContext의 master파라메터는 어떤 클러스터를 사용할지
결정함
master description
local run Spark locally with one worker thread
(no...
Master
http://spark.apache.org/docs/latest/cluster-overview.html
1. Application 리소스 할당을 위해 Cluster Manager에 접속
2. 클러스터의 ta...
Master – YARN vs. Standalone
• Master 종류에 따른 비교(YARN vs. Standalone)
YARN Cluster YARN Client Spark Standalone
Driver runs...
Resilient Distributed Datasets(RDD)
• Primary abstraction in Spark
• An Immutable collections of objects that can be opera...
• Two types: transformations and actions
• Transformation Operation
• 변환을 통해 새로운 RDD를 생성, e.g, rdd.map(…)
• lazy operation...
RDD – Transformations
Transformation Meaning
map(f: TU) RDD[T]  RDD[U]
filter(f: TBool) RDD[T]  RDD[T]
flatMap(f: T  ...
RDD – Transformations
Transformation Meaning
join(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)])  RDD[(K,(V,W))], 각 k...
RDD - Transformations
Scala:
Python:
val distFile = sc.textFile(“README.md”)
distFile.map(l => l.split(“ “)).collect()
dis...
RDD – Actions
Action Meaning
reduce(f: (T,T)  T) RDD[T]  T, dataset내의 모든 element를 f를 사용해서 aggregate한 결과를 반환
collect() RD...
RDD - Actions
Scala:
Python:
val f = sc.textFile(“README.md”)
val words = f.flatMap(l => l.split(“ “)).map(word => (word, ...
RDD - Persistence
• MapReduce와 다르게 Spark은 dataset을 persist(or cache)할 수
있다.
• 다른 RDD operation(trans/action)에서 재사용하기 위해 각 ...
RDD - Persistence
Storage Level Meaning
MEMORY_ONLY RDD를 deserialized Java objects로 jvm heap에 저장. Default Level임
RDD가 메모리에...
RDD - Persistence
Scala:
Python:
val f = sc.textFile(“README.md”)
val w = f.flatMap(l => l.split(“ “)).map(word => (word, ...
RDD - Fault Tolerance
The State of Spark, and Where We're Going Next
- Matei Zaharia, Spark Summit (2013)
youtu.be/nU6vO2E...
• Narrow Dependencies
• 한 노드
• 메모리만 이용
• 빠름
• 일부 파티션 복구도 빠름
• Wide Dependencies
• 여러 노드
• 셔플 발생
• 네트웍을 이용함
• 복구에 많은 시간 소요
...
RDD – Job Scheduling
• DAG 방향 따라 계산
• Stage는 가능하면 로컬에서 실
행할 수 있도록 구성(Narrow
Dependency를 갖도록)
• 셔플이 필요한 경우 Stage 구분
• 파티션이 ...
Examples – Word Count
aardvark 1
cat 1
mat 1
on 2
sat 2
sofa 1
the 4
Input Data
the cat sat on the mat
the aardvark sat on...
Examples – Word Count
the cat sat on the
mat
the aardvark sat on
the sofa
the
cat
sat
on
the
mat
the
aardvark
sat
…
(the, ...
Examples - Estimate Pi
• Monte Carlo method를 이용한 Pi 값 계산
• ./bin/run-example SparkPi 2 local
• 알고리즘
1. Draw a square, then...
Examples – Estimate Pi
Base RDD
transformed RDD
https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
action
Spark SQL
Spark SQL
• Spark module for structured data processing (e.g. DB tables, JSON
files)
• Adding Schema to RDDs
• Three ways ...
Spark SQL Motivation
• Create and Run Spark Programs Faster
• Write less code
• Read less data
• Let the optimizer do the ...
SQL
• Execute SQL queries written using either a basic SQL syntax
or HiveQL
• When running SQL from within another program...
DataFrames
• Distributed collection of data organized into named columns.
• Conceptually equivalent to a table in relation...
DataFrames
http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
• Constructed from a wide array of sources
Datasets
• New experimental interface added in Spark 1.6
• Tries to provide the benefits of RDDs(strong typing, ability to...
SQL Context and Hive Context
• SQLContext
• Entry point into all functionality in Spark SQL
• Wraps / extends existing spa...
DataFrame Example
• Reading Data From Table
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay"...
DataFrame Example
• Using DataFrame API to Filter Data(show delays more than
15min)
df.select("Origin", "Dest", "DepDelay”...
SQL Example
• Using SQL to Query and Filter Data(again, show delays more
than 15 min)
// Register Temporary Table
df.regis...
RDD vs. DataFrame
• RDD
• Lower-level API (more control)
• Lots of existing code & users
• Compile-time type-safety
• Data...
DataFrame은 직관적
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by department?...
Spark SQL Optimizations
• Spark SQL uses an underlying optimization engine(Catalyst)
• Catalyst can perform intelligent op...
Plan Optimization & Execution
• Spark SQL uses an underlying optimization engine(Catalyst)
http://www.slideshare.net/datab...
An example query
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
Logical Plan
Project
name
Filter
id = 1
...
Optimizing with Rules
Original
Plan
Project
name
Filter
id = 1
Project
id,name
People
Project
name
Project
id,name
Filter
...
References
Papers
• Spark: Cluster Computing with Working Sets: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spa...
References
• Apache Apache Spark Overview - MapR: http://www.slideshare.net/caroljmcdonald/apache-spark-overview-52602792
...
References
• Spark SQL - Quick Guide: https://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm
• Big Data Proces...
Appendix
Latency numbers
구글 빅데이터 관련 기술
기술 연도 내용
GFS 2003 Google File System: A Distributed Storage
MapReduce 2004 Simplified Data Processing on Lar...
GFS Motivation
• More than 15,000 commodity-class PC's.
• Fault-tolerance provided in software
• More cost-effective solut...
GFS Assumptions
• 높은 컴포넌트 장애율
• 장애에 대한 모니터링/감시, 장애 내성, 장애 복구 등의 준비가 필요하다.
• “적당한” 규모의 큰(HUGE) 파일들
• Just a few million
• E...
GFS Architecture
http://research.google.com/archive/gfs.html
https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs...
SQL JOINS
http://amirulkamil.com/best-describe-join/
YARN Architecture
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN Features
Feature Description
Multi-tenancy YARN allows multiple access engines (either open-source or proprietar
y) t...
Hive
Hive Overview
• Invented at Facebook. Open sourced to Apache in 2008
• A database/data warehouse on top of Hadoop
• Struct...
Hive Motivation(Facebook)
• Problem: Data growth was exponential
• 200GB per day in March 2008
• 2+TB(compressed) raw data...
Hive – Data Flow of Facebook
출처: http://borthakur.com/ftp/hadoopmicrosoft.pdf, http://sydney.edu.au/engineering/it/~zhouy/...
Hive - Architecture
https://cwiki.apache.org/confluence/display/Hive/Design
Hive - Query Execution and MR Jobs
출처: Ysmart(Yet Another SQL-to-MapReduce Translator), http://sydney.edu.au/engineering/i...
Hive - Limitations
• Performance (주로 ~0.12)
• For simple queries, HIVE performance is comparable with hand-
coded MR jobs
...
Spark
RDD Operations in paper
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
Spark Master - YARN
• YARN allows you to dynamically share and centrally configure the same pool of cluster
resources betw...
Spark Master - YARN
yarn-client modeyarn-cluster mode
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-manageme...
Spark 2.0 Datasets
Spark 2.0 DataSets
Language Main Abstraction
Scala Dataset[T] & DataFrame (alias for
Dataset[Row])
Java Dataset[T]
Python*...
Static-typing and runtime type-safety
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-datafr...
High-level abstraction and custom view
1. Spark reads the JSON, infers the schema, and creates a collection of DataFrames
...
Ease-of-use of APIs with structure
• Although structure may limit control in what your Spark program can
do with data
• Mo...
Performance and Optimization
• DataFrame and Dataset APIs are built on top of the Spark
SQL engine.
• it uses Catalyst to ...
Use DataFrames or Datasets?
• If you want rich semantics, high-level abstractions, and domain specific APIs  use
DataFram...
Spark Streaming vs. Storm
Reliability Models
Core Storm Storm Trident Spark Streaming
At Most Once Yes Yes No
At Least Once Yes Yes No*
Once and Onl...
Programing Model
Core Storm Storm Trident Spark Streaming
Stream Primitive Tuple
Tuple, Tuple Batch,
Partition
Dstream
Str...
Performance
• Storm capped at 10k msgs/sec/node?
• Spark Streaming 40x faster than Storm?
System Performance
Storm(Twitter...
Upcoming SlideShare
Loading in …5
×

Apache Spark Overview part1 (20161107)

697 views

Published on

Apache Spark Overview - part1
- spark core (rdd)
- spark sql (dataframe)

Published in: Technology
  • Be the first to comment

Apache Spark Overview part1 (20161107)

  1. 1. 아파치 스파크 소개 Part1 2016.11.07 민형기
  2. 2. Contents • MapReduce • Apache Spark • Spark SQL
  3. 3. Brief History
  4. 4. MapReduce
  5. 5. MapReduce History • 1979 – Stanford, MIT, CMU, etc • set/list operations in LISP, Prolog, etc. for parallel processing • 2004 – Google • MapReduce(2004): Simplified Data Processing on Large Clusters • http://research.google.com/archive/mapreduce.html • 2006 – Apache Hadoop: http://hadoop.apache.org/ • Hadoop, originating from the Nutch Project, Doug Cutting • 2008 – Yahoo • Web scale search indexing • Hadoop Summit, HUG, etc • 2009 – Amazon AWS • Elastic MapReduce • Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc • 2012.01 – Apache Hadoop 1.0 • MapReduce 1.0: cluster resource management & data processing • 2013.10 – Apache Hadoop 2.2 • MapReduce 2.0: data processing • YARN: cluster resource management Jeff Dean Doug Cutting 제프 딘의 29가지 진실: http://ppss.kr/archives/16672
  6. 6. MapReduce Motivation • 구글에서 사용중인 데이터를 가공하기 위해서는 많은 머신이 필요함. • 특히, 입력 데이터가 크고, 적절한 시간 내에 완료되려면 컴퓨테이션이 많은 장비에 분산되어 야 한다. • 웹 페이지의 인덱스를 생성하는 과정에서 방대한 양의 웹 페이지를 처리해야 할 때도 분산처 리가 필요함. • 데이터 가공의 종류는 지속적으로 증가함 • 검색 색인(역 인덱스) 계산, 웹 문서의 그래프 구조의 다양한 표현, Host별로 크롤된 페이지의 수의 Summary, 해당 일자의 가장 많이 요청된 쿼리 셋 등 • 대부분은 개념적으로 어렵지 않으나, 분산처리 고려(작업 병렬화, 데이터 분산, 실패 처리 등) 로 인하여 코드가 복잡해 짐 • 분산 데이터 처리 Framework 필요 http://research.google.com/archive/mapreduce.html
  7. 7. MapReduce Programming Model • Map과 Reduce는 Lisp과 같은 함수형 언어에서 유래한 용어 • Map: 데이터의 집합에 함수를 적용하여 새로운 집합을 만드는 것 • Reduce: 데이터의 집합에 함수를 적용하여 하나의 결과로 모으는 것 • Map: <키, 값>  <키`, 값`>* • Reduce: <키`, 값` *>  값``*
  8. 8. MapReduce Process
  9. 9. MapReduce Design Pattern • Basic MapReduce Patterns • Counting, Summing • Collating • Filtering(“Grepping”), Parsing, and Validation • Distributed Task Execution • Sorting • Not-So-Basic MapReduce Patterns • Iterative Message Passing(Graph Processing) • Distinct Values(Unique Items Counting) • Cross-Correlation • Relational MapReduce Patterns • Selection • Projection • Union • Intersection • Difference • GroupBy and Aggregation • Joining • Machine Learning and Math MapReduce Algorithms https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  10. 10. MapReduce Limitations • MR로 직접 프로그래밍 하는 것은 어렵다. • MR은 어렵고, 개발 노력이 많이 들고, 성능 보장이 어렵다. • 개발자의 수준에 따른 성능 차 발생 • 기존 SQL 구현에 비해 생산성이 많이 떨어짐 • MapReduce는 one-pass 연산에는 우수한 성능을 보이나, multi- pass 알고리즘에는 효율적이지 못하다. • Disk IO에 최적화됨 / 메모리를 잘 사용하지 못함 • 반복적인 알고리즘의 경우 디스크 IO를 계속 발생시키기 때문에 효율적이지 못함 MR은 다양한 종류의 연산에 최적화 되어있지 않다. • 특화된 시스템이 필요함 https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf
  11. 11. MapReduce Limitations MapReduce Storm Giraph Drill Tez Impala … Specialized systems (iterative, interactive and streaming apps) General batch processing Tajo Druid Presto
  12. 12. 아파치 스파크
  13. 13. 아파치 스파크란? • Fast and general engine for large-scale data processing. • 특징 • 스피드: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • 쉬운 사용: Java, Scala, Python, R로 쉽게 작성 가능 • 일반성: Batch, Streaming, iterative, interactive • Runs Everywhere: Hadoop, Mesos, Standalone • ‘09 UC Berkeley AMPLab, open sourced in ‘10
  14. 14. Spark Stack(Unified Platform) Spark Core / RDD Spark Streaming (Streaming) GraphX (graph) Spark SQL MLlib (Machine Learning) Standalone YARN Mesos Scala Java Python R
  15. 15. Separate engine: Benefit for Users 동일한 엔진으로 데이터 추출, 모델 학습, interactive 쿼리를 수 행할 수 있다. … DFS read DFS write parse DFS read DFS write train DFS read DFS write query HDFS DFS read parse train query Spark: Interactive analysis https://spark-summit.org/2013/zaharia-the-state-of-spark-and-where-were-going/
  16. 16. 스파크 히스토리 • 2009년 – UC Berkeley RAD Lab(AMP Lab)에서 개발시작 • 2010년 – Open Source화 • 2010년 - Spark: Cluster Computing with Working Sets • 2012년 – Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing • 2013년 – 아파치 프로젝트로 전환 • 2014년 – 아파치 최상위 프로젝트(Top-Level Project) • 2014년 – 스파크로 Large scale Sorting 세계기록(Databricks) • 2014년 5월 – 1.0 release • 2016년 7월 – 2.0 release
  17. 17. 스파크 - Motivation • MapReduce는 빅데이터 분석을 쉽게 만들어 줌 • 그러나 이것은 방향성을 갖는 데이터 플로우 모델에만 적합 • MapReduce가 부족한 것 • Iterative Job: 기계학습, 그래프 처리 • Interactive analytics: Ad-hoc 쿼리 (Hive, Pig)  Data Sharing is Slow • 어떻게 개선할 수 있을까? • Fast data sharing • General DAGs https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf http://www.slideshare.net/yongho/rdd-paper-review
  18. 18. Operations in MapReduce • MR에서 데이터공유는 replication, serialization, and disk IO로 느림 • 대부분의 MR의 90%시간은 HDFS read-write에서 사용됨 • Iterative Operations • Interactive Operations https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
  19. 19. Operations in Spark RDD • Iterative Operations • Interactive Operations https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm • RDD: 데이터공유를 메모리에서 함 • 메모리를 이용한 데이터 공유는 네트워크나 디스크보다 10~100배 빠름
  20. 20. 아파치 스파크 – Time to Sort 100TB http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
  21. 21. Scala, Java, Python, R // Scala: val lines = sc.textFile(…) val pairs = lines.map( s => (s, 1) ) val counts = pairs.reduceByKey( (a,b) => a + b) // Java: JavaRDD<String> lines = sc.textFile("data.txt"); JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1)); JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b); // Python: lines = sc.textFile(…) pairs = lines.map(lambda s: (s, 1)) counts = pairs.reduceByKey(lambda a, b: a+b)
  22. 22. Spark Context • 모든 Spark 응용프로그램은 Spark Context가 필요함 • Spark API를 위한 Main entry point • Spark cluster와의 connection을 대표함 • Spark Shell은 미리 설정된 Spark Context인 sc를 제공함 • Scala (spark-shell): • Python (pyspark):
  23. 23. Master • SparkContext의 master파라메터는 어떤 클러스터를 사용할지 결정함 master description local run Spark locally with one worker thread (no parallelism) local[K] run Spark locally with K worker threads (ideally set to # cores) spark://host:port connect to a Spark standalone cluster; PORT depends on config (7077 by default) mesos://host:port connect to a Mesos cluster; PORT depends on config (5050 by default) yarn Connect to yarn cluster in client or cluster mode depending on the value of –deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
  24. 24. Master http://spark.apache.org/docs/latest/cluster-overview.html 1. Application 리소스 할당을 위해 Cluster Manager에 접속 2. 클러스터의 task를 수행할 executors를 획득 3. Applicaion code를 executor에 전달 4. Task를 executor에 전달하고 실행
  25. 25. Master – YARN vs. Standalone • Master 종류에 따른 비교(YARN vs. Standalone) YARN Cluster YARN Client Spark Standalone Driver runs in: Application Master Client Client Who requests resources? Application Master Application Master Client Who starts executor processes? YARN NM YARN NM Spark Workers Persistent services YARN RM / NM YARN RM / NM Spark Master / Workers Supports Spark Shell? No Yes Yes http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
  26. 26. Resilient Distributed Datasets(RDD) • Primary abstraction in Spark • An Immutable collections of objects that can be operated on in parallel • RDD • Resilient: 메모리에 저장된 데이터가 유실 되도, 다시 만들어짐 • Distributed: 메모리가 클러스터를 통해 저장됨 • Main idea: Resilient Distributed Datasets • Immutable collections of objects, spreads across cluster • 유저는 컬렉션의 파티셔닝과 퍼시스턴스(메모리, 디스크 등)를 관리할 수 있 음 • RDD 생성: 스토리지  RDD, RDD  RDD만 가능 • Statically typed: RDD[T] has objects of type T • Fault-tolerant: 어떤 데이터의 계보(lineage)만 기록 https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf https://gist.github.com/hellerbarde/2843375 http://www.slideshare.net/yongho/rdd-paper-review
  27. 27. • Two types: transformations and actions • Transformation Operation • 변환을 통해 새로운 RDD를 생성, e.g, rdd.map(…) • lazy operation • 계보(lineage)에 기록 • Action operation • 모든 계산된 결과를 제공하거나 저장, e.g. rdd.count() • 즉시 수행 • 계보에 있는 정보(transformation operations)를 이용하여, Execution Plan을 계산 • 최적의 코스로 수행됨 RDD - Operations http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  28. 28. RDD – Transformations Transformation Meaning map(f: TU) RDD[T]  RDD[U] filter(f: TBool) RDD[T]  RDD[T] flatMap(f: T  Seq[U]) RDD[T]  RDD[U] mapPartitions(f: Iterator[T]  Iterator[U]) RDD[T]  RDD[U], 각 파티션 블록에서 개별적으로 수행됨 mapPartitionsWithIndex(f: (Int, Iterator[T])  Iterator[U]) RDD[T]  RDD[U], integer value는 파티션 index임 sample(withReplacement, fraction, seed) RDD[T]  RDD[T], fraction 비율 만큼 sampling union(otherDataset) (RDD[T], RDD[T])  RDD[T], A ∪ B intersection(otherDataset) (RDD[T], RDD[T])  RDD[T], A ∩ B distinct([numTasks]) RDD[T]  RDD[T], source dataset에서 distinct element를 제공함 groupByKey([numTasks]) RDD[(K,V)]  RDD[(K, Iterable[V])] reduceByKey(f: (V,V)  V, [numTasks]) RDD[(K,V)]  RDD[(K,V)], 각 Key별로 value를 aggregated value함 sortByKey([ascending], [numTasks]) RDD[(K,V)]  RDD[(K,V)], Key를 기준으로 정렬 http://spark.apache.org/docs/1.6.2/programming-guide.html http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  29. 29. RDD – Transformations Transformation Meaning join(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)])  RDD[(K,(V,W))], 각 k에 대한 모든 (v,w) leftOuterJoin, rightOuterJoin, fullOuterJoin cogroup(otherDataset, [numTasks]) (RDD[(K,V)], RDD[(K,W)])  RDD[(K, (Iterable[V], Iterable[W]))] alias: groupWith cartesian(otherDataset) (RDD[T], RDD[U])  RDD[(T,U)], RDD간의 cartesian product 모든 (a,b) element, (a in RDD[T], b in RDD[U]) pipe(command, [envVars]) String  RDD[String], shell command 실행후 결과를 RDD로 변환함 stdin, lines-> process -> stdout 한 stdout결과를 RDD[string]으로 제공함 http://blog.madhukaraphatak.com/pipe-in-spark/ coalesce(numPartitions) RDD[T]  RDD[T], RDD의 파티션 개수를 지정된 파티션 수로 줄임 repartition(numPartitions) RDD[T]  RDD[T], RDD에 있는 데이터를 지정된 파티션 수로 줄이고, 리셔플됨. 항상 전 데이터가 네트웍을 통해서 셔플됨 repartitionAndSortWithinPartitions(partitioner) RDD[(K,V)]  RDD[(K,V)], 주어진 파티셔너를 통해서 repartition됨, 그리고 각 파티션 결과 내에서 정렬함 repartition보다 효과적 http://spark.apache.org/docs/1.6.2/programming-guide.html http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  30. 30. RDD - Transformations Scala: Python: val distFile = sc.textFile(“README.md”) distFile.map(l => l.split(“ “)).collect() distFile.flatMap(l => l.split(“ “)).collect() distFile = sc.textFile(“README.md”) distFile.map(lambda x: x.split(’ ‘)).collect() distFile.flatMap(lambda x: x.split(’ ‘)).collect()
  31. 31. RDD – Actions Action Meaning reduce(f: (T,T)  T) RDD[T]  T, dataset내의 모든 element를 f를 사용해서 aggregate한 결과를 반환 collect() RDD[T]  Array[T], dataset내의 모든 element를 array로 반환 count() RDD[T]  Long, dataset내의 element의 개수를 반환 first() RDD[T]  T, 첫 번째 element를 반환 take(n) RDD[T]  Array[T], n 번째 까지의 element들을 반환 taskSample(withReplacement, num, [seed]) RDD[T]  Array[T], 랜덤으로 num만큼 element들의 결과를 반환 takeOrdered(n, [ordering]) RDD[T]  Array[T], 정렬된 n번째까지의 element를 반환 saveAsTextFile(path) RDD[T]  Unit, 모든 element를 text파일로 저장, local filesystem, HDFS등에 저장 saveAsSequenceFile(path) RDD[T]  Unit, 모든 element를 Hadoop SequenceFile로 저장 saveAsObjectFile(path) RDD[T]  Unit, 모든 element를 java serialization을 이용한 simple format으로 저장 countByKey() RDD[(K,V)]  Map[K, Long], 각 key에 대한 count를 반환 foreach(f: Iterator[T]  Unit) RDD[T]  Unit, 각 element에 대한 함수 f를 수행 saveAsNewAPIHadoopDataset RDD[T]  Unit, Hadoop API의 ‘OutputFormat’(mapreduce.OutputFormat)을 이용하 여 임의의 HDFS에 저장(MR Job), HBase BulkLoad에서 사용 http://spark.apache.org/docs/1.6.2/programming-guide.html http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  32. 32. RDD - Actions Scala: Python: val f = sc.textFile(“README.md”) val words = f.flatMap(l => l.split(“ “)).map(word => (word, 1)) words.reduceByKey(_ + _).collect from operator import add f = sc.textFile(“README.md”) words = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1)) words.reduceByKey(add).collect()
  33. 33. RDD - Persistence • MapReduce와 다르게 Spark은 dataset을 persist(or cache)할 수 있다. • 다른 RDD operation(trans/action)에서 재사용하기 위해 각 노 드는 임의의 파티션을 메모리나 스토리지에 저장 • 10배의 스피드 증가 • 가장 중요한 스파크 피처 중의 하나임 >>> val wordCounts = rdd.flatMap(x => x.split(“ “)) .map(s => (s, 1)) .reduceByKey((a,b) => a + b) .cache()
  34. 34. RDD - Persistence Storage Level Meaning MEMORY_ONLY RDD를 deserialized Java objects로 jvm heap에 저장. Default Level임 RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 저장되지 않고, 필요 시 재계산. MEMORY_AND_DISK RDD를 deserialized Java objects로 jvm heap에 저장. RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음. MEMORY_ONLY_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장. 메모리 공간에 효과적, 읽는 시점에 cpu-intensive MEMORY_AND_DISK_SER RDD를 serialized Java objects(one byte array per partition)로 jvm heap에 저장. RDD가 메모리에 다 저장되지 않으면, 일부 파티션은 디스크에 저장됨, 필요 시 디스크에서 읽음. DISK_ONLY 오직 디스크에만 저장함. MEMORY_ONLY_2 MEMORY_ONLY와 동일하나, 각 파티션 마다 2개의 node에 저장. MEMORY_AND_DISK_2 MEMORY_AND_DISK와 동일하나, 각 파티션 마다 2개의 node에 저장. OFF_HEAP RDD를 Tachyon에 맞는 serialized format으로 저장. MEMORY_ONLY_SER와 비교해서, gc overhead가 감소. Large heap과 다수의 concurrent application을 사용하는 경우에 효과적. Executor에 crash가 발생하더라도 cache된 데이터가 유실되지 않음.
  35. 35. RDD - Persistence Scala: Python: val f = sc.textFile(“README.md”) val w = f.flatMap(l => l.split(“ “)).map(word => (word, 1)).cache() w.reduceByKey(_ + _).collect.foreach(println) from operator import add f = sc.textFile(“README.md”) w = f.flatMap(lambda x: x.split(’ ’)).map(lambda x: (x, 1)).cache() w.reduceByKey(add).collect()
  36. 36. RDD - Fault Tolerance The State of Spark, and Where We're Going Next - Matei Zaharia, Spark Summit (2013) youtu.be/nU6vO2EJAb4 • RDD는 각 변환에 대해서 계보(lineage)를 기록해서 유실된 데이 터를 복구할 수 있다.
  37. 37. • Narrow Dependencies • 한 노드 • 메모리만 이용 • 빠름 • 일부 파티션 복구도 빠름 • Wide Dependencies • 여러 노드 • 셔플 발생 • 네트웍을 이용함 • 복구에 많은 시간 소요 • Checkpoint 권장 RDD – Narrow vs. Wide Dependencies
  38. 38. RDD – Job Scheduling • DAG 방향 따라 계산 • Stage는 가능하면 로컬에서 실 행할 수 있도록 구성(Narrow Dependency를 갖도록) • 셔플이 필요한 경우 Stage 구분 • 파티션이 수행될 노드는 데이터 로컬리티를 고려함(HDFS)
  39. 39. Examples – Word Count aardvark 1 cat 1 mat 1 on 2 sat 2 sofa 1 the 4 Input Data the cat sat on the mat the aardvark sat on the sofa Result http://www.slideshare.net/cloudera/spark-devwebinarslides-final
  40. 40. Examples – Word Count the cat sat on the mat the aardvark sat on the sofa the cat sat on the mat the aardvark sat … (the, 1) (cat, 1) (sat, 1) (on, 1) (the, 1) (mat, 1) (the, 1) (aardvark, 1) (sat, 1) … (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4) (aardvark, 1) (cat, 1) (mat, 1) (on, 2) (sat, 2) (sofa, 1) (the, 4) val f = sc.textFile(file) val w = f.flatMap(l => l.split(“ “)) val counts = w.reduceByKey(_ + _) counts.saveAsTextFile(output) .map(word => (word, 1)) HadoopRDD MapPartitionsRDD MapPartitionsRDD ShuffledRDD Array
  41. 41. Examples - Estimate Pi • Monte Carlo method를 이용한 Pi 값 계산 • ./bin/run-example SparkPi 2 local • 알고리즘 1. Draw a square, then inscribe a circle within it. 2. Uniformly scatter objects of uniform size over the square. 3. Count the number of objects inside the circle and the total number of objects. 4. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf http://demonstrations.wolfram.com/MonteCarloEstimateForPi/ https://en.wikipedia.org/wiki/Monte_Carlo_method
  42. 42. Examples – Estimate Pi Base RDD transformed RDD https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf action
  43. 43. Spark SQL
  44. 44. Spark SQL • Spark module for structured data processing (e.g. DB tables, JSON files) • Adding Schema to RDDs • Three ways to manipulate data: • SQL (2014.05, Spark 1.0) • DataFrame (2015.03, Spark 1.3) • Datasets (2016.01, Spark 1.6) • Same execution engine for all three • Spark SQL interfaces provide more information about both structure and computation being performed than basic Spark RDD API
  45. 45. Spark SQL Motivation • Create and Run Spark Programs Faster • Write less code • Read less data • Let the optimizer do the hard work • Shark의 한계 • Limited integration with Spark programs • Hive optimizer not designed for Spark Spark SQL reuses the best parts of Shark http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014
  46. 46. SQL • Execute SQL queries written using either a basic SQL syntax or HiveQL • When running SQL from within another programming language the results will be returned as a DataFrame. • Interact with the SQL interface using the CLI or JDBC/ODBC
  47. 47. DataFrames • Distributed collection of data organized into named columns. • Conceptually equivalent to a table in relational DB or data frame in R/Python • API available in Scala, Java, Python, and R • Richer optimizations(significantly faster than RDDs) • Can be constructed from a wide array of sources • data files, tables in Hive, external databases, exisiting RDD
  48. 48. DataFrames http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune • Constructed from a wide array of sources
  49. 49. Datasets • New experimental interface added in Spark 1.6 • Tries to provide the benefits of RDDs(strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. • Unified Dataset API can be used both in Scala and Java.
  50. 50. SQL Context and Hive Context • SQLContext • Entry point into all functionality in Spark SQL • Wraps / extends existing spark context • HiveContext • Superset of functionality provided by basic SQLContext • Read data from Hive tables • Access to Hive Functions -> UDFs val sqlContext = SQLContext(sc) val hc = HiveContext(sc)
  51. 51. DataFrame Example • Reading Data From Table val df = sqlContext.table("flightsTbl") df.select("Origin", "Dest", "DepDelay").show(5) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 8| | IAD| TPA| 19| | IND| BWI| 8| | IND| BWI| -4| | IND| BWI| 34| +------+----+--------+
  52. 52. DataFrame Example • Using DataFrame API to Filter Data(show delays more than 15min) df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  53. 53. SQL Example • Using SQL to Query and Filter Data(again, show delays more than 15 min) // Register Temporary Table df.registerTempTable("flights") // Use SQL to Query Dataset sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT 5").show +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  54. 54. RDD vs. DataFrame • RDD • Lower-level API (more control) • Lots of existing code & users • Compile-time type-safety • DataFrame • Higher-level API(faster development) • Faster sorting, hashing, and serialization • More opportunities for automatic optimization • Lower memory pressure
  55. 55. DataFrame은 직관적 dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department? RDD Example Data Frame Example SQL Example sc.sql (“SELECT avg(age) FROM data GROUP BY dept”) http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin
  56. 56. Spark SQL Optimizations • Spark SQL uses an underlying optimization engine(Catalyst) • Catalyst can perform intelligent optimization since it understands the schema • Spark SQL does not materialize all the columns(as with RDD) only what’s needed http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin
  57. 57. Plan Optimization & Execution • Spark SQL uses an underlying optimization engine(Catalyst) http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
  58. 58. An example query SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1 Logical Plan Project name Filter id = 1 Project id,name People http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
  59. 59. Optimizing with Rules Original Plan Project name Filter id = 1 Project id,name People Project name Project id,name Filter id = 1 People Filter Push-Down Combine Projection Project name Filter id = 1 People IndexLookup id = 1 return: name Physical Plan http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
  60. 60. References Papers • Spark: Cluster Computing with Working Sets: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing : http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf • Spark SQL: Relational Data Processing in Spark: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf RDD • Spark 의 핵심은 무엇인가? RDD! (RDD paper review): http://www.slideshare.net/yongho/rdd-paper-review • Apache Spark RDDs: http://www.slideshare.net/deanchen11/scala-bay-spark-talk Stanford 자료 • Intro to Apache Spark: https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf • Distributed Computing with Spark: https://stanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdf
  61. 61. References • Apache Apache Spark Overview - MapR: http://www.slideshare.net/caroljmcdonald/apache-spark-overview-52602792 • Introduction to Apache Spark Developer Training - Cloudera: http://www.slideshare.net/cloudera/spark-devwebinarslides-final • Apache Spark Overview: http://www.slideshare.net/VadimYBichutskiy/apache-spark-overview • Simplifying Big Data Analytics with Apache Spark: http://www.slideshare.net/databricks/bdtc2 • Intro to Spark with Zeppelin: http://www.slideshare.net/hortonworks/intro-to-spark-with-zeppelin • Introduction to Big Data Analytics using Apache Spark and Zeppelin: http://www.slideshare.net/alexzeltov/introduction-to-big-data-analytics-using-apache-spark-and-zeppelin-on-hdinsights-on- azure-saas-andor-hdp-on-azurepaas • Spark overview: http://www.slideshare.net/LisaHua/spark-overview-37479609 • Introduction to real time big data with Apache Spark: http://www.slideshare.net/tmatyashovsky/introduction-to-realtime-big-data-with-apache-spark • Spark은 왜 이렇게 유명해 지고 있을까?: http://www.slideshare.net/KSLUG/ss-47355270 • Apache Spark Briefing: http://www.slideshare.net/ThomasWDinsmore/apache-spark-briefing-12062013 • Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106: http://www.slideshare.net/sanghoonlee982/spark-overview-20141106 • Zeppelin(Spark)으로 데이터 분석하기: http://www.slideshare.net/sangwookimme/zeppelinspark-41329473 • Lightening Fast Big Data Analytics using Apache Spark: http://www.slideshare.net/manishgforce/lightening-fast-big-data-analytics-using-apache-spark • Apache Hive on Apache Spark: http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ • Apache Spark: The Next Gen toolset for Big Data Processing: http://www.slideshare.net/prajods/apache-spark-the-next-gen-toolset-for-big-data-processing • Intro to Spark and Spark SQL: http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014 • Spark SQL Deep Dive @ Melbourne Spark Meetup: http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune • A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html • Apache Spark (big Data) DataFrame - Things to know: https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary
  62. 62. References • Spark SQL - Quick Guide: https://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm • Big Data Processing with Apache Spark - Part 2: Spark SQL: https://www.infoq.com/articles/apache-spark-sql • Analytics with Apache Spark Tutorial Part 2: Spark SQL: https://dzone.com/articles/analytics-with-apache-spark-tutorial-part-2-spark • Apache Spark Resource Management and YARN App Models: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management- and-yarn-app-models/ • Why Spark Is the Next Top (Compute) Model: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model-39976454 • Introduction to Apache Spark:http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010 • Apache Spark & Hadoop:http://www.slideshare.net/MapRTechnologies/spark-overviewjune2014 • Spark와 Hadoop, 완벽한 조합 (한국어): http://www.slideshare.net/pudidic/spark-hadoop • Big Data visualization with Apache Spark and Zeppelin: http://www.slideshare.net/prajods/big-data-visualization-with-apache-spark-and-zeppelin • latency: https://gist.github.com/hellerbarde/2843375 • Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San Jose 2015: http://www.slideshare.net/databricks/strata-sj-everyday- im-shuffling-tips-for-writing-better-spark-programs • Apache Spark Architecture: http://www.slideshare.net/AGrishchenko/apache-spark-architecture
  63. 63. Appendix
  64. 64. Latency numbers
  65. 65. 구글 빅데이터 관련 기술 기술 연도 내용 GFS 2003 Google File System: A Distributed Storage MapReduce 2004 Simplified Data Processing on Large Clusters Sawzall 2005 Interpreting the Data: Parallel Analysis with Sawzall Chubby 2006 The Chubby Lock Service for Loosely-Coupled Distributed Systems BigTable 2006 A Distributed Storage System for Structured Data Paxos 2007 Paxos Made Live - An Engineering Perspective Colossus 2009 GFS II Percolator 2010 Large-scale Incremental Processing Using Distributed Transactions and Notifications Pregel 2010 A System for Large-Scale Graph Processing Dremel 2010 Interactive Analysis of Web-Scale Datasets Tenzing 2011 A SQL Implementation On The MapReduce Framework Megastore 2011 Providing Scalable, Highly Available Storage for Interactive Services Spanner 2012 Google's Globally-Distributed Database F1 2012 The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business
  66. 66. GFS Motivation • More than 15,000 commodity-class PC's. • Fault-tolerance provided in software • More cost-effective solution • Multiple clusters distributed worldwide • One query reads 100’s of MB of data • One query consumes 10’s of billions of CPU cycles • Thousands of queries served per second. • Google stores dozens of copies of the entire Web! • Conclusion: Need Large, distributed, highly fault- tolerant file system http://www.cs.brandeis.edu/~dilant/WebPage_TA160/The%20Google%20File%20System.pdf
  67. 67. GFS Assumptions • 높은 컴포넌트 장애율 • 장애에 대한 모니터링/감시, 장애 내성, 장애 복구 등의 준비가 필요하다. • “적당한” 규모의 큰(HUGE) 파일들 • Just a few million • Each is 100MB or larger; multi-GB files typical • 파일은 한번 쓰고, 대부분은 추가된다. • Perhaps concurrently • 큰 순차 읽기(Large Streaming Reads) • 높은 지속적인 처리량(throughput)이 저 지연(low latency)보다 중 요 http://research.google.com/archive/gfs.html https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf
  68. 68. GFS Architecture http://research.google.com/archive/gfs.html https://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfs-hdfs.pdf <GFS Architecture> <GFS 파일 저장 구조>
  69. 69. SQL JOINS http://amirulkamil.com/best-describe-join/
  70. 70. YARN Architecture https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html
  71. 71. YARN Features Feature Description Multi-tenancy YARN allows multiple access engines (either open-source or proprietar y) to use Hadoop as the common standard for batch, interactive and r eal-time engines that can simultaneously access the same data set. Multi-tenant data processing improves an enterprise’s return on its Hadoop investments. Cluster utilization YARN’s dynamic allocation of cluster resources improves utilization ove r more static MapReduce rules used in early versions of Hadoop Scalability Data center processing power continues to rapidly expand. YARN’s Res ourceManager focuses exclusively on scheduling and keeps pace as cl usters expand to thousands of nodes managing petabytes of data. Compatibility Existing MapReduce applications developed for Hadoop 1 can run YAR N without any disruption to existing processes that already work
  72. 72. Hive
  73. 73. Hive Overview • Invented at Facebook. Open sourced to Apache in 2008 • A database/data warehouse on top of Hadoop • Structured data similar to relational schema • Tables, columns, rows and partitions • SQL like query language (HiveQL) • A subset of SQL with many traditional features • It is possible to embedded MR script in HiveQL • Queries are compiled into MR jobs that are executed on Hadoop. 출처: http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf
  74. 74. Hive Motivation(Facebook) • Problem: Data growth was exponential • 200GB per day in March 2008 • 2+TB(compressed) raw data / day in April 2009 • 4+TB(compressed) raw data / day in Nov. 2009 • 12+TB(compressed) raw data / day today(2010) • The Hadoop Experiment • Much superior to availability and scalability of commercial DBs • Efficiency not that great, but throw more hardware • Partial Availability/resilience/scale more important than ACID • Problem: Programmability and Metadata • MapReduce hard to program (users know sql/bash/python) • Need to publish data in well known schemas • Solution: SQL + MapReduce = HIVE (2007) 출처: 1) http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation 2) http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop
  75. 75. Hive – Data Flow of Facebook 출처: http://borthakur.com/ftp/hadoopmicrosoft.pdf, http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf
  76. 76. Hive - Architecture https://cwiki.apache.org/confluence/display/Hive/Design
  77. 77. Hive - Query Execution and MR Jobs 출처: Ysmart(Yet Another SQL-to-MapReduce Translator), http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf
  78. 78. Hive - Limitations • Performance (주로 ~0.12) • For simple queries, HIVE performance is comparable with hand- coded MR jobs • The execution time is much longer for complex queries • 연산단계마다 MR잡이 실행되기 때문에 많은 IO로 인한 성능 병목 발생 • 각 단계마다, 비효율적인 데이터 스캔 및 전송이 발생함 • 약한 Optimizer로 인한 비효율적인 실행계획 • 스팅어 계획(Stinger Initiative)로 성능은 이전에 비해 비약적으 로 향상됨(Tez, Orc 도입, Optimizer개선) • DW용으로만 한정적으로 사용될 수 있음. • Streaming, Graph, ML등의 작업에는 제한이 있음
  79. 79. Spark
  80. 80. RDD Operations in paper http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  81. 81. Spark Master - YARN • YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration. • You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads. • Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use. • Finally, YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes. http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
  82. 82. Spark Master - YARN yarn-client modeyarn-cluster mode http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
  83. 83. Spark 2.0 Datasets
  84. 84. Spark 2.0 DataSets Language Main Abstraction Scala Dataset[T] & DataFrame (alias for Dataset[Row]) Java Dataset[T] Python* DataFrame R* DataFrame Note: Since Python and R have no compile-time type-safety, we only have untyped APIs, namely DataFrames. Typed and Un-typed APIs https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  85. 85. Static-typing and runtime type-safety https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  86. 86. High-level abstraction and custom view 1. Spark reads the JSON, infers the schema, and creates a collection of DataFrames 2. At this point, Spark converts your data into DataFrame=Dataset[Row], a collection of generic Row object, since it does not know the exact type. 3. Now, Spark converts the Dataset[Row]-> Dataset[DeviceIoTData] type-specific Scala JVM Object, as dictated by the class DeviceIoTData case class DeviceIoTData (battery_level: Long, c02_level: Long, cca2: String, cca3: String, cn: String, device_id: Long, device_name: String, humidity: Long, ip: String, latitude: Double, lcd: String, longitude: Double, scale:String, temp: Long, timestamp: Long) {"device_id": 198164, "device_name": "sensor-pad-198164owomcJZ", "ip": "80.55.20.25", "cca2": "PL", "cca3": "POL", "cn": "Poland", "latitude": 53.080000, "longitude": 18.620000, "scale": "Celsius", "temp": 21, "humidity": 65, "battery_level": 8, "c02_level": 1408, "lcd": "red", "timestamp" :1458081226051} // read the json file and create the dataset from the // case class DeviceIoTData // ds is now a collection of JVM Scala objects DeviceIoTData val ds = spark.read.json("/databricks-public-datasets/data/iot/iot_devices.json").as[DeviceIoTData] https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  87. 87. Ease-of-use of APIs with structure • Although structure may limit control in what your Spark program can do with data • Most computations can be accomplished with Dataset’s high-level APIs. For example, it’s much simpler to perform agg, select, sum, avg, map, filter, or groupBy // Use filter(), map(), groupBy() country, and compute avg() // for temperatures and humidity. This operation results in // another immutable Dataset. The query is simpler to read, // and expressive val dsAvgTmp = ds.filter(d => {d.temp > 25}).map(d => (d.temp, d.humidity, d.cca3)).groupBy($"_3").avg() //display the resulting dataset display(dsAvgTmp) https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  88. 88. Performance and Optimization • DataFrame and Dataset APIs are built on top of the Spark SQL engine. • it uses Catalyst to generate an optimized logical and physical query plan. • Spark은 Dataset의 Tungsten’s Encoder를 이용하면, serialize / deserialize시에 bytecode를 compact시켜줘서 Speed에 이점이 있다. https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  89. 89. Use DataFrames or Datasets? • If you want rich semantics, high-level abstractions, and domain specific APIs  use DataFrame or Dataset. • If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data  DataFrame or Dataset. • If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation  Dataset. • If you want unification and simplification of APIs across Spark Libraries  DataFrame or Dataset. • If you are a R user  DataFrames. • If you are a Python user  DataFrames and resort back to RDDs if you need more control.
  90. 90. Spark Streaming vs. Storm
  91. 91. Reliability Models Core Storm Storm Trident Spark Streaming At Most Once Yes Yes No At Least Once Yes Yes No* Once and Only Once (Exactly Once) No Yes Yes* http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
  92. 92. Programing Model Core Storm Storm Trident Spark Streaming Stream Primitive Tuple Tuple, Tuple Batch, Partition Dstream Stream Sources Spouts Spouts, Trident Spouts HDFS, Network Computation/ Transformation Bolts Filters, Functions, Aggregations, Joins Transformation, Window Operations Stateful Operation No (roll your own) Yes Yes Output/ Persistence Bolts State, MapState foreachRDD 2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
  93. 93. Performance • Storm capped at 10k msgs/sec/node? • Spark Streaming 40x faster than Storm? System Performance Storm(Twitter) 10,000 records/s/node Spark Streaming 400,000 records/s/node Apache S4 7,000 records/s/node Other Commercial Systems 100,000 records/s/node 2014, http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

×