SlideShare a Scribd company logo
1 of 117
Download to read offline
[SSA] Big Data Analytics

SQL on Hadoop

(A New Generation of Analytic Databases)

Hyounggi Min
hg.min@samsung.com
2014. 2. 5.
Contents
I. 빅데이터 분석
II. 빅데이터 분석 접근법
III. Analytic Database at Google
IV. SQL on Hadoop 종류
I. SQL to MapReduce
II. SQL natively on Hadoop
III. Split SQL Execution
1
빅데이터 분석

출처: http://blog.brainhost.com/server-breakthroughs-that-revolutionized-web-hosting/

2
데이터 분석

정보
데이터

분석

가치

3
빅데이터 구성요소

출처: http://smartdatacollective.com/yellowfin/75616/why-big-data-and-business-intelligence-one-direction

4
빅데이터 처리 흐름

출처: Gruter BigData를 위한 아키텍처 및 기술(2011)

5
Analytical Complexity

데이터 분석의 진화 (Faster, deeper, insight)
Behaviour
modelling

Statistical Analysis

Machine learning
algorithms

Dynamic
Simulation

Clustering

Dynamic
Interaction

Fraud detection
Reporting & BPM

Campaign
Management

Technology/Automation
출처: http://strataconf.com/stratany2013/public/schedule/detail/31731

6
빅 데이터 분석 접근법
 Traditional Analytic Database
 MapReduce
 SQL on Hadoop

7
Motivation
 대규모 인터넷 회사
–예) 아마존, 구글, 야후, 페이스북, 트위터, 마이크로소프트,
넷플릭스 등

매일 TB 규모의 데이터를 수집함
정보를 가공하기 위해 ad-hoc 분석이 필요함
–과거의 트랜트와 함께 분석해야 함
–예) 엔지니어는 검색엔진의 랭킹 알고리즘 개선을 위해 검
색 로그에 대한 트랜드 분석을 해야 함
8
Solutions

Traditional Analytic Database
MapReduce on Hadoop
SQL on Hadoop

9
Solution 1 - Analytic Database Problem
 비용
– 전통적인 분석 데이터 베이스는 TCO가 너무 높음
– 예) 특히 CPU/Core 당 가격정책의 경우에는 인터넷 회사에서 돌리기에는 너무 부담
이됨

 확장성 / 가용성
– 상용 데이터 베이스는 확장성과 가용성이 떨어짐
– 확장성과 비용 제약으로 인하여 가공된 데이터를 저장해야 함
– 웹스케일 데이터에서는 ACID보단 가용성과 확장성, 탄력성이 더 중요함

 구조
– 전통적인 분석 데이터베이스에서는 로그 분석을 위한 별도 저장소가 필요함

 벤더
– Teradata, IBM’s Netezza, Greenplum, Oracle Exadata, HP Vertica, Sybase IQ 등
출처: http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

10
Solution 2 - MapReduce Problem
 MapReduce 개발의 어려움
– MapReduce는 어렵고, 개발 노력이 많이 들고, 성능 보장이 어려움
– 개발자의 수준에 따른 성능차 발생
– 기존 SQL구현에 비해 생산성이 많이 떨어짐

 부족한 기능: 스키마, 옵티마이저, 인덱스, 뷰 등
 데이터 처리 모델상의 한계(관계형 처리를 위해 고안된 것이 아님)
– 고정된 데이터 흐름
– Shuffle: merge sort  hasing  merge sort
– Job간의 데이터 교환 오버헤드

출처: http://www.slideshare.net/Hadoop_Summit/steinbach-june26-405pmroom230av2,

11
Solution 3 - SQL on Hadoop
Hadoop 기반의 차세대 분석 엔진을 지칭
 SQL 표준 지원
– 기존 시스템과 통합 또는 대체 용이
– MapReduce보다 적은 학습곡선 필요

 높은 처리 성능
– MapReduce의 한계를 극복하는 분산 처리 프레임워크
– CPU와 메모리를 보다 잘 활용하는 처리방식

 종류
– Hive(Stinger), Impala, Presto, Drill, Tajo, Hadapt 등

출처: http://www.slideshare.net/deview/deview2013-tajo

12
Analytic Database at Google
 Dremel
 Tenzing
 F1

13
Dremel
(Interactive Analysis of WebScale Datasets)

14
Google Dremel System
 Trillion-record, multi-terabyte datasets at interactive speed
– Scales to thousands of nodes
– Fault and straggler tolerant execution

 Nested data model
– Complex datasets; normalization is prohibitive
– Columnar storage and processing

 Tree architecture (as in web search)
 Interoperates with Google's data mgmt tools
– In situ data access (e.g., GFS, Bigtable)
– MapReduce pipelines
출처: Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10

15
Google Dremel - Widely used inside Google
 Analysis of crawled web
documents
 Tracking install data for
applications on Android Market
 Crash reporting for Google
products
 OCR results from Google Books
 Spam analysis
 Debugging of map tiles on
Google Maps

 Tablet migrations in
managed Bigtable instances
 Results of tests run on
Google's distributed build
system
 Disk I/O statistics for
hundreds of thousands of
disks
 Resource monitoring for
jobs run in Google's data
centers
 Symbols and dependencies
in Google's codebase
16
Google Dremel - Storage Format
DocId: 10
Links
Forward: 20
Name
Language
Code: 'en-us'
Country: 'us'
Url: 'http://A'
Name
Url: 'http://B'

r1

r1

C

r1
r2

A
*
*
...
B
*
D

r1

E

r1
r2

r2

Read less,
r2
cheaper
...
decompression
Challenge: preserve structure, reconstruct from a subset of fields
출처: Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10

17
Google Dremel - Nested Data Model
schema

Repetition
Level

Definition
Level

records

r: At what repeated field in the field's path the value has repeated
d: How many fields in paths that could be undefined (opt. or rep.) are actually present
출처: Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10

18
Google Dremel - Column-striped representation
DocId

Name.Url

Links.Forward

Links.Backward

value

r

d

value

r

d

value

r

d

value

r

d

10

0

0

http://A

0

2

20

0

2

NULL

0

1

20

0

0

http://B

1

2

40

1

2

10

0

2

NULL

1

1

60

1

2

30

1

2

http://C

0

2

80

0

2

Name.Language.Code

Name.Language.Country

value

r

d

value

r

d

en-us

0

2

us

0

3

en

2

2

NULL

2

2

NULL

1

1

NULL

1

1

en-gb

1

2

gb

1

3

NULL

0

1

NULL

0

1
19
Google Dremel - Record assembly FSM

0
1

DocId

Links.Backward

Name.Language.Code
1

Transitions
labeled with
repetition levels

0
0

0,1,2
2

Links.Forward

1

Name.Language.Country

Name.Ur
l
0

0,1

For record-oriented data processing (e.g., MapReduce)

20
Google Dremel - Reading two fields

DocId
0
1,2

Name.Language.Country
0

DocId: 10
Name
Language
Country: 'us'
Language
Name
Name
Language
Country: 'gb'
DocId: 20
Name

s1

s2

Structure of parent fields is preserved.
Useful for queries like /Name[3]/Language[1]/Country

21
Google Dremel - SQL dialect for nested data
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Name.Url + ',' + Name.Language.Code AS Str
FROM t
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
Output table
Id: 10
Name
Cnt: 2
Language
Str: 'http://A,en-us'
Str: 'http://A,en'
Name
Cnt: 0

t1

Output schema
message QueryResult {
required int64 Id;
repeated group Name {
optional uint64 Cnt;
repeated group Language {
optional string Str;
}
}
}
22
Google Dremel - Serving Tree
[Dean WSDM'09]

client

• Parallelizes scheduling
and aggregation
• Fault tolerance
• Designed for "small"
results (<1M records)

histogram of
response times

23
Google Dremel - Example: count()
0

SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}

SELECT A, SUM(c)
FROM (R11.. UNION ALL R110)
GROUP BY A

R11

1
...
3

R12

SELECT A, COUNT(B) AS c
FROM T11 GROUP BY A
T11 = {/gfs/1, …, /gfs/10000}

SELECT A, COUNT(B) AS c
FROM T12 GROUP BY A
T12 = {/gfs/10001, …, /gfs/20000}

SELECT A, COUNT(B) AS c
FROM T31 GROUP BY A
T31 = {/gfs/1}

...

...

Data access ops
24
Google Dremel - Experiments
• 1 PB of real data
(uncompressed, non-replicated)
• 100K-800K tablets per table
• Experiments run during business hours
Table
name

Number of
records

Size (unrepl.,
compressed)

Number of Data
fields
center

Repl.
factor

T1

85 billion

87 TB

270

A

3×

T2

24 billion

13 TB

530

A

3×

T3

4 billion

70 TB

1200

A

3×

T4

1+ trillion

105 TB

50

B

3×

T5

1+ trillion

20 TB

30

B

2×

25
Google Dremel - Read from disk

26
Google Dremel - MR and Dremel execution

27
Google Dremel - Impact of serving tree depth

28
Google Dremel - Scalability

29
Google Dremel - Interactive speed

30
Google Dremel - Observations
Possible to analyze large disk-resident datasets
interactively on commodity hardware
–1T records, 1000s of nodes

MR can benefit from columnar storage just like a
parallel DBMS
–But record assembly is expensive
–Interactive SQL and MR can be complementary

Parallel DBMSes may benefit from serving tree
architecture just like search engines

31
Tenzing
(A SQL Implementation On
the MapReduce Framework)

32
Google Tenzing - Overview
 SQL Query Engine Built on top of MapReduce for ad-hoc analysis of Google data.
 Features
– Complete SQL implementation (with several extensions) – SQL92, some SQL99

 Key Characters
–
–
–
–
–
–
–
–

이질성(Heterogeneity)
고성능(High Performance)
확장성(Scalability)
신뢰성(Reliability)
메터데이터 인식(Meta data awareness)
로우 레이턴시(Low latency)
스토리지 확장: 컬럼형 스토러지(Cloumnar Storage), 구조화 된 데이터 지원(structured data)
쉬운 확장(Easy Extensibility)

 현황 (2011년)
– 10000+ query/day
– 1.5PB의 압축데이터
출처: VLDB2011, http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf

33
Google Tenzing - Motivation
 2008년에 구글 광고데이터를 위한 상용 DW에 문제가 발생
– 확장 비용 증가 (PB 수준)
– 급격히 증가하는 로딩 시간
– SQL 제약, 다양한 데이터 소스에 대한 지원 부족으로 복잡한 분석에 제약
발생

 해결 책: Tenzing
– 확장성 : 수천개의 코어, 수백명의 사용자, 페타바이트 규모의 데이터를 처리
– 높은 신뢰성: 일반 HW를 기반으로 동작 가능
– 성능: 기존 상용 DW 대비 동등 이상의 성능
– ETL 프로세스 최소화: Google 시스템의 데이터를 직접 읽을 수 있어야 함
– SQL 표준 지원: 분석가의 학습곡선을 줄임
– UDF 지원: Prediction과 mining에도 적용할 수 있어야 함
출처: [VLDB2011]

34
Google Tenzing - Architecture

35
Google Tenzing - SQL Features
 Projection and Filtering
– Standard SQL operations: IN, LIKE, BETWEEN, CASE etc
– Built-in Sawzall function

 Aggregation
– Standard aggregate functions: SUM, COUNT, MIN, MAX etc.
– Hash Based Aggregation

 Joins
– inner, left, right, cross, full other joins, Equi, semi-equi, non-equi, function based joins etc
– Distributed Implementations: Broadcast Joins, Remote Lookup Joins, Distributed Sort-Merge
Joins, Distributed Hash Joins







Analytic Functions: RANK, SUM, MIN, MAX, LEAD, LAG, NTITLE
Set Operations: UNION, UNION ALL, MINUS, MINUS ALL
Nested Queries and Subqueries
Handling Structured Data
Views, DDL, DML, Table Valued Functions, Data Formats

출처: http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf

36
Google Tenzing - Hash based Aggregation
 Tenzing Query:

 MapReduce Pseudo-Code

SELECT dept id, COUNT(1)
FROM Employee
/*+ HASH */ GROUP BY 1;

출처: http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf

37
Google Tenzing - Analytic Functions
 Tenzing Query:

 MapReduce Pseudo-Code

SELECT
dept, emp, salary,
RANK() OVER (
PARTITION BY dept ORDER BY
salary DESC)
AS salary rank
FROM Employee;

출처: http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf

38
Google Tenzing - Performance Enhancement
 MapReduce Enhancement
– Worker Pool: master watcher, master pool, worker pool
– Streaming & In-memory Chaining
– Sort Avoidance
– Block shuffle
– Local Execution

 Query Engine Enhancement
– 1st impl: SQL to Sawzall code (Sawzall JIT compiler)
 Sawzall 시스템으로 변환 비용으로 비효율적

– 2nd impl: Dremel’s SQL expression evaluation engine
 Interpreter 방식 환경, Row 기반 처리로 다소 느림

– 3rd impl: LLVM 기반의 native code 생성

 Row major block based intermediate data
 Column major vector based processing with columnar intermediate storage
출처: http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf

39
F1
(A Distributed SQL Database That Scales)

40
Google F1 - Overview
 구글 AdWord 비즈니스를 지원하기 위한 분산 관계형
데이터베이스 시스템
 하이브리드 데이터베이스: Google Spanner를 기반으로 구축
– NoSQL 특징: High Availability, Scalability
– 전통적인 SQL 데이터베이스 특징: Consistency, Usability

 Desing Goal
– Scalability: Auto-sharded storage
– Availability & Consistency: Synchronous replication
– High commit latency: Can be hidden
 Hierarchical schema
 Protocol buffer column types
 Efficient client code
출처: [VLDB2013], [SIGMOD2012]

41
Google F1 - Adwards Ecosystem

출처: [VLDB2011]

42
Google F1 - Motivation (1/2)
 Legacy DB: Sharded MySQL
– 샤딩 전략
 Sharded by customer
 Apps optimized using shard awareness

 제약사항
– 가용성
 Master / slave replication -> downtime during failover
 Schema changes -> downtime for table locking
– 확장성
 Grow by adding shards
 Rebalancing shards is extremely difficult and risky
 Therefore, limit size and growth of data stored in database
– 기능

 Can't do cross-shard transactions or joins
출처: http://research.google.com/pubs/pub38125.html

43
Google F1 - Motivation (2/2)
 Critical applications driving Google's core ad business
– 24/7 availability, even with datacenter outages
– Consistency required
 Can't afford to process inconsistent data
 Eventual consistency too complex and painful
– Scale: 10s of TB, replicated to 1000s of machines

 Shared schema
– Dozens of systems sharing one database
– Constantly evolving - multiple schema changes per week

 SQL Query
– Query without code

 해결책: Google F1
– built from scratch,
– designed to operate at Google scale,
– without compromising on RDBMS features.

출처: http://research.google.com/pubs/pub38125.html

44
Google F1 - Features
 구글 Spanner(NewSQL)를 기반으로 추가 기능을 구현
– OLTP + OLAP 성격을 가지고 있음

 Spanner 제공 특징: extremely scalable data storage, synchronous replication, and
strong consistency, ordering properties
 추가 기능
–
–
–
–
–

Distributed SQL queries, including joining data from external data sources
Transactionally consistent secondary indexes
Asynchronous schema changes including database reorgnizations
Optimistic transactions
Automatic change history recording and publishing

 적용 현황: 2012년 초 AdWords 광고 캠페인 데이터 관리에 사용됨
– 100s of applications, 1000s of users, all sharing the same database
– 100TB↑, hundreds of thousands req/sec, scan ten of trillions of data rows per day
– Availability 99.999%, 계획하지 않은 outage에도 obserbable latency는 증가하지 않음

45
Google F1 - Architecture
 Architecture
– Sharded Spanner servers
 data on GFS and in memory
– Stateless F1 server
– Slave Pool for query execution

 Features
– Relational schema
 Extensions for hierarchy and rich data
types
 Non-blocking schema changes
– Consistent indexes
– Parallel reads with SQL or Map-Reduce

46
Google F1 - Schema & Clustered Storage
 Explicit table hierarchies. Example:
– Customer (root table): PK (CustomerId)
– Campaign (child): PK (CustomerId, CampaignId)
– AdGroup (child): PK (CustomerId, CampaignId, AdGroupId)

47
Google F1 - Data Model
 Hierarchical Schema
– Customer (root table): PK (CustomerId)
– Campaign (child): PK (CustomerId, CampaignId)
– AdGroup (child): PK (CustomerId, CampaignId, AdGroupId

 Table Column Types: Schema, Protocol Buffers(Binary encoding format)

출처: http://research.google.com/pubs/pub41344.html

48
Google F1 - SQL Query
 Parallel query engine implemented from scratch
 Fully functional SQL, joins to external sources
 Language extensions for protocol buffers

 Making queries fast
– Hide RPC latency
– Parallel and batch execution
– Hierarchical joins

출처: http://research.google.com/pubs/pub38125.html, http://research.google.com/pubs/pub41344.html

49
SQL on Hadoop

50
SQL on Hadoop
Hadoop 기반의 SQL을 지원하는 분석 데이터베이스를 지칭
 SQL 언어지원
– 기존 시스템과 통합 또는 대체 용이
– SQL에 최적화된 옵티마이저, 인덱서 등 기능 제공

 높은 처리성능
– MapReduce의 한계를 극복하는 분산처리

51
왜 SQL on Hadoop 인가?
 인력 확보 및 생산성 문제
– MapReduce를 익히는데 많은 학습곡선이 소요됨  인력 확보 문제
– MapReduce기반으로 구현하는데 많은 노력이 소요됨  생산성 문제

 성능 보장 및 사람에 의한 오류 방지
– 개발자 역량에 따라 성능이 좌우됨
– 성능 튜닝에 많은 노력이 소요됨
– 버그 가능성 높음

 Ad-hoc 질의에 대한 DB 병행사용 문제
– 복잡한 Architecture 및 관리부담 문제
– 비용부담 증가: 추가적인 DBMS 라이선스 및 스토리지 필요
– 데이터 교환문제: HDFS ↔ DBMS
52
SQL on Hadoop 역사
 SQL has been ruling since 1970!!
 Hadoop came…But little traction…
 Facebook open-sourced HIVE in 2008.. Hadoop takes the
next leap in adoption
 RDBMS and MPP Vendors brought Hadoop Connectors
 Niche players used SQL engine to run Distributed Query on
Hadoop
 2010.10: Google Dremel paper opened
 2012.10: Cloudera Impala sets the trend for Real time
Query over Hadoop
 2013.03: Apache Tajo Project enters incubation.
 2013.11: Facebook open sourced Presto
출처: http://www.slideshare.net/SameerMandal1/sql-over-hadoop-ver-3

53
SQL on Hadoop Products

출처: http://www.slideshare.net/SameerMandal1/sql-over-hadoop-ver-3

54
SQL on Hadoop 분류 (1/3)
 451 Group
– SQL in Hadoop: Hive
– SQL on Hadoop: Impala, Drill, HWAQ, JethoData, Big SQL
– SQL and Hadoop: Hadapt, RainStor, PolyBase, Citus Data, SQL-H

 Gruter
– Data Warehouse System: Hive(Stinger), Tajo
– Query Engine: Impala, Drill, Presto

출처: 1) http://blogs.the451group.com/information_management/2013/10/09/7-hadoop-questions-q5-sql-in-hadoop-sql-on-hadoop-or-sql-and-hadoop/
2) http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013

55
SQL on Hadoop 분류 (2/3)
 Datascientist.info
–SQL natively on Hadoop: Stinger, Impala, Drill, Presto
–DBMS on Hadoop: Hadapt, CitusDB, Tajo

 Hadapt
– SQL translated to MapReduce jobs over a Hadoop cluster
 Hive, Stinger(without Tez)

– SQL processed by a specialized(Google-inspired) SQL engine that sits on
a Hadoop cluster
 Impala(F1), Drill(Dremel)

– Processing of SQL queries are split between MapReduce and storage that
natively speaks SQL
 Hadapt, Polybase

출처: http://datascientists.info/sql-and-hadoop/, http://hadapt.com/blog/2013/10/02/classifying-the-sql-on-hadoop-solutions/

56
SQL on Hadoop 분류 (3/3)
 SQL to MapReduce: Hive(Stinger)
 SQL natively on Hadoop: Tajo, Impala, Drill, Presto
 Split SQL execution: Hadapt , CitusDB, Polybase
 Others: Shark, Phoenix, Cascading Lingual, Vertica

57
SQL to MapReduce
 Hive
 Stinger Initiative

58
59
Hive - Overview
 Invented at Facebook. Open sourced to Apache in 2008.
 ASF, 2008~, 0.12/2013.10.15
 A database/data warehouse on top of Hadoop
– Structured data similar to relational schema
 Tables, columns, rows and partitions

– SQL like query language (HiveQL)
 A subset of SQL with many traditional features
 It is possible to embedded MR script in HiveQL

– Queries are compiled into MR jobs that are executed on Hadoop.

 Key Building Principles:
– SQL as a familiar data warehousing tool
– Extensibility – Types, Functions, Formats, Scripts
– Scalability and Performance
– Interoperability
출처: http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

60
Hive - Motivation(Facebook)
 Problem: Data growth was exponential
– 200GB per day in March 2008
– 2+TB(compressed) raw data / day in April 2009
– 4+TB(compressed) raw data / day in Nov. 2009
– 12+TB(compressed) raw data / day today(2010)

 The Hadoop Experiment
– Much superior to availability and scalability of commercial DBs
– Efficiency not that great, but throw more hardware
– Partial Availability/resilience/scale more important than ACID

 Problem: Programmability and Metadata
– MapReduce hard to program (users know sql/bash/python)
– Need to publish data in well known schemas

 Solution: SQL + MapReduce = HIVE (2007)
출처: 1) http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation
2) http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

61
Hive - Data Flow Architecture of Facebook
Realtime
Hadoop
Cluster
Scribe MidTier

Web Servers

Oracle RAC

Hadoop Hive Warehouse

Online querying of the summary data

Scribe Writers

MySQL

Offline batch processing

출처: http://borthakur.com/ftp/hadoopmicrosoft.pdf, http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

62
Hive - Architecture

출처: http://cs.brown.edu/~debrabant/cis570-website/papers/hive.pdf

63
Hive - Data Model
 Re-used from Relational Database
– Database, Table, Partition, Row, Column

 Tables
– Typed Columns (int, float, string, date, boolean)
– Also, array/list/map/struct for JSON-like data

 Partitions
– E.g., to range-partition tables by date

 Buckets
– Hash partitions within ranges (useful for sampling, join optimization)

 Column Data Type
CREATE TABLE t {
s STRING,
f FLOAT,
a ARRAY<MAP<STRING, STRUCT<p1:INT, p2:INT>>>;
}
SELECT s, f, a[0][‘foobar’] p2 FROM t;

출처: http://www.slideshare.net/cwsteinbach/hive-quick-start-tutorial

64
Hive - HiveQL (Hive Query Language)
 Basic SQL
–
–
–
–
–
–

From clause subquery
ANSI JOIN (equi-join only)
Multi-table Insert
Multi group-by
Sampling
Objects traversal

 Extensibility
– Pluggable Map-reduce scripts using TRANSFORM

 Limitations
–
–
–
–

Subset of SQL
Meta-data queries
Limited equality and join predicates
No inserts on existing tables (to preserve worm property)

Can overwrite an entire table

65
Hive - Query Execution and MR Jobs

Query plan with 3 map-reduce jobs for multi-table insert query
출처: Ysmart(Yet Another SQL-to-MapReduce Translator), http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

66
Hive - Problems
 Performance Gap
– For simple queries, HIVE performance is comparable with hand-coded MR jobs
– The execution time is much longer for complex queries
 HiveQL allows for arbitrary number of embedded subqueries.
 They are converted to MapReduce jobs respectively
 The simple conversion may involve many unnecessary data scan and transfer.

 Missing Features
–
–
–
–

ANSI SQL
Cost Based Optimizer
UDFs
Data Types

Hand-coded MR program only requires 2 jobs

Automatically generated MR code uses 6 jobs
Simple aggregation query

출처: http://hortonworks.com/blog/100x-faster-hive/, http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf

67
스팅어 계획
(Stinger Initiative)

68
Stinger - Overview
An initiative, not a project or product
Includes changes to Hive and a new project Tez
Two main goals
–Improve Hive performance 100x over Hive 0.10
–Extend Hive SQL to include features needed for analytics

Hive will support:
–BI tools connecting to Hadoop
–Analysts performing ad-hoc, interactive queries
–Still excellent at the large batch jobs it is used for today

출처: http://www.slideshare.net/alanfgates/stinger-hadoop-summit-june-2013

69
Stinger - Milepost

출처: http://www.slideshare.net/alanfgates/stinger-hadoop-summit-june-2013, http://hortonworks.com/labs/stinger/

70
Stinger - Hive 0.12 (HDP 2.0) Feature

출처: http://hortonworks.com/labs/stinger/

71
Stinger - Performance (1/2)

출처: http://www.slideshare.net/pudidic/hive-28625667, http://www.slideshare.net/alanfgates/stinger-hadoop-summit-june-2013

72
Stinger - Performance (2/2)
 Query 27: Pricing Analytics using Star Schema Join
 Query 28: Inventory Analytics Joining 2 Large Fact Tables

출처: http://www.slideshare.net/alanfgates/stinger-hadoop-summit-june-2013, http://hortonworks.com/labs/stinger/

73
Stinger - SQL Semantics

출처: http://hortonworks.com/labs/stinger/

74
Stinger - Scale
 Sustained Query Times: Apache Hive 0.12 provides sustained acceptable
query times even at petabyte scale.
 Smaller Footprint: Better encoding with ORCFile in Apache Hive 12
reduces resource requirements for a cluster.

출처: http://hortonworks.com/labs/stinger/

75
SQL natively on Hadoop





Impala
Drill
Presto
Tajo

76
77
Impala - Overview
In-memory, distributed SQL query engine (no Map/Reduce)
 Inspired by Google Dremel, VLDB 2010
 Native Code(C++), Version 1.2.3 at 2014.01.22
 Distributed on HDFS data nodes
 Interactive SQL
– Typically 5-65x faster than Hive (observed up to 100x faster)
– Responses in seconds instead of minutes(sometimes sub-second)

 Nearly ANSI-92 standard SQL queries with Hive SQL
– Compatible SQL interface for existing Hadoop/CDH applications
– Based on industry standard SQL

 Natively on Hadoop/Hbase storage and metadata
– Flexibility, scale, and cost advantages of Hadoop
– No duplication/synchronization of data and metadata
– Local processing to avoid network bottlenecks

 Separate runtime from MapReduce
– MapReduce is designed and greate for batch
– Impala is purpose-built for low-latency SQL queries on Hadoop
출처: http://www.slideshare.net/insideHPC/impala-overview

78
Impala - Architecture

출처: http://www.slideshare.net/insideHPC/impala-overview

79
80
Drill - Overview
 Interactive analysis of Big Data using standard SQL
 Fast
– Low latency
– Columnar execution
 Inspired by Google Dremel / BigQuery

– Complement native interfaces and MapReduce/Hive/Pig
 Open
– Community driven open source project
– Under Apache Software Foundation
 Modern
– Standard ANSI SQL: 2003 (select/into)
– Nested / hierarchical data support
– Schema is optional
– Supports RDBMS, Hadoop and NoSQL
출처: http://www.slideshare.net/tdunning/apache-drill-16513485

81
Drill - How Does it Work?
• Drillbits run on each node, designed to
maximize data locality
• Processing is done outside MapReduce
paradigm (but possibly within YARN)
• Queries can be fed to any Drillbit
• Coordination, query planning, optimization,
scheduling, and execution are distributed

출처: http://www.slideshare.net/tdunning/apache-drill-16513485

SELECT * FROM
oracle.transactions,
mongo.users,
hdfs.events
LIMIT 1

82
Drill - Architecture

83
Drill - Status
 Heavy active development by multiple organizations
 Available
– Logical plan syntax and interpreter
– Reference interpreter

 In progress
– SQL interpreter
– Storage engine implementations for Accumulo, Cassandra, HBase and various file formats

 Significant community momentum
–
–
–
–

Over 200 people on the Drill mailing list
Over 200 members of the Bay Area Drill User Group
Drill meetups across the US and Europe
OpenDremel team joined Apache Drill

 Anticipated schedule:
– Prototype: Q1
– Alpha: Q2
– Beta: Q3

출처: http://www.slideshare.net/tdunning/apache-drill-16513485

84
Google Dremel

출처: http://research.google.com/pubs/pub36632.html

85
86
Presto - Overview
 Facebook에서 개발하여 오픈소스로 공개(2013.11.6)
 수 초에서 수 분이 걸리는 질의 유형을 타겟으로 설계
 No Fault Tolerance
 빠른 응답과 online query processing
– 우선적으로 처리된 결과가 반환됨

 일부 질의 유형에 대해 approximate query 지원
 Hive의 보완 시스템으로 개발
 모든 연산자는 파이프라이닝(pipelining)과 데이터 전송은 스트리밍
 Facebook 사례
– 누적된 300 PB Data Warehouse, 일일 30k 질의, 하루에 1PB 처리

출처: http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013

87
Presto - Architecture(1/2)

출처: https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

88
Presto - Architecture (2/2)

출처: https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

89
Presto - Query Execution Flow

출처:

http://www.slideshare.net/frsyuki/hadoop-source-code-reading-15-in-japan-presto

90
91
Tajo - Overview
 Tajo
– 하둡 기반의 대용량 데이터웨어하우스
– 2010년 고려대에서 리서치 프로토타입으로 개발 시작
– 2013.03 아파치 인큐베이션 프로젝트
– 0.2-incubating, 2013.11

 Features
– 호환성
 표준 SQL 지원, HiveQL 지원, UDF 지원
 JDBC
– 고성능 및 낮은 반응 시간
 유연하고 효율적인 분산 처리 엔진
 Cost-based Optimizer
 JIT Query Compilation 및 Vectorized 질의엔진
출처: http://tajo.incubator.apache.org/, http://www.slideshare.net/deview/deview2013-tajo

92
Tajo - Architecture
 Master-Worker 모델
– RPC 기반: Protocol Buffer, Netty

 Tajo Master
– 클라이언트 및 어플리케이션 요청 처리
– 카탈로그 서버
 테이블 스키마, 물리적인 정보, 각종 통계
 JDBC 이용 외부 RDBMS를 저장소로 사용 가능
– Query Parser, 플래너, 최적화, 클러스터 자원관리, Query Master 관리

 Query Master
– 질의별 동작
– Executin Block(질의 실행 단계) 제어
– 태스크 스케줄링

 Tajo Worker
– 스토리지 매니저
– 로컬 질의 엔진
출처: http://www.slideshare.net/deview/deview2013-tajo

93
Tajo - Architecture

출처: http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013

94
Tajo - Query Planning Process

출처: http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013

95
Tajo - Distributed Execution Plan

출처: http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013

96
Tajo - Benchmark 결과
Tajo vs. Hive on TPC-H 1TB
Experimental Environments
– 1TB TPC-H Data Set


Some of TPC-H Queries






Intel i5
16GB Memory
1TB HDD x 2
1G ether networks

– 32 Cluster Nodes

– Hadoop 0.20.2-cdh3u3

출처:

http://www.slideshare.net/hyunsikchoi/tajo-intro

97
Split SQL Execution
 Hadapt

98
99
Hadapt - Overview
 A Commercialized version of Daniel Abadi’s HadoopDB project.
 HadoopDB Project: An Architetural hybrid of MapReduce and DBMS Technologies
for Analytical Workloads
– Hadoop as communication layer above multiple nodes running single-node DBMS
instances

 Full open-source solution:
– PostgreSQL as DB Layer
– Hadoop as communication layer
– Hive as translation layer

 Hadapt(HadoopDB) = Hadoop + PostgreSQL

100
Hadapt - HadoopDB
 Recent study at Yale University, Database Research Dep.
 Hybrid architecture of parallel databases and MapReduce
system
 The idea is to combine the best qualities of both technologies
 Multiple single-node databases are connected using Hadoop
as the task coordinator and network communication layer
 Queries are distributed across the nodes by MapReduce
framework, but as much work as possible is done in the
database node

출처: https://wiki.aalto.fi/download/attachments/37102509/HadoopDB%2Bproject.ppt

101
Hadapt - Architecture

출처: http://www.slideshare.net/DavidPortnoy/hybrid-data-warehouse-hadoop-implementations

102
Hadapt - HadoopDB Benchmark (1/2)

출처: https://wiki.aalto.fi/download/attachments/37102509/HadoopDB%2Bproject.ppt

103
Hadapt - HadoopDB Benchmark (2/2)

출처: https://wiki.aalto.fi/download/attachments/37102509/HadoopDB%2Bproject.ppt

104
참고

105
Hadoop
대규모 데이터의 분산처리를 위한 오픈 소스 프레임워크
: HDFS(데이터 저장) + MapReduce(분산처리)

 유연성
– 다양한 데이터 유형 지원(구조적, 비구조적 데이터, 텍스트 등)
– 범용 프로그래밍 언어를 지원

 확장성
– 노드 증가에 따른 선형적인 (성능, 용량) 확장 가능

 비용
– 다수의 범용 서버(Commodity Server) 클러스터에서 동작하도록 설계
– 노드 증가를 통한 용량 증설 및 처리 성능 향상 비용이 저렴

106
MapReduce 동작방식

출처: http://horicky.blogspot.kr/2010/08/designing-algorithmis-for-map-reduce.html

107
Star Schema (1/3)
Star schema:
–Design of database
–Separating dimensional data and fact/event data

Two major types of components
–Fact table and dimensional tables
–A fact table is surrounded by dimensional tables in a simple
star schema
–Each dimensional table has a one-to-many relationship to
the central fact table

출처: http://www.fbe.hku.hk/~is/busi0092/Notes/t1_dataWarehouse_full_v3.pdf

108
Star Schema (2/3) - Component of Star Schema
Fact tables contain
factual or quantitative
data

1:N relationship
between dimension
tables and fact
tables

Dimension tables contain
descriptions about the
subjects of the business

출처: http://www.fbe.hku.hk/~is/busi0092/Notes/t1_dataWarehouse_full_v3.pdf

109
Star Schema (3/3) - Sample Data

출처: http://www.fbe.hku.hk/~is/busi0092/Notes/t1_dataWarehouse_full_v3.pdf

110
SQL on Hadoop Products by MapR

출처: http://www.mapr.com/products/sql-on-hadoop

111
빅데이터의 잠재적 사용사례

출처: SAS and IDC, http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/

112
BigData Software Stack (Hadoop)

113
BDAS(Berkeley Data Analytics Stack)

출처: https://amplab.cs.berkeley.edu/software/

114
데이터베이스 역사

출처: http://www.benstopford.com/2012/06/30/thoughts-on-big-data-technologies-part-1/

115
감사합니다.

116

More Related Content

What's hot

Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015Yousun Jeong
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Databricks
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at ScaleSascha Dittmann
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystifiedOmid Vahdaty
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 

What's hot (20)

Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 

Viewers also liked

An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoopMinJae Kang
 
Cloud Music v1.0
Cloud Music v1.0Cloud Music v1.0
Cloud Music v1.0Steve Min
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database OverviewSteve Min
 
Apache Htrace overview (20160520)
Apache Htrace overview (20160520)Apache Htrace overview (20160520)
Apache Htrace overview (20160520)Steve Min
 
빅데이터_ISP수업
빅데이터_ISP수업빅데이터_ISP수업
빅데이터_ISP수업jrim Choi
 
Cloud Computing v1.0
Cloud Computing v1.0Cloud Computing v1.0
Cloud Computing v1.0Steve Min
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Matthew (정재화)
 
[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)Steve Min
 
Scala overview
Scala overviewScala overview
Scala overviewSteve Min
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoMatthew (정재화)
 
[SSA] 01.bigdata database technology (2014.02.05)
[SSA] 01.bigdata database technology (2014.02.05)[SSA] 01.bigdata database technology (2014.02.05)
[SSA] 01.bigdata database technology (2014.02.05)Steve Min
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter
 

Viewers also liked (16)

An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoop
 
Cloud Music v1.0
Cloud Music v1.0Cloud Music v1.0
Cloud Music v1.0
 
Html5 video
Html5 videoHtml5 video
Html5 video
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 
Apache Htrace overview (20160520)
Apache Htrace overview (20160520)Apache Htrace overview (20160520)
Apache Htrace overview (20160520)
 
빅데이터_ISP수업
빅데이터_ISP수업빅데이터_ISP수업
빅데이터_ISP수업
 
Cloud Computing v1.0
Cloud Computing v1.0Cloud Computing v1.0
Cloud Computing v1.0
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
 
vertica_tmp_4.5
vertica_tmp_4.5vertica_tmp_4.5
vertica_tmp_4.5
 
BigData, Hadoop과 Node.js
BigData, Hadoop과 Node.jsBigData, Hadoop과 Node.js
BigData, Hadoop과 Node.js
 
[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)
 
Scala overview
Scala overviewScala overview
Scala overview
 
RESTful API Design, Second Edition
RESTful API Design, Second EditionRESTful API Design, Second Edition
RESTful API Design, Second Edition
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
[SSA] 01.bigdata database technology (2014.02.05)
[SSA] 01.bigdata database technology (2014.02.05)[SSA] 01.bigdata database technology (2014.02.05)
[SSA] 01.bigdata database technology (2014.02.05)
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigData
 

Similar to [SSA] 04.sql on hadoop(2014.02.05)

Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGuang Xu
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseMongoDB
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Vicente Orjales
 

Similar to [SSA] 04.sql on hadoop(2014.02.05) (20)

Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Big data
Big dataBig data
Big data
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

[SSA] 04.sql on hadoop(2014.02.05)

  • 1. [SSA] Big Data Analytics SQL on Hadoop (A New Generation of Analytic Databases) Hyounggi Min hg.min@samsung.com 2014. 2. 5.
  • 2. Contents I. 빅데이터 분석 II. 빅데이터 분석 접근법 III. Analytic Database at Google IV. SQL on Hadoop 종류 I. SQL to MapReduce II. SQL natively on Hadoop III. Split SQL Execution 1
  • 6. 빅데이터 처리 흐름 출처: Gruter BigData를 위한 아키텍처 및 기술(2011) 5
  • 7. Analytical Complexity 데이터 분석의 진화 (Faster, deeper, insight) Behaviour modelling Statistical Analysis Machine learning algorithms Dynamic Simulation Clustering Dynamic Interaction Fraud detection Reporting & BPM Campaign Management Technology/Automation 출처: http://strataconf.com/stratany2013/public/schedule/detail/31731 6
  • 8. 빅 데이터 분석 접근법  Traditional Analytic Database  MapReduce  SQL on Hadoop 7
  • 9. Motivation  대규모 인터넷 회사 –예) 아마존, 구글, 야후, 페이스북, 트위터, 마이크로소프트, 넷플릭스 등 매일 TB 규모의 데이터를 수집함 정보를 가공하기 위해 ad-hoc 분석이 필요함 –과거의 트랜트와 함께 분석해야 함 –예) 엔지니어는 검색엔진의 랭킹 알고리즘 개선을 위해 검 색 로그에 대한 트랜드 분석을 해야 함 8
  • 11. Solution 1 - Analytic Database Problem  비용 – 전통적인 분석 데이터 베이스는 TCO가 너무 높음 – 예) 특히 CPU/Core 당 가격정책의 경우에는 인터넷 회사에서 돌리기에는 너무 부담 이됨  확장성 / 가용성 – 상용 데이터 베이스는 확장성과 가용성이 떨어짐 – 확장성과 비용 제약으로 인하여 가공된 데이터를 저장해야 함 – 웹스케일 데이터에서는 ACID보단 가용성과 확장성, 탄력성이 더 중요함  구조 – 전통적인 분석 데이터베이스에서는 로그 분석을 위한 별도 저장소가 필요함  벤더 – Teradata, IBM’s Netezza, Greenplum, Oracle Exadata, HP Vertica, Sybase IQ 등 출처: http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf 10
  • 12. Solution 2 - MapReduce Problem  MapReduce 개발의 어려움 – MapReduce는 어렵고, 개발 노력이 많이 들고, 성능 보장이 어려움 – 개발자의 수준에 따른 성능차 발생 – 기존 SQL구현에 비해 생산성이 많이 떨어짐  부족한 기능: 스키마, 옵티마이저, 인덱스, 뷰 등  데이터 처리 모델상의 한계(관계형 처리를 위해 고안된 것이 아님) – 고정된 데이터 흐름 – Shuffle: merge sort  hasing  merge sort – Job간의 데이터 교환 오버헤드 출처: http://www.slideshare.net/Hadoop_Summit/steinbach-june26-405pmroom230av2, 11
  • 13. Solution 3 - SQL on Hadoop Hadoop 기반의 차세대 분석 엔진을 지칭  SQL 표준 지원 – 기존 시스템과 통합 또는 대체 용이 – MapReduce보다 적은 학습곡선 필요  높은 처리 성능 – MapReduce의 한계를 극복하는 분산 처리 프레임워크 – CPU와 메모리를 보다 잘 활용하는 처리방식  종류 – Hive(Stinger), Impala, Presto, Drill, Tajo, Hadapt 등 출처: http://www.slideshare.net/deview/deview2013-tajo 12
  • 14. Analytic Database at Google  Dremel  Tenzing  F1 13
  • 15. Dremel (Interactive Analysis of WebScale Datasets) 14
  • 16. Google Dremel System  Trillion-record, multi-terabyte datasets at interactive speed – Scales to thousands of nodes – Fault and straggler tolerant execution  Nested data model – Complex datasets; normalization is prohibitive – Columnar storage and processing  Tree architecture (as in web search)  Interoperates with Google's data mgmt tools – In situ data access (e.g., GFS, Bigtable) – MapReduce pipelines 출처: Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 15
  • 17. Google Dremel - Widely used inside Google  Analysis of crawled web documents  Tracking install data for applications on Android Market  Crash reporting for Google products  OCR results from Google Books  Spam analysis  Debugging of map tiles on Google Maps  Tablet migrations in managed Bigtable instances  Results of tests run on Google's distributed build system  Disk I/O statistics for hundreds of thousands of disks  Resource monitoring for jobs run in Google's data centers  Symbols and dependencies in Google's codebase 16
  • 18. Google Dremel - Storage Format DocId: 10 Links Forward: 20 Name Language Code: 'en-us' Country: 'us' Url: 'http://A' Name Url: 'http://B' r1 r1 C r1 r2 A * * ... B * D r1 E r1 r2 r2 Read less, r2 cheaper ... decompression Challenge: preserve structure, reconstruct from a subset of fields 출처: Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 17
  • 19. Google Dremel - Nested Data Model schema Repetition Level Definition Level records r: At what repeated field in the field's path the value has repeated d: How many fields in paths that could be undefined (opt. or rep.) are actually present 출처: Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 18
  • 20. Google Dremel - Column-striped representation DocId Name.Url Links.Forward Links.Backward value r d value r d value r d value r d 10 0 0 http://A 0 2 20 0 2 NULL 0 1 20 0 0 http://B 1 2 40 1 2 10 0 2 NULL 1 1 60 1 2 30 1 2 http://C 0 2 80 0 2 Name.Language.Code Name.Language.Country value r d value r d en-us 0 2 us 0 3 en 2 2 NULL 2 2 NULL 1 1 NULL 1 1 en-gb 1 2 gb 1 3 NULL 0 1 NULL 0 1 19
  • 21. Google Dremel - Record assembly FSM 0 1 DocId Links.Backward Name.Language.Code 1 Transitions labeled with repetition levels 0 0 0,1,2 2 Links.Forward 1 Name.Language.Country Name.Ur l 0 0,1 For record-oriented data processing (e.g., MapReduce) 20
  • 22. Google Dremel - Reading two fields DocId 0 1,2 Name.Language.Country 0 DocId: 10 Name Language Country: 'us' Language Name Name Language Country: 'gb' DocId: 20 Name s1 s2 Structure of parent fields is preserved. Useful for queries like /Name[3]/Language[1]/Country 21
  • 23. Google Dremel - SQL dialect for nested data SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; Output table Id: 10 Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en' Name Cnt: 0 t1 Output schema message QueryResult { required int64 Id; repeated group Name { optional uint64 Cnt; repeated group Language { optional string Str; } } } 22
  • 24. Google Dremel - Serving Tree [Dean WSDM'09] client • Parallelizes scheduling and aggregation • Fault tolerance • Designed for "small" results (<1M records) histogram of response times 23
  • 25. Google Dremel - Example: count() 0 SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/1, /gfs/2, …, /gfs/100000} SELECT A, SUM(c) FROM (R11.. UNION ALL R110) GROUP BY A R11 1 ... 3 R12 SELECT A, COUNT(B) AS c FROM T11 GROUP BY A T11 = {/gfs/1, …, /gfs/10000} SELECT A, COUNT(B) AS c FROM T12 GROUP BY A T12 = {/gfs/10001, …, /gfs/20000} SELECT A, COUNT(B) AS c FROM T31 GROUP BY A T31 = {/gfs/1} ... ... Data access ops 24
  • 26. Google Dremel - Experiments • 1 PB of real data (uncompressed, non-replicated) • 100K-800K tablets per table • Experiments run during business hours Table name Number of records Size (unrepl., compressed) Number of Data fields center Repl. factor T1 85 billion 87 TB 270 A 3× T2 24 billion 13 TB 530 A 3× T3 4 billion 70 TB 1200 A 3× T4 1+ trillion 105 TB 50 B 3× T5 1+ trillion 20 TB 30 B 2× 25
  • 27. Google Dremel - Read from disk 26
  • 28. Google Dremel - MR and Dremel execution 27
  • 29. Google Dremel - Impact of serving tree depth 28
  • 30. Google Dremel - Scalability 29
  • 31. Google Dremel - Interactive speed 30
  • 32. Google Dremel - Observations Possible to analyze large disk-resident datasets interactively on commodity hardware –1T records, 1000s of nodes MR can benefit from columnar storage just like a parallel DBMS –But record assembly is expensive –Interactive SQL and MR can be complementary Parallel DBMSes may benefit from serving tree architecture just like search engines 31
  • 33. Tenzing (A SQL Implementation On the MapReduce Framework) 32
  • 34. Google Tenzing - Overview  SQL Query Engine Built on top of MapReduce for ad-hoc analysis of Google data.  Features – Complete SQL implementation (with several extensions) – SQL92, some SQL99  Key Characters – – – – – – – – 이질성(Heterogeneity) 고성능(High Performance) 확장성(Scalability) 신뢰성(Reliability) 메터데이터 인식(Meta data awareness) 로우 레이턴시(Low latency) 스토리지 확장: 컬럼형 스토러지(Cloumnar Storage), 구조화 된 데이터 지원(structured data) 쉬운 확장(Easy Extensibility)  현황 (2011년) – 10000+ query/day – 1.5PB의 압축데이터 출처: VLDB2011, http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf 33
  • 35. Google Tenzing - Motivation  2008년에 구글 광고데이터를 위한 상용 DW에 문제가 발생 – 확장 비용 증가 (PB 수준) – 급격히 증가하는 로딩 시간 – SQL 제약, 다양한 데이터 소스에 대한 지원 부족으로 복잡한 분석에 제약 발생  해결 책: Tenzing – 확장성 : 수천개의 코어, 수백명의 사용자, 페타바이트 규모의 데이터를 처리 – 높은 신뢰성: 일반 HW를 기반으로 동작 가능 – 성능: 기존 상용 DW 대비 동등 이상의 성능 – ETL 프로세스 최소화: Google 시스템의 데이터를 직접 읽을 수 있어야 함 – SQL 표준 지원: 분석가의 학습곡선을 줄임 – UDF 지원: Prediction과 mining에도 적용할 수 있어야 함 출처: [VLDB2011] 34
  • 36. Google Tenzing - Architecture 35
  • 37. Google Tenzing - SQL Features  Projection and Filtering – Standard SQL operations: IN, LIKE, BETWEEN, CASE etc – Built-in Sawzall function  Aggregation – Standard aggregate functions: SUM, COUNT, MIN, MAX etc. – Hash Based Aggregation  Joins – inner, left, right, cross, full other joins, Equi, semi-equi, non-equi, function based joins etc – Distributed Implementations: Broadcast Joins, Remote Lookup Joins, Distributed Sort-Merge Joins, Distributed Hash Joins      Analytic Functions: RANK, SUM, MIN, MAX, LEAD, LAG, NTITLE Set Operations: UNION, UNION ALL, MINUS, MINUS ALL Nested Queries and Subqueries Handling Structured Data Views, DDL, DML, Table Valued Functions, Data Formats 출처: http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf 36
  • 38. Google Tenzing - Hash based Aggregation  Tenzing Query:  MapReduce Pseudo-Code SELECT dept id, COUNT(1) FROM Employee /*+ HASH */ GROUP BY 1; 출처: http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf 37
  • 39. Google Tenzing - Analytic Functions  Tenzing Query:  MapReduce Pseudo-Code SELECT dept, emp, salary, RANK() OVER ( PARTITION BY dept ORDER BY salary DESC) AS salary rank FROM Employee; 출처: http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf 38
  • 40. Google Tenzing - Performance Enhancement  MapReduce Enhancement – Worker Pool: master watcher, master pool, worker pool – Streaming & In-memory Chaining – Sort Avoidance – Block shuffle – Local Execution  Query Engine Enhancement – 1st impl: SQL to Sawzall code (Sawzall JIT compiler)  Sawzall 시스템으로 변환 비용으로 비효율적 – 2nd impl: Dremel’s SQL expression evaluation engine  Interpreter 방식 환경, Row 기반 처리로 다소 느림 – 3rd impl: LLVM 기반의 native code 생성  Row major block based intermediate data  Column major vector based processing with columnar intermediate storage 출처: http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf 39
  • 41. F1 (A Distributed SQL Database That Scales) 40
  • 42. Google F1 - Overview  구글 AdWord 비즈니스를 지원하기 위한 분산 관계형 데이터베이스 시스템  하이브리드 데이터베이스: Google Spanner를 기반으로 구축 – NoSQL 특징: High Availability, Scalability – 전통적인 SQL 데이터베이스 특징: Consistency, Usability  Desing Goal – Scalability: Auto-sharded storage – Availability & Consistency: Synchronous replication – High commit latency: Can be hidden  Hierarchical schema  Protocol buffer column types  Efficient client code 출처: [VLDB2013], [SIGMOD2012] 41
  • 43. Google F1 - Adwards Ecosystem 출처: [VLDB2011] 42
  • 44. Google F1 - Motivation (1/2)  Legacy DB: Sharded MySQL – 샤딩 전략  Sharded by customer  Apps optimized using shard awareness  제약사항 – 가용성  Master / slave replication -> downtime during failover  Schema changes -> downtime for table locking – 확장성  Grow by adding shards  Rebalancing shards is extremely difficult and risky  Therefore, limit size and growth of data stored in database – 기능  Can't do cross-shard transactions or joins 출처: http://research.google.com/pubs/pub38125.html 43
  • 45. Google F1 - Motivation (2/2)  Critical applications driving Google's core ad business – 24/7 availability, even with datacenter outages – Consistency required  Can't afford to process inconsistent data  Eventual consistency too complex and painful – Scale: 10s of TB, replicated to 1000s of machines  Shared schema – Dozens of systems sharing one database – Constantly evolving - multiple schema changes per week  SQL Query – Query without code  해결책: Google F1 – built from scratch, – designed to operate at Google scale, – without compromising on RDBMS features. 출처: http://research.google.com/pubs/pub38125.html 44
  • 46. Google F1 - Features  구글 Spanner(NewSQL)를 기반으로 추가 기능을 구현 – OLTP + OLAP 성격을 가지고 있음  Spanner 제공 특징: extremely scalable data storage, synchronous replication, and strong consistency, ordering properties  추가 기능 – – – – – Distributed SQL queries, including joining data from external data sources Transactionally consistent secondary indexes Asynchronous schema changes including database reorgnizations Optimistic transactions Automatic change history recording and publishing  적용 현황: 2012년 초 AdWords 광고 캠페인 데이터 관리에 사용됨 – 100s of applications, 1000s of users, all sharing the same database – 100TB↑, hundreds of thousands req/sec, scan ten of trillions of data rows per day – Availability 99.999%, 계획하지 않은 outage에도 obserbable latency는 증가하지 않음 45
  • 47. Google F1 - Architecture  Architecture – Sharded Spanner servers  data on GFS and in memory – Stateless F1 server – Slave Pool for query execution  Features – Relational schema  Extensions for hierarchy and rich data types  Non-blocking schema changes – Consistent indexes – Parallel reads with SQL or Map-Reduce 46
  • 48. Google F1 - Schema & Clustered Storage  Explicit table hierarchies. Example: – Customer (root table): PK (CustomerId) – Campaign (child): PK (CustomerId, CampaignId) – AdGroup (child): PK (CustomerId, CampaignId, AdGroupId) 47
  • 49. Google F1 - Data Model  Hierarchical Schema – Customer (root table): PK (CustomerId) – Campaign (child): PK (CustomerId, CampaignId) – AdGroup (child): PK (CustomerId, CampaignId, AdGroupId  Table Column Types: Schema, Protocol Buffers(Binary encoding format) 출처: http://research.google.com/pubs/pub41344.html 48
  • 50. Google F1 - SQL Query  Parallel query engine implemented from scratch  Fully functional SQL, joins to external sources  Language extensions for protocol buffers  Making queries fast – Hide RPC latency – Parallel and batch execution – Hierarchical joins 출처: http://research.google.com/pubs/pub38125.html, http://research.google.com/pubs/pub41344.html 49
  • 52. SQL on Hadoop Hadoop 기반의 SQL을 지원하는 분석 데이터베이스를 지칭  SQL 언어지원 – 기존 시스템과 통합 또는 대체 용이 – SQL에 최적화된 옵티마이저, 인덱서 등 기능 제공  높은 처리성능 – MapReduce의 한계를 극복하는 분산처리 51
  • 53. 왜 SQL on Hadoop 인가?  인력 확보 및 생산성 문제 – MapReduce를 익히는데 많은 학습곡선이 소요됨  인력 확보 문제 – MapReduce기반으로 구현하는데 많은 노력이 소요됨  생산성 문제  성능 보장 및 사람에 의한 오류 방지 – 개발자 역량에 따라 성능이 좌우됨 – 성능 튜닝에 많은 노력이 소요됨 – 버그 가능성 높음  Ad-hoc 질의에 대한 DB 병행사용 문제 – 복잡한 Architecture 및 관리부담 문제 – 비용부담 증가: 추가적인 DBMS 라이선스 및 스토리지 필요 – 데이터 교환문제: HDFS ↔ DBMS 52
  • 54. SQL on Hadoop 역사  SQL has been ruling since 1970!!  Hadoop came…But little traction…  Facebook open-sourced HIVE in 2008.. Hadoop takes the next leap in adoption  RDBMS and MPP Vendors brought Hadoop Connectors  Niche players used SQL engine to run Distributed Query on Hadoop  2010.10: Google Dremel paper opened  2012.10: Cloudera Impala sets the trend for Real time Query over Hadoop  2013.03: Apache Tajo Project enters incubation.  2013.11: Facebook open sourced Presto 출처: http://www.slideshare.net/SameerMandal1/sql-over-hadoop-ver-3 53
  • 55. SQL on Hadoop Products 출처: http://www.slideshare.net/SameerMandal1/sql-over-hadoop-ver-3 54
  • 56. SQL on Hadoop 분류 (1/3)  451 Group – SQL in Hadoop: Hive – SQL on Hadoop: Impala, Drill, HWAQ, JethoData, Big SQL – SQL and Hadoop: Hadapt, RainStor, PolyBase, Citus Data, SQL-H  Gruter – Data Warehouse System: Hive(Stinger), Tajo – Query Engine: Impala, Drill, Presto 출처: 1) http://blogs.the451group.com/information_management/2013/10/09/7-hadoop-questions-q5-sql-in-hadoop-sql-on-hadoop-or-sql-and-hadoop/ 2) http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013 55
  • 57. SQL on Hadoop 분류 (2/3)  Datascientist.info –SQL natively on Hadoop: Stinger, Impala, Drill, Presto –DBMS on Hadoop: Hadapt, CitusDB, Tajo  Hadapt – SQL translated to MapReduce jobs over a Hadoop cluster  Hive, Stinger(without Tez) – SQL processed by a specialized(Google-inspired) SQL engine that sits on a Hadoop cluster  Impala(F1), Drill(Dremel) – Processing of SQL queries are split between MapReduce and storage that natively speaks SQL  Hadapt, Polybase 출처: http://datascientists.info/sql-and-hadoop/, http://hadapt.com/blog/2013/10/02/classifying-the-sql-on-hadoop-solutions/ 56
  • 58. SQL on Hadoop 분류 (3/3)  SQL to MapReduce: Hive(Stinger)  SQL natively on Hadoop: Tajo, Impala, Drill, Presto  Split SQL execution: Hadapt , CitusDB, Polybase  Others: Shark, Phoenix, Cascading Lingual, Vertica 57
  • 59. SQL to MapReduce  Hive  Stinger Initiative 58
  • 60. 59
  • 61. Hive - Overview  Invented at Facebook. Open sourced to Apache in 2008.  ASF, 2008~, 0.12/2013.10.15  A database/data warehouse on top of Hadoop – Structured data similar to relational schema  Tables, columns, rows and partitions – SQL like query language (HiveQL)  A subset of SQL with many traditional features  It is possible to embedded MR script in HiveQL – Queries are compiled into MR jobs that are executed on Hadoop.  Key Building Principles: – SQL as a familiar data warehousing tool – Extensibility – Types, Functions, Formats, Scripts – Scalability and Performance – Interoperability 출처: http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf 60
  • 62. Hive - Motivation(Facebook)  Problem: Data growth was exponential – 200GB per day in March 2008 – 2+TB(compressed) raw data / day in April 2009 – 4+TB(compressed) raw data / day in Nov. 2009 – 12+TB(compressed) raw data / day today(2010)  The Hadoop Experiment – Much superior to availability and scalability of commercial DBs – Efficiency not that great, but throw more hardware – Partial Availability/resilience/scale more important than ACID  Problem: Programmability and Metadata – MapReduce hard to program (users know sql/bash/python) – Need to publish data in well known schemas  Solution: SQL + MapReduce = HIVE (2007) 출처: 1) http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation 2) http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop 61
  • 63. Hive - Data Flow Architecture of Facebook Realtime Hadoop Cluster Scribe MidTier Web Servers Oracle RAC Hadoop Hive Warehouse Online querying of the summary data Scribe Writers MySQL Offline batch processing 출처: http://borthakur.com/ftp/hadoopmicrosoft.pdf, http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf 62
  • 64. Hive - Architecture 출처: http://cs.brown.edu/~debrabant/cis570-website/papers/hive.pdf 63
  • 65. Hive - Data Model  Re-used from Relational Database – Database, Table, Partition, Row, Column  Tables – Typed Columns (int, float, string, date, boolean) – Also, array/list/map/struct for JSON-like data  Partitions – E.g., to range-partition tables by date  Buckets – Hash partitions within ranges (useful for sampling, join optimization)  Column Data Type CREATE TABLE t { s STRING, f FLOAT, a ARRAY<MAP<STRING, STRUCT<p1:INT, p2:INT>>>; } SELECT s, f, a[0][‘foobar’] p2 FROM t; 출처: http://www.slideshare.net/cwsteinbach/hive-quick-start-tutorial 64
  • 66. Hive - HiveQL (Hive Query Language)  Basic SQL – – – – – – From clause subquery ANSI JOIN (equi-join only) Multi-table Insert Multi group-by Sampling Objects traversal  Extensibility – Pluggable Map-reduce scripts using TRANSFORM  Limitations – – – – Subset of SQL Meta-data queries Limited equality and join predicates No inserts on existing tables (to preserve worm property) Can overwrite an entire table 65
  • 67. Hive - Query Execution and MR Jobs Query plan with 3 map-reduce jobs for multi-table insert query 출처: Ysmart(Yet Another SQL-to-MapReduce Translator), http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf 66
  • 68. Hive - Problems  Performance Gap – For simple queries, HIVE performance is comparable with hand-coded MR jobs – The execution time is much longer for complex queries  HiveQL allows for arbitrary number of embedded subqueries.  They are converted to MapReduce jobs respectively  The simple conversion may involve many unnecessary data scan and transfer.  Missing Features – – – – ANSI SQL Cost Based Optimizer UDFs Data Types Hand-coded MR program only requires 2 jobs Automatically generated MR code uses 6 jobs Simple aggregation query 출처: http://hortonworks.com/blog/100x-faster-hive/, http://sydney.edu.au/engineering/it/~zhouy/info5011/doc/08_DataAnalytics.pdf 67
  • 70. Stinger - Overview An initiative, not a project or product Includes changes to Hive and a new project Tez Two main goals –Improve Hive performance 100x over Hive 0.10 –Extend Hive SQL to include features needed for analytics Hive will support: –BI tools connecting to Hadoop –Analysts performing ad-hoc, interactive queries –Still excellent at the large batch jobs it is used for today 출처: http://www.slideshare.net/alanfgates/stinger-hadoop-summit-june-2013 69
  • 71. Stinger - Milepost 출처: http://www.slideshare.net/alanfgates/stinger-hadoop-summit-june-2013, http://hortonworks.com/labs/stinger/ 70
  • 72. Stinger - Hive 0.12 (HDP 2.0) Feature 출처: http://hortonworks.com/labs/stinger/ 71
  • 73. Stinger - Performance (1/2) 출처: http://www.slideshare.net/pudidic/hive-28625667, http://www.slideshare.net/alanfgates/stinger-hadoop-summit-june-2013 72
  • 74. Stinger - Performance (2/2)  Query 27: Pricing Analytics using Star Schema Join  Query 28: Inventory Analytics Joining 2 Large Fact Tables 출처: http://www.slideshare.net/alanfgates/stinger-hadoop-summit-june-2013, http://hortonworks.com/labs/stinger/ 73
  • 75. Stinger - SQL Semantics 출처: http://hortonworks.com/labs/stinger/ 74
  • 76. Stinger - Scale  Sustained Query Times: Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale.  Smaller Footprint: Better encoding with ORCFile in Apache Hive 12 reduces resource requirements for a cluster. 출처: http://hortonworks.com/labs/stinger/ 75
  • 77. SQL natively on Hadoop     Impala Drill Presto Tajo 76
  • 78. 77
  • 79. Impala - Overview In-memory, distributed SQL query engine (no Map/Reduce)  Inspired by Google Dremel, VLDB 2010  Native Code(C++), Version 1.2.3 at 2014.01.22  Distributed on HDFS data nodes  Interactive SQL – Typically 5-65x faster than Hive (observed up to 100x faster) – Responses in seconds instead of minutes(sometimes sub-second)  Nearly ANSI-92 standard SQL queries with Hive SQL – Compatible SQL interface for existing Hadoop/CDH applications – Based on industry standard SQL  Natively on Hadoop/Hbase storage and metadata – Flexibility, scale, and cost advantages of Hadoop – No duplication/synchronization of data and metadata – Local processing to avoid network bottlenecks  Separate runtime from MapReduce – MapReduce is designed and greate for batch – Impala is purpose-built for low-latency SQL queries on Hadoop 출처: http://www.slideshare.net/insideHPC/impala-overview 78
  • 80. Impala - Architecture 출처: http://www.slideshare.net/insideHPC/impala-overview 79
  • 81. 80
  • 82. Drill - Overview  Interactive analysis of Big Data using standard SQL  Fast – Low latency – Columnar execution  Inspired by Google Dremel / BigQuery – Complement native interfaces and MapReduce/Hive/Pig  Open – Community driven open source project – Under Apache Software Foundation  Modern – Standard ANSI SQL: 2003 (select/into) – Nested / hierarchical data support – Schema is optional – Supports RDBMS, Hadoop and NoSQL 출처: http://www.slideshare.net/tdunning/apache-drill-16513485 81
  • 83. Drill - How Does it Work? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce paradigm (but possibly within YARN) • Queries can be fed to any Drillbit • Coordination, query planning, optimization, scheduling, and execution are distributed 출처: http://www.slideshare.net/tdunning/apache-drill-16513485 SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1 82
  • 85. Drill - Status  Heavy active development by multiple organizations  Available – Logical plan syntax and interpreter – Reference interpreter  In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats  Significant community momentum – – – – Over 200 people on the Drill mailing list Over 200 members of the Bay Area Drill User Group Drill meetups across the US and Europe OpenDremel team joined Apache Drill  Anticipated schedule: – Prototype: Q1 – Alpha: Q2 – Beta: Q3 출처: http://www.slideshare.net/tdunning/apache-drill-16513485 84
  • 87. 86
  • 88. Presto - Overview  Facebook에서 개발하여 오픈소스로 공개(2013.11.6)  수 초에서 수 분이 걸리는 질의 유형을 타겟으로 설계  No Fault Tolerance  빠른 응답과 online query processing – 우선적으로 처리된 결과가 반환됨  일부 질의 유형에 대해 approximate query 지원  Hive의 보완 시스템으로 개발  모든 연산자는 파이프라이닝(pipelining)과 데이터 전송은 스트리밍  Facebook 사례 – 누적된 300 PB Data Warehouse, 일일 30k 질의, 하루에 1PB 처리 출처: http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013 87
  • 89. Presto - Architecture(1/2) 출처: https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920 88
  • 90. Presto - Architecture (2/2) 출처: https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920 89
  • 91. Presto - Query Execution Flow 출처: http://www.slideshare.net/frsyuki/hadoop-source-code-reading-15-in-japan-presto 90
  • 92. 91
  • 93. Tajo - Overview  Tajo – 하둡 기반의 대용량 데이터웨어하우스 – 2010년 고려대에서 리서치 프로토타입으로 개발 시작 – 2013.03 아파치 인큐베이션 프로젝트 – 0.2-incubating, 2013.11  Features – 호환성  표준 SQL 지원, HiveQL 지원, UDF 지원  JDBC – 고성능 및 낮은 반응 시간  유연하고 효율적인 분산 처리 엔진  Cost-based Optimizer  JIT Query Compilation 및 Vectorized 질의엔진 출처: http://tajo.incubator.apache.org/, http://www.slideshare.net/deview/deview2013-tajo 92
  • 94. Tajo - Architecture  Master-Worker 모델 – RPC 기반: Protocol Buffer, Netty  Tajo Master – 클라이언트 및 어플리케이션 요청 처리 – 카탈로그 서버  테이블 스키마, 물리적인 정보, 각종 통계  JDBC 이용 외부 RDBMS를 저장소로 사용 가능 – Query Parser, 플래너, 최적화, 클러스터 자원관리, Query Master 관리  Query Master – 질의별 동작 – Executin Block(질의 실행 단계) 제어 – 태스크 스케줄링  Tajo Worker – 스토리지 매니저 – 로컬 질의 엔진 출처: http://www.slideshare.net/deview/deview2013-tajo 93
  • 95. Tajo - Architecture 출처: http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013 94
  • 96. Tajo - Query Planning Process 출처: http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013 95
  • 97. Tajo - Distributed Execution Plan 출처: http://www.slideshare.net/hyunsikchoi/sqlonhadoop-tajo-tech-planet-2013 96
  • 98. Tajo - Benchmark 결과 Tajo vs. Hive on TPC-H 1TB Experimental Environments – 1TB TPC-H Data Set  Some of TPC-H Queries     Intel i5 16GB Memory 1TB HDD x 2 1G ether networks – 32 Cluster Nodes – Hadoop 0.20.2-cdh3u3 출처: http://www.slideshare.net/hyunsikchoi/tajo-intro 97
  • 100. 99
  • 101. Hadapt - Overview  A Commercialized version of Daniel Abadi’s HadoopDB project.  HadoopDB Project: An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads – Hadoop as communication layer above multiple nodes running single-node DBMS instances  Full open-source solution: – PostgreSQL as DB Layer – Hadoop as communication layer – Hive as translation layer  Hadapt(HadoopDB) = Hadoop + PostgreSQL 100
  • 102. Hadapt - HadoopDB  Recent study at Yale University, Database Research Dep.  Hybrid architecture of parallel databases and MapReduce system  The idea is to combine the best qualities of both technologies  Multiple single-node databases are connected using Hadoop as the task coordinator and network communication layer  Queries are distributed across the nodes by MapReduce framework, but as much work as possible is done in the database node 출처: https://wiki.aalto.fi/download/attachments/37102509/HadoopDB%2Bproject.ppt 101
  • 103. Hadapt - Architecture 출처: http://www.slideshare.net/DavidPortnoy/hybrid-data-warehouse-hadoop-implementations 102
  • 104. Hadapt - HadoopDB Benchmark (1/2) 출처: https://wiki.aalto.fi/download/attachments/37102509/HadoopDB%2Bproject.ppt 103
  • 105. Hadapt - HadoopDB Benchmark (2/2) 출처: https://wiki.aalto.fi/download/attachments/37102509/HadoopDB%2Bproject.ppt 104
  • 107. Hadoop 대규모 데이터의 분산처리를 위한 오픈 소스 프레임워크 : HDFS(데이터 저장) + MapReduce(분산처리)  유연성 – 다양한 데이터 유형 지원(구조적, 비구조적 데이터, 텍스트 등) – 범용 프로그래밍 언어를 지원  확장성 – 노드 증가에 따른 선형적인 (성능, 용량) 확장 가능  비용 – 다수의 범용 서버(Commodity Server) 클러스터에서 동작하도록 설계 – 노드 증가를 통한 용량 증설 및 처리 성능 향상 비용이 저렴 106
  • 109. Star Schema (1/3) Star schema: –Design of database –Separating dimensional data and fact/event data Two major types of components –Fact table and dimensional tables –A fact table is surrounded by dimensional tables in a simple star schema –Each dimensional table has a one-to-many relationship to the central fact table 출처: http://www.fbe.hku.hk/~is/busi0092/Notes/t1_dataWarehouse_full_v3.pdf 108
  • 110. Star Schema (2/3) - Component of Star Schema Fact tables contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables contain descriptions about the subjects of the business 출처: http://www.fbe.hku.hk/~is/busi0092/Notes/t1_dataWarehouse_full_v3.pdf 109
  • 111. Star Schema (3/3) - Sample Data 출처: http://www.fbe.hku.hk/~is/busi0092/Notes/t1_dataWarehouse_full_v3.pdf 110
  • 112. SQL on Hadoop Products by MapR 출처: http://www.mapr.com/products/sql-on-hadoop 111
  • 113. 빅데이터의 잠재적 사용사례 출처: SAS and IDC, http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ 112
  • 114. BigData Software Stack (Hadoop) 113
  • 115. BDAS(Berkeley Data Analytics Stack) 출처: https://amplab.cs.berkeley.edu/software/ 114