[2A2]Vectorized_processing_in_a_Nutshell

김형준
GruterVectorizedProcessing in a Nutshell

김형준, babokim@gmail.com
www.jaso.co.kr
클라우드컴퓨팅구현기술등저
Apache Tajo Committer
Gruter(BigData) NHN(Hadoop) SDS(공공SI, 젂자ITO)
Who I am

Vectorized
Query Compilation
CPU pipeline
Branch misprediction
이건무슨외계어?
SI 개발자가왜여기까지왔나?
Cache Miss
SIMD
LLVM

Hadoop쉽다며?
Hadoop빠르다며?

메모리를잘활용해보자!
실행계획을잘만들어보자!
파일포맷을최적화해보자!
...
그래서다양한시도가
SQL을이용한분석

단순하게하드웨어로해결?
처리해야할데이터가많으니Disk만빠르면?
http://www.enterprisestorageforum.com/storage-hardware/ssd-vs.-hdd-pricing-seven-myths-that-need-correcting-2.html
164
205
563
947
3,277
0
5000
10000
15000
0
1000
2000
3000
4000
SATA-HDD
SAS-HDD-15K
SATA-SSD
SAS-SSD
PCIe-SSDRead MB/s
$/TB
MB/sec
$/1TB

SAS * 24개= 200MB/sec * 24 = 4.8GB/s
X 50대
240GB/s10초에2.4TB
이렇게했는데도별로안빠르다!

CPU, 53
DISK, 30
N/W, 17
HP-DL385P, SATA/HDD*8, 10대Tajo, TPC-H Scale 1000질의
데이터분석리소스사용유형CPU, 72DISK, 10

더빠른CPU를만들거나
→ H/W 제조사가할일
CPU/Memory가많은장비를사용
(Scale-up)
→ 비용이많이들고
메인보드의한계
그렇다고계속HDD만사용?

CPU를마른수건짜듯이잘활용Vectorized
Query
Compilation
우리(개발자)가할수있는것은

어떻게하면CPU를
잘활용할수있나?
한번에더많은명령을실행하게하고
→ CPU Pipeline
CPU내메모리를잘활용하고
→ CPU Cache
CPU에서제공하는기능을잘활용한다.
→ SIMD

DBMS는어떻게데이터를처리?
SelectionGroupby
Join
Orderby
Eval
Filter
SCAN
...
Basic OperatorSLECT o_custkey, l_lineitem, count(*) FROM lineitemaJOIN orders b ON a.l_orderkey= b.o_orderkeyWHERE a.l_shipdate> ‘2014-09-01’ GROUP BY o_custkey, l_lineitem
SCAN
(lineitem)
SCAN
(orders)
Fliter
Join
Groupby
Operator 조합으로
Tree 형태의
PhysicalPlan생성Call()
Call()
Return Tuple
Return Tuple

Tajo란?
•SQL on Hadoop솔루션으로분류
–Hadoop을메인저장소로사용하는Data Warehouse 시스템
–수초미만의Interactive 질의+ 수시갂의ETL 질의모두지원
•Apache Open Source (http://tajo.apache.org)
•자체분산실행엔진
–MapReduce가아닌자체분산실행엔진이용
•다양한성능최적화기법도입
–Join Optimization, Dynamic Planning
–Disk balanced scheduler, Multi-Level Count Distinct
–Query Compilation, VectorizedEngine(2015 1Q) 등
•이미엔터프라이즈에적용
–상용DW Win back, 수TB/일처리
http://events.linuxfoundation.org/sites/events/files/slides/ApacheCon_Keuntae_Park_0.pdf
http://spark-summit.org/wp-content/uploads/2014/07/Big-Telco-Bigger-Real-time-Demands-Jungryong-Lee.pdf

Tajo DW 사례Operational Systems
Integration
Layer
Data Warehouse
Data Mart
Marketing
Sales
ERPSCM
ODS
Staging Area
Strategic
Marts
Data
Vault
상용DW용DBMS 역할대체

이미지: http://www.digitalinternals.com/
Instruction
처리속도향상이아닌
처리용량(throughput)
향상
CPU Pipeline
Not Pipelined
Pipelined

CPU Pipeline의어려움
•Structural Hazards
–여러명령이같은하드웨어리소스를동시에필요로할때
•Data Hazards
–instruction들이사용하는data갂의존성문제
–ex) c = a+b;
e = c+d;
•Control Hazards
–조건문등이있는경우다음instruction을예측하여pipelining
–상황에따라분기오예측(branch misprediction)이발생
–분기예측이잘못되었을경우pipelining에있는instruction이모두flush됨
–Pipelining stage가길수록throughput이떨어짐
–분기오예측은일반적으로10-20 CPU cycle 소모
다음과같은경우Pipeline이중지

for (i= 0; i< 128; i++) {
a[i] = b[i] + c[i];
} for (i= 0; i< 128; i+=4) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; }
Original Loop
Unrolled Loop
지시자: #pragmaGCC optimize ("unroll-loops")
컴파일옵션: -funroll-loops 등
Loop unrolling
컴파일러옵션또는지시자로설정가능
Control Hazard 개선
조건비교

for (inti= 0; i< num; i++) {
if (data[i] >= 128) {
sum += data[i];
}
}
for (inti= 0; i< num; i++) {
int v = (data[i] -128) >> 31;
sum += ~v & data[i];
}
Branch version
Non branch version
Branch avoidance
http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array
Random data : 35,683 ms
Sorted data : 11,977 ms
11,977 ms10,402 ms
Control Hazard 개선

Memory Hierarchy
http://article.techlabs.by/29_1411.html
L1 Cache
L2 Cache
L3 Cache
Main Memory

CPU Cache 특징
•Spatial locality
–cache line (64 bytes) 단위인접한데이터까지메모리로부터읽음
–File System의Block size와같은개념
•Temporal locality
–cache의크기는제한적이며캐쉬유지젂략은LRU
–즉, 가장최귺사용된것이캐쉬에남아있음
Cache
Cache line
(64 byte)
Main Memory

for (i= 0 to size)
for (j = 0 to size)
do something with ary[j][i]
for (i= 0 to size)
for (j = 0 to size)
do something with ary[i][j]
인접하지않은메모리를반복해서
읽기때문에비효율적
순차읽기로인한높은cache hit
Case 1
Case 2
CPU Hit 비교
Classic Example of Spatial Locality

i
j
Case 2
Cache Line
i
j
Case 1
CPU Hit 비교

DBMS에서의CPU Cache
•Tuple을Row 단위로저장할경우
–Row의크기가커서Cache Hit율이낮음
•Column 단위로저장할경우
–Cache Hit율이높음
–동일한Operator를동일컬럼에서여러개의데이터를한번에수행가능(SIMD)
–하지만기존의처리방식(Tuple-at-a time)을사용하지못함
•시스템젂반적인변경이필요

SIMD
•Single Instruction Multiple Data
–CPU에서제공하는병렬처리명령Set
–사칙연산및논리연산을SIMD명령으로제공
•Pipeline은명령을병렬처리, SIMD는데이터를병렬처리
•Intel CPU의SIMD는MMX라는이름으로제안, SSE, AVX로진화중
•Nehalem은단일명령으로128bit 데이터
•Sandy Bridge 이후부터는256bit 데이터에대한명령셋제공
–256 bit의크기= 16 Shorts, 8 Integers/Floats, 4 Long/Double
https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions

https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Intel SIMD Instruction

DBMS 에서는
•TupleMemory Model
–Tuple을메모리에저장하는방식
–Cache hit율에영향
•TupleProcessing Model
–처리하는Tuple의단위
–Cache hit, CPU Pipeline, SIMD 활용에모두영향

OR
출처:[Marcin2009]
TupleMemory Model
N-array Storage Model
Decomposed Storage
Model

TupleMemory Model
•Row-store (NSM)
–하나의row를레코드단위로하여물리적으로연속적인바이트열로저장
–장점: 레코드단위의쉬운삽입/삭제/업데이트가능
–단점: I/O의최소단위는레코드
–OLTP에적합
•Column-store (DSM에서발젂)
–컬럼을별도의파일에저장
–장점
•필요한컬럼들에대해컬럼단위I/O가능
•매우빠른읽기성능
•더높은압축률
–단점
•많은컬럼을읽을경우tuple형태로만드는비용이큼
•파일쓸때여러파일에대해I/O가발생
•읽기위주의분석적처리에적합

Tuple-at-a-time
Block-at-a-time
Column-at-a-time
Vectorized
TupleProcessing Model

Tuple-at-a-time Model
next()
next()
next()
….
….
…. JOIN A,B
Volcano-an extensible and parallel query evaluation system
SCAN
JOIN C
SCAN
...
결과Tuple
•next() 호출마다재귀적으로하위연산자의next() 호출
•next()의반환값은하나의Tuple
•N-array Storage Model(NSM) 레코드기반
•CPU활용측면에서비효율적
–함수호출비용(20 CPU cycle)
–Pipeline 활용도낮음
–SIMD 활용도낮음
–빈번한instruction/data cache miss
–NSM으로인한cache miss
•MySQL, Pig,Tajo(AS-IS)가기반하는구조

Block-at-a-time Model
•한번의next() 호출에N개의NSM 기반tuple
•N-array Storage Model(NSM)에기반
•Impala, PostgreSQL등
•장점
–함수호출비용감소
–instruction/data cache hit 향상
•단점
–Pipelining 활용도여젂히낮음
–SIMD활용도낮음
–NSM으로인한cache miss
Block oriented processing of Relational Database operations in modern Computer Architectures
next()
next()
next()
….
JOIN A,B
SCAN
JOIN C
SCAN
...
결과Tuple
….
….

Column-at-a-time Model
•Decomposed Storage Model 사용
•각컬럼배열자료구조에는젂체Row의
데이터를반환
•MonetDB에서사용
•장점
–Instruction/datacache hit 향상
–Pipelining 활용도높음
–SIMD활용도높음
•단점
–중갂데이터유지비용높음
–여젂히높은cache miss빈도
•두컬럼의데이터를이용하는연산의경우
Monet: a next-Generation DBMS Kernel
For Query-Intensive Applications
next()
next()
next()
JOIN A,B
SCAN
JOIN C
결과Tuple

Column-at-a-time Model
•성능비교
–MySQL: 26.2 sec, MonetDB: 3.7 sec
–TPCH-Q1(1 Table Groupby, Orderby, 결과데이터는4건)
–In-memory 데이터에대한비교실험결과
•논리연산/사칙연산별로별도Primitive 연산자필요
/* Returns a vector of row IDs’ satisfying comparison
* and count of entries in vector. */
intselect_gt_float(int* res, float* column, float val, intn) {
intidx=0;
for (inti= 0; i< n; i++) {
idx+= column[i] > val;
res[idx] = j;
}
return idx;
}
Greater-than primitive 연산자의예
기본데이터타입, 사칙연산, 논리연산에약5천여개필요

VectorizedModel
•1개의Column에대해Vector에여러개의데이터를저장
•서로다른Column vector를CPU L2 Cache 크기(256KB)에맞추어row group 단위로연산
•조건식/계산식은vector 단위로처리
•장점
–Instruction/datacache hit 향상
–Pipelining 활용도높음
–SIMD활용도높음
–Interpret overhead 최소화
•단점
–새로운모델로개발필요
MonetDB / X100 : Hyper-Pipelining Query Execution
next()
next()
next()
JOIN A,B
SCAN
JOIN C
결과Tuple
...
...
...

Filter 연산예제*+
/
<
RowGroupRowGroup
Vector
Vectorized
Primitives
(Filter)
VectorizedModel
•하나의operator 처리를위해서row group에속하는다수의컬럼들을엑세스필요
•Column-at-a-time 방식은그과정에서cache miss가발생
•Vectorized방식은각row group의vector들이CPU L2 cache사이즈에맞도록설계
•CPU Pipelining, SIMD, CPU Cache세가지활용을극대화
•MySQL(Tuple-at-a-time) : 26.2 secs
•MonetDB(Column-at-atime) : 3.7 secs
•MonetDBX100(Vectorized) : 0.6 secs

We always welcome
new contributors!
General
http://tajo.apache.org
Issue Tracker
https://issues.apache.org/jira/browse/TAJO
Join the mailing list
dev-subscribe@tajo.apache.org, issues-subscribe@tajo.apache.org
한국User group
https://groups.google.com/forum/?hl=ko#!forum/tajo-user-kr

[2A2]Vectorized_processing_in_a_Nutshell

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to [2A2]Vectorized_processing_in_a_Nutshell

Similar to [2A2]Vectorized_processing_in_a_Nutshell (20)

More from NAVER D2

More from NAVER D2 (20)

Recently uploaded

Recently uploaded (20)

[2A2]Vectorized_processing_in_a_Nutshell