김형준 
GruterVectorizedProcessing in a Nutshell
김형준, babokim@gmail.com 
www.jaso.co.kr 
클라우드컴퓨팅구현기술등저 
Apache Tajo Committer 
Gruter(BigData) NHN(Hadoop) SDS(공공SI, 젂자ITO) 
Who I am
Vectorized 
Query Compilation 
CPU pipeline 
Branch misprediction 
이건무슨외계어? 
SI 개발자가왜여기까지왔나? 
Cache Miss 
SIMD 
LLVM
Hadoop쉽다며? 
Hadoop빠르다며?
메모리를잘활용해보자! 
실행계획을잘만들어보자! 
파일포맷을최적화해보자! 
... 
그래서다양한시도가 
SQL을이용한분석
단순하게하드웨어로해결? 
처리해야할데이터가많으니Disk만빠르면? 
http://www.enterprisestorageforum.com/storage-hardware/ssd-vs.-hdd-pricing-seven-myths-that-need-correcting-2.html 
164 
205 
563 
947 
3,277 
0 
5000 
10000 
15000 
0 
1000 
2000 
3000 
4000 
SATA-HDD 
SAS-HDD-15K 
SATA-SSD 
SAS-SSD 
PCIe-SSDRead MB/s 
$/TB 
MB/sec 
$/1TB
SAS * 24개= 200MB/sec * 24 = 4.8GB/s 
X 50대 
240GB/s10초에2.4TB 
이렇게했는데도별로안빠르다!
CPU, 53 
DISK, 30 
N/W, 17 
HP-DL385P, SATA/HDD*8, 10대Tajo, TPC-H Scale 1000질의 
데이터분석리소스사용유형CPU, 72DISK, 10
더빠른CPU를만들거나 
→ H/W 제조사가할일 
CPU/Memory가많은장비를사용 
(Scale-up) 
→ 비용이많이들고 
메인보드의한계 
그렇다고계속HDD만사용?
CPU를마른수건짜듯이잘활용Vectorized 
Query 
Compilation 
우리(개발자)가할수있는것은
어떻게하면CPU를 
잘활용할수있나? 
한번에더많은명령을실행하게하고 
→ CPU Pipeline 
CPU내메모리를잘활용하고 
→ CPU Cache 
CPU에서제공하는기능을잘활용한다. 
→ SIMD
DBMS는어떻게데이터를처리? 
SelectionGroupby 
Join 
Orderby 
Eval 
Filter 
SCAN 
... 
Basic OperatorSLECT o_custkey, l_lineitem, count(*) FROM lineitemaJOIN orders b ON a.l_orderkey= b.o_orderkeyWHERE a.l_shipdate> ‘2014-09-01’ GROUP BY o_custkey, l_lineitem 
SCAN 
(lineitem) 
SCAN 
(orders) 
Fliter 
Join 
Groupby 
Operator 조합으로 
Tree 형태의 
PhysicalPlan생성Call() 
Call() 
Return Tuple 
Return Tuple
Tajo란? 
•SQL on Hadoop솔루션으로분류 
–Hadoop을메인저장소로사용하는Data Warehouse 시스템 
–수초미만의Interactive 질의+ 수시갂의ETL 질의모두지원 
•Apache Open Source (http://tajo.apache.org) 
•자체분산실행엔진 
–MapReduce가아닌자체분산실행엔진이용 
•다양한성능최적화기법도입 
–Join Optimization, Dynamic Planning 
–Disk balanced scheduler, Multi-Level Count Distinct 
–Query Compilation, VectorizedEngine(2015 1Q) 등 
•이미엔터프라이즈에적용 
–상용DW Win back, 수TB/일처리 
http://events.linuxfoundation.org/sites/events/files/slides/ApacheCon_Keuntae_Park_0.pdf 
http://spark-summit.org/wp-content/uploads/2014/07/Big-Telco-Bigger-Real-time-Demands-Jungryong-Lee.pdf
Tajo DW 사례Operational Systems 
Integration 
Layer 
Data Warehouse 
Data Mart 
Marketing 
Sales 
ERPSCM 
ODS 
Staging Area 
Strategic 
Marts 
Data 
Vault 
상용DW용DBMS 역할대체
이제본론으로...
이미지: http://www.digitalinternals.com/ 
Instruction 
처리속도향상이아닌 
처리용량(throughput) 
향상 
CPU Pipeline 
Not Pipelined 
Pipelined
CPU Pipeline의어려움 
•Structural Hazards 
–여러명령이같은하드웨어리소스를동시에필요로할때 
•Data Hazards 
–instruction들이사용하는data갂의존성문제 
–ex) c = a+b; 
e = c+d; 
•Control Hazards 
–조건문등이있는경우다음instruction을예측하여pipelining 
–상황에따라분기오예측(branch misprediction)이발생 
–분기예측이잘못되었을경우pipelining에있는instruction이모두flush됨 
–Pipelining stage가길수록throughput이떨어짐 
–분기오예측은일반적으로10-20 CPU cycle 소모 
다음과같은경우Pipeline이중지
for (i= 0; i< 128; i++) { 
a[i] = b[i] + c[i]; 
} for (i= 0; i< 128; i+=4) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; } 
Original Loop 
Unrolled Loop 
지시자: #pragmaGCC optimize ("unroll-loops") 
컴파일옵션: -funroll-loops 등 
Loop unrolling 
컴파일러옵션또는지시자로설정가능 
Control Hazard 개선 
조건비교
for (inti= 0; i< num; i++) { 
if (data[i] >= 128) { 
sum += data[i]; 
} 
} 
for (inti= 0; i< num; i++) { 
int v = (data[i] -128) >> 31; 
sum += ~v & data[i]; 
} 
Branch version 
Non branch version 
Branch avoidance 
http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array 
Random data : 35,683 ms 
Sorted data : 11,977 ms 
11,977 ms10,402 ms 
Control Hazard 개선
Memory Hierarchy 
http://article.techlabs.by/29_1411.html 
L1 Cache 
L2 Cache 
L3 Cache 
Main Memory
CPU Cache 특징 
•Spatial locality 
–cache line (64 bytes) 단위인접한데이터까지메모리로부터읽음 
–File System의Block size와같은개념 
•Temporal locality 
–cache의크기는제한적이며캐쉬유지젂략은LRU 
–즉, 가장최귺사용된것이캐쉬에남아있음 
Cache 
Cache line 
(64 byte) 
Main Memory
for (i= 0 to size) 
for (j = 0 to size) 
do something with ary[j][i] 
for (i= 0 to size) 
for (j = 0 to size) 
do something with ary[i][j] 
인접하지않은메모리를반복해서 
읽기때문에비효율적 
순차읽기로인한높은cache hit 
Case 1 
Case 2 
CPU Hit 비교 
Classic Example of Spatial Locality
i 
j 
Case 2 
Cache Line 
i 
j 
Case 1 
CPU Hit 비교
DBMS에서의CPU Cache 
•Tuple을Row 단위로저장할경우 
–Row의크기가커서Cache Hit율이낮음 
•Column 단위로저장할경우 
–Cache Hit율이높음 
–동일한Operator를동일컬럼에서여러개의데이터를한번에수행가능(SIMD) 
–하지만기존의처리방식(Tuple-at-a time)을사용하지못함 
•시스템젂반적인변경이필요
SIMD 
•Single Instruction Multiple Data 
–CPU에서제공하는병렬처리명령Set 
–사칙연산및논리연산을SIMD명령으로제공 
•Pipeline은명령을병렬처리, SIMD는데이터를병렬처리 
•Intel CPU의SIMD는MMX라는이름으로제안, SSE, AVX로진화중 
•Nehalem은단일명령으로128bit 데이터 
•Sandy Bridge 이후부터는256bit 데이터에대한명령셋제공 
–256 bit의크기= 16 Shorts, 8 Integers/Floats, 4 Long/Double 
https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions
SIMD
https://software.intel.com/sites/landingpage/IntrinsicsGuide/ 
Intel SIMD Instruction
DBMS 에서는 
•TupleMemory Model 
–Tuple을메모리에저장하는방식 
–Cache hit율에영향 
•TupleProcessing Model 
–처리하는Tuple의단위 
–Cache hit, CPU Pipeline, SIMD 활용에모두영향
OR 
출처:[Marcin2009] 
TupleMemory Model 
N-array Storage Model 
Decomposed Storage 
Model
TupleMemory Model 
•Row-store (NSM) 
–하나의row를레코드단위로하여물리적으로연속적인바이트열로저장 
–장점: 레코드단위의쉬운삽입/삭제/업데이트가능 
–단점: I/O의최소단위는레코드 
–OLTP에적합 
•Column-store (DSM에서발젂) 
–컬럼을별도의파일에저장 
–장점 
•필요한컬럼들에대해컬럼단위I/O가능 
•매우빠른읽기성능 
•더높은압축률 
–단점 
•많은컬럼을읽을경우tuple형태로만드는비용이큼 
•파일쓸때여러파일에대해I/O가발생 
•읽기위주의분석적처리에적합
Tuple-at-a-time 
Block-at-a-time 
Column-at-a-time 
Vectorized 
TupleProcessing Model
Tuple-at-a-time Model 
next() 
next() 
next() 
…. 
…. 
…. JOIN A,B 
Volcano-an extensible and parallel query evaluation system 
SCAN 
JOIN C 
SCAN 
... 
결과Tuple 
•next() 호출마다재귀적으로하위연산자의next() 호출 
•next()의반환값은하나의Tuple 
•N-array Storage Model(NSM) 레코드기반 
•CPU활용측면에서비효율적 
–함수호출비용(20 CPU cycle) 
–Pipeline 활용도낮음 
–SIMD 활용도낮음 
–빈번한instruction/data cache miss 
–NSM으로인한cache miss 
•MySQL, Pig,Tajo(AS-IS)가기반하는구조
Block-at-a-time Model 
•한번의next() 호출에N개의NSM 기반tuple 
•N-array Storage Model(NSM)에기반 
•Impala, PostgreSQL등 
•장점 
–함수호출비용감소 
–instruction/data cache hit 향상 
•단점 
–Pipelining 활용도여젂히낮음 
–SIMD활용도낮음 
–NSM으로인한cache miss 
Block oriented processing of Relational Database operations in modern Computer Architectures 
next() 
next() 
next() 
…. 
JOIN A,B 
SCAN 
JOIN C 
SCAN 
... 
결과Tuple 
…. 
….
Column-at-a-time Model 
•Decomposed Storage Model 사용 
•각컬럼배열자료구조에는젂체Row의 
데이터를반환 
•MonetDB에서사용 
•장점 
–함수호출비용감소 
–Instruction/datacache hit 향상 
–Pipelining 활용도높음 
–SIMD활용도높음 
•단점 
–중갂데이터유지비용높음 
–여젂히높은cache miss빈도 
•두컬럼의데이터를이용하는연산의경우 
Monet: a next-Generation DBMS Kernel 
For Query-Intensive Applications 
next() 
next() 
next() 
JOIN A,B 
SCAN 
JOIN C 
결과Tuple
Column-at-a-time Model 
•성능비교 
–MySQL: 26.2 sec, MonetDB: 3.7 sec 
–TPCH-Q1(1 Table Groupby, Orderby, 결과데이터는4건) 
–In-memory 데이터에대한비교실험결과 
•논리연산/사칙연산별로별도Primitive 연산자필요 
/* Returns a vector of row IDs’ satisfying comparison 
* and count of entries in vector. */ 
intselect_gt_float(int* res, float* column, float val, intn) { 
intidx=0; 
for (inti= 0; i< n; i++) { 
idx+= column[i] > val; 
res[idx] = j; 
} 
return idx; 
} 
Greater-than primitive 연산자의예 
기본데이터타입, 사칙연산, 논리연산에약5천여개필요
VectorizedModel 
•1개의Column에대해Vector에여러개의데이터를저장 
•서로다른Column vector를CPU L2 Cache 크기(256KB)에맞추어row group 단위로연산 
•조건식/계산식은vector 단위로처리 
•장점 
–함수호출비용감소 
–Instruction/datacache hit 향상 
–Pipelining 활용도높음 
–SIMD활용도높음 
–Interpret overhead 최소화 
•단점 
–새로운모델로개발필요 
MonetDB / X100 : Hyper-Pipelining Query Execution 
next() 
next() 
next() 
JOIN A,B 
SCAN 
JOIN C 
결과Tuple 
... 
... 
...
Filter 연산예제*+ 
/ 
< 
RowGroupRowGroup 
Vector 
Vectorized 
Primitives 
(Filter) 
VectorizedModel 
•하나의operator 처리를위해서row group에속하는다수의컬럼들을엑세스필요 
•Column-at-a-time 방식은그과정에서cache miss가발생 
•Vectorized방식은각row group의vector들이CPU L2 cache사이즈에맞도록설계 
•CPU Pipelining, SIMD, CPU Cache세가지활용을극대화 
•MySQL(Tuple-at-a-time) : 26.2 secs 
•MonetDB(Column-at-atime) : 3.7 secs 
•MonetDBX100(Vectorized) : 0.6 secs
VectorizedModelVectorSize
We always welcome 
new contributors! 
General 
http://tajo.apache.org 
Issue Tracker 
https://issues.apache.org/jira/browse/TAJO 
Join the mailing list 
dev-subscribe@tajo.apache.org, issues-subscribe@tajo.apache.org 
한국User group 
https://groups.google.com/forum/?hl=ko#!forum/tajo-user-kr
Q&A
THANK YOU

[2A2]Vectorized_processing_in_a_Nutshell

  • 2.
  • 3.
    김형준, babokim@gmail.com www.jaso.co.kr 클라우드컴퓨팅구현기술등저 Apache Tajo Committer Gruter(BigData) NHN(Hadoop) SDS(공공SI, 젂자ITO) Who I am
  • 4.
    Vectorized Query Compilation CPU pipeline Branch misprediction 이건무슨외계어? SI 개발자가왜여기까지왔나? Cache Miss SIMD LLVM
  • 5.
  • 6.
  • 7.
    단순하게하드웨어로해결? 처리해야할데이터가많으니Disk만빠르면? http://www.enterprisestorageforum.com/storage-hardware/ssd-vs.-hdd-pricing-seven-myths-that-need-correcting-2.html 164 205 563 947 3,277 0 5000 10000 15000 0 1000 2000 3000 4000 SATA-HDD SAS-HDD-15K SATA-SSD SAS-SSD PCIe-SSDRead MB/s $/TB MB/sec $/1TB
  • 8.
    SAS * 24개=200MB/sec * 24 = 4.8GB/s X 50대 240GB/s10초에2.4TB 이렇게했는데도별로안빠르다!
  • 9.
    CPU, 53 DISK,30 N/W, 17 HP-DL385P, SATA/HDD*8, 10대Tajo, TPC-H Scale 1000질의 데이터분석리소스사용유형CPU, 72DISK, 10
  • 10.
    더빠른CPU를만들거나 → H/W제조사가할일 CPU/Memory가많은장비를사용 (Scale-up) → 비용이많이들고 메인보드의한계 그렇다고계속HDD만사용?
  • 11.
  • 12.
    어떻게하면CPU를 잘활용할수있나? 한번에더많은명령을실행하게하고 → CPU Pipeline CPU내메모리를잘활용하고 → CPU Cache CPU에서제공하는기능을잘활용한다. → SIMD
  • 13.
    DBMS는어떻게데이터를처리? SelectionGroupby Join Orderby Eval Filter SCAN ... Basic OperatorSLECT o_custkey, l_lineitem, count(*) FROM lineitemaJOIN orders b ON a.l_orderkey= b.o_orderkeyWHERE a.l_shipdate> ‘2014-09-01’ GROUP BY o_custkey, l_lineitem SCAN (lineitem) SCAN (orders) Fliter Join Groupby Operator 조합으로 Tree 형태의 PhysicalPlan생성Call() Call() Return Tuple Return Tuple
  • 14.
    Tajo란? •SQL onHadoop솔루션으로분류 –Hadoop을메인저장소로사용하는Data Warehouse 시스템 –수초미만의Interactive 질의+ 수시갂의ETL 질의모두지원 •Apache Open Source (http://tajo.apache.org) •자체분산실행엔진 –MapReduce가아닌자체분산실행엔진이용 •다양한성능최적화기법도입 –Join Optimization, Dynamic Planning –Disk balanced scheduler, Multi-Level Count Distinct –Query Compilation, VectorizedEngine(2015 1Q) 등 •이미엔터프라이즈에적용 –상용DW Win back, 수TB/일처리 http://events.linuxfoundation.org/sites/events/files/slides/ApacheCon_Keuntae_Park_0.pdf http://spark-summit.org/wp-content/uploads/2014/07/Big-Telco-Bigger-Real-time-Demands-Jungryong-Lee.pdf
  • 15.
    Tajo DW 사례OperationalSystems Integration Layer Data Warehouse Data Mart Marketing Sales ERPSCM ODS Staging Area Strategic Marts Data Vault 상용DW용DBMS 역할대체
  • 16.
  • 17.
    이미지: http://www.digitalinternals.com/ Instruction 처리속도향상이아닌 처리용량(throughput) 향상 CPU Pipeline Not Pipelined Pipelined
  • 18.
    CPU Pipeline의어려움 •StructuralHazards –여러명령이같은하드웨어리소스를동시에필요로할때 •Data Hazards –instruction들이사용하는data갂의존성문제 –ex) c = a+b; e = c+d; •Control Hazards –조건문등이있는경우다음instruction을예측하여pipelining –상황에따라분기오예측(branch misprediction)이발생 –분기예측이잘못되었을경우pipelining에있는instruction이모두flush됨 –Pipelining stage가길수록throughput이떨어짐 –분기오예측은일반적으로10-20 CPU cycle 소모 다음과같은경우Pipeline이중지
  • 19.
    for (i= 0;i< 128; i++) { a[i] = b[i] + c[i]; } for (i= 0; i< 128; i+=4) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; } Original Loop Unrolled Loop 지시자: #pragmaGCC optimize ("unroll-loops") 컴파일옵션: -funroll-loops 등 Loop unrolling 컴파일러옵션또는지시자로설정가능 Control Hazard 개선 조건비교
  • 20.
    for (inti= 0;i< num; i++) { if (data[i] >= 128) { sum += data[i]; } } for (inti= 0; i< num; i++) { int v = (data[i] -128) >> 31; sum += ~v & data[i]; } Branch version Non branch version Branch avoidance http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array Random data : 35,683 ms Sorted data : 11,977 ms 11,977 ms10,402 ms Control Hazard 개선
  • 21.
    Memory Hierarchy http://article.techlabs.by/29_1411.html L1 Cache L2 Cache L3 Cache Main Memory
  • 22.
    CPU Cache 특징 •Spatial locality –cache line (64 bytes) 단위인접한데이터까지메모리로부터읽음 –File System의Block size와같은개념 •Temporal locality –cache의크기는제한적이며캐쉬유지젂략은LRU –즉, 가장최귺사용된것이캐쉬에남아있음 Cache Cache line (64 byte) Main Memory
  • 23.
    for (i= 0to size) for (j = 0 to size) do something with ary[j][i] for (i= 0 to size) for (j = 0 to size) do something with ary[i][j] 인접하지않은메모리를반복해서 읽기때문에비효율적 순차읽기로인한높은cache hit Case 1 Case 2 CPU Hit 비교 Classic Example of Spatial Locality
  • 24.
    i j Case2 Cache Line i j Case 1 CPU Hit 비교
  • 25.
    DBMS에서의CPU Cache •Tuple을Row단위로저장할경우 –Row의크기가커서Cache Hit율이낮음 •Column 단위로저장할경우 –Cache Hit율이높음 –동일한Operator를동일컬럼에서여러개의데이터를한번에수행가능(SIMD) –하지만기존의처리방식(Tuple-at-a time)을사용하지못함 •시스템젂반적인변경이필요
  • 26.
    SIMD •Single InstructionMultiple Data –CPU에서제공하는병렬처리명령Set –사칙연산및논리연산을SIMD명령으로제공 •Pipeline은명령을병렬처리, SIMD는데이터를병렬처리 •Intel CPU의SIMD는MMX라는이름으로제안, SSE, AVX로진화중 •Nehalem은단일명령으로128bit 데이터 •Sandy Bridge 이후부터는256bit 데이터에대한명령셋제공 –256 bit의크기= 16 Shorts, 8 Integers/Floats, 4 Long/Double https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions
  • 27.
  • 28.
  • 29.
    DBMS 에서는 •TupleMemoryModel –Tuple을메모리에저장하는방식 –Cache hit율에영향 •TupleProcessing Model –처리하는Tuple의단위 –Cache hit, CPU Pipeline, SIMD 활용에모두영향
  • 30.
    OR 출처:[Marcin2009] TupleMemoryModel N-array Storage Model Decomposed Storage Model
  • 31.
    TupleMemory Model •Row-store(NSM) –하나의row를레코드단위로하여물리적으로연속적인바이트열로저장 –장점: 레코드단위의쉬운삽입/삭제/업데이트가능 –단점: I/O의최소단위는레코드 –OLTP에적합 •Column-store (DSM에서발젂) –컬럼을별도의파일에저장 –장점 •필요한컬럼들에대해컬럼단위I/O가능 •매우빠른읽기성능 •더높은압축률 –단점 •많은컬럼을읽을경우tuple형태로만드는비용이큼 •파일쓸때여러파일에대해I/O가발생 •읽기위주의분석적처리에적합
  • 32.
    Tuple-at-a-time Block-at-a-time Column-at-a-time Vectorized TupleProcessing Model
  • 33.
    Tuple-at-a-time Model next() next() next() …. …. …. JOIN A,B Volcano-an extensible and parallel query evaluation system SCAN JOIN C SCAN ... 결과Tuple •next() 호출마다재귀적으로하위연산자의next() 호출 •next()의반환값은하나의Tuple •N-array Storage Model(NSM) 레코드기반 •CPU활용측면에서비효율적 –함수호출비용(20 CPU cycle) –Pipeline 활용도낮음 –SIMD 활용도낮음 –빈번한instruction/data cache miss –NSM으로인한cache miss •MySQL, Pig,Tajo(AS-IS)가기반하는구조
  • 34.
    Block-at-a-time Model •한번의next()호출에N개의NSM 기반tuple •N-array Storage Model(NSM)에기반 •Impala, PostgreSQL등 •장점 –함수호출비용감소 –instruction/data cache hit 향상 •단점 –Pipelining 활용도여젂히낮음 –SIMD활용도낮음 –NSM으로인한cache miss Block oriented processing of Relational Database operations in modern Computer Architectures next() next() next() …. JOIN A,B SCAN JOIN C SCAN ... 결과Tuple …. ….
  • 35.
    Column-at-a-time Model •DecomposedStorage Model 사용 •각컬럼배열자료구조에는젂체Row의 데이터를반환 •MonetDB에서사용 •장점 –함수호출비용감소 –Instruction/datacache hit 향상 –Pipelining 활용도높음 –SIMD활용도높음 •단점 –중갂데이터유지비용높음 –여젂히높은cache miss빈도 •두컬럼의데이터를이용하는연산의경우 Monet: a next-Generation DBMS Kernel For Query-Intensive Applications next() next() next() JOIN A,B SCAN JOIN C 결과Tuple
  • 36.
    Column-at-a-time Model •성능비교 –MySQL: 26.2 sec, MonetDB: 3.7 sec –TPCH-Q1(1 Table Groupby, Orderby, 결과데이터는4건) –In-memory 데이터에대한비교실험결과 •논리연산/사칙연산별로별도Primitive 연산자필요 /* Returns a vector of row IDs’ satisfying comparison * and count of entries in vector. */ intselect_gt_float(int* res, float* column, float val, intn) { intidx=0; for (inti= 0; i< n; i++) { idx+= column[i] > val; res[idx] = j; } return idx; } Greater-than primitive 연산자의예 기본데이터타입, 사칙연산, 논리연산에약5천여개필요
  • 37.
    VectorizedModel •1개의Column에대해Vector에여러개의데이터를저장 •서로다른Columnvector를CPU L2 Cache 크기(256KB)에맞추어row group 단위로연산 •조건식/계산식은vector 단위로처리 •장점 –함수호출비용감소 –Instruction/datacache hit 향상 –Pipelining 활용도높음 –SIMD활용도높음 –Interpret overhead 최소화 •단점 –새로운모델로개발필요 MonetDB / X100 : Hyper-Pipelining Query Execution next() next() next() JOIN A,B SCAN JOIN C 결과Tuple ... ... ...
  • 38.
    Filter 연산예제*+ / < RowGroupRowGroup Vector Vectorized Primitives (Filter) VectorizedModel •하나의operator 처리를위해서row group에속하는다수의컬럼들을엑세스필요 •Column-at-a-time 방식은그과정에서cache miss가발생 •Vectorized방식은각row group의vector들이CPU L2 cache사이즈에맞도록설계 •CPU Pipelining, SIMD, CPU Cache세가지활용을극대화 •MySQL(Tuple-at-a-time) : 26.2 secs •MonetDB(Column-at-atime) : 3.7 secs •MonetDBX100(Vectorized) : 0.6 secs
  • 39.
  • 40.
    We always welcome new contributors! General http://tajo.apache.org Issue Tracker https://issues.apache.org/jira/browse/TAJO Join the mailing list dev-subscribe@tajo.apache.org, issues-subscribe@tajo.apache.org 한국User group https://groups.google.com/forum/?hl=ko#!forum/tajo-user-kr
  • 41.
  • 42.