SlideShare a Scribd company logo
2© 2015 Pivotal Software, Inc. All rights reserved.
Pivotal HAWQ 소개
Seungdon Choi
Field Engineer
Pivotal Korea
3© 2015 Pivotal Software, Inc. All rights reserved.
Agenda
Ÿ Overview
Ÿ Architecture
Ÿ Machine Learning using HAWQ
Ÿ Roadmap
Ÿ Appendix: HAWQ vs Hive
4© 2015 Pivotal Software, Inc. All rights reserved.
Pivotal HAWQ is
	
  
Enterprise	
  platform	
  that	
  provides	
  the	
  
fewest	
  barriers,	
  lowest	
  risk,	
  most	
  cost	
  
effective	
  and	
  fastest	
  way	
  to	
  enter	
  in	
  to	
  
big	
  data	
  analytics	
  on	
  Hadoop	
  
5© 2015 Pivotal Software, Inc. All rights reserved.
So What exactly HAWQ is?
Combining SQL with Hadoop is key for analytics
SQL remains #1 choice for Data Science•  Massively Parallel Processing RDBMS on
HADOOP
•  ANSI SQL on Hadoop
•  Extremely high performance for analytics (not
like Hive)
•  Stores all data directly on HDFS
•  Open-Source
•  ODP 코어 기반의 하둡 배포판에서 동작(PHD,
HDP, IBM..)
6© 2015 Pivotal Software, Inc. All rights reserved.
Why SQL on Hadoop?
1.  Map Reduce 문제점
1) Map Reduce 의 한계 : 느린 성능 개발 역량에 의존적. 버그 가능성
2) 높은 Learning Curve
3) Legacy System, App 들과의 호환성 문제
4) Ad-hoc query 성능 문제로 인해 DBMS 와 병행사용 불가피
2. SQL on Hadoop 사용 이유
1) ANSI SQL 지원
- 기존 시스템과 통합 혹은 대체 용이 , 개발 시간 단축
- 낮은 learning curve(기존 개발자들에게 편리)
2) 높은 처리 성능 : MR 한계 극복
3) 낮은 반응 시간
4) Legacy System/App 호환 가능(SAS, Tabulu등의 BI 툴 재 사용 가능)
5) 대화형 질의(Interactive Query) 사용
- 데이터 분석의 생산성 증가 à 빠른 의사 결정 가능
7© 2015 Pivotal Software, Inc. All rights reserved.
Pivotal HAWQ
Ÿ  15년 이상 기업 시장에서 검증된 Greenplum 데이터베이스 엔진 사용
-  Partition, Compression, Resource 관리
Ÿ  100% ANSI SQL Compliant – 기존 BI, SAS 툴 재활용
Ÿ  실시간 쿼리 가능
- MapReduce를 사용하지 않고 분산되어 있는 데이터에 직접 접근
Ÿ  PXF External Table로 HDFS, HBase, Hive 및 다양한 데이터 통합
Ÿ  Libhdfs 개선(JavaàC)으로 일반 HDFS보다 빠른 속도
Ÿ  PL/R, Madlib 등 다양한 분석 패키지 지원
Ÿ  보안 및 유저 권한 관리, 암호화 지원
8© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Benefits…
Ÿ  Out of the box SQL for Hadoop
–  MapReduce Programming 러닝 커브 없이 SQL만으로 분석 수행
Ÿ  GPFX External Tables providing SQL access to Hadoop
–  HDFS, HBase, Hive or any data types 등 다양한 데이터 소스들의 통합 인터페이스
Ÿ  Broad data access, integration and portability
Ÿ  성능과 확장성, DW 구축하듯이 Big Data 프로젝트를 수행
–  Parallel Everything
–  Dynamic Pipelining
–  High Speed Interconnect
–  Optimized HDFS access with libhdfs3
–  Co-Located Joins & Data Locality
–  Partition Elimination
–  Higher Cluster Utilization
–  Concurrency Control
9© 2015 Pivotal Software, Inc. All rights reserved. 9
Architecture
10© 2015 Pivotal Software, Inc. All rights reserved.
Basic	
  Architecture	
  	
  
Interconnect	
  
Catalog	
  
HAWQ	
  Master	
  
Local	
  TM	
  
Execu;on	
  Coordina;on	
  
Parser	
   Query	
  Op;mizer	
  
Dispatch	
  
NameNode	
  
	
  
Local	
  Temp	
  Storage	
  
Segment	
  Host	
  
Query	
  Executor	
  
HDFS	
  
PXF	
  
Segment	
  
[Segment	
  …]	
  
DataNode	
  
Local	
  Temp	
  Storage	
  
Segment	
  Host	
  
Query	
  Executor	
  
HDFS	
  
PXF	
  
Segment	
  
[Segment	
  …]	
  
DataNode	
  
HDFS	
  
…	
  
HAWQ	
  Standby	
  
Master	
  
	
  
Secondary	
  
NameNode	
  
	
  HDFS	
  
HAWQ	
  
11© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ	
  Master	
  
Ÿ  Client	
  의 SQL	
  request를 받아 이를 parsing	
  하여 각각
의 Segment	
  Node로 전달하고, 수행 결과를 받아 Client	
  
에 반납하는 역할을 수행	
  
Ÿ  실제 User	
  Data를 가지지 않고,	
  System	
  metadata를 	
  
저장하는 Global	
  System	
  Catalog를 가짐	
  
Ÿ  H/W장애시 역할을 위임받을 Standby	
  Master(Warm	
  
Standby)	
  서버를 구성	
Ÿ  운영 시스템 구성시는 일반적으로 Hadoop	
  NameNode
와 별도의 서버에 설치 	
  
Local	
  Storage	
  
HAWQ	
  Master	
  
Local	
  TM	
  
Query	
  Executor	
  
Parser	
   Query	
  Op;mizer	
  
Dispatch	
  
Catalog	
  
HAWQ	
  
12© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ	
  Segments	
  
Ÿ  A	
  HAWQ	
  segment	
  within	
  a	
  Segment	
  Host	
  is	
  an	
  HDFS	
  client	
  
that	
  runs	
  on	
  a	
  DataNode	
  
Ÿ  하나의 Segment	
  Host/DataNode 에 여러개의 Segment	
  Node	
  	
  
Ÿ  Segment	
  =	
  a	
  basic	
  unit	
  of	
  parallelism	
  
–  Mul;ple	
  segments	
  work	
  together	
  to	
  form	
  a	
  single	
  
parallel	
  query	
  processing	
  system	
  
Ÿ  Opera;ons	
  (scans,	
  joins,	
  aggrega;ons,	
  sorts,	
  etc.)	
  execute	
  in	
  
parallel	
  across	
  all	
  segments	
  simultaneously	
  	
  
Ÿ  Libhdfs3(Pivotal	
  rewri[en)	
  를 사용하여 더 빠른 HDFS	
  R/W속도	
  
	
  
	
  
Local	
  Temp	
  Storage	
  
Segment	
  Host	
  
Query	
  Executor	
  
HDFS	
  
PXF	
  
Segment	
  
[Segment	
  …]	
  
DataNode	
  
HAWQ	
  
13© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Interconnect Performance and Scalability
Ÿ  Inter-process communication between segments
–  Standard Ethernet switching fabric
Ÿ  Uses UDP protocol (User Datagram Protocol)
–  성능과 확장성 향상
Ÿ  Additional packet verification and checking not performed by UDP
–  Reliability equivalent to TCP
Interconnect
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
Segment
DataNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
Segment
DataNode
14© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Dynamic Pipelining tm
Local Temp Storage
Segment Host
Query Executor
DataNode
PXF
Local Temp Storage
Segment Host
Query Executor
DataNode
PXF
Local Temp Storage
Segment Host
Query Executor
DataNode
PXF
•  Differentiating competitive advantage
•  Core execution technology from GPDB
•  Parallel data flow using the high speed UDP interconnect
•  중간 결과값에 대한 No materialization
- MapReduce 와 다름
Dynamic Pipelining
Interconnect
15© 2015 Pivotal Software, Inc. All rights reserved.
Interconnect
HAWQ Parser
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
NameNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
Segment
DataNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
Segment
DataNode
Clients
JDBC
SQL
•  Enforces syntax and semantics
•  Converts SQL query into a
parse tree data structure
describing details of the query
16© 2015 Pivotal Software, Inc. All rights reserved.
Interconnect
HAWQ Parallel Query Optimizer
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
NameNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
Segment
DataNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
Segment
DataNode
Gather Motion
Sort
HashAggregate
HashJoin
Redistribute Motion
HashJoin
Seq Scan on
lineitem Hash
Seq Scan on
orders
Hash
HashJoin
Seq Scan
on customer Hash
Broadcast Motion
Seq Scan on
nation
17© 2015 Pivotal Software, Inc. All rights reserved.
Interconnect
HAWQ Dispatch and Query Executor
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
NameNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
Segment
DataNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
Segment
DataNode
1.  Dispatch communicates the
query plan to segments
2.  Query Executor executes
the physical steps in the
plan
ScanBars
b
HashJoinb.name = s.bar
ScanSells
s
Filterb.city ='San Francisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ScanBars
b
HashJoinb.name = s.bar
ScanSells
s
Filterb.city ='San Francisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
18© 2015 Pivotal Software, Inc. All rights reserved.
Pivotal Query Optimizer (PQO)
For HAWQ and Greenplum Database
HAWQ
Turns a SQL query into an execution plan
Greenplum DB
Ÿ  First Cost Based Optimizer for BIG data
Ÿ  Applies all possible optimizations at the same time
Ÿ  New Extensible Code Base
Ÿ  Rapid adoption of emerging technologies
PIVOTAL VALUE-ADDED FUNCTIONALITY
19© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Transactions
Ÿ  DataNodes in HDFS do not know what is visible
–  No idea what data they have
–  Visibility is defined by the NameNode
Ÿ  Therefore, segment nodes do not know what is visible
–  Visibility is defined by HAWQ Master
Ÿ  No distributed transaction management
–  No UPDATE or DELETE
Ÿ  Truncate is implemented to support rollback of failed
transactions
Ÿ  Transaction logs present only on HAWQ Master
–  For inserts, single phase commit performed on HAWQ Master
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
Catalog
20© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Fault Tolerance
Ÿ  HDFS replication을 사용한 Fault tolerance 보장
Ÿ  Replication factor decided when creating a file-space & table-
space related to HDFS
–  Default is 3
Ÿ  When a segment server goes down shard is accessible from
another node
–  No data stored for mirrors
Ÿ  Recovery of segment through regular gprecoverseg
21© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Availability
HDFS DataNode
Segment 2
HDFS DataNode
Segment 3
Segment 1
Replication is embedded in HDFS so GPDB file replication is not needed
When a segment fails the
shard is accessible from
another node through the
HDFS NameNode and then
the DataNode to where the
shard was replicated
Master Host
HDFS NameNode
22© 2015 Pivotal Software, Inc. All rights reserved.
Pivotal HAWQ – Polymorphic AO Storage
GPDB와 동일한 유연한 Row/column based table/partition 구성으로 성능 및 저장공간 최적화
Ÿ  Columnar storage is well suited to
scanning a large percentage of the data
Ÿ  Row storage excels at small lookups
Ÿ  Most systems need to do both
Ÿ  Row and column orientation can be
mixed within a table or database
Ÿ  Both types can be dramatically more efficient with
compression
Ÿ  Compression is definable column by column:
Ÿ  Blockwise: Gzip1-9 & QuickLZ
Ÿ  Streamwise: Run Length Encoding (RLE) (levels 1-4)
Ÿ  Flexible indexing, partitioning enable more granular control
and enable true ILM
TABLE ‘SALES’
Mar Apr May Jun Jul Aug Sept Oct Nov
Row-oriented for Small ScansColumn-oriented for Full Scans
23© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Master Mirroring
Master Host
HDFS NameNode
Global System Catalog
Ÿ  Master node와 별도의 H/W에 Standby Master node 구성
Ÿ  Transaction Log를 실시간 복제하여 데이터 정합성 보장(Warm
standby). Master Node 장애시 Standby 가 Roll을 위임받음
Ÿ  System catalogs synchronized
Synchronization
Process
Transaction Logs
Master Host
HDFS NameNode
Global System Catalog
Transaction Logs
24© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Storage Options
Ÿ  Table in HAWQ:
ü Distributed
ü Partition(range/list)
ü Polymorphic Storage
ü Row/Columnar Oriented
ü Compress(zlib,quicklz,RL
E..)
TABLE	
  A	
  
SEG-­‐1	
   SEG-­‐2	
   SEG-­‐3	
   SEG-­‐4	
   …	
   SEG-­‐N	
  
PART	
  A	
  
ROW	
  
COLUMNAR	
  
COMPRESS	
  
SUB-­‐PART	
  
SUB-­‐PART	
  
PART	
  A	
  
ROW	
  
COLUMNAR	
  
COMPRESS	
  
SUB-­‐PART	
  
SUB-­‐PART	
  
PART	
  A	
  
ROW	
  
COLUMNAR	
  
COMPRESS	
  
SUB-­‐PART	
  
SUB-­‐PART	
  
PART	
  A	
  
ROW	
  
COLUMNAR	
  
COMPRESS	
  
SUB-­‐PART	
  
SUB-­‐PART	
  
DISTRIBUTION	
  
PARTITIONS	
  
POLYMORPHIC	
  
STORAGE	
  
25© 2015 Pivotal Software, Inc. All rights reserved.
Flat	
  Files,	
  CSV,	
  Delimited,	
  …	
  
gpload,	
  gpfdist,	
  External	
  Tables	
   PXF	
  	
  {Native	
  Hadoop	
  Files}	
  
Spring	
  XD	
  
Existing	
  RDBMS	
  Systems	
  
Web	
  Tables,	
  JSON,	
  XML,	
  HTML,	
  …	
  
Executing	
  Scripts,	
  …	
  
HDFS	
  Flat	
  Files,	
  CSV,	
  Delimited,	
  …	
  
Hive	
  
HBase	
  {w.	
  predicate	
  push-­‐down}	
  	
  
Avro,	
  RCFile,	
  SeqFile	
  
Open	
  extendable	
  API	
  
Available:	
  Accumulo,	
  JSON,…	
  
Streaming	
  ||	
  Batch	
  Mode	
  	
  
Java	
  Development	
  Framework	
  
Parallel	
  loading/unloading	
  at	
  Scale	
   HAWQ
26© 2015 Pivotal Software, Inc. All rights reserved.
Pivotal eXtension Framework (PXF)
Ÿ  External Table Interface 제공: Hadoop eco의 다양한
data store 를 조회
Ÿ  Hadoop à HAWQ 로 데이터 적재 혹은 직접 쿼리
Ÿ  Enables combining HAWQ data and Hadoop data in
single queryaa
Ÿ  Supports connectors for HDFS, HBase and Hive
Ÿ  Provides extensible framework API to enable
custom connector
Ÿ  Available on Github: JSON, Accumulo, S3…
Ÿ  HAWQ MapReduce RecordReader
PIVOTAL-­‐HD	
  EXTENSION	
  FRAMEWORK	
  
HDFS	
   HBase	
   Hive	
  
	
  
Industry	
  differen;ators	
  :	
  	
  
•  Low	
  latency	
  on	
  large	
  data	
  sets	
  
•  Extensible	
  and	
  customizable	
  
•  Considers	
  cost	
  model	
  of	
  federated	
  sources	
  
HAWQ	
  
HAWQ	
  
27© 2015 Pivotal Software, Inc. All rights reserved.
PXF Features
Ÿ  Hbase,Hive 의 연계시 filter 조건으로 predicate push down
Ÿ  Hive table Partitioning exclusion
Ÿ  HDFS data에 대한 통계정보 수집으로 최적화된 수행 계획 작성
Ÿ  Extensible Framework JAVA API 제공으로 다른 데이터소스(eg: Oracle DB)/format에 대한
custom 개발 용이
Ÿ  HDFS block locality to HAWQ processing segment
Ÿ  빠른 Parallel Optimizer(ORCA)
Ÿ  사용예:
(1)HAWQ 의 Dimension Table 과 HBase fact table과 Join
(2)HDFS, Hive, HBase 데이터를 빠르게 HAWQ로 로드하여 통합 관리
(3)다양한 포맷과 저장소의 데이터에 대한 materialization없이 통합(federation) 쿼리 엔진
으로 사용
Ÿ 
28© 2015 Pivotal Software, Inc. All rights reserved.
PXF External Table 예제
Ÿ  Simple HDFS Text
CREATE EXTERNAL TABLE jan_2012_sales (
id int,
total int,
comments varchar
)
LOCATION(‘pxf://10.76.72.26:50070/sales/
2012/01/items_*.csv?
profile=HdfsTextSimple
)
FORMAT ‘TEXT’ (delimiter ‘,’);
CREATE EXTERNAL TABLE hbase_sales (
recordkey bytea,
“cf1:saleid” int,
“cf8:comments” varchar
)
LOCATION(‘pxf://10.76.72.26:50070/sales?
profile=HBase
)
FORMAT
‘custom’ (formatter='gpxfwritable_import');
Ÿ  Hbase Table
CREATE WRITABLE EXTERNAL TABLE ...
LOCATION(‘pxf://<host:port>/sales?
profile=HdfsTextSimple&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec')
FORMAT ‘text’(delimiter ‘,’);
Ÿ  Export to HDFS using Writable PXF
29© 2015 Pivotal Software, Inc. All rights reserved.
Data Distribution
DN3	
  DN2	
  
X=2	
   X=3	
   X=4	
   X=5	
  X=1	
  
Table	
  A	
  
Y=2	
   Y=3	
  Y=1	
  
Table	
  B	
  
DN1	
  
SELECT	
  X	
  FROM	
  A,B	
  WHERE	
  A.X	
  =	
  B.Y	
  
SELECT	
  SUM(X)	
  FROM	
  A	
  GROUP	
  BY	
  A.X	
  
Ÿ  특정 Column/Column Set/Random에 기반한 데이터 분산
Ÿ  Tables distributed similarly are co-located
Ÿ  Distribution scheme modifiable thru alter table
Advantages:
Ÿ  Co-located joins
Ÿ  No data movement on joins or aggregates
Ÿ  Improved performance on complex queries
Ÿ  Query engine optimization
30© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Distribution vs Hive Partition
DN1	
  
Table	
  A	
  
NO	
  CO-­‐LOCATED	
  JOINS,	
  NO	
  CO-­‐LOCATED	
  AGGREGATES	
  
FOLDER	
  b	
   FOLDER	
  c	
  FOLDER	
  a	
  
X=2	
   X=3	
   X=4	
   X=5	
  X=1	
  
DN2	
   DN3	
  
Table	
  B	
  
FOLDER	
  bb	
  FOLDER	
  aa	
  
Y=2	
   Y=3	
  Y=1	
  
DATA	
  IS	
  SPREAD	
  ON	
  HDFS	
  
Ÿ  In Hive partitions are organized into
folders
Ÿ  Folders are spread across entire HDFS
Ÿ  Similar data are not co-located, data
location is lost
Ÿ  Data movement is required for large joins
and aggregates
Ÿ  Hive partitions help in sequential scan of
the original table only
31© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Resource Management
Ÿ  쿼리 우선 순위 부여를 통한 효과적인 Mixed Workload 관리
Ÿ  # of active query / memory / CPU/ disk IO 에 대한 queue 관리
Ÿ  다양한 SLA 설정 및 동적 Queue설정 변경 가능(주간/일간/시간)
동시 쿼리 수행 수 제어
Max Cost 값 제어
Min Cost 값 제어
쿼리 우선 순위
Max Cost 값 이상 쿼리 사전 차단 기능
32© 2015 Pivotal Software, Inc. All rights reserved. 32
MACHINE LEARNING
ON HDFS
Using HAWQ
33© 2015 Pivotal Software, Inc. All rights reserved.
MADlib	
  Advantages	
  
Ÿ  Be[er	
  parallelism	
  
–  Algorithms	
  designed	
  to	
  leverage	
  MPP	
  and	
  Hadoop	
  
architecture	
  
Ÿ  Be[er	
  scalability	
  
–  Algorithms	
  scale	
  as	
  your	
  data	
  set	
  scales	
  
Ÿ  Be[er	
  predic;ve	
  accuracy	
  
–  Can	
  use	
  all	
  data,	
  not	
  a	
  sample	
  
Ÿ  Open	
  source	
  
–  Available	
  for	
  customiza;on	
  and	
  op;miza;on	
  by	
  user	
  if	
  
desired	
  
HAWQ	
  
34© 2015 Pivotal Software, Inc. All rights reserved.
Functions
Predictive Modeling Library
Linear Systems
•  Sparse and Dense Solvers
•  Linear Algebra
Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low Rank
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards Regression
•  Elastic Net Regularization
•  Robust Variance (Huber-White), Clustered
Variance, Marginal Effects
Other Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Apriori)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Random Forest
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
•  Naïve Bayes
•  Support Vector Machines (SVM)
Descriptive Statistics
Sketch-Based Estimators
•  CountMin (Cormode-Muth.)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
Data Preparation
PMML Export
Conjugate Gradient
Inferential Statistics
Hypothesis Tests
Time Series
•  ARIMA
Oct 2014
35© 2015 Pivotal Software, Inc. All rights reserved.
Calling	
  MADlib	
  Func;ons:	
  Fast	
  Training,	
  Scoring	
  
SELECT	
  madlib.linregr_train(	
  'houses’,	
  
'houses_linregr’,	
  
'price’,	
  
'ARRAY[1,	
  tax,	
  bath,	
  size]’);	
  
MADlib	
  model	
  func;on	
  
Table	
  containing	
  
training	
  data	
  
Table	
  in	
  which	
  to	
  
save	
  results	
  
Column	
  containing	
  
dependent	
  variable	
  Features	
  included	
  in	
  the	
  
model	
  
Ÿ  MADlib	
  allows	
  users	
  to	
  easily	
  and	
  
create	
  models	
  without	
  moving	
  data	
  out	
  
of	
  the	
  systems	
  
–  Model	
  genera;on	
  
–  Model	
  valida;on	
  
–  Scoring	
  (evalua;on	
  of)	
  new	
  data	
  
Ÿ  All	
  the	
  data	
  can	
  be	
  used	
  in	
  one	
  model	
  
Ÿ  Built-­‐in	
  func;onality	
  to	
  create	
  of	
  
mul;ple	
  smaller	
  models	
  (e.g.	
  
classifica;on	
  grouped	
  by	
  feature)	
  
Ÿ  Open-­‐source	
  lets	
  you	
  tweak	
  and	
  
extend	
  methods,	
  or	
  build	
  your	
  own	
  
HAWQ	
  
36© 2015 Pivotal Software, Inc. All rights reserved.
UDF – pl/x : 다양한 분석용 언어 사용
Ÿ  R/Python/Java/C/Perl, Pgsql을 사용
한 User Defined Function을 사용
Ÿ  Numpy, NLTK, Scikit-learn, Scipy등
의 python extension 사용
Ÿ  MPP Architecture의 Data
Parallelism 을 사용하여 빠른 분석 성능
제공
Standby	
  
Master	
  
…	
  
Master	
  
Host	
  
SQL	
  
Interconnect	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Standby	
  
Master	
  
…	
  
Master	
  
Host	
  
SQL	
  
Interconnect	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
37© 2015 Pivotal Software, Inc. All rights reserved.
PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ  Challenge
Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics
Ÿ  Simple solution:
Translate R code into SQL
d <- db.data.frame(”houses")
houses_linregr <- madlib.lm(price ~ tax
+ bath
+ size
, data=d)
Pivotal R
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’);
SQL Code
https://github.com/pivotalsoftware/PivotalR
38© 2015 Pivotal Software, Inc. All rights reserved.
PivotalR Design Overview
2. SQL to execute
3. Computation results
1. R à SQL
RPostgreSQL
PivotalR
Data lives hereNo data here
Database/Hadoop
w/ MADlib
•  Call MADlib’s in-DB machine learning functions
directly from R
•  Syntax is analogous to native R function
•  Data doesn’t need to leave the database
•  All heavy lifting, including model estimation
& computation, are done in the database
39© 2015 Pivotal Software, Inc. All rights reserved.
Security	
  &	
  Authoriza;on	
  
Ÿ  Role based security
Ÿ  Availability of Users, Groups
Ÿ  Access granularity on Connection, Databases, Schema, Tables, View, …
Ÿ  Inheritance:
–  Inherit security privileges from other users or groups for easy administration
Ÿ  Assign groups and users to Resource Queues
Ÿ  Secure connection between HAWQ processes
Ÿ  Built-in column encryption (pgcrypto)
HAWQ	
  
40© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Client Program - pgAdmin 
HAWQ 및 GPDB 를 위한 Client 툴
41© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Client Program - Aginitiy Workbench 
Aginitiy Workbench for EMC Greenplum
- HAWQ / Greenplum database 사용자를 위한 DBA 및 개발자를 위한 Client 프로그램
- 한글 처리 및 다양한 다국어를 지원
42© 2015 Pivotal Software, Inc. All rights reserved. 42
Pivotal HAWQ
New/Enhanced Feature & Roadmap
Apr 2015
43© 2015 Pivotal Software, Inc. All rights reserved.
PHD 3.0 & HAWQ 1.3.X (2015년 H1)
구분 기능 개선 사항 개선 내용
GPDB
Version
관리
ODP Core
산업 표준의 ODP core를 기반의 Hadoop
- HDFS, YARN, MapReduce, Ambari
PHD 3.0/
HAWQ 1.3
Ambari 적용 PHD 관리, 운영, 모니터링(Ganglia), Alert(Nagios) 강화
PHD 3.0/
HAWQ 1.3
보안 보안 개선
관리(Ranger, Ambari), 인증(Kerberos, Knox), 권한 관리(ACLs, AD/LDAP,
Ranger), Audit(Ranger), Data Protection(Encryption) 강화
PHD 3.0/
HAWQ 1.3
Eco
시스템
최신 버전 및
에코 시스템 지원
Hadoop 2.6 기반
Spark stack 포함
Know, Ranger 추가
PHD 3.0/
HAWQ 1.3
44© 2015 Pivotal Software, Inc. All rights reserved.
PHD 3.X & HAWQ 2.X Roadmap(2015년 H2)
구분 기능 개선 사항 개선 내용
GPDB
Version
성능
MV In memory 기반의 Materialized Views 제공 HAWK 2.X
파티션 multilevel partitioning 성능 개선 HAWK 2.X
관리
Resource 관리 계층구조의 Resource 관리 HAWK 2.X
YARN
HAWQ의 Resource 관리를 YARN에 Plugin 으로 구성하여 YARN에서 시스템 리
소스 통합 관리
HAWK 2.X
기능 Hcatalog HAWQ와 HCatalog 통합 관리 HAWK 2.X
호환성 Isilon 지원
EMC Isilon 지원
- Scale out NAS 스토리지(Isilon)이용한 하둡클러스터 지원
- 100TB 이상의 HDFS 구성 시 효과적임.
HAWK 2.X
45© 2015 Pivotal Software, Inc. All rights reserved.
What’s in HAWQ 1.3
•  New Ambari Installation experience
•  Enhancement to Query Optimizer & Query Execution
•  Incremental Analyze on tables
•  HAWQ 1.3.0.1 support for HDP 2.4.2.2
•  libhdfs3 updates & HDFS support for truncate patch
•  HAWQ 1.3.0.2 support for SLES
•  Documentation enhancements on administration, etc…
46© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ Roadmap
•  First Half 2015: (1.3.x)
•  Ambari 2.0: Advanced Monitoring & Alerting, StackAdvisor
•  Migration from 1.2 line into 1.3
•  Isilon DA support
•  Second Half 2015: (2.x)
•  Isilon support
•  Elastic Runtime (NxM): Performance, Higher concurrency, Cloud optimized
•  Advanced Resource Manager: Hierarchical, Highly Multi-tenant, YARN
•  HCatalog Integration
•  AWS enablement
•  Improved Support for multilevel partitioning
•  Open Sourcing into ASF
47© 2015 Pivotal Software, Inc. All rights reserved. 47
Appendix:
HAWQ vs Hive
Advantages over Apache Hive
Apr 2015
48© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ 장점 1: Performance
Ÿ  MPP 병렬 처리 엔진으로 빠른 분석 속도 및 interactive 쿼리 지원
Ÿ  Big Data에 최적화된 Cost Base Optimizer – PQO (Pivotal Query Optimizer)
Ÿ  Hive와 같은 Map Reduce 변환이 아닌 직접 쿼리 처리로 보다 나은 동시성 쿼리 지원
Ÿ  더 많은 사용자 쿼리를 더 작은 서버 리소스로 처리
49© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ 장점 2: 100% ANSI SQL 지원
Ÿ  100% ANSI SQL 문법 지원
Ÿ  복잡한 조인 및 분석 쿼리 지원
Ÿ  기존 사용하던 BI 툴들을 변경 없이 사용
Ÿ  러닝 커브 제거로 빠른 개발 구축 가능
TPC-DS Query: HAWQ : 111 개 모든 쿼리가
수정없이 수행가능
Stinger: 20 Impala : 31 Presto: 12
50© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ 장점 3: 다양한 분석 기능 제공
Ÿ  Data Scientist를 위한 MADlib, PivotalR,
PL/R, PL/Python, PL/Java등 다양한 open
source 분석 통계 툴 제공
Ÿ  샘플이 아닌 전 구간의 데이터 셋 분석을
병렬 처리하여 빠른 분석 속도 보장
Ÿ  In Database 분석으로 데이터 이동 불필
요
Standby	
  
Master	
  
…	
  
Master	
  
Host	
  
SQL	
  
Interconnect	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
Segment	
  Host	
  
Segment	
  
Segment	
  
51© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ 장점 4: 다양한 소스에 대한 확장성
Ÿ  PXF(Pivotal eXtension Framework)
Ÿ  HAWQ 와 다양한 소스데이터(HDFS,Hive,Avro,
Hbase..) 를 조인하여 통합 쿼리 기능 제공
Ÿ  확장가능한 API Framework 제공으로 보다 많
은 데이터 소스와 연계 가능
(eg:Oracle,DB2,JSON..)
Ÿ  외부 테이블에 대한 병렬 처리로 빠른 데이터
조회 가능
PIVOTAL-­‐HD	
  EXTENSION	
  FRAMEWORK	
  
HDFS	
   HBase	
   Hive	
  
HAWQ	
  
52© 2015 Pivotal Software, Inc. All rights reserved.
HAWQ 장점 5: 통합된 모니터링 및 관리툴
Ÿ  PHD3.0 – 오픈 소스 Ambari와 통합
Ÿ  손쉬운 설치 및 관리 툴 제공
Ÿ  타 Hadoop 제품들과 통합된 모니터링 제공
Ÿ  ODP(Open Data Platform) 공유로 12개+ 협업
벤더의 하둡 배포판에서 수정 없이
실행 가능한 호환성 제공
53© 2015 Pivotal Software, Inc. All rights reserved.
다양한 워크로드에 대한 Pivotal 기술군 HAWQ
Hive	
  
HAWQ	
  
HAWQ	
   HAWQ	
  
SparkSQL	
  
SpringXD	
  
Batch	
  	
  
SQL	
  
	
  
•  Minutes,	
  hours	
  	
  
•  IO	
  heavy	
  
•  Less	
  complex	
  
Interac;ve	
  
SQL	
  	
  
	
  
•  Seconds,	
  minutes	
  
•  Joins	
  
•  Extensibility	
  
OLAP	
  
SQL	
  
	
  
•  Seconds	
  
•  Very	
  complex	
  
•  BI	
  Tools	
  
Streaming	
  
SQL	
  
	
  
•  In-­‐memory	
  
•  Small	
  dataset	
  
54© 2015 Pivotal Software, Inc. All rights reserved.
요약: Apache Hive VS Pivotal HAWQ HAWQ
Apache Hive Pivotal HAWQ
복잡한 조인 조건 지원 미지원 복잡한 조인 조건도 빠르게 처리
기사용 분석 BI 툴 호환성 미호환 툴 다수로 투자 증가 호환성 보장으로 추가 투자 X
Interactive query 지원 성능 이슈 존재
배치잡에만 최적화됨
큰 data set에 대한 빠른 interactive
query
Ad-hoc query 지원 성능 이슈 존재 ad-hoc query 에 최적화된 cost
base optimizer 탑재
ANSI SQL 지원 제한적 ANSI SQL 지원으로
호환성 문제
100% SQL compliance 지원
동시성 쿼리 지원 쿼리 동시성 처리가 힘듬 mixed workload 에 대한 쿼리 동시
성 확보
55© 2015 Pivotal Software, Inc. All rights reserved.
요약: HAWQ Business Benefits
Feature Benefit
풍부하고 호환성 있는 SQL dialect •  Powerful and portable SQL apps
•  Leverage large SQL-based ecosystems
TPC-DS compliance •  보다 다양한 use case 적용 가능, 기존 BI 톨과의
호환 보장
•  안정적인 운영 가능
선형적 확장성과 유연하고 효율적인
조인 조건 지원
EDW 부하의 절감을 매우 작은 비용으로 가능
Deep analytics + machine learning Predictive/advanced learning use cases at scale
Data 통합 기능 제공 데이터의 이동 없이 다양한 외부 데이터를 통합 조회
고가용성 보장 EDW 로 부터 주요 업무 하둡으로 이관 가능
Native Hadoop file format support Reduce ETL and data movement = lower costs
HAWQ
56© 2015 Pivotal Software, Inc. All rights reserved. 56
Spark–HAWQ Integration
57© 2015 Pivotal Software, Inc. All rights reserved.
Spark approaches to read HAWQ data
•  Spark JDBC (JdbcRDD, DBInputFormat)
•  Spark with HAWQInputFormat (AO, Parquet)
•  Shared Parquet Storage
•  Apache Crunch-Spark (HAWQInputFormat2)
Pivotal HAWQ 소개

More Related Content

What's hot

Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
DataWorks Summit
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
DataWorks Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
DataWorks Summit
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
DataWorks Summit
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
DataWorks Summit
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
DataWorks Summit/Hadoop Summit
 
Hybrid Data Platform
Hybrid Data Platform Hybrid Data Platform
Hybrid Data Platform
DataWorks Summit/Hadoop Summit
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks
 
Migrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie MaeMigrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie Mae
DataWorks Summit
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
DataWorks Summit
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
How to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHow to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARN
Hortonworks
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
DataWorks Summit
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
 

What's hot (20)

Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Hybrid Data Platform
Hybrid Data Platform Hybrid Data Platform
Hybrid Data Platform
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Migrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie MaeMigrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie Mae
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
How to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHow to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARN
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 

Viewers also liked

Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
Alexey Grishchenko
 
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현
K data
 
PXF BDAM 2016
PXF BDAM 2016PXF BDAM 2016
PXF BDAM 2016
Shivram Mani
 
gsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Manigsoc_mentor for Shivram Mani
gsoc_mentor for Shivram ManiShivram Mani
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
PivotalOpenSourceHub
 
PXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged DataPXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged Data
Shivram Mani
 
Hawq Hcatalog Integration
Hawq Hcatalog IntegrationHawq Hcatalog Integration
Hawq Hcatalog Integration
Shivram Mani
 
Apache HAWQ : An Introduction
Apache HAWQ : An IntroductionApache HAWQ : An Introduction
Apache HAWQ : An Introduction
Sandeep Kunkunuru
 
DLAB company info and big data case studies
DLAB company info and big data case studiesDLAB company info and big data case studies
DLAB company info and big data case studies
DLAB
 
Enterprise Data Classification and Provenance
Enterprise Data Classification and ProvenanceEnterprise Data Classification and Provenance
Enterprise Data Classification and Provenance
DataWorks Summit/Hadoop Summit
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
VMware Tanzu
 
Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)
saravana krishnamurthy
 
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
PivotalOpenSourceHub
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
InMobi Technology
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
PivotalOpenSourceHub
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
DataWorks Summit/Hadoop Summit
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개
Hyoungjun Kim
 
Phd tutorial hawq_v0.1
Phd tutorial hawq_v0.1Phd tutorial hawq_v0.1
Phd tutorial hawq_v0.1
seungdon Choi
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
Muhammad Ali
 

Viewers also liked (20)

Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현
 
PXF BDAM 2016
PXF BDAM 2016PXF BDAM 2016
PXF BDAM 2016
 
gsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Manigsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Mani
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
 
PXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged DataPXF HAWQ Unmanaged Data
PXF HAWQ Unmanaged Data
 
Hawq Hcatalog Integration
Hawq Hcatalog IntegrationHawq Hcatalog Integration
Hawq Hcatalog Integration
 
Apache HAWQ : An Introduction
Apache HAWQ : An IntroductionApache HAWQ : An Introduction
Apache HAWQ : An Introduction
 
DLAB company info and big data case studies
DLAB company info and big data case studiesDLAB company info and big data case studies
DLAB company info and big data case studies
 
Enterprise Data Classification and Provenance
Enterprise Data Classification and ProvenanceEnterprise Data Classification and Provenance
Enterprise Data Classification and Provenance
 
Pivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ LaunchPivotal Strata NYC 2015 Apache HAWQ Launch
Pivotal Strata NYC 2015 Apache HAWQ Launch
 
Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)Pivotal HAWQ - High Availability (2014)
Pivotal HAWQ - High Availability (2014)
 
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
 
오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개
 
Phd tutorial hawq_v0.1
Phd tutorial hawq_v0.1Phd tutorial hawq_v0.1
Phd tutorial hawq_v0.1
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 

Similar to Pivotal HAWQ 소개

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017
Alex Diachenko
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
EMC
 
5. pivotal hd 2013
5. pivotal hd 20135. pivotal hd 2013
5. pivotal hd 2013
Chiou-Nan Chen
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAYthevijayps
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
Mukund Babbar
 
Hawq wp 042313_final
Hawq wp 042313_finalHawq wp 042313_final
Hawq wp 042313_finalEMC
 
Pivotal hawq internals
Pivotal hawq internalsPivotal hawq internals
Pivotal hawq internals
Alexey Grishchenko
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
Jim Kaskade
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBase
Cloudera, Inc.
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Rajit Saha
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Hortonworks
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
NoSQLmatters
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 

Similar to Pivotal HAWQ 소개 (20)

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
5. pivotal hd 2013
5. pivotal hd 20135. pivotal hd 2013
5. pivotal hd 2013
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAY
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
 
Hawq wp 042313_final
Hawq wp 042313_finalHawq wp 042313_final
Hawq wp 042313_final
 
Pivotal hawq internals
Pivotal hawq internalsPivotal hawq internals
Pivotal hawq internals
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBase
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Pivotal HAWQ 소개

  • 1.
  • 2. 2© 2015 Pivotal Software, Inc. All rights reserved. Pivotal HAWQ 소개 Seungdon Choi Field Engineer Pivotal Korea
  • 3. 3© 2015 Pivotal Software, Inc. All rights reserved. Agenda Ÿ Overview Ÿ Architecture Ÿ Machine Learning using HAWQ Ÿ Roadmap Ÿ Appendix: HAWQ vs Hive
  • 4. 4© 2015 Pivotal Software, Inc. All rights reserved. Pivotal HAWQ is   Enterprise  platform  that  provides  the   fewest  barriers,  lowest  risk,  most  cost   effective  and  fastest  way  to  enter  in  to   big  data  analytics  on  Hadoop  
  • 5. 5© 2015 Pivotal Software, Inc. All rights reserved. So What exactly HAWQ is? Combining SQL with Hadoop is key for analytics SQL remains #1 choice for Data Science•  Massively Parallel Processing RDBMS on HADOOP •  ANSI SQL on Hadoop •  Extremely high performance for analytics (not like Hive) •  Stores all data directly on HDFS •  Open-Source •  ODP 코어 기반의 하둡 배포판에서 동작(PHD, HDP, IBM..)
  • 6. 6© 2015 Pivotal Software, Inc. All rights reserved. Why SQL on Hadoop? 1.  Map Reduce 문제점 1) Map Reduce 의 한계 : 느린 성능 개발 역량에 의존적. 버그 가능성 2) 높은 Learning Curve 3) Legacy System, App 들과의 호환성 문제 4) Ad-hoc query 성능 문제로 인해 DBMS 와 병행사용 불가피 2. SQL on Hadoop 사용 이유 1) ANSI SQL 지원 - 기존 시스템과 통합 혹은 대체 용이 , 개발 시간 단축 - 낮은 learning curve(기존 개발자들에게 편리) 2) 높은 처리 성능 : MR 한계 극복 3) 낮은 반응 시간 4) Legacy System/App 호환 가능(SAS, Tabulu등의 BI 툴 재 사용 가능) 5) 대화형 질의(Interactive Query) 사용 - 데이터 분석의 생산성 증가 à 빠른 의사 결정 가능
  • 7. 7© 2015 Pivotal Software, Inc. All rights reserved. Pivotal HAWQ Ÿ  15년 이상 기업 시장에서 검증된 Greenplum 데이터베이스 엔진 사용 -  Partition, Compression, Resource 관리 Ÿ  100% ANSI SQL Compliant – 기존 BI, SAS 툴 재활용 Ÿ  실시간 쿼리 가능 - MapReduce를 사용하지 않고 분산되어 있는 데이터에 직접 접근 Ÿ  PXF External Table로 HDFS, HBase, Hive 및 다양한 데이터 통합 Ÿ  Libhdfs 개선(JavaàC)으로 일반 HDFS보다 빠른 속도 Ÿ  PL/R, Madlib 등 다양한 분석 패키지 지원 Ÿ  보안 및 유저 권한 관리, 암호화 지원
  • 8. 8© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Benefits… Ÿ  Out of the box SQL for Hadoop –  MapReduce Programming 러닝 커브 없이 SQL만으로 분석 수행 Ÿ  GPFX External Tables providing SQL access to Hadoop –  HDFS, HBase, Hive or any data types 등 다양한 데이터 소스들의 통합 인터페이스 Ÿ  Broad data access, integration and portability Ÿ  성능과 확장성, DW 구축하듯이 Big Data 프로젝트를 수행 –  Parallel Everything –  Dynamic Pipelining –  High Speed Interconnect –  Optimized HDFS access with libhdfs3 –  Co-Located Joins & Data Locality –  Partition Elimination –  Higher Cluster Utilization –  Concurrency Control
  • 9. 9© 2015 Pivotal Software, Inc. All rights reserved. 9 Architecture
  • 10. 10© 2015 Pivotal Software, Inc. All rights reserved. Basic  Architecture     Interconnect   Catalog   HAWQ  Master   Local  TM   Execu;on  Coordina;on   Parser   Query  Op;mizer   Dispatch   NameNode     Local  Temp  Storage   Segment  Host   Query  Executor   HDFS   PXF   Segment   [Segment  …]   DataNode   Local  Temp  Storage   Segment  Host   Query  Executor   HDFS   PXF   Segment   [Segment  …]   DataNode   HDFS   …   HAWQ  Standby   Master     Secondary   NameNode    HDFS   HAWQ  
  • 11. 11© 2015 Pivotal Software, Inc. All rights reserved. HAWQ  Master   Ÿ  Client  의 SQL  request를 받아 이를 parsing  하여 각각 의 Segment  Node로 전달하고, 수행 결과를 받아 Client   에 반납하는 역할을 수행   Ÿ  실제 User  Data를 가지지 않고,  System  metadata를   저장하는 Global  System  Catalog를 가짐   Ÿ  H/W장애시 역할을 위임받을 Standby  Master(Warm   Standby)  서버를 구성 Ÿ  운영 시스템 구성시는 일반적으로 Hadoop  NameNode 와 별도의 서버에 설치   Local  Storage   HAWQ  Master   Local  TM   Query  Executor   Parser   Query  Op;mizer   Dispatch   Catalog   HAWQ  
  • 12. 12© 2015 Pivotal Software, Inc. All rights reserved. HAWQ  Segments   Ÿ  A  HAWQ  segment  within  a  Segment  Host  is  an  HDFS  client   that  runs  on  a  DataNode   Ÿ  하나의 Segment  Host/DataNode 에 여러개의 Segment  Node     Ÿ  Segment  =  a  basic  unit  of  parallelism   –  Mul;ple  segments  work  together  to  form  a  single   parallel  query  processing  system   Ÿ  Opera;ons  (scans,  joins,  aggrega;ons,  sorts,  etc.)  execute  in   parallel  across  all  segments  simultaneously     Ÿ  Libhdfs3(Pivotal  rewri[en)  를 사용하여 더 빠른 HDFS  R/W속도       Local  Temp  Storage   Segment  Host   Query  Executor   HDFS   PXF   Segment   [Segment  …]   DataNode   HAWQ  
  • 13. 13© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Interconnect Performance and Scalability Ÿ  Inter-process communication between segments –  Standard Ethernet switching fabric Ÿ  Uses UDP protocol (User Datagram Protocol) –  성능과 확장성 향상 Ÿ  Additional packet verification and checking not performed by UDP –  Reliability equivalent to TCP Interconnect Local Temp Storage Segment Host Query Executor HDFS PXF Segment Segment DataNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment Segment DataNode
  • 14. 14© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Dynamic Pipelining tm Local Temp Storage Segment Host Query Executor DataNode PXF Local Temp Storage Segment Host Query Executor DataNode PXF Local Temp Storage Segment Host Query Executor DataNode PXF •  Differentiating competitive advantage •  Core execution technology from GPDB •  Parallel data flow using the high speed UDP interconnect •  중간 결과값에 대한 No materialization - MapReduce 와 다름 Dynamic Pipelining Interconnect
  • 15. 15© 2015 Pivotal Software, Inc. All rights reserved. Interconnect HAWQ Parser Local Storage HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF NameNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment Segment DataNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment Segment DataNode Clients JDBC SQL •  Enforces syntax and semantics •  Converts SQL query into a parse tree data structure describing details of the query
  • 16. 16© 2015 Pivotal Software, Inc. All rights reserved. Interconnect HAWQ Parallel Query Optimizer Local Storage HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF NameNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment Segment DataNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment Segment DataNode Gather Motion Sort HashAggregate HashJoin Redistribute Motion HashJoin Seq Scan on lineitem Hash Seq Scan on orders Hash HashJoin Seq Scan on customer Hash Broadcast Motion Seq Scan on nation
  • 17. 17© 2015 Pivotal Software, Inc. All rights reserved. Interconnect HAWQ Dispatch and Query Executor Local Storage HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch PXF NameNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment Segment DataNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment Segment DataNode 1.  Dispatch communicates the query plan to segments 2.  Query Executor executes the physical steps in the plan ScanBars b HashJoinb.name = s.bar ScanSells s Filterb.city ='San Francisco' Projects.beer, s.price MotionGather MotionRedist(b.name) ScanBars b HashJoinb.name = s.bar ScanSells s Filterb.city ='San Francisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 18. 18© 2015 Pivotal Software, Inc. All rights reserved. Pivotal Query Optimizer (PQO) For HAWQ and Greenplum Database HAWQ Turns a SQL query into an execution plan Greenplum DB Ÿ  First Cost Based Optimizer for BIG data Ÿ  Applies all possible optimizations at the same time Ÿ  New Extensible Code Base Ÿ  Rapid adoption of emerging technologies PIVOTAL VALUE-ADDED FUNCTIONALITY
  • 19. 19© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Transactions Ÿ  DataNodes in HDFS do not know what is visible –  No idea what data they have –  Visibility is defined by the NameNode Ÿ  Therefore, segment nodes do not know what is visible –  Visibility is defined by HAWQ Master Ÿ  No distributed transaction management –  No UPDATE or DELETE Ÿ  Truncate is implemented to support rollback of failed transactions Ÿ  Transaction logs present only on HAWQ Master –  For inserts, single phase commit performed on HAWQ Master Local Storage HAWQ Master Local TM Query Executor Parser Query Optimizer Dispatch Catalog
  • 20. 20© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Fault Tolerance Ÿ  HDFS replication을 사용한 Fault tolerance 보장 Ÿ  Replication factor decided when creating a file-space & table- space related to HDFS –  Default is 3 Ÿ  When a segment server goes down shard is accessible from another node –  No data stored for mirrors Ÿ  Recovery of segment through regular gprecoverseg
  • 21. 21© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Availability HDFS DataNode Segment 2 HDFS DataNode Segment 3 Segment 1 Replication is embedded in HDFS so GPDB file replication is not needed When a segment fails the shard is accessible from another node through the HDFS NameNode and then the DataNode to where the shard was replicated Master Host HDFS NameNode
  • 22. 22© 2015 Pivotal Software, Inc. All rights reserved. Pivotal HAWQ – Polymorphic AO Storage GPDB와 동일한 유연한 Row/column based table/partition 구성으로 성능 및 저장공간 최적화 Ÿ  Columnar storage is well suited to scanning a large percentage of the data Ÿ  Row storage excels at small lookups Ÿ  Most systems need to do both Ÿ  Row and column orientation can be mixed within a table or database Ÿ  Both types can be dramatically more efficient with compression Ÿ  Compression is definable column by column: Ÿ  Blockwise: Gzip1-9 & QuickLZ Ÿ  Streamwise: Run Length Encoding (RLE) (levels 1-4) Ÿ  Flexible indexing, partitioning enable more granular control and enable true ILM TABLE ‘SALES’ Mar Apr May Jun Jul Aug Sept Oct Nov Row-oriented for Small ScansColumn-oriented for Full Scans
  • 23. 23© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Master Mirroring Master Host HDFS NameNode Global System Catalog Ÿ  Master node와 별도의 H/W에 Standby Master node 구성 Ÿ  Transaction Log를 실시간 복제하여 데이터 정합성 보장(Warm standby). Master Node 장애시 Standby 가 Roll을 위임받음 Ÿ  System catalogs synchronized Synchronization Process Transaction Logs Master Host HDFS NameNode Global System Catalog Transaction Logs
  • 24. 24© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Storage Options Ÿ  Table in HAWQ: ü Distributed ü Partition(range/list) ü Polymorphic Storage ü Row/Columnar Oriented ü Compress(zlib,quicklz,RL E..) TABLE  A   SEG-­‐1   SEG-­‐2   SEG-­‐3   SEG-­‐4   …   SEG-­‐N   PART  A   ROW   COLUMNAR   COMPRESS   SUB-­‐PART   SUB-­‐PART   PART  A   ROW   COLUMNAR   COMPRESS   SUB-­‐PART   SUB-­‐PART   PART  A   ROW   COLUMNAR   COMPRESS   SUB-­‐PART   SUB-­‐PART   PART  A   ROW   COLUMNAR   COMPRESS   SUB-­‐PART   SUB-­‐PART   DISTRIBUTION   PARTITIONS   POLYMORPHIC   STORAGE  
  • 25. 25© 2015 Pivotal Software, Inc. All rights reserved. Flat  Files,  CSV,  Delimited,  …   gpload,  gpfdist,  External  Tables   PXF    {Native  Hadoop  Files}   Spring  XD   Existing  RDBMS  Systems   Web  Tables,  JSON,  XML,  HTML,  …   Executing  Scripts,  …   HDFS  Flat  Files,  CSV,  Delimited,  …   Hive   HBase  {w.  predicate  push-­‐down}     Avro,  RCFile,  SeqFile   Open  extendable  API   Available:  Accumulo,  JSON,…   Streaming  ||  Batch  Mode     Java  Development  Framework   Parallel  loading/unloading  at  Scale   HAWQ
  • 26. 26© 2015 Pivotal Software, Inc. All rights reserved. Pivotal eXtension Framework (PXF) Ÿ  External Table Interface 제공: Hadoop eco의 다양한 data store 를 조회 Ÿ  Hadoop à HAWQ 로 데이터 적재 혹은 직접 쿼리 Ÿ  Enables combining HAWQ data and Hadoop data in single queryaa Ÿ  Supports connectors for HDFS, HBase and Hive Ÿ  Provides extensible framework API to enable custom connector Ÿ  Available on Github: JSON, Accumulo, S3… Ÿ  HAWQ MapReduce RecordReader PIVOTAL-­‐HD  EXTENSION  FRAMEWORK   HDFS   HBase   Hive     Industry  differen;ators  :     •  Low  latency  on  large  data  sets   •  Extensible  and  customizable   •  Considers  cost  model  of  federated  sources   HAWQ   HAWQ  
  • 27. 27© 2015 Pivotal Software, Inc. All rights reserved. PXF Features Ÿ  Hbase,Hive 의 연계시 filter 조건으로 predicate push down Ÿ  Hive table Partitioning exclusion Ÿ  HDFS data에 대한 통계정보 수집으로 최적화된 수행 계획 작성 Ÿ  Extensible Framework JAVA API 제공으로 다른 데이터소스(eg: Oracle DB)/format에 대한 custom 개발 용이 Ÿ  HDFS block locality to HAWQ processing segment Ÿ  빠른 Parallel Optimizer(ORCA) Ÿ  사용예: (1)HAWQ 의 Dimension Table 과 HBase fact table과 Join (2)HDFS, Hive, HBase 데이터를 빠르게 HAWQ로 로드하여 통합 관리 (3)다양한 포맷과 저장소의 데이터에 대한 materialization없이 통합(federation) 쿼리 엔진 으로 사용 Ÿ 
  • 28. 28© 2015 Pivotal Software, Inc. All rights reserved. PXF External Table 예제 Ÿ  Simple HDFS Text CREATE EXTERNAL TABLE jan_2012_sales ( id int, total int, comments varchar ) LOCATION(‘pxf://10.76.72.26:50070/sales/ 2012/01/items_*.csv? profile=HdfsTextSimple ) FORMAT ‘TEXT’ (delimiter ‘,’); CREATE EXTERNAL TABLE hbase_sales ( recordkey bytea, “cf1:saleid” int, “cf8:comments” varchar ) LOCATION(‘pxf://10.76.72.26:50070/sales? profile=HBase ) FORMAT ‘custom’ (formatter='gpxfwritable_import'); Ÿ  Hbase Table CREATE WRITABLE EXTERNAL TABLE ... LOCATION(‘pxf://<host:port>/sales? profile=HdfsTextSimple&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec') FORMAT ‘text’(delimiter ‘,’); Ÿ  Export to HDFS using Writable PXF
  • 29. 29© 2015 Pivotal Software, Inc. All rights reserved. Data Distribution DN3  DN2   X=2   X=3   X=4   X=5  X=1   Table  A   Y=2   Y=3  Y=1   Table  B   DN1   SELECT  X  FROM  A,B  WHERE  A.X  =  B.Y   SELECT  SUM(X)  FROM  A  GROUP  BY  A.X   Ÿ  특정 Column/Column Set/Random에 기반한 데이터 분산 Ÿ  Tables distributed similarly are co-located Ÿ  Distribution scheme modifiable thru alter table Advantages: Ÿ  Co-located joins Ÿ  No data movement on joins or aggregates Ÿ  Improved performance on complex queries Ÿ  Query engine optimization
  • 30. 30© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Distribution vs Hive Partition DN1   Table  A   NO  CO-­‐LOCATED  JOINS,  NO  CO-­‐LOCATED  AGGREGATES   FOLDER  b   FOLDER  c  FOLDER  a   X=2   X=3   X=4   X=5  X=1   DN2   DN3   Table  B   FOLDER  bb  FOLDER  aa   Y=2   Y=3  Y=1   DATA  IS  SPREAD  ON  HDFS   Ÿ  In Hive partitions are organized into folders Ÿ  Folders are spread across entire HDFS Ÿ  Similar data are not co-located, data location is lost Ÿ  Data movement is required for large joins and aggregates Ÿ  Hive partitions help in sequential scan of the original table only
  • 31. 31© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Resource Management Ÿ  쿼리 우선 순위 부여를 통한 효과적인 Mixed Workload 관리 Ÿ  # of active query / memory / CPU/ disk IO 에 대한 queue 관리 Ÿ  다양한 SLA 설정 및 동적 Queue설정 변경 가능(주간/일간/시간) 동시 쿼리 수행 수 제어 Max Cost 값 제어 Min Cost 값 제어 쿼리 우선 순위 Max Cost 값 이상 쿼리 사전 차단 기능
  • 32. 32© 2015 Pivotal Software, Inc. All rights reserved. 32 MACHINE LEARNING ON HDFS Using HAWQ
  • 33. 33© 2015 Pivotal Software, Inc. All rights reserved. MADlib  Advantages   Ÿ  Be[er  parallelism   –  Algorithms  designed  to  leverage  MPP  and  Hadoop   architecture   Ÿ  Be[er  scalability   –  Algorithms  scale  as  your  data  set  scales   Ÿ  Be[er  predic;ve  accuracy   –  Can  use  all  data,  not  a  sample   Ÿ  Open  source   –  Available  for  customiza;on  and  op;miza;on  by  user  if   desired   HAWQ  
  • 34. 34© 2015 Pivotal Software, Inc. All rights reserved. Functions Predictive Modeling Library Linear Systems •  Sparse and Dense Solvers •  Linear Algebra Matrix Factorization •  Single Value Decomposition (SVD) •  Low Rank Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Robust Variance (Huber-White), Clustered Variance, Marginal Effects Other Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Apriori) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Random Forest •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation •  Naïve Bayes •  Support Vector Machines (SVM) Descriptive Statistics Sketch-Based Estimators •  CountMin (Cormode-Muth.) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient Inferential Statistics Hypothesis Tests Time Series •  ARIMA Oct 2014
  • 35. 35© 2015 Pivotal Software, Inc. All rights reserved. Calling  MADlib  Func;ons:  Fast  Training,  Scoring   SELECT  madlib.linregr_train(  'houses’,   'houses_linregr’,   'price’,   'ARRAY[1,  tax,  bath,  size]’);   MADlib  model  func;on   Table  containing   training  data   Table  in  which  to   save  results   Column  containing   dependent  variable  Features  included  in  the   model   Ÿ  MADlib  allows  users  to  easily  and   create  models  without  moving  data  out   of  the  systems   –  Model  genera;on   –  Model  valida;on   –  Scoring  (evalua;on  of)  new  data   Ÿ  All  the  data  can  be  used  in  one  model   Ÿ  Built-­‐in  func;onality  to  create  of   mul;ple  smaller  models  (e.g.   classifica;on  grouped  by  feature)   Ÿ  Open-­‐source  lets  you  tweak  and   extend  methods,  or  build  your  own   HAWQ  
  • 36. 36© 2015 Pivotal Software, Inc. All rights reserved. UDF – pl/x : 다양한 분석용 언어 사용 Ÿ  R/Python/Java/C/Perl, Pgsql을 사용 한 User Defined Function을 사용 Ÿ  Numpy, NLTK, Scikit-learn, Scipy등 의 python extension 사용 Ÿ  MPP Architecture의 Data Parallelism 을 사용하여 빠른 분석 성능 제공 Standby   Master   …   Master   Host   SQL   Interconnect   Segment  Host   Segment   Segment   Segment  Host   Segment   Segment   Segment  Host   Segment   Segment   Segment  Host   Segment   Segment   Standby   Master   …   Master   Host   SQL   Interconnect   Segment  Host   Segment   Segment   Segment  Host   Segment   Segment   Segment  Host   Segment   Segment   Segment  Host   Segment   Segment  
  • 37. 37© 2015 Pivotal Software, Inc. All rights reserved. PivotalR: Bringing MADlib and HAWQ to a familiar R interface Ÿ  Challenge Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics Ÿ  Simple solution: Translate R code into SQL d <- db.data.frame(”houses") houses_linregr <- madlib.lm(price ~ tax + bath + size , data=d) Pivotal R SELECT madlib.linregr_train( 'houses’, 'houses_linregr’, 'price’, 'ARRAY[1, tax, bath, size]’); SQL Code https://github.com/pivotalsoftware/PivotalR
  • 38. 38© 2015 Pivotal Software, Inc. All rights reserved. PivotalR Design Overview 2. SQL to execute 3. Computation results 1. R à SQL RPostgreSQL PivotalR Data lives hereNo data here Database/Hadoop w/ MADlib •  Call MADlib’s in-DB machine learning functions directly from R •  Syntax is analogous to native R function •  Data doesn’t need to leave the database •  All heavy lifting, including model estimation & computation, are done in the database
  • 39. 39© 2015 Pivotal Software, Inc. All rights reserved. Security  &  Authoriza;on   Ÿ  Role based security Ÿ  Availability of Users, Groups Ÿ  Access granularity on Connection, Databases, Schema, Tables, View, … Ÿ  Inheritance: –  Inherit security privileges from other users or groups for easy administration Ÿ  Assign groups and users to Resource Queues Ÿ  Secure connection between HAWQ processes Ÿ  Built-in column encryption (pgcrypto) HAWQ  
  • 40. 40© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Client Program - pgAdmin HAWQ 및 GPDB 를 위한 Client 툴
  • 41. 41© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Client Program - Aginitiy Workbench Aginitiy Workbench for EMC Greenplum - HAWQ / Greenplum database 사용자를 위한 DBA 및 개발자를 위한 Client 프로그램 - 한글 처리 및 다양한 다국어를 지원
  • 42. 42© 2015 Pivotal Software, Inc. All rights reserved. 42 Pivotal HAWQ New/Enhanced Feature & Roadmap Apr 2015
  • 43. 43© 2015 Pivotal Software, Inc. All rights reserved. PHD 3.0 & HAWQ 1.3.X (2015년 H1) 구분 기능 개선 사항 개선 내용 GPDB Version 관리 ODP Core 산업 표준의 ODP core를 기반의 Hadoop - HDFS, YARN, MapReduce, Ambari PHD 3.0/ HAWQ 1.3 Ambari 적용 PHD 관리, 운영, 모니터링(Ganglia), Alert(Nagios) 강화 PHD 3.0/ HAWQ 1.3 보안 보안 개선 관리(Ranger, Ambari), 인증(Kerberos, Knox), 권한 관리(ACLs, AD/LDAP, Ranger), Audit(Ranger), Data Protection(Encryption) 강화 PHD 3.0/ HAWQ 1.3 Eco 시스템 최신 버전 및 에코 시스템 지원 Hadoop 2.6 기반 Spark stack 포함 Know, Ranger 추가 PHD 3.0/ HAWQ 1.3
  • 44. 44© 2015 Pivotal Software, Inc. All rights reserved. PHD 3.X & HAWQ 2.X Roadmap(2015년 H2) 구분 기능 개선 사항 개선 내용 GPDB Version 성능 MV In memory 기반의 Materialized Views 제공 HAWK 2.X 파티션 multilevel partitioning 성능 개선 HAWK 2.X 관리 Resource 관리 계층구조의 Resource 관리 HAWK 2.X YARN HAWQ의 Resource 관리를 YARN에 Plugin 으로 구성하여 YARN에서 시스템 리 소스 통합 관리 HAWK 2.X 기능 Hcatalog HAWQ와 HCatalog 통합 관리 HAWK 2.X 호환성 Isilon 지원 EMC Isilon 지원 - Scale out NAS 스토리지(Isilon)이용한 하둡클러스터 지원 - 100TB 이상의 HDFS 구성 시 효과적임. HAWK 2.X
  • 45. 45© 2015 Pivotal Software, Inc. All rights reserved. What’s in HAWQ 1.3 •  New Ambari Installation experience •  Enhancement to Query Optimizer & Query Execution •  Incremental Analyze on tables •  HAWQ 1.3.0.1 support for HDP 2.4.2.2 •  libhdfs3 updates & HDFS support for truncate patch •  HAWQ 1.3.0.2 support for SLES •  Documentation enhancements on administration, etc…
  • 46. 46© 2015 Pivotal Software, Inc. All rights reserved. HAWQ Roadmap •  First Half 2015: (1.3.x) •  Ambari 2.0: Advanced Monitoring & Alerting, StackAdvisor •  Migration from 1.2 line into 1.3 •  Isilon DA support •  Second Half 2015: (2.x) •  Isilon support •  Elastic Runtime (NxM): Performance, Higher concurrency, Cloud optimized •  Advanced Resource Manager: Hierarchical, Highly Multi-tenant, YARN •  HCatalog Integration •  AWS enablement •  Improved Support for multilevel partitioning •  Open Sourcing into ASF
  • 47. 47© 2015 Pivotal Software, Inc. All rights reserved. 47 Appendix: HAWQ vs Hive Advantages over Apache Hive Apr 2015
  • 48. 48© 2015 Pivotal Software, Inc. All rights reserved. HAWQ 장점 1: Performance Ÿ  MPP 병렬 처리 엔진으로 빠른 분석 속도 및 interactive 쿼리 지원 Ÿ  Big Data에 최적화된 Cost Base Optimizer – PQO (Pivotal Query Optimizer) Ÿ  Hive와 같은 Map Reduce 변환이 아닌 직접 쿼리 처리로 보다 나은 동시성 쿼리 지원 Ÿ  더 많은 사용자 쿼리를 더 작은 서버 리소스로 처리
  • 49. 49© 2015 Pivotal Software, Inc. All rights reserved. HAWQ 장점 2: 100% ANSI SQL 지원 Ÿ  100% ANSI SQL 문법 지원 Ÿ  복잡한 조인 및 분석 쿼리 지원 Ÿ  기존 사용하던 BI 툴들을 변경 없이 사용 Ÿ  러닝 커브 제거로 빠른 개발 구축 가능 TPC-DS Query: HAWQ : 111 개 모든 쿼리가 수정없이 수행가능 Stinger: 20 Impala : 31 Presto: 12
  • 50. 50© 2015 Pivotal Software, Inc. All rights reserved. HAWQ 장점 3: 다양한 분석 기능 제공 Ÿ  Data Scientist를 위한 MADlib, PivotalR, PL/R, PL/Python, PL/Java등 다양한 open source 분석 통계 툴 제공 Ÿ  샘플이 아닌 전 구간의 데이터 셋 분석을 병렬 처리하여 빠른 분석 속도 보장 Ÿ  In Database 분석으로 데이터 이동 불필 요 Standby   Master   …   Master   Host   SQL   Interconnect   Segment  Host   Segment   Segment   Segment  Host   Segment   Segment   Segment  Host   Segment   Segment   Segment  Host   Segment   Segment  
  • 51. 51© 2015 Pivotal Software, Inc. All rights reserved. HAWQ 장점 4: 다양한 소스에 대한 확장성 Ÿ  PXF(Pivotal eXtension Framework) Ÿ  HAWQ 와 다양한 소스데이터(HDFS,Hive,Avro, Hbase..) 를 조인하여 통합 쿼리 기능 제공 Ÿ  확장가능한 API Framework 제공으로 보다 많 은 데이터 소스와 연계 가능 (eg:Oracle,DB2,JSON..) Ÿ  외부 테이블에 대한 병렬 처리로 빠른 데이터 조회 가능 PIVOTAL-­‐HD  EXTENSION  FRAMEWORK   HDFS   HBase   Hive   HAWQ  
  • 52. 52© 2015 Pivotal Software, Inc. All rights reserved. HAWQ 장점 5: 통합된 모니터링 및 관리툴 Ÿ  PHD3.0 – 오픈 소스 Ambari와 통합 Ÿ  손쉬운 설치 및 관리 툴 제공 Ÿ  타 Hadoop 제품들과 통합된 모니터링 제공 Ÿ  ODP(Open Data Platform) 공유로 12개+ 협업 벤더의 하둡 배포판에서 수정 없이 실행 가능한 호환성 제공
  • 53. 53© 2015 Pivotal Software, Inc. All rights reserved. 다양한 워크로드에 대한 Pivotal 기술군 HAWQ Hive   HAWQ   HAWQ   HAWQ   SparkSQL   SpringXD   Batch     SQL     •  Minutes,  hours     •  IO  heavy   •  Less  complex   Interac;ve   SQL       •  Seconds,  minutes   •  Joins   •  Extensibility   OLAP   SQL     •  Seconds   •  Very  complex   •  BI  Tools   Streaming   SQL     •  In-­‐memory   •  Small  dataset  
  • 54. 54© 2015 Pivotal Software, Inc. All rights reserved. 요약: Apache Hive VS Pivotal HAWQ HAWQ Apache Hive Pivotal HAWQ 복잡한 조인 조건 지원 미지원 복잡한 조인 조건도 빠르게 처리 기사용 분석 BI 툴 호환성 미호환 툴 다수로 투자 증가 호환성 보장으로 추가 투자 X Interactive query 지원 성능 이슈 존재 배치잡에만 최적화됨 큰 data set에 대한 빠른 interactive query Ad-hoc query 지원 성능 이슈 존재 ad-hoc query 에 최적화된 cost base optimizer 탑재 ANSI SQL 지원 제한적 ANSI SQL 지원으로 호환성 문제 100% SQL compliance 지원 동시성 쿼리 지원 쿼리 동시성 처리가 힘듬 mixed workload 에 대한 쿼리 동시 성 확보
  • 55. 55© 2015 Pivotal Software, Inc. All rights reserved. 요약: HAWQ Business Benefits Feature Benefit 풍부하고 호환성 있는 SQL dialect •  Powerful and portable SQL apps •  Leverage large SQL-based ecosystems TPC-DS compliance •  보다 다양한 use case 적용 가능, 기존 BI 톨과의 호환 보장 •  안정적인 운영 가능 선형적 확장성과 유연하고 효율적인 조인 조건 지원 EDW 부하의 절감을 매우 작은 비용으로 가능 Deep analytics + machine learning Predictive/advanced learning use cases at scale Data 통합 기능 제공 데이터의 이동 없이 다양한 외부 데이터를 통합 조회 고가용성 보장 EDW 로 부터 주요 업무 하둡으로 이관 가능 Native Hadoop file format support Reduce ETL and data movement = lower costs HAWQ
  • 56. 56© 2015 Pivotal Software, Inc. All rights reserved. 56 Spark–HAWQ Integration
  • 57. 57© 2015 Pivotal Software, Inc. All rights reserved. Spark approaches to read HAWQ data •  Spark JDBC (JdbcRDD, DBInputFormat) •  Spark with HAWQInputFormat (AO, Parquet) •  Shared Parquet Storage •  Apache Crunch-Spark (HAWQInputFormat2)