Druid+superset

Druid 란 무엇인가
2018-05-13

OLAP 이란?
OLAP(Online Analytical Processing)
최종 사용자가 다차원 정보에 직접 접근하여 대화식으로 정보를
질의/응답 받아 이를 의사결정에 활용하는 과정이라고 정의 할 수
있다.

OLAP의 목적
OLAP의 궁극적인 목표는 위와 같은 OLAP Cube를 구축하고, 이를 이용해 특정 피처에 대한 연산 값을 얻기 위함

OLAP 구축의 어려움
1. 데이터의 양이 굉장히 많아 질 수 있음.
2. 분석을 위해선 데이터를 slice/dice 할 수 있어야 함.
3. High dimensionality
4. 과거 데이터 뿐만 아니라 현재 업데이트 되고 있는 데이터도 함께 질의 결과에
포함되어야 함.
5. adhoc query 성능 향상을 위해 precomputing을 해둬야 하는데 이때 만들어야 하는
경우의 수가 굉장히 많음.
6. 많은 유저가 동시에 쿼리를 수행할 수 있어야 함.

Druid란?
Druid is an open-source data store designed for sub-second queries on real-time and
historical data. It is primarily used for business intelligence (OLAP) queries on event
data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and
fast data aggregation. Existing Druid deployments have scaled to trillions of events and
petabytes of data. Druid is most commonly used to power user-facing analytic
applications.
Druid is a high-performance, column-oriented, distributed
database.

왜 Druid인가?
1. Column-oriented Data store
2. Low latency ingestion from streams
3. Adhoc query
4. Exact and approximate algorithm
5. Keep a lot of history
6. Designed for interactive applications
7. Real-time and historical supports at the same time

Druid Architecture – Real-time Node
1.Query ingestion
2.Query data as soon as it is ingested
3.Buffer data in memory
4.Periodically “hand off” dataa

Druid Architecture – Historical Node
1.Main workhorse of a Druid cluster
2.Load historical data
3.Respond to query

Druid Architecture – Broker Node
1.Know which nodes hold what data
2.Query scatter/gatter
3.Caching

Druid Architecture #Batch Ingestion

Druid Architecture #Streaming Ingestion

Druid Data
• Timestamp Column
Timestamp를 나타내는 열. 모든 데이터는 Timestamp를 가짐
• Dimension Columns
측정 기준이 되는 열. OLAP 시 데이터를 slice하기 위한 데이터 축이 됨.
• Metric Columns
집계 및 계산에 사용되는 열. 개수, 합계, 평균 등과 같은 OALP 측정 값이 사용됨.

Roll-Up
롤업은 드루이드의 핵심적인 기능 중 하나이다.
개별적인 events가 특정 dimension을 기준으로 aggregation된다고 생각하면 쉽다.
데이터 사이즈를 줄일 수 있는 장점이 있지만 원본 데이터를 온전히 가지고 있다고 보장할 수 없다.

Segmentation
Druid query 호출에 대한 응답의 데이터 단위는 segment이다.
Druid에 ingestion되는 데이터들은 timestamp 값을 기준으로 segment단위로 분할되고 인덱싱 된다.

Druid의 Data Scan #1
실시간 처리를 담당하는 에이전트인 Real Time Node에서는
메모리 상에 Data serving과 indexing을 동시에 수행하게 된다.
일정 양을 in-memory에 적재하고 나면 이를 off-heap 영역으로
옮기게 되고 주기적으로 off-heap 영역에 있는 indexing 정보를
disk에 flush하게 된다.
이때 flush하는 데이터가 바로 앞서 언급한 segment이다.
실시간 데이터에 대한 query가 들어오면 memory에 있는
인덱싱 정보를 바탕으로 데이터를 리턴하는 방식이라 빠르게
결과를 리턴할 수 있다.
데이터에 대한 메타정보는 Zookeeper와 Coordinator Node에 의해
관리되며 Broker Node가 해당 메타정보를 검색해 응답한다.

Druid는 Column 값들을 개별적으로 관리를 하게 되는데, dictionary encoding을 사용하여
각 값들을 Integer 값으로 ID로 생성하게 된다. 만들어진 integer ID를 기반으로 binary array를
dictionary에 저장함.

data scan시 전체 데이터를 대상으로 scna하는 것이 아니라
앞서 저장한 dictionary를 이용해 필요한 데이터가 어느 위치에 존재하고 있는지 확인하여 로우를 선택함
OR, AND 등의 쿼리 조건문에 대해 Binary array들 간의 Boolean 연산을 통해 찾고자 하는 범위가 어떤 Binary Array로
표현될 수 있는지를 찾게 되고 이를 dictionary와 매핑하여 값들을 구함.

data scan시 전체 데이터를 대상으로 scna하는 것이 아니라
앞서 저장한 dictionary를 이용해 필요한 데이터가 어느 위치에 존재하고 있는지 확인하여 로우를 선택함
OR, AND 등의 쿼리 조건문에 대해 Binary array들 간의 Boolean 연산을 통해 찾고자 하는 범위가 어떤 Binary Array로
표현될 수 있는지를 찾게 되고 이를 dictionary와 매핑하여 값들을 구함.
이 아이디어는 bitmap set에서 boolean 연산을 수행하는 기초 개념이 됨. 해당 개념을 기반으로
Druid는 Concise compression 을 통해 다차원 binary array들에 대한 압축을 수행함.

SQL on Druid
HTTP Post 방식으로 아래와 같은 JSON Format의 쿼리를 전송

Druid 관련 시각화 툴
Airbnd superset: https://github.com/apache/incubator-superset
Grafana Druid Plugin: https://grafana.com/plugins/abhisant-druid-datasource/installation
Metabase: https://github.com/metabase/metabase

Druid 한계점
1. YARN 리소스 사용 불가
2. Join 기능을 부분적으로만 지원함(대규모 Join 사용 불가)
3. Zookeeper에 의존적임. Zookeeper가 죽더라도 메모리상에 있는 데이터들을 읽을 수 있으나 새로운 segment에
해당하는 데이터들을 읽을 수 없음.
4. Druid ingestion 시 사용하는 원본데이터를 별도로 저장하고 있어야 함.

Appendix - Druid 성능 비교
https://www.popit.kr/druid-spark-performance/
직접적인 비교는 어렵겠지만
Druid는 timeseries 데이터를 rollup, aggregation하는
워크로드에서는 Spark보다 10배 이상의 응답성을
보여주는 것에 반해, cardinality가 높은 column에 대해
top-n쿼리쪽은 다소 약한 모습을 보여주고 있습니다.
Spark의 경우 DataFrame으로 변환한 후 메모리에 캐시해 놓으면
스캔 쿼리에 대해서 상당한 수준의 응답성을 보여주고 있고
GROUP BY쿼리의 경우 cardinality에 따른 영향이 Druid보다 작습니
timeseries 데이터에 대한 빠른 응답성을 요구하는 요구사항과
노드에 따른 구동 환경이 잘 맞게 설정된다면 Druid는 상당히
매력적인 솔루션으로 다가올 것으로 기대됩니다.

Appendix - Reference
1. Druid 논문
http://static.druid.io/docs/druid.pdf
2. Concise compression 논문
https://pdfs.semanticscholar.org/e660/5a8c82e93c0809720e0927972f23e62c94b3.pdf
3. druid 관련 한글 포스팅
https://www.popit.kr/?s=time+series+olap
4. Hive를 이용한 Druid ingestion
https://cwiki.apache.org/confluence/display/Hive/Druid+Integration
https://hortonworks.com/blog/apache-hive-druid-part-1-3/
5. Airbnb | DataEngConf SF '17 발표 영상
https://www.youtube.com/watch?v=W_Sp4jo1ACg

Druid+superset

More Related Content

What's hot

Similar to Druid+superset

Druid+superset

Editor's Notes