Intro to Hadoop ecosystem and Apache Kylin

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Intro to Hadoop Ecosystem
and Apache Kylin
Chase Zhang
Strikingly
November 14, 2017

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Disclaimer
▶ Pardon for my English, feel free to interrupt
▶ I’m not yet an expert in this area, feel free to point out my faults

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Questions
▶ Why do we need data platform?
▶ What technologies are we using?
▶ Why have we chosen these technologies?

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Prerequisite
The Hadoop Ecosystem
Map Reduce
Intro to Hadoop
Other projects
Apache Kylin
Introduction
Concepts
Comparisons
Our Project
Hive/SparkSQL
Druid

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Prerequisite
ETL Extract-Transform-Load1
OLAP Online Analytical Processing2
TSDB Time-Series Database3
1
https://en.wikipedia.org/wiki/Extract,_transform,_load
2
https://en.wikipedia.org/wiki/Online_analytical_processing
3
https://en.wikipedia.org/wiki/Time_series_database

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Map Reduce
Fact
A single computing node is no longer able to tackle nowadays huge data
1. Data is too huge to be put into a single node’s storage system or memory
2. It will cost too much time to process all the data with a single computing node
3. Super computers are too expensive for common individuals and enterprises

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Map Reduce
For some particular problems, we can apply a simple idea which is very common in
practical algorithms: Divide and Conquer
Map Reduce
Input Output
Figure: Map Reduce

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Map Reduce
Map Reduce is not just about the simple Divide and Conquer paradigm, it also features a
series techniques and algorithms to achieve
▶ Fault tolerance and error recovery
▶ Job scheduling and supervising
▶ Resource management and arrangement

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Map Reduce
▶ Map Reduce is not a new idea in the context of functional programming
▶ At 2004, Google published the famous paper: MapReduce: Simplified Data
Processing on Large Clusters4
▶ This paper is believed to be the very beginning of the prevalence Big Data’s concept
▶ Google no longer use Map Reduce as the primary data processing method since 20045
▶ Although very successful in application, MapReduce is usually thought inflexible and
of low performance compared to latest computing models
4
https://research.google.com/archive/mapreduce.html
5
https://en.wikipedia.org/wiki/MapReduce

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Intro to Hadoop
▶ At 2003, Google published a paper named: The Google File System6
, after which the
Hadoop project started conceiving
▶ At 2006, the Hadoop project was offically created and it implemented ideas inspired
by both GFS and MapReduce from Google
▶ Currently, Hadoop project contains four main components:
Common common library and utilities
HDFS distributed file system inspired by GFS
YARN job scheduling and resource management platform
MapReduce implementation of the MapReduce computing model
6
http://research.google.com/archive/gfs.html

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Intro to Hadoop: HDFS
▶ One of main component of Hadoop
▶ It implements ideas inspired by GFS
▶ It runs on commdity machines and provides high availability
▶ HDFS lacks POSIX compliance and is of low performance for random reads and writes
▶ AWS’s Simple Storage Service (S3) might be regarded as a counterpart of HDFS

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Intro to Hadoop: Hive
▶ We have HDFS as our platform for storing data to analyze
▶ We have MapReduce or other computing model on Hadoop to process the data
▶ Why don’t we make something help us to perform analysis instead of writing bare job
programs?
▶ One of the best choices is SQL

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Intro to Hadoop: Hive
▶ Hive is a data warehouse built on top of Hadoop to providing data processing,
aggregation and analysis through an interface of SQL
▶ Hive was initially developed by Facebook and later donated to Apache Foundation
▶ Initially, Hive runs on MapReduce engine. But currently it supports more computing
model like Spark and Tez

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Intro to Hadoop: HBase
▶ HDFS is good for high throughput read and write but not good for key query
▶ Hive is good for data aggregation and analysis, but not fast enough for key-value
pair retrieval
▶ HBase is a Hadoop component intended to provide fast Key-Value access service

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Intro to Hadoop: HBase
▶ At 2006, Google published a paper: Bigtable: A Distributed Storage System for
Structured Data7
▶ Apache HBase is the outer world’s implementation of the BigTable paradigm.
▶ HBase is a distributed Key-Value Store runs on top of HDFS
▶ HBase features faster read and write operations and is of high throughput and low
latency
▶ HBase implements Log-Structured Merge Tree8
which enables high throughput for
both read and write operations
▶ AWS’s DynamoDB is quite similar to HBase
7
https://research.google.com/archive/bigtable.html
8
https://en.wikipedia.org/wiki/Log-structured_merge-tree

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Other projects
MapReduce is not popular any more, however Hadoop is still the base infrastructure for
many data components. Including:
Apache Spark A new computing model featured with Directed Acyclic Graph and in
memory computing, which provides much more flexibility and better
performance than MapReduce. Spark has a Hive counterpart named Spark
SQL
Apache Storm Distributed stream processing computation framework with similar design
of DAG with Spark
Apache Kylin Distributed analytics engine for multi-dimensional analysis on large
datasets

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Apache Kylin
Introduction
▶ Apache Kylin is an OLAP system
▶ It was initially developed by eBay Shanghai
▶ The basic idea of Kylin is to perform ETL automatically for us
▶ It has SQL interface and is able to serve sub-second queries for multiple concurrent
users

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Apache Kylin
Introduction
Hive HBase
HDFS
Apache Kylin
MapReduce / Spark
Figure: Apache Kylin

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Apache Kylin
Concepts
With Apache Kylin, we need not programically specify the ETL process, but just abstract
our business model into several Kylin concepts:
▶ Model
▶ Cube
▶ Job
▶ Segment

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Apache Kylin
Concepts
Model definites table strucutre. Typically contains one Fact table and several Dimension
table. A Range Key must be specified for data separation.
Model
Fact
Dimension
Dimension
Dimension
Figure: Kylin Concepts: Model

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Apache Kylin
Concepts
Cube specifies query pattern and build demand. Pick some dimensions for aggregation
and measurement, Kylin will find the best way to transform the data
Cube
COUNT
SUM
MAX
Figure: Kylin Concepts: Cube

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Apache Kylin
Concepts
Job controls build process. It mantains the entire procedure of the extraction,
transformation, storage and cleanup steps.
Hadoop
Job
Hive
HBase
Figure: Kylin Concepts: Job

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Apache Kylin
Concepts
Segment is the result of a Job’s building. It contains transformed data for corresponding
Job’s building range of Range Key.
Segment
HBase
Figure: Kylin Concepts: Segment

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Apache Kylin
Concepts
Once a Segment is ready, data whose Range Key is in the corresponding range will be able
to be load and query efficiently.
Model Cube
JobJobJob
Segment
Figure: Model, Cube, Job and Segment

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Apache Kylin
Concepts
The principles of Apache Kylin’s transform
logic is simple: Enumerate every possible
aggregation and measurement results across
every selected dimensions.
But the problem is still very hard as we have
to perform the job efficiently and to reduce
the task into MapReduce/Spark computing
model.
Figure: Cuboid

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Comparisons
Our Project
▶ A service similiar to Google Analytics
▶ Multi-dimentional data analysis with time series as primary key
▶ Multi-tenants use case, expose services to our customers

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Comparisons
Hive/SparkSQL
Good points
▶ Both Hive/SparkSQL supports SQL
▶ Hive/SparkSQL are more flexible than Kylin as they need not pre-defined ETL model
Bad points
▶ Query time of Hive/SparkSQL is not garanteed as they have no pre-calculation and
cache mechanisms
▶ Hive/SparkSQL are not suitable for multi-tenants use case as they can not provide
sub-second query function for mutiple concurrent users

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Comparisons
Druid
▶ Druid is a time series database designed especially for situations like our case
▶ Druid is based on two new papers from Google: Dremel: Interactive Analysis of
Web-Scale Datasets (2010)9
and Processing a Trillion Cells per Mouse Click (2012)10
▶ Druid is comparatively more flexible than Kylin, but not as flexible as Hive/SparkSQL
▶ Druid makes use of complex and cutting edge algorithms to achieve fast query while
Kylin cached results in HBase
▶ Druid handles high availability and job scheduling itself while Kylin makes use of
components from Hadoop ecosystem
▶ Druid is quite promising, but we still need more investigation
9
https://research.google.com/pubs/pub36632.html
10
https://research.google.com/pubs/pub40465.html

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Thank you!

Intro to Hadoop ecosystem and Apache Kylin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to Hadoop ecosystem and Apache Kylin

Similar to Intro to Hadoop ecosystem and Apache Kylin (20)

More from Chase Zhang

More from Chase Zhang (6)

Recently uploaded

Recently uploaded (20)

Intro to Hadoop ecosystem and Apache Kylin