6. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Hadoop Ecosystem
Map Reduce
Fact
A single computing node is no longer able to tackle nowadays huge data
1. Data is too huge to be put into a single node’s storage system or memory
2. It will cost too much time to process all the data with a single computing node
3. Super computers are too expensive for common individuals and enterprises
9. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Hadoop Ecosystem
Map Reduce
▶ Map Reduce is not a new idea in the context of functional programming
▶ At 2004, Google published the famous paper: MapReduce: Simplified Data
Processing on Large Clusters4
▶ This paper is believed to be the very beginning of the prevalence Big Data’s concept
▶ Google no longer use Map Reduce as the primary data processing method since 20045
▶ Although very successful in application, MapReduce is usually thought inflexible and
of low performance compared to latest computing models
4
https://research.google.com/archive/mapreduce.html
5
https://en.wikipedia.org/wiki/MapReduce
10. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Hadoop Ecosystem
Intro to Hadoop
▶ At 2003, Google published a paper named: The Google File System6
, after which the
Hadoop project started conceiving
▶ At 2006, the Hadoop project was offically created and it implemented ideas inspired
by both GFS and MapReduce from Google
▶ Currently, Hadoop project contains four main components:
Common common library and utilities
HDFS distributed file system inspired by GFS
YARN job scheduling and resource management platform
MapReduce implementation of the MapReduce computing model
6
http://research.google.com/archive/gfs.html
13. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Hadoop Ecosystem
Intro to Hadoop: Hive
▶ Hive is a data warehouse built on top of Hadoop to providing data processing,
aggregation and analysis through an interface of SQL
▶ Hive was initially developed by Facebook and later donated to Apache Foundation
▶ Initially, Hive runs on MapReduce engine. But currently it supports more computing
model like Spark and Tez
15. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Hadoop Ecosystem
Intro to Hadoop: HBase
▶ At 2006, Google published a paper: Bigtable: A Distributed Storage System for
Structured Data7
▶ Apache HBase is the outer world’s implementation of the BigTable paradigm.
▶ HBase is a distributed Key-Value Store runs on top of HDFS
▶ HBase features faster read and write operations and is of high throughput and low
latency
▶ HBase implements Log-Structured Merge Tree8
which enables high throughput for
both read and write operations
▶ AWS’s DynamoDB is quite similar to HBase
7
https://research.google.com/archive/bigtable.html
8
https://en.wikipedia.org/wiki/Log-structured_merge-tree
16. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Hadoop Ecosystem
Other projects
MapReduce is not popular any more, however Hadoop is still the base infrastructure for
many data components. Including:
Apache Spark A new computing model featured with Directed Acyclic Graph and in
memory computing, which provides much more flexibility and better
performance than MapReduce. Spark has a Hive counterpart named Spark
SQL
Apache Storm Distributed stream processing computation framework with similar design
of DAG with Spark
Apache Kylin Distributed analytics engine for multi-dimensional analysis on large
datasets
28. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Comparisons
Druid
▶ Druid is a time series database designed especially for situations like our case
▶ Druid is based on two new papers from Google: Dremel: Interactive Analysis of
Web-Scale Datasets (2010)9
and Processing a Trillion Cells per Mouse Click (2012)10
▶ Druid is comparatively more flexible than Kylin, but not as flexible as Hive/SparkSQL
▶ Druid makes use of complex and cutting edge algorithms to achieve fast query while
Kylin cached results in HBase
▶ Druid handles high availability and job scheduling itself while Kylin makes use of
components from Hadoop ecosystem
▶ Druid is quite promising, but we still need more investigation
9
https://research.google.com/pubs/pub36632.html
10
https://research.google.com/pubs/pub40465.html