Low Latency OLAP with Hadoop and HBase

Low-Latency “OLAP” with Hadoop and HBase
Andrei Dragomir | Software Engineer

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Synopsis

§  What are we trying to solve
§  Description of our system
§  How it works
§  Minimizing Latency

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2

In a nutshell

Low-latency OLAP system
Hadoop DFS to store input data (ie log files, or
HBase tables)
The processing loop of the system takes a cube
description and processes it (pre-aggregations)
using Hadoop Map/Reduce.
The output is written to a statistics HBase table.
To get the data, users query a server, which scans
the HBase table, applying the filters, roll-ups or
drill-downs, and returning the result.

In a nutshell

HBase tables)

Vocabulary

Date Country City OS Browser Sales
2012-05-12 USA NY Win FF $ 0.0
2012-05-12 USA NY Win FF $ 10.0
2012-05-13 USA SF OSX Chrome $ 25.0
2012-05-13 Canada Ontario Linux Chrome $ 0.0
2012-05-14 USA Chicago OSX Safari $ 15.0
... ... ... ... ... ...
5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0
3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales
Canada: 1 SF: 1 OSX: 2 Chrome: 2


Vocabulary

2012-05-12 USA NY Win FF $ 0.0
2012-05-12 USA NY Win FF $ 10.0
... ... ... ... ... ...
§  We want to get (mostly) numeric data: metrics
§  These metrics have a set of labels (dimensions)
§  We want to view the metrics by any combination of
dimensions

Vocabulary

2012-05-12 USA NY Win FF $ 0.0
2012-05-12 USA NY Win FF $ 10.0
... ... ... ... ... ...
dimensions

OLAP Queries

§  Rolling up to country level Country visits sales
SELECT
COUNT(visits),
SUM(sales)
USA 4 $50
GROUP
BY
country

Canada 1 0

§  “Slicing” by browser Country visits sales

SELECT
COUNT(visits),
SUM(sales)
USA 2 $10

GROUP
BY
country
Canada 0 0
HAVING
browser
=
“FF”

Browser sales visits
§  Top browsers by sales
Chrome $25 2
SELECT
SUM(sales),
COUNT(visits)

GROUP
BY
browser

Safari $15 1

ORDER
BY
sales
FF $10 2


Looking inside – physical diagram


Looking inside – logical diagram


Simplifying assumptions: pre-aggregation

§  In most cases...
§  Data needs to be summarized – hard to
draw 1B data points
§  You don’t need to look at all dimensions at
the same time – hard to correlate
§  Not all queries are used with the same
frequency


A timeless CS problem: Optimize...

Time Space
§  Pre-aggregation §  Runtime

§  Fast
aggregation
§  Flexible
§  Efficient reads –
O(1)
§  Inflexible §  I/O, CPU intensive
§  Processing latency §  Slow– always need
§  Combinatorial
to look at all the
Explosion data
§  Low throughput

Solution ?

§  Just do both !
§  Can tune: pre-aggregate more, or rely on
runtime aggregation
§  Ingestion + process speed vs Query speed

§  Works just like normal queries +
materialized views


Solution ?

§  Process: pre-aggregate all the report
definitions, create an indexed HBase table.
§  Query: use the indexes to get the data
fast. Perform extra aggregation, filtering if
needed at runtime.
§  Platform strengths
§  Parallelism in M/R
§  Fast access and natural key ordering in
HBase

Minimal HBase details

Row
Columns...

§  Data is stored in tables Key

u1
v1
v2
v3

§  Each row has a key,
u2
v
X
...

and any number of
columns (long & wide) u3
v
x
...

u4
x
v2
...

§  Ordered by row keys: u5
...
v3
...

clustered indexes
u6
...
v5
...

built-in
u7
...
...
...

§  Sparse tables. NULLs u8
...
...
...

are free.


Minimal HBase details

Row
Column
§  Operations use row key
...

key: get(), put()
aaa
v1

aab
v2

§  Can scan a range of
←

rows:[start,
end)
aac
v3

←
aad
v4

§  We can use the row ←
aae
v5

key as a built-in ←
aaf
v6

indexing aba
...

mechanism abb
...


SaasBase vs. SQL Views Comparison


Reports configuration

§  List of Dimensions (with custom classes,
arguments, etc)
§  List of Metrics (with custom classes, arguments,
etc)
§  List of Reports, each containing
§  Dimensions (subset)
§  Metrics (subset)
§  Sorting, etc
§  The reports configuration is used in the
entire system: import, process, query

Solution ?

Date Countr Cit Sale
y y s
2012-05-1 USA NY 3
2
2012-05-1 USA NY 10
2
2012-05-1 USA SF 25
3
2012-05-1 CAN ON 0
3
2012-05-1 USA CH 15
4


Solution ?

y y s
2012-05-1 USA NY 3
2
2012-05-1 USA NY 10
2
2012-05-1 USA SF 25
3
2012-05-1 CAN ON 0
3
2012-05-1 USA
visits_by_city:
{
CH 15

dimensions:
[country,
city],

4

metrics:
[visits]

},

daily_sales:
{

dimensions:
[year,
month,
day,

country],

metrics:
[sales]

}


Solution ?

y y s
2012-05-1 USA NY 3
2
2012-05-1 USA NY 10
2

Statistics
HBASE
Output
Table

ROWKEY

VALUE

2012-05-1 USA SF 25
3 daily_sales/2012+05+12+USA

$13

daily_sales/2012+05+13+CAN

$0

2012-05-1 CAN ON 0
daily_sales/2012+05+13+USA

$25

3
daily_sales/2012+05+14+USA

$15

2012-05-1 USA
visits_by_city:
{
CH 15 visits_by_city/CAN+ON

1

dimensions:
[country,
city],

4

metrics:
[visits]
visits_by_city/USA+CH

1

},

daily_sales:
{
visits_by_city/USA+NY

2

dimensions:
[year,
month,
day,
visits_by_city/USA+SF

1

country],

metrics:
[sales]

}


HBase natural order: hierarchical filtering


Sorting

§  Add the metrics that you want to sort by to the
row key...
§  In a way that preserves the ordering


Sorting

§  Add the metrics that you want to sort by to the
row key...
§  In a way that preserves the ordering
§  ORDER
BY
metric
DESC
==
Long.MAX_VALUE
–
metric

2012+05+USA+0000000000+

2012+05+USA+4294961296+SF
=
1000
visits

2012+05+USA+4294961396+NY
=
900
visits

.
.
.

2012+05+USA+9999999999+


Minimizing Latency


Minimizing Import Latency

§  Only import the minimal set of changes
§  Map/Reduce input filters:
§  c.a.s.a.i.FileCache – checks if file already
processed
§  c.a.s.a.i.FileDateFilter – checks if a date in
the file path is against a specified interval
§  process files from 3 days ago up until now,
once
§  HBase scan (from import table) start and stop row
§  Minimize map-task overhead – stitch input splits


§  Minimize map-task overhead – stitch input splits
§  for 400000 files -> 400000 Map Tasks, slow reduce-copy
phase
§  o.a.h.m.i.CombineFileInputFormat – make 2GB
splits
§  c.a.s.a.m.i.FixedMappersTableInputFormat –
stitches multiple HBase regions in the same
map task



§  If warehousing in HBase, use
o.a.h.h.m.HFileOutputFormat

§  ~ 100 times faster than using the API
§  No shuffle step! you must use a global order partitioner
§  Problem: data grows over time
§  Solution: estimate output partitions based on input data
size, and make partitions (regions) using this heuristic
§  c.a.s.a.m.FileSizeDatePartitioner – inject input files
size and dates and rebalance regions based on these,
and a fixed size (2GB)


Minimizing Processing Latency

§  Processing involves reading the input (files, tables,
events), pre-aggregating it (reducing cardinality) and
generating tables that can be queried in real-time
§  Processing does GROUP BY, COUNT/SUM/AVG, ORDER
BY
§  Minimize each M/R step: read, map, partition, combine,
copy, sort, reduce, write
§  Read
§  Filter input data (incremental processing) – differentiate
between OPEN and CLOSED data
§  HBase Scan options: caching, batching, etc
§  Ensure HBase table regions are distributed in the cluster

Minimizing Processing Latency

§  c.a.s.a.m.j.SuperProcessor

§  One shot M/R job: for all data, for all reports, emit the
pre-aggregated values in 1 map() call
§  no allocations
§  Simple and tight
§  no system calls (avoid context switches)
§  no String <> byte[] transformations
§  minimize Map > Combine > Reduce I/O
§  NO ALLOCATIONS


Minimizing Query Latency

§  c.a.s.a.m.t.ReportHandler

§  Simple Thrift server
§  Data is already processed and pre-aggregated
§  Query time does HAVING/WHERE (filters), extra
GROUP BY (roll-ups)
§  Calculate an optimal set of HBase scan()s

§  single / multiple scans
§  start / stop rows (prefixes, index positions)
§  Perform extra roll-ups / sorting
§  Assorted sundries: paging, display-time ser/des, etc


Flexible

§  Report configuration – the core of the system
§  c.a.s.a.e.Dimension, c.a.s.a.e.Metric

§  Can override ser/des, aggregate functions (for metrics)
§  Can override behavior (only add 1 if X...)
§  Emergent patterns are rolled-up in the reporting core
§  The entire processing loop can be written outside of
M/R for realtime
§  Storm ?
§  Applied in 4 use-cases right now, easy to extend
§  Some programming required

Thank you

adragomi@adobe.com / @adragomir
http://hstack.org

Our team: Adrian Muraru, Andrei Dulvac, Bogdan Dragu,
Bogdan Drutu, Cosmin Lehene, Raluca Podiuc, Tudor Scurtu


Break!
Break takes place in the Community Showcase (Hall 2)
Sessions will resume at 3:35pm

Page 40

Low Latency OLAP with Hadoop and HBase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Low Latency OLAP with Hadoop and HBase

Similar to Low Latency OLAP with Hadoop and HBase (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Low Latency OLAP with Hadoop and HBase