The challenge of serving large amount of batch-computed data
A very BIG data
The challenge of serving
data sets on-line
The challenge of serving massive
batch-computed data sets online
Serving batch-computed data
by David Gruzman
Today we will discuss the case
when we have multi-terabyte
dataset which is periodically
recalculated and have to be
served in the real time.
SimilarWeb allowed us to
reveal internals of their
Similar Web data flow – the context
Company assemble billions of events from their panel on the daily
Fast growing Hadoop cluster is used to process this data using
various kinds of statistical analysis and machine learning.
The data model is “web scale”. The data derived from the raw events
is processed into “top pages”,”demography”, “keywords” and many
other metrics company assemble.
Problem dimensionality is: Per domain, per day, per country. More
dimensions might appear.
How data is calculated
Data is imported into HDFS from the farm of application
Set of MR Jobs as well as Hive scripts is used to do
Result data has a common structure of the “key-value”
where key – our dimensions or their subset. For
Value: “Top Pages: Page1, …. statistics:.... “
Abstract schema of the relevant part of
Hadoop – Map Reduce
Hadoop – Hbase Stage
Hbase under heavy inserts
First of all – it do works
The question – what was done...
Hbase : Split storms
When you insert data evenly into many regions all of
them starts splitting roughly in the same time. Hbase
does not like it... It became not available, insertion job
failes, leases expired etc...
Solution : pre split table and disable automatic split.
Price : it is hard to achieve even distribution of the
data among regions. Hotspots possible...
Under heavy load to all regions – all of them
starting minor compaction in the same time
Results are similar to the split storm... Nothing
Inherent problem – delayed work
Hbase does not do ALL work required during
Part of the work Delayed till the compaction.
System who delay work is inherently
problematic for the prolonged high load.
It is good to work with spikes of activity, not with
steady heavy load.
Massive insert problem
There is a lot of overhead in randomly insert data.
What happens that MapReduce produce already sorted
data and Hbase is sorting it again.
Hbase is sorting data constantly, while MR do it in the
batch what is inherently more efficient
Hbase is strongly consistent system and under heavy
load all kinds of problems (leasing related) happens
HBase Snapshots – come to rescue
Snapshot is capability to get “point in time” state of the
Technically snapshot is list of files which constitute the table.
So taking snapshot is pure meta-data operation.
When files are to be deleted for the table they are moved to
the archive directory.
Thus all operation like clone, restore – are just file renames
and metadata changes.
Hbase – snapshot export
Before 1 Before 2
Move / rename
Hbase – snapshot export
There is additional capability of snapshots –
Technically it is like DISTCP and even not
required alive cluster on the destination side.
Only HDFS has to be operational.
What we gain – DISTCP speed and scalability.
What happens – files are copied into archive
directory. Hbase is using it's structure as a
So how snapshots help us?
As you remember SimilarWeb has several
Hbase clusters. One used as a company data-
warehouse and two used to serve production
So we prepare data on one cluster where we
have long time-outs and then move it using
snapshots to the production cluster.
So we get to the following solution
Hadoop – Map Reduce
Hadoop – Hbase Stage
Is it ideal?
We effectively minimized impact on Hbase
But we left with Hbase high availability problem
Currently we have two Hbase servers to
It is working but it is far from ideal HW utilization
In production we do not need strong consistency
and we pay for it with Partition tolerance in CAP
theorem. In practice – it is availability problem.
We do not need random writes and most of
Hbase is built for them
We actually have more complex system then we
BigTable vs Dynamo
There are two kinds of NoSQLs – built after
BigTable (Hbase, Hypertable) and after
Dynamo (Cassandra, Voldemort …)
BigTable – good for data warehouse. Capability
to scan data ranges is important
Dynamo – good for online serving since the
systems are more high-available
We decided to do research what system better
Need was formulated as “to be able to prepare
data files offline and copy them into system by
In addition – high availability is a must so
systems built around consistent hashing idea
This is system created exactly for this case
It is capable of serving data from index prepared offline
It is very simple – contains about 5K lines of code
Main drawback – unknown... Very little known usages..
Berkly DB java edition is used to serve local
indexes. It is common with Voldmort which also
has such option.
MR Job (Cascading) is used to prepare indexes.
Indexes cached locally by the servers in the
There is MR job for incremental change of data.
ElephantDB – batch read
Having data sitting in the DFS in a MR friendly
format enable us to do scans right there.
Opposite example – we usually scan Hbase
table to process it using MR. When there is no
filtering / predicate push-down it is serious
waste of resources
Elephant DB - drawbacks
First one – is rare use. We already mentioned it
It is read only. In case we also need random
writes – we will need to deploy another NoSQL.
How building data works
The job gets as parameter all cluster
Thereof it can build data specific for each node
Pull vs Push
It was interesting decision of the Linkedin
engineers to implement pull.
The explanation is that Voldemort as a system
should be able to throttle data load in order to
prevent system performance degradation.
We tested on 3 node dedicated clusters with SSD.
Throughput – 5-6K reads per second barely
change CPU level. Documentation tells about
20K requests per node.
Latency 10-15 milliseconds on not-cached data.
We are researching this number. It sounds too
much for SSD.
1 – 1.5 milliseconds for cached data.
Voldemort (as well as MongoDB) is not develop
own caching mechanism but offload it to OS.
It is done by doing MMAP of the data files.
In my opinion – it is inferior approach since OS
do not have application specific statistics, add
not-needed context switches.
Easy to install. It took 2 hours to build the cluster
even without installer..
Pluggable storage engines.
Support for efficient import of batch-computed data
There is limit in pre-computing way when number
of dimension grow.
What we are doing – we have proprietary layer
build on LINQ and C# which makes missing
We also evaluate Jethrodata which can do it in
It is RDBMS engine running on top of HDFS and
gives full index with join and group by capability
ElephantDB information used