Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics!
Big Data Business has solved the problem of 'Big data acquisition and persistence' using daily
ETL and Batch Analysis through the Hadoop Eco-system.!
Now lets now see how the Big Data Market evolved beyond batch processing (Hadoop ,
RDBMS - world) to extract intelligence from global data streams in real time !!
Tremendous Research work underway , last few years , to challenge conventional wisdom and
create the augmented reality of Big Data Analytics !!
'Dynamic Decision Making' is no longer driven by 'Traditional Business Intelligence' but involves
'Fast exploration of data patterns', 'performing complex deterministic or stochastic
approximations or accurate queries' !!
Lets glance through some of the core concepts and key technologies that are driving this
Need to preserve Data Locality !
Traditional Hadoop MR does not preserve 'data locality' during Map Reduce transition or
between iterations. In order to send data to the next job in a MR workﬂow, a MR job
needs to store its data in HDFS. So it incurs communication overhead and extra
processing time. Bulk Synchronous Parallel - concept was implemented in (Pregel,
Giraph, Hama) to solve this very important MR problem. !
Hama initiates a peer2peer communication only when necessary and peers focus on
keeping locally processed data in local nodes.!
BSP manages the synchronization and communication in the middle layer as opposed to
ﬁle-system based parallel random access pattern ! It uses the K-Means Clustering
algorithm. Apache Hama provides a stable reference implementation for analyzing
streaming events or big data with graph/network structure by implementing deadlock-
free 'message passing interface' and 'barrier synchronization' (reduces signiﬁcant n/w
Need for Real-time processing and streaming ETL!
Hadoop was purpose-built for 'distributed batch processing' using static input, output and
processor conﬁguration. But fast ad-hoc machine learning query requires real-time
distributed processing and real-time updates based on dynamically changing
conﬁguration without requiring any code change. Saffron Memory Base (SMB) is a
technology that offers real-time analytic on hybrid data.!
SQLstream Connector for Hadoop provides bi-directional, continuous integration with
Hadoop HBase. DataTorrent Apex is another front-runner in Stream Processing.!
With SciDB, one can run a query the moment it occurs to the user. By contrast, arguably
Hadoop enforces a huge burden of infrastructure setup, data preparation, map-reduce
conﬁguration and architectural coding. Both SciDB and SMB positions themselves as a
complete replacement of Hadoop-MR when it comes to complex data analysis.!
Need for complex analytic functions!
Its not suitable for increasingly complex mathematical and graphical functions like Page-
Ranking, BFS, Matrix-Multiplication which require repetitive MR Jobs.!
So many interesting research works spawned in recent time; BSP, Twister, Haloop,
Need for rich Data Model and rich Query syntax!
Existing MR Query API has limited syntax for relational joins and group-bys. They do
not directly support iteration or recursion in declarative form and are not able to
handle complex semi-structured nested scientiﬁc data. Here comes MRQL (Map-Reduce
Query Language) to the rescue of Hadoop MR by supporting nested collections, trees,
arbitrary query nesting, and user-deﬁned types and functions.!
Impala : read directly from HDFS and HBase data. it will add a columnar storage engine,
cost-based optimizer and other distinctly database-like features.!
Need to optimize data ﬂow and query execution!
All the 'Big data Analytics datastores' both proprietery and open-sourced trying their best
to redeﬁne the 'traditional Hadoop MR'!
Well MapReduce does not make sense as an engine for querying !!
So here comes Shark is a Distributed In-Memory MR framework with great speed of
Need for speed and versatility of MR Query!
Apache Hadoop is designed to achieve very high throughput, but is not designed to
achieve the sub-second latency needed for interactive data analysis and exploration.
Here comes Google Dremel and Apache Drill. Columnar query execution engine offers
low latency interactive reporting using DrQL which encompasses a broad range of low
latency frameworks like Mongo Query, Cascading, Plume. It adheres to Hadoop
philosophy of connecting to multiple storage systems, but broadens the scope and
introduces enormous ﬂexibility through supporting multiple query languages, data
formats and data sources. Spark extends Hive syntax but employs a very efﬁcient
column-storage with interactive multi-stage computing ﬂow.!
Ability to cache and reuse intermediate map outputs!
HaLoop introduces recursive joins for effective iterative data analysis by caching and
reusing loop-independent data across iterations. A signiﬁcant improvement over
conventional general-purpose MapReduce.!
Twister offers a modest mechanism for managing conﬁgurable and cacheable.mr tasks
and implements effective pub/sub based communication and off course special support
for 'iterative MR computations'.!
Leverage CPU cache and distributed memory access patterns!
There are quite a few frameworks to store data in distributed memory instead of HDFS
like GridGain, HazelCast, RDD, Piccolo!
Hadoop was not designed to facilitate interactive analytics. So it required few game
changers to exploit 'CPU cache' and 'Distributed Memory’!
Single-bus Multi-core CPU!
It's a well-known fact how Single-bus Multi-core CPU offers simultaneous multi-threading
that signiﬁcantly reduces latency for certain type of algorithm and data structures ! So
DBMS being re-architected from ground up to leverage the 'cpu-bound partitioning-
phase hash-join' as opposed to 'memory-bound hash-join' !!
Its the ever-increasing speed of CPU caches and TLBs which allow blazing fast
computation and retrieval of hashed result. Also its noteworthy how modern multi-core
CPU and GPGPU offer cheap compression schemes at virtually no CPU cost. As we
know access to memory becoming pathetically slower compared to the ever galloping
ElasticCube from SiSense leverages 'query plans optimized for fast response time and
parallel execution based on multi-cores' and continuous 'instruction recycling' for
reusing pre-computed results.!
Few other Business Analytics database/tools like VectorWise, Tableau Data Engine
and SalesEdge also make great use of CPU cache to offer blazing fast ad-hoc query.!
Implementing Parallel Vectorization of compressed data through SIMD
(Single-Instruction, Multiple Data) : !
VectorWise efﬁciently utilized the techniques of vectorization, cpu compression and
'using cpu as execution memory'. This is also a core technology behind many leading
analytics column stores.!
Driving Positional-Delta-Tree (PDT) : !
PDT stores both position and the delta are stored in memory and effectively merged with
data during optimized query execution. Ad-hoc query in most cases is about identifying
the 'difference'. VectorWise makes effective use of PDT. More can be found in its white
Need for directly query compressed data residing in heavily indexed
columnar ﬁles : !
SSDs and Flash storages will get cheaper (means more cheaper Cloud service) with
innovative compression and de-duplication on ﬁle system. All the analytics datastores
are gearing up to make most of this feature. Watch out for Pure Storage .!
Need for Dynamic Memory Computation and Resource Allocation!
Output of batched job need to be dumped in secondary storage. Rather it would be good
idea to constantly compute the data size and create peer processors and make sure
collective memory does not exceed entire data size. So its important to understand
Hadoop-MR is not the best-ﬁt for processing all types data structures ! BSP model
should be adopted for massive Graph processing where bulk of static data can remain in
ﬁlesystem while dynamic data processed by peers and result kept in memory.!
Usage of Fractal Tree Indexes : !
Local processing is the key ! Keep enough buffers and pivots in the Tree node itself in
order to avoid frequent costly round trips along the tree for individual items! That
means keep ﬁlling up your local buffer and then do bulk ﬂush! New age drives love bulk
updates (more changes per write) to avoid fragmentation !!
TokuDB replaces MySQL and MongoDB binary tree implementation with Fractral tree
and achieved massive performance gain.!
Distributed Shared Memory Abstraction !
It provides a radical performance improvement over disk-based MR ! Partial DAG
execution (Directed Acyclic Graph) model to describe parallel processing for for in-
memory computation (aggregate result set that ﬁts in memory e.g. like intermediate Map
Spark uses 'Resilient Distributed Dataset' architecture for converting query into 'operator
tree' ! Shark keeps on reoptimizing a running query after running ﬁrst few stages of the
task DAG, thereby selecting better Join strategy and right degree of parallelism. Shark
also offers 'co-partitioning multiple tables based on common key' for faster join query!!
It leverages SSD, CPU cores and Main Memory to the fullest extent !!
Its worth mentioning how DDF (Distributed DataFrame - nexgen extension of Spark
RDD) , H2O DataFrame and Dato SFrame expanding the horizons machine learning at
Parallel Array Computation !
It is a very simplistic yet powerful mathematical approach to embed Big Math functions
directly inside database engine. SciDB has mastered this concept by embedding
statistical computation using distributed, multidimensional arrays.!
Semi-computation and instant approximation!
'fast response to ad-hoc query' through Continuous learning from experience and instant
approximation as opposed to waiting for the end of processing and computation of ﬁnal
result . A bunch of analytics products coming to market with built-in 'data science
capabilities' - for example H20 from 0xdata .!
Need for Push-based Map Reduce Resource Management!
Though Hadoop's main strength is distributed processing over clusters, but at peak load
the utilization drops due to scheduling overhead. Its well-known how map-reduce cluster
is divided into ﬁxed number of processor slots based on static conﬁguration. !
So Facebook introduced Corona - a 'push-based scheduling where a cluster-manager
tracks nodes, dynamically allocates a slot making it easier to utilize all the slots based on
cluster workload for both map-reduce and non-mr applications.!
Topological Data Analysis !
It is a giant leap forward ! It treat data model as a topology of nodes and discovers
patterns and results by measuring similarity ! Ayasdi has pioneered this idea to build the
ﬁrst 'Query-free exploratory analytics tool' ! Its a true example of 'analytics based on
unsupervised learning without requiring a priori algebraic model'.!
Ayasdi Iris is a mind-boggling insight discovery tool !!
Learn from KnowledgeBase!
InfoBright offers fast analytics based on the Knowledge-base built by continuously
updating metadata about data, data access patterns, query, aggregate result. !
This type of innovative 'dynamic introspective' approach helps columnar storage to avoid
requirement for indexing and costly sub-selects and allows to decompress only the
required data !!
Reduction in Cluster size!
'Data Analysis' processors do not need the same number of machines like Hadoop
nodes. For example ParAccel can handle data analysis in one node as opposed to (avg)
8 nodes required by Hadoop to perform same type of analysis due to advanced storage
schema optimization. Enhanced File systems like QFS offer much lower replication
factor (1.5) and higher throughput.!
Avoid Data duplication !
Both ParAccel and Hadapt share a similar vision of analyzing the data as close to the
data node as possible without moving data to different BI layer.!
As opposed to Hadoop Connector strategies of MPP analytic datastores, Hadapt
processes data in an RDBMS layer sitting close to HDFS and load-balances queries in a
virtualized environment of adaptive query engines.!
Ensure Fault-tolerance & Reliability !
Apache Hama ensures reliability through Process Communication and Barrier
Synchronization. Each peer uses checkpoint recovery to occasionally ﬂush the volatile
part of its state to the DFS and allows rollback to the last checkpoint in the event of
Per Shark documentation, "RDDs track the series of transformations used to build them
(their lineage) to recompute lost data "!
Shark/Spark with its compact Resilient Data Set -based distributed in-memory
computing offers a great hope to startups and open source enthusiasts for building a
lightning fast data warehouse system.!
References : !
** Single-Bus-Multi-Core CPU : http://pages.cs.wisc.edu/~jignesh/publ/hashjoin.pdf !
** Bulk Synchronous Parallel : http://www.staff.science.uu.nl/~bisse101/Book/PSC/psc1_2.pdf !
Spark : http://spark-project.org/research/ !
Shark : https://github.com/amplab/shark !
Pregel : http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-
Giraph : http://giraph.apache.org/ !
Hama : http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf !
Saffron Memory Base : http://www.slideshare.net/paulhofmann/big-data-and-saffron !
SQL Stream : http://www.sqlstream.com/applications/hadoop/ !
RHadoop : https://github.com/RevolutionAnalytics/RHadoop/wiki !
MRQL : http://lambda.uta.edu/mrql/ !
Apache Drill : http://incubator.apache.org/drill/ !
HaLoop : http://code.google.com/p/haloop/wiki/UserManual !
Twister : http://www.iterativemapreduce.org/ !
BSP : http://en.wikipedia.org/wiki/Bulk_synchronous_parallel !
Corona : https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona !
ParAccell : http://img.en25.com/Web/ParAccel/%7Be72a7284-edb0-4e58-bb75-ff1145717d2b
QFS : https://github.com/quantcast/qfs/wiki/Performance-Comparison-to-HDFS !
Hadapt : http://hadapt.com/assets/Hadapt-Product-Overview1.pdf !
ElasticCube : http://pages.sisense.com/elasticube-whitepaper.html?src=bottom !
VectorWise : http://fastreporting.ﬁles.wordpress.com/2011/03/vectorwise-whitepaper.pdf !