Modern Big Data Analytics Tools: An Overview

Modern Big Data
AnalyticsTools:An
Overview
Milind Bhandarkar

Chief Scientist, Pivotal

(Twitter: @techmilind)

(All Images Courtesy Flickr, Creative Commons Licensed)

About Me
• http://www.linkedin.com/in/milindb

• Founding member of Hadoop team atYahoo! [2005-2010]

• Contributor to Apache Hadoop since v0.1

• Built and led Grid SolutionsTeam atYahoo! [2007-2010]

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

• Center for Development of Advanced Computing (C-DAC),
National Center for Supercomputing Applications (NCSA), Center
for Simulation of Advanced Rockets, Siebel Systems (acquired by
Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and
Pivotal (formerly Greenplum)

Once upon a time, in a
land far far away…

And, then…
HDFS
ASF Projects FLOSS Projects Pivotal Products
MapReduce

In a blink of an eye…
HDFS
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
ASF Projects FLOSS Projects Pivotal Products

W-1-W
•WebMap : Graph processing for WWW

•Dreadnaught: Infrastructure for WebMap

•W-1-W:WebMap In One Week

•Juggernaut: Infrastructure for W-1-W

•JFS, JMR, Condor:Abandoned for Hadoop

MapReduce is the Revenge of
System Programmers on
Database community.
- Anonymous at XLDB, Stanford, 2010

Who Uses Hadoop?
(From Hadoop Summit 2010)

Big Data Landscape - July 2012
http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/

Hadoop Ecosystem (Jan 2013)http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html

Game Changing
Hadoop Economics
$-
$20,000
$40,000
$60,000
$80,000
2008 2009 2010 2011 2012 2013
Big Data Platform Price/TB
Big Data DB Hadoop

Hadoop Maturity
ETL Offload

Accommodate massive  
data growth with existing
EDW investments
Data Lakes

Unify Unstructured and
Structured Data Access
Big Data
Apps

Build analytic-led
applications impacting  
top line revenue
Data-Driven
Enterprise

App Dev and Operational
Management on HDFS
Data Architecture

70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being
analyzed
<0.5% being
operationalized
Average Enterprises
The Big Gap

Storage Options
•HDFS, MapR, Quantcast QFS

•EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS,
Lustre

•Amazon S3, EMC Atmos, OpenStack Swift

•GlusterFS, Ceph

•EMCViPR

SQL-on-Hadoop
•Pivotal HAWQ

•Cloudera Impala, Facebook Presto,Apache
Drill, Cascading Lingual, Optiq, Hortonworks
Stinger

•Hadapt, Jethrodata, IBM BigSQL, Microsoft
PolyBase

•More to come...

Network

Interconnect
...
......HAWQ & HDFS
Master 
Severs

Planning & dispatch
Segment 
Severs

Query execution
...
Storage
!
HDFS, HBase …

Namenode
B
replication
Rack1 Rack2
DatanodeDatanode Datanode
Read/Write
Segment
Segment host
Segment
Segment
Segment host
Segment
Segment host
Master host
Meta Ops
HAWQ Interconnect
Segment
Segment
Segment
Segment host
Segment
Datanode
Segment
SegmentSegment Segment

Provides data-parallel implementations

of mathematical, statistical and machine-learning
methods

for structured and unstructured data.
In-Database Analytics

MADLib Functions
• Linear Regression

• Logistic Regression

• Multinomial Logistic
Regression

• K-Means

• Association Rules

• Latent Dirichlet Allocation

• Naïve Bayes

• Elastic Net Regression

• DecisionTrees / Random
Forest

• SupportVector Machines

• Cox Proportional Hazards
Regression

• Descriptive Statistics

• ARIMA

k-Means Usage
SELECT * FROM madlib.kmeanspp (
‘customers’, -- name of the input table
‘features’, -- name of the feature array column
2 -- k : number of clusters
);
!
centroids | objective_fn | frac_reassigned | …!
------------------------------------------------------------------------+------------------+-----------------+ …
{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …

Pivotal R
•Interface is R client

•Execution is in database

•Parallelism handled by PivotalR

•Supports a portion of R
R> x = db.data.frame(“t1”)
R> l = madlib.lm(interlocks ~ assets + nation, data = t)

A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary

A wrapper of MADlib
• Elastic Net
• ARIMA
• Table summary
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/%
^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect db.list
db.objects
db.existsObject delete
• dim names
• content
And more ... (SQL wrapper)
• predict

A wrapper of MADlib
• Elastic Net
• ARIMA
• Table summary
• Categorial variable
as.factor()
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/%
^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect db.list
db.objects
db.existsObject delete
• dim names
• content
And more ... (SQL wrapper)
• predict

In-Database Execution
•All data stays in DB: R objects merely point
to DB objects

•All model estimation and heavy lifting done
in DB by MADlib

•R→ SQL translation done in the R client

•Only strings of SQL and model output
transferred across ODBC/DBI

Single'App'
BATCH
HDFS
Single'App'
INTERACTIVE
Single'App'
BATCH
HDFS
Single'App'
BATCH
HDFS
Single'App'
ONLINE
Hadoop 1.0
(Image Courtesy Arun Murthy, Hortonworks)

MapReduce 1.0

Hadoop 2.0
HADOOP 1.0
HDFS%
(redundant,*reliable*storage)*
MapReduce%
(cluster*resource*management*
*&*data*processing)*
HDFS2%
(redundant,*reliable*storage)*
YARN%
(cluster*resource*management)*
Tez%
(execu7on*engine)*
HADOOP 2.0
Pig%
(data*ﬂow)*
Hive%
(sql)*
%
Others%
(cascading)*
*
Pig%
(data*ﬂow)*
Hive%
(sql)*
%
Others%
(cascading)*
%
MR%
(batch)*
RT%%
Stream,%
Graph%
Storm,''
Giraph'
*
Services%
HBase'
*

Applica'ons+Run+Na'vely+IN+Hadoop+
HDFS2+(Redundant,*Reliable*Storage)*
YARN+(Cluster*Resource*Management)***
BATCH+
(MapReduce)+
INTERACTIVE+
(Tez)+
STREAMING+
(Storm,+S4,…)+
GRAPH+
(Giraph)+
INLMEMORY+
(Spark)+
HPC+MPI+
(OpenMPI)+
ONLINE+
(HBase)+
OTHER+
(Search)+
(Weave…)+
YARN Platform

NodeManager* NodeManager* NodeManager* NodeManager*
Container*1.1*
Container*2.4*
Container*1.2*
Container*1.3*
AM*1*
Container*2.2*
Container*2.1*
Container*2.3*
AM2*
Client2*
ResourceManager*
Scheduler*
YARN Architecture

YARN
•Yet Another Resource Negotiator

•Resource Manager

•Node Managers

•Application Masters

•Speciﬁc to paradigm, e.g. MR Application
master (aka JobTracker)

Beyond MapReduce
•Apache Giraph - BSP & Graph Processing

•Storm onYarn - Streaming Computation

•HOYA - HBase onYarn

•Hamster - MPI on Hadoop

•More to come ...

Hamster
• Hadoop and MPI on the same
cluster

• OpenMPI Runtime on
HadoopYARN

• Hadoop Provides: Resource
Scheduling, Process
monitoring, Distributed File
System

• Open MPI Provides: Process
launching, Communication, I/O
forwarding

GraphLab + Hamster
on Hadoop
!

About GraphLab
•Graph-based, High-Performance distributed
computation framework

•Started by Prof. Carlos Guestrin in CMU in
2009

•Recently founded Graphlab Inc to
commercialize Graphlab.org

GraphLab Features
•Topic Modeling (e.g. LDA)

•Graph Analytics (Pagerank,Triangle counting)

•Clustering (K-Means)

•Collaborative Filtering

•Linear Solvers

•etc...

Only Graphs are not
Enough
•Full Data processing workﬂow requires ETL/
Postprocessing,Visualization, Data Wrangling,
Serving

•MapReduce excels at data wrangling

•OLTP/NoSQL Row-Based stores excel at
Serving

•GraphLab should co-exist with other Hadoop
frameworks

Data Platform of the Future ?
Analytic 
Data Marts
SQL Services
Operational 
Intelligence
In-Memory Database
Run-Time 
Applications
Data Staging 
Platform
Data Mgmt. Services
Stream  
Ingestion
Streaming Services
Software-Defined Datacenter
New Data-fabrics
In-Memory Grid
...ETC

Modern Big Data Analytics Tools: An Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modern Big Data Analytics Tools: An Overview

Similar to Modern Big Data Analytics Tools: An Overview (20)

More from Great Wide Open

More from Great Wide Open (20)

Recently uploaded

Recently uploaded (20)

Modern Big Data Analytics Tools: An Overview