Modern Big Data
AnalyticsTools:An
Overview
Milind Bhandarkar	

Chief Scientist, Pivotal	

(Twitter: @techmilind)	

(All Im...
About Me
• http://www.linkedin.com/in/milindb	

• Founding member of Hadoop team atYahoo! [2005-2010]	

• Contributor to A...
Hadoop Midwife :-)
Once upon a time, in a
land far far away…
Fast forward 15 years..
What Happened ?
And, then…
HDFS
ASF Projects FLOSS Projects Pivotal Products
MapReduce
In a blink of an eye…
HDFS
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
GemFire XD
Oozie
...
History (2003-2010)
Google Papers
Yahoo! Search
+
=
W-1-W
•WebMap : Graph processing for WWW	

•Dreadnaught: Infrastructure for WebMap	

•W-1-W:WebMap In One Week	

•Juggerna...
Lucene, Nutch
Kryptonite
Major Step Backwards?
MapReduce is the Revenge of
System Programmers on
Database community.
- Anonymous at XLDB, Stanford, 2010
O’Reilly Books 2013
Who Uses Hadoop?
(From Hadoop Summit 2010)
Big Data Landscape - July 2012
http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
Hadoop Ecosystem (Jan 2013)http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
Game Changing
Hadoop Economics
$-
$20,000
$40,000
$60,000
$80,000
2008 2009 2010 2011 2012 2013
Big Data Platform Price/TB...
Hadoop Maturity
ETL Offload	

Accommodate massive 

data growth with existing
EDW investments
Data Lakes	

Unify Unstructu...
70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being
analyzed
<0.5% being...
Storage Options
•HDFS, MapR, Quantcast QFS	

•EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS,
Lustre	

•Amazon S3, EMC Atmos, O...
SQL-on-Hadoop
•Pivotal HAWQ	

•Cloudera Impala, Facebook Presto,Apache
Drill, Cascading Lingual, Optiq, Hortonworks
Stinge...
Network	

Interconnect
...
......HAWQ & HDFS
Master

Severs	

Planning & dispatch
Segment

Severs	

Query execution
...
St...
Namenode
B
replication
Rack1 Rack2
DatanodeDatanode Datanode
Read/Write
Segment
Segment host
Segment
Segment
Segment host
...
HAWQ vs Hive
Lower is Better
Provides data-parallel implementations 	

of mathematical, statistical and machine-learning
methods 	

for structured and ...
MADlib Algorithms
MADLib Functions
• Linear Regression	

• Logistic Regression	

• Multinomial Logistic
Regression	

• K-Means	

• Associati...
k-Means Usage
SELECT * FROM madlib.kmeanspp (
‘customers’, -- name of the input table
‘features’, -- name of the feature a...
Accessing HAWQ
Through R
Pivotal R
•Interface is R client	

•Execution is in database	

•Parallelism handled by PivotalR	

•Supports a portion of R...
A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
• $ [ [[ $<- [<- [[<-
...
A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
• Categorial variable
...
In-Database Execution
•All data stays in DB: R objects merely point
to DB objects	

•All model estimation and heavy liftin...
Beyond MapReduce
withYARN
Single'App'
BATCH
HDFS
Single'App'
INTERACTIVE
Single'App'
BATCH
HDFS
Single'App'
BATCH
HDFS
Single'App'
ONLINE
Hadoop 1.0...
MapReduce 1.0
(Image Courtesy Arun Murthy, Hortonworks)
Hadoop 2.0
(Image Courtesy Arun Murthy, Hortonworks)
HADOOP 1.0
HDFS%
(redundant,*reliable*storage)*
MapReduce%
(cluster*r...
Applica'ons+Run+Na'vely+IN+Hadoop+
HDFS2+(Redundant,*Reliable*Storage)*
YARN+(Cluster*Resource*Management)***
BATCH+
(MapR...
NodeManager* NodeManager* NodeManager* NodeManager*
Container*1.1*
Container*2.4*
NodeManager* NodeManager* NodeManager* N...
YARN
•Yet Another Resource Negotiator	

•Resource Manager	

•Node Managers	

•Application Masters	

•Specific to paradigm, ...
Beyond MapReduce
•Apache Giraph - BSP & Graph Processing	

•Storm onYarn - Streaming Computation	

•HOYA - HBase onYarn	

...
Hamster
• Hadoop and MPI on the same
cluster	

• OpenMPI Runtime on
HadoopYARN	

• Hadoop Provides: Resource
Scheduling, P...
GraphLab + Hamster
on Hadoop
!
About GraphLab
•Graph-based, High-Performance distributed
computation framework	

•Started by Prof. Carlos Guestrin in CMU...
GraphLab Features
•Topic Modeling (e.g. LDA)	

•Graph Analytics (Pagerank,Triangle counting)	

•Clustering (K-Means)	

•Co...
Only Graphs are not
Enough
•Full Data processing workflow requires ETL/
Postprocessing,Visualization, Data Wrangling,
Servi...
Data Platform of the Future ?
Analytic

Data Marts
SQL Services
Operational

Intelligence
In-Memory Database
Run-Time

App...
Questions?
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Upcoming SlideShare
Loading in...5
×

Modern Big Data Analytics Tools: An Overview

1,561

Published on

Great Wide Open 2014 - Day 1
Milind Bhandarkar - Pivotal
3:30 PM - Operations 2 (Big Data)

Published in: Technology

Modern Big Data Analytics Tools: An Overview

  1. 1. Modern Big Data AnalyticsTools:An Overview Milind Bhandarkar Chief Scientist, Pivotal (Twitter: @techmilind) (All Images Courtesy Flickr, Creative Commons Licensed)
  2. 2. About Me • http://www.linkedin.com/in/milindb • Founding member of Hadoop team atYahoo! [2005-2010] • Contributor to Apache Hadoop since v0.1 • Built and led Grid SolutionsTeam atYahoo! [2007-2010] • Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) • Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
  3. 3. Hadoop Midwife :-)
  4. 4. Once upon a time, in a land far far away…
  5. 5. Fast forward 15 years..
  6. 6. What Happened ?
  7. 7. And, then… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  8. 8. In a blink of an eye… HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN ASF Projects FLOSS Projects Pivotal Products
  9. 9. History (2003-2010)
  10. 10. Google Papers
  11. 11. Yahoo! Search + =
  12. 12. W-1-W •WebMap : Graph processing for WWW •Dreadnaught: Infrastructure for WebMap •W-1-W:WebMap In One Week •Juggernaut: Infrastructure for W-1-W •JFS, JMR, Condor:Abandoned for Hadoop
  13. 13. Lucene, Nutch
  14. 14. Kryptonite
  15. 15. Major Step Backwards?
  16. 16. MapReduce is the Revenge of System Programmers on Database community. - Anonymous at XLDB, Stanford, 2010
  17. 17. O’Reilly Books 2013
  18. 18. Who Uses Hadoop? (From Hadoop Summit 2010)
  19. 19. Big Data Landscape - July 2012 http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
  20. 20. Hadoop Ecosystem (Jan 2013)http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
  21. 21. Game Changing Hadoop Economics $- $20,000 $40,000 $60,000 $80,000 2008 2009 2010 2011 2012 2013 Big Data Platform Price/TB Big Data DB Hadoop
  22. 22. Hadoop Maturity ETL Offload Accommodate massive 
 data growth with existing EDW investments Data Lakes Unify Unstructured and Structured Data Access Big Data Apps Build analytic-led applications impacting 
 top line revenue Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture
  23. 23. 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed <0.5% being operationalized Average Enterprises The Big Gap
  24. 24. Storage Options •HDFS, MapR, Quantcast QFS •EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre •Amazon S3, EMC Atmos, OpenStack Swift •GlusterFS, Ceph •EMCViPR
  25. 25. SQL-on-Hadoop •Pivotal HAWQ •Cloudera Impala, Facebook Presto,Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger •Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase •More to come...
  26. 26. Network Interconnect ... ......HAWQ & HDFS Master
 Severs Planning & dispatch Segment
 Severs Query execution ... Storage ! HDFS, HBase …
  27. 27. Namenode B replication Rack1 Rack2 DatanodeDatanode Datanode Read/Write Segment Segment host Segment Segment Segment host Segment Segment host Master host Meta Ops HAWQ Interconnect Segment Segment Segment Segment host Segment Datanode Segment SegmentSegment Segment
  28. 28. HAWQ vs Hive Lower is Better
  29. 29. Provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. In-Database Analytics
  30. 30. MADlib Algorithms
  31. 31. MADLib Functions • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Naïve Bayes • Elastic Net Regression • DecisionTrees / Random Forest • SupportVector Machines • Cox Proportional Hazards Regression • Descriptive Statistics • ARIMA
  32. 32. k-Means Usage SELECT * FROM madlib.kmeanspp ( ‘customers’, -- name of the input table ‘features’, -- name of the feature array column 2 -- k : number of clusters ); ! centroids | objective_fn | frac_reassigned | …! ------------------------------------------------------------------------+------------------+-----------------+ … {{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …
  33. 33. Accessing HAWQ Through R
  34. 34. Pivotal R •Interface is R client •Execution is in database •Parallelism handled by PivotalR •Supports a portion of R R> x = db.data.frame(“t1”) R> l = madlib.lm(interlocks ~ assets + nation, data = t)
  35. 35. A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary
  36. 36. A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary • $ [ [[ $<- [<- [[<- • is.na • + - * / %% %/% ^ • & | ! • == != > < >= <= • merge • by • db.data.frame • as.db.data.frame • preview• sort • c mean sum sd var min max length colMeans colSums • db.connect db.disconnect db.list db.objects db.existsObject delete • dim names • content And more ... (SQL wrapper) • predict
  37. 37. A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary • Categorial variable as.factor() • $ [ [[ $<- [<- [[<- • is.na • + - * / %% %/% ^ • & | ! • == != > < >= <= • merge • by • db.data.frame • as.db.data.frame • preview• sort • c mean sum sd var min max length colMeans colSums • db.connect db.disconnect db.list db.objects db.existsObject delete • dim names • content And more ... (SQL wrapper) • predict
  38. 38. In-Database Execution •All data stays in DB: R objects merely point to DB objects •All model estimation and heavy lifting done in DB by MADlib •R→ SQL translation done in the R client •Only strings of SQL and model output transferred across ODBC/DBI
  39. 39. Beyond MapReduce withYARN
  40. 40. Single'App' BATCH HDFS Single'App' INTERACTIVE Single'App' BATCH HDFS Single'App' BATCH HDFS Single'App' ONLINE Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks)
  41. 41. MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks)
  42. 42. Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks) HADOOP 1.0 HDFS% (redundant,*reliable*storage)* MapReduce% (cluster*resource*management* *&*data*processing)* HDFS2% (redundant,*reliable*storage)* YARN% (cluster*resource*management)* Tez% (execu7on*engine)* HADOOP 2.0 Pig% (data*flow)* Hive% (sql)* % Others% (cascading)* * Pig% (data*flow)* Hive% (sql)* % Others% (cascading)* % MR% (batch)* RT%% Stream,% Graph% Storm,'' Giraph' * Services% HBase' *
  43. 43. Applica'ons+Run+Na'vely+IN+Hadoop+ HDFS2+(Redundant,*Reliable*Storage)* YARN+(Cluster*Resource*Management)*** BATCH+ (MapReduce)+ INTERACTIVE+ (Tez)+ STREAMING+ (Storm,+S4,…)+ GRAPH+ (Giraph)+ INLMEMORY+ (Spark)+ HPC+MPI+ (OpenMPI)+ ONLINE+ (HBase)+ OTHER+ (Search)+ (Weave…)+ YARN Platform (Image Courtesy Arun Murthy, Hortonworks)
  44. 44. NodeManager* NodeManager* NodeManager* NodeManager* Container*1.1* Container*2.4* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* Container*1.2* Container*1.3* AM*1* Container*2.2* Container*2.1* Container*2.3* AM2* Client2* ResourceManager* Scheduler* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks)
  45. 45. YARN •Yet Another Resource Negotiator •Resource Manager •Node Managers •Application Masters •Specific to paradigm, e.g. MR Application master (aka JobTracker)
  46. 46. Beyond MapReduce •Apache Giraph - BSP & Graph Processing •Storm onYarn - Streaming Computation •HOYA - HBase onYarn •Hamster - MPI on Hadoop •More to come ...
  47. 47. Hamster • Hadoop and MPI on the same cluster • OpenMPI Runtime on HadoopYARN • Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System • Open MPI Provides: Process launching, Communication, I/O forwarding
  48. 48. GraphLab + Hamster on Hadoop !
  49. 49. About GraphLab •Graph-based, High-Performance distributed computation framework •Started by Prof. Carlos Guestrin in CMU in 2009 •Recently founded Graphlab Inc to commercialize Graphlab.org
  50. 50. GraphLab Features •Topic Modeling (e.g. LDA) •Graph Analytics (Pagerank,Triangle counting) •Clustering (K-Means) •Collaborative Filtering •Linear Solvers •etc...
  51. 51. Only Graphs are not Enough •Full Data processing workflow requires ETL/ Postprocessing,Visualization, Data Wrangling, Serving •MapReduce excels at data wrangling •OLTP/NoSQL Row-Based stores excel at Serving •GraphLab should co-exist with other Hadoop frameworks
  52. 52. Data Platform of the Future ? Analytic
 Data Marts SQL Services Operational
 Intelligence In-Memory Database Run-Time
 Applications Data Staging
 Platform Data Mgmt. Services Stream 
 Ingestion Streaming Services Software-Defined Datacenter New Data-fabrics In-Memory Grid ...ETC
  53. 53. Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×