Modern Big Data Analytics Tools: An Overview
Upcoming SlideShare
Loading in...5
×
 

Modern Big Data Analytics Tools: An Overview

on

  • 633 views

Great Wide Open 2014 - Day 1

Great Wide Open 2014 - Day 1
Milind Bhandarkar - Pivotal
3:30 PM - Operations 2 (Big Data)

Statistics

Views

Total Views
633
Views on SlideShare
633
Embed Views
0

Actions

Likes
5
Downloads
86
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Modern Big Data Analytics Tools: An Overview Modern Big Data Analytics Tools: An Overview Presentation Transcript

    • Modern Big Data AnalyticsTools:An Overview Milind Bhandarkar Chief Scientist, Pivotal (Twitter: @techmilind) (All Images Courtesy Flickr, Creative Commons Licensed)
    • About Me • http://www.linkedin.com/in/milindb • Founding member of Hadoop team atYahoo! [2005-2010] • Contributor to Apache Hadoop since v0.1 • Built and led Grid SolutionsTeam atYahoo! [2007-2010] • Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) • Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
    • Hadoop Midwife :-)
    • Once upon a time, in a land far far away…
    • Fast forward 15 years..
    • What Happened ?
    • And, then… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
    • In a blink of an eye… HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN ASF Projects FLOSS Projects Pivotal Products
    • History (2003-2010)
    • Google Papers
    • Yahoo! Search + =
    • W-1-W •WebMap : Graph processing for WWW •Dreadnaught: Infrastructure for WebMap •W-1-W:WebMap In One Week •Juggernaut: Infrastructure for W-1-W •JFS, JMR, Condor:Abandoned for Hadoop
    • Lucene, Nutch
    • Kryptonite
    • Major Step Backwards?
    • MapReduce is the Revenge of System Programmers on Database community. - Anonymous at XLDB, Stanford, 2010
    • O’Reilly Books 2013
    • Who Uses Hadoop? (From Hadoop Summit 2010)
    • Big Data Landscape - July 2012 http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
    • Hadoop Ecosystem (Jan 2013)http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
    • Game Changing Hadoop Economics $- $20,000 $40,000 $60,000 $80,000 2008 2009 2010 2011 2012 2013 Big Data Platform Price/TB Big Data DB Hadoop
    • Hadoop Maturity ETL Offload Accommodate massive 
 data growth with existing EDW investments Data Lakes Unify Unstructured and Structured Data Access Big Data Apps Build analytic-led applications impacting 
 top line revenue Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture
    • 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed <0.5% being operationalized Average Enterprises The Big Gap
    • Storage Options •HDFS, MapR, Quantcast QFS •EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre •Amazon S3, EMC Atmos, OpenStack Swift •GlusterFS, Ceph •EMCViPR
    • SQL-on-Hadoop •Pivotal HAWQ •Cloudera Impala, Facebook Presto,Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger •Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase •More to come...
    • Network Interconnect ... ......HAWQ & HDFS Master
 Severs Planning & dispatch Segment
 Severs Query execution ... Storage ! HDFS, HBase …
    • Namenode B replication Rack1 Rack2 DatanodeDatanode Datanode Read/Write Segment Segment host Segment Segment Segment host Segment Segment host Master host Meta Ops HAWQ Interconnect Segment Segment Segment Segment host Segment Datanode Segment SegmentSegment Segment
    • HAWQ vs Hive Lower is Better
    • Provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. In-Database Analytics
    • MADlib Algorithms
    • MADLib Functions • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Naïve Bayes • Elastic Net Regression • DecisionTrees / Random Forest • SupportVector Machines • Cox Proportional Hazards Regression • Descriptive Statistics • ARIMA
    • k-Means Usage SELECT * FROM madlib.kmeanspp ( ‘customers’, -- name of the input table ‘features’, -- name of the feature array column 2 -- k : number of clusters ); ! centroids | objective_fn | frac_reassigned | …! ------------------------------------------------------------------------+------------------+-----------------+ … {{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …
    • Accessing HAWQ Through R
    • Pivotal R •Interface is R client •Execution is in database •Parallelism handled by PivotalR •Supports a portion of R R> x = db.data.frame(“t1”) R> l = madlib.lm(interlocks ~ assets + nation, data = t)
    • A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary
    • A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary • $ [ [[ $<- [<- [[<- • is.na • + - * / %% %/% ^ • & | ! • == != > < >= <= • merge • by • db.data.frame • as.db.data.frame • preview• sort • c mean sum sd var min max length colMeans colSums • db.connect db.disconnect db.list db.objects db.existsObject delete • dim names • content And more ... (SQL wrapper) • predict
    • A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary • Categorial variable as.factor() • $ [ [[ $<- [<- [[<- • is.na • + - * / %% %/% ^ • & | ! • == != > < >= <= • merge • by • db.data.frame • as.db.data.frame • preview• sort • c mean sum sd var min max length colMeans colSums • db.connect db.disconnect db.list db.objects db.existsObject delete • dim names • content And more ... (SQL wrapper) • predict
    • In-Database Execution •All data stays in DB: R objects merely point to DB objects •All model estimation and heavy lifting done in DB by MADlib •R→ SQL translation done in the R client •Only strings of SQL and model output transferred across ODBC/DBI
    • Beyond MapReduce withYARN
    • Single'App' BATCH HDFS Single'App' INTERACTIVE Single'App' BATCH HDFS Single'App' BATCH HDFS Single'App' ONLINE Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks)
    • MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks)
    • Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks) HADOOP 1.0 HDFS% (redundant,*reliable*storage)* MapReduce% (cluster*resource*management* *&*data*processing)* HDFS2% (redundant,*reliable*storage)* YARN% (cluster*resource*management)* Tez% (execu7on*engine)* HADOOP 2.0 Pig% (data*flow)* Hive% (sql)* % Others% (cascading)* * Pig% (data*flow)* Hive% (sql)* % Others% (cascading)* % MR% (batch)* RT%% Stream,% Graph% Storm,'' Giraph' * Services% HBase' *
    • Applica'ons+Run+Na'vely+IN+Hadoop+ HDFS2+(Redundant,*Reliable*Storage)* YARN+(Cluster*Resource*Management)*** BATCH+ (MapReduce)+ INTERACTIVE+ (Tez)+ STREAMING+ (Storm,+S4,…)+ GRAPH+ (Giraph)+ INLMEMORY+ (Spark)+ HPC+MPI+ (OpenMPI)+ ONLINE+ (HBase)+ OTHER+ (Search)+ (Weave…)+ YARN Platform (Image Courtesy Arun Murthy, Hortonworks)
    • NodeManager* NodeManager* NodeManager* NodeManager* Container*1.1* Container*2.4* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* Container*1.2* Container*1.3* AM*1* Container*2.2* Container*2.1* Container*2.3* AM2* Client2* ResourceManager* Scheduler* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks)
    • YARN •Yet Another Resource Negotiator •Resource Manager •Node Managers •Application Masters •Specific to paradigm, e.g. MR Application master (aka JobTracker)
    • Beyond MapReduce •Apache Giraph - BSP & Graph Processing •Storm onYarn - Streaming Computation •HOYA - HBase onYarn •Hamster - MPI on Hadoop •More to come ...
    • Hamster • Hadoop and MPI on the same cluster • OpenMPI Runtime on HadoopYARN • Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System • Open MPI Provides: Process launching, Communication, I/O forwarding
    • GraphLab + Hamster on Hadoop !
    • About GraphLab •Graph-based, High-Performance distributed computation framework •Started by Prof. Carlos Guestrin in CMU in 2009 •Recently founded Graphlab Inc to commercialize Graphlab.org
    • GraphLab Features •Topic Modeling (e.g. LDA) •Graph Analytics (Pagerank,Triangle counting) •Clustering (K-Means) •Collaborative Filtering •Linear Solvers •etc...
    • Only Graphs are not Enough •Full Data processing workflow requires ETL/ Postprocessing,Visualization, Data Wrangling, Serving •MapReduce excels at data wrangling •OLTP/NoSQL Row-Based stores excel at Serving •GraphLab should co-exist with other Hadoop frameworks
    • Data Platform of the Future ? Analytic
 Data Marts SQL Services Operational
 Intelligence In-Memory Database Run-Time
 Applications Data Staging
 Platform Data Mgmt. Services Stream 
 Ingestion Streaming Services Software-Defined Datacenter New Data-fabrics In-Memory Grid ...ETC
    • Questions?