Modern Big Data Analytics Tools: An Overview

1,979
-1

Published on

Great Wide Open 2014 - Day 1
Milind Bhandarkar - Pivotal
3:30 PM - Operations 2 (Big Data)

Published in: Technology

Modern Big Data Analytics Tools: An Overview

  1. 1. Modern Big Data AnalyticsTools:An Overview Milind Bhandarkar Chief Scientist, Pivotal (Twitter: @techmilind) (All Images Courtesy Flickr, Creative Commons Licensed)
  2. 2. About Me • http://www.linkedin.com/in/milindb • Founding member of Hadoop team atYahoo! [2005-2010] • Contributor to Apache Hadoop since v0.1 • Built and led Grid SolutionsTeam atYahoo! [2007-2010] • Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) • Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
  3. 3. Hadoop Midwife :-)
  4. 4. Once upon a time, in a land far far away…
  5. 5. Fast forward 15 years..
  6. 6. What Happened ?
  7. 7. And, then… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  8. 8. In a blink of an eye… HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN ASF Projects FLOSS Projects Pivotal Products
  9. 9. History (2003-2010)
  10. 10. Google Papers
  11. 11. Yahoo! Search + =
  12. 12. W-1-W •WebMap : Graph processing for WWW •Dreadnaught: Infrastructure for WebMap •W-1-W:WebMap In One Week •Juggernaut: Infrastructure for W-1-W •JFS, JMR, Condor:Abandoned for Hadoop
  13. 13. Lucene, Nutch
  14. 14. Kryptonite
  15. 15. Major Step Backwards?
  16. 16. MapReduce is the Revenge of System Programmers on Database community. - Anonymous at XLDB, Stanford, 2010
  17. 17. O’Reilly Books 2013
  18. 18. Who Uses Hadoop? (From Hadoop Summit 2010)
  19. 19. Big Data Landscape - July 2012 http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
  20. 20. Hadoop Ecosystem (Jan 2013)http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
  21. 21. Game Changing Hadoop Economics $- $20,000 $40,000 $60,000 $80,000 2008 2009 2010 2011 2012 2013 Big Data Platform Price/TB Big Data DB Hadoop
  22. 22. Hadoop Maturity ETL Offload Accommodate massive 
 data growth with existing EDW investments Data Lakes Unify Unstructured and Structured Data Access Big Data Apps Build analytic-led applications impacting 
 top line revenue Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture
  23. 23. 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed <0.5% being operationalized Average Enterprises The Big Gap
  24. 24. Storage Options •HDFS, MapR, Quantcast QFS •EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre •Amazon S3, EMC Atmos, OpenStack Swift •GlusterFS, Ceph •EMCViPR
  25. 25. SQL-on-Hadoop •Pivotal HAWQ •Cloudera Impala, Facebook Presto,Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger •Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase •More to come...
  26. 26. Network Interconnect ... ......HAWQ & HDFS Master
 Severs Planning & dispatch Segment
 Severs Query execution ... Storage ! HDFS, HBase …
  27. 27. Namenode B replication Rack1 Rack2 DatanodeDatanode Datanode Read/Write Segment Segment host Segment Segment Segment host Segment Segment host Master host Meta Ops HAWQ Interconnect Segment Segment Segment Segment host Segment Datanode Segment SegmentSegment Segment
  28. 28. HAWQ vs Hive Lower is Better
  29. 29. Provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. In-Database Analytics
  30. 30. MADlib Algorithms
  31. 31. MADLib Functions • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Naïve Bayes • Elastic Net Regression • DecisionTrees / Random Forest • SupportVector Machines • Cox Proportional Hazards Regression • Descriptive Statistics • ARIMA
  32. 32. k-Means Usage SELECT * FROM madlib.kmeanspp ( ‘customers’, -- name of the input table ‘features’, -- name of the feature array column 2 -- k : number of clusters ); ! centroids | objective_fn | frac_reassigned | …! ------------------------------------------------------------------------+------------------+-----------------+ … {{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …
  33. 33. Accessing HAWQ Through R
  34. 34. Pivotal R •Interface is R client •Execution is in database •Parallelism handled by PivotalR •Supports a portion of R R> x = db.data.frame(“t1”) R> l = madlib.lm(interlocks ~ assets + nation, data = t)
  35. 35. A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary
  36. 36. A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary • $ [ [[ $<- [<- [[<- • is.na • + - * / %% %/% ^ • & | ! • == != > < >= <= • merge • by • db.data.frame • as.db.data.frame • preview• sort • c mean sum sd var min max length colMeans colSums • db.connect db.disconnect db.list db.objects db.existsObject delete • dim names • content And more ... (SQL wrapper) • predict
  37. 37. A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary • Categorial variable as.factor() • $ [ [[ $<- [<- [[<- • is.na • + - * / %% %/% ^ • & | ! • == != > < >= <= • merge • by • db.data.frame • as.db.data.frame • preview• sort • c mean sum sd var min max length colMeans colSums • db.connect db.disconnect db.list db.objects db.existsObject delete • dim names • content And more ... (SQL wrapper) • predict
  38. 38. In-Database Execution •All data stays in DB: R objects merely point to DB objects •All model estimation and heavy lifting done in DB by MADlib •R→ SQL translation done in the R client •Only strings of SQL and model output transferred across ODBC/DBI
  39. 39. Beyond MapReduce withYARN
  40. 40. Single'App' BATCH HDFS Single'App' INTERACTIVE Single'App' BATCH HDFS Single'App' BATCH HDFS Single'App' ONLINE Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks)
  41. 41. MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks)
  42. 42. Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks) HADOOP 1.0 HDFS% (redundant,*reliable*storage)* MapReduce% (cluster*resource*management* *&*data*processing)* HDFS2% (redundant,*reliable*storage)* YARN% (cluster*resource*management)* Tez% (execu7on*engine)* HADOOP 2.0 Pig% (data*flow)* Hive% (sql)* % Others% (cascading)* * Pig% (data*flow)* Hive% (sql)* % Others% (cascading)* % MR% (batch)* RT%% Stream,% Graph% Storm,'' Giraph' * Services% HBase' *
  43. 43. Applica'ons+Run+Na'vely+IN+Hadoop+ HDFS2+(Redundant,*Reliable*Storage)* YARN+(Cluster*Resource*Management)*** BATCH+ (MapReduce)+ INTERACTIVE+ (Tez)+ STREAMING+ (Storm,+S4,…)+ GRAPH+ (Giraph)+ INLMEMORY+ (Spark)+ HPC+MPI+ (OpenMPI)+ ONLINE+ (HBase)+ OTHER+ (Search)+ (Weave…)+ YARN Platform (Image Courtesy Arun Murthy, Hortonworks)
  44. 44. NodeManager* NodeManager* NodeManager* NodeManager* Container*1.1* Container*2.4* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* Container*1.2* Container*1.3* AM*1* Container*2.2* Container*2.1* Container*2.3* AM2* Client2* ResourceManager* Scheduler* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks)
  45. 45. YARN •Yet Another Resource Negotiator •Resource Manager •Node Managers •Application Masters •Specific to paradigm, e.g. MR Application master (aka JobTracker)
  46. 46. Beyond MapReduce •Apache Giraph - BSP & Graph Processing •Storm onYarn - Streaming Computation •HOYA - HBase onYarn •Hamster - MPI on Hadoop •More to come ...
  47. 47. Hamster • Hadoop and MPI on the same cluster • OpenMPI Runtime on HadoopYARN • Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System • Open MPI Provides: Process launching, Communication, I/O forwarding
  48. 48. GraphLab + Hamster on Hadoop !
  49. 49. About GraphLab •Graph-based, High-Performance distributed computation framework •Started by Prof. Carlos Guestrin in CMU in 2009 •Recently founded Graphlab Inc to commercialize Graphlab.org
  50. 50. GraphLab Features •Topic Modeling (e.g. LDA) •Graph Analytics (Pagerank,Triangle counting) •Clustering (K-Means) •Collaborative Filtering •Linear Solvers •etc...
  51. 51. Only Graphs are not Enough •Full Data processing workflow requires ETL/ Postprocessing,Visualization, Data Wrangling, Serving •MapReduce excels at data wrangling •OLTP/NoSQL Row-Based stores excel at Serving •GraphLab should co-exist with other Hadoop frameworks
  52. 52. Data Platform of the Future ? Analytic
 Data Marts SQL Services Operational
 Intelligence In-Memory Database Run-Time
 Applications Data Staging
 Platform Data Mgmt. Services Stream 
 Ingestion Streaming Services Software-Defined Datacenter New Data-fabrics In-Memory Grid ...ETC
  53. 53. Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×