Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 Group

2,947 views
2,749 views

Published on

Who is contributing to the Hadoop ecosystem, what are they contributing, and why? Who are the vendors that are supplying Hadoop-related products and services and what do they want from Hadoop? How is the expanding ecosystem benefiting or damaging the Apache Hadoop project? What are the emerging alternatives to Hadoop and what chance do they have? In this session, the 451 Group will seek to answer these questions based on their latest research and present their perspective of where Hadoop fits in the total data management landscape.

Published in: Technology, Business
1 Comment
13 Likes
Statistics
Notes
No Downloads
Views
Total views
2,947
On SlideShare
0
From Embeds
0
Number of Embeds
357
Actions
Shares
0
Downloads
0
Comments
1
Likes
13
Embeds 0
No embeds

No notes for slide

Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 Group

  1. 1. The Blind Men and the ElephantMatthew Aslett, Senior Analyst, The 451 Group Hadoop World, 8 November, 2011 © 2011 by The 451 Group. All rights reserved
  2. 2. Agenda Introduction and family history The Blind Men and the Elephant What is the point of Hadoop? Adoption trends Big data, total data Exploratory analytics Hadoop-related business strategies Contributors and their contributions A cautionary tale © 2011 by The 451 Group. All rights reserved
  3. 3. The 451 Group 451 Research is focused on the business of enterprise IT innovation. The company’s analysts provide critical and timely insight into the competitive dynamics of innovation in emerging technology segments. Tier1 Research is a single-source research and advisory firm covering the multi-tenant datacenter, hosting, IT and cloud-computing sectors, blending the best of industry and financial research. The Uptime Institute is ‘The Global Data Center Authority’ and a pioneer in the creation and facilitation of end-user knowledge communities to improve reliability and uninterruptible availability in datacenter facilities. TheInfoPro is a leading IT advisory and research firm that provides real-world perspectives on the customer and market dynamics of the enterprise information technology landscape, harnessing the collective knowledge and insight of leading IT organizations worldwide. ChangeWave Research is a research firm that identifies and quantifies ‘change’ in consumer spending behavior, corporate purchasing, and industry, company and technology trends. © 2011 by The 451 Group. All rights reserved
  4. 4. 451 Research Matthew Aslett • Senior analyst, enterprise software • With The 451 Group since 2007 • www.twitter.com/maslettInformation Management Commercial Adoption of Open Source Operational databases (CAOS) Data warehousing  Open source projects Data caching  Adoption of open source software Event processing  Vendor strategies Hadoop first properly covered in March  Hadoop first covered February 2008 as2009 report covering the formation of part of coverage of emerging open sourceApache Hadoop distributor Cloudera data management projects © 2011 by The 451 Group. All rights reserved
  5. 5. A family history? © 2011 by The 451 Group. All rights reserved
  6. 6. The Blind Men and the Elephant“It was six men of IndostanTo learning much inclined,Who went to see the Elephant(Though all of them were blind),That each by observationMight satisfy his mind.”John Godfrey Saxe (1872) © 2011 by The 451 Group. All rights reserved
  7. 7. The Blind Men and the Elephant“After Hadoop finishesfiltering the data, the placeyou want to put that datais in Oracle Database.”Larry Ellison (2011) © 2011 by The 451 Group. All rights reserved
  8. 8. Oracle Big Data ApplianceApache HadoopNoSQL DatabaseOracle Tools Oracle DatabaseData Integrator for Oracle DatabaseData Loader Big dataR distribution integration © 2011 by The 451 Group. All rights reserved
  9. 9. What is the point of Hadoop? Big data Big data Big data storage integration analytics Yes, depending on who you ask (and when) © 2011 by The 451 Group. All rights reserved
  10. 10. Example deployment Processes millions of searches and transactions a Orbitz day, resulting in hundreds of GBs of log data  Early Hadoop adopter for long-term storage and Big data processing of un/semi-structured data storage  Too much data to store and process in data warehouse due to cost and space considerations  Adopted Hive for SQL-like query capabilities Big data  Also, machine learning to automate hotel ranking based analytics on user behavior  Hadoop provided repository to store and query search logs and MapReduce a more efficient data extraction process Big data integration  Creating data exports to R, and aggregating data to data warehouse © 2011 by The 451 Group. All rights reserved
  11. 11. Vendor timeline – 451 Research coverageOCT 11SEP 11AUG 11 JUL 11JUN 11 MicroStrategy Quest Software OperaMAY 11APR 11 EMC NetApp Dell PervasiveMAR 11FEB 11JAN 11 Platfora Jaspersoft RevolutionDEC 10 HadaptNOV 10OCT 10 SAS Appistry Informatica MapRSEP 10AUG 10 Amazon JUL 10JUN 10 SnapLogic Cloudera TableauMAY 10APR 10MAR 10 Oracle Pentaho IBM Hortonworks Karmasphere KitengaFEB 10JAN 10DEC 09 Talend MicrosoftNOV 09OCT 09SEP 09 Datameer DataStax RainStorAUG 09 JUL 09 PlatformJUN 09MAY 09 ZettaSet Gluster Teradata CompositeAPR 09MAR 09 © 2011 by The 451 Group. All rights reserved
  12. 12. The Apache Hadoop ecosystemBig data analytics Microsoft IBM Revolution Platfora Karmasphere ZettaSet MicroStrategy Tableau Pentaho Kitenga Datameer Jaspersoft Opera SASBig data integration RainStor Platform Pervasive Informatica Composite Talend IBM Quest Hadapt SnapLogic Oracle Teradata Microsoft ClouderaHadoop distributors Cloudera Hortonworks Microsoft DataStax IBM MapR EMC AmazonBig data storage Appistry EMC Dell IBM Gluster NetApp © 2011 by The 451 Group. All rights reserved
  13. 13. Current data management trendsThe amount of Preliminary survey results Data processing 2013 The value of % Change:data to be – for illustration purposes capabilitiesvs. 2011 have data has neverstored, manage Enterprise Data Warehouse never been 198% been betterd and analyzed better Regional/Departmental Data Marts 169% understoodis growing Exploratory Analytics Platform 183%rapidly Hadoop Cluster 115% Data Archive 394% Operational Databases 703% Searchable Data Platform 259% Total Data Growth 2011-2013 180%RISKOPPORTUNITY The data deluge problem is also a big data opportunity © 2011 by The 451 Group. All rights reserved
  14. 14. What is Big Data? More than just rising data volumes  Big Data ≠ Volume © 2011 by The 451 Group. All rights reserved
  15. 15. What is Big Data? Also variety of data types/sources and velocity of data updates  Big Data = Volume Variety Velocity Preliminary survey results – for illustrative purposes:My organization’s existing data management architecture is suitable to meet its future 29% 34% 37% demands for business intelligence Strongly Agree/Agree Neutral Disagree/Strongly Disagree © 2011 by The 451 Group. All rights reserved
  16. 16. Current data management trendsThe ‘Big Data’ Data processing The value ofvolume, variety covers a diverse capabilities have data has neverand velocity of set of products never been been betterdata is growing that can be better understoodrapidly applied to different problemsRISKOPPORTUNITY ‘Big Data’ highlights the problem – volume/variety/velocity, and promises a solution – value, but doesn’t provide a path in between © 2011 by The 451 Group. All rights reserved
  17. 17. What is Total Data? Not just another name for Big Data Inspired by ‘Total Football’ – a new approach to soccer that emerged in the late 1960s If your data is big, the way you manage it should be total Total Data is making the most efficient use of existing and new data management resources to deliver value from data © 2011 by The 451 Group. All rights reserved
  18. 18. What is Total Data? Also the desire of the user to store and process all their data  Value = (Volume Variety Velocity) x Totality Big data storage © 2011 by The 451 Group. All rights reserved
  19. 19. What is Total Data? Within tolerable time frames  Value = (Volume Variety Velocity) x Totality Time Stream processing S4 Hadoop Storm Percolator © 2011 by The 451 Group. All rights reserved
  20. 20. What is Total Data? And the desire to explore data for new value  Value = (Volume Variety Velocity) x (Totality + Exploration) Time Big data analytics © 2011 by The 451 Group. All rights reserved
  21. 21. Data exploration Schema on write  Schema on read Application Application Schema Hadoop RDBMS Schema SQL MapReduce © 2011 by The 451 Group. All rights reserved
  22. 22. Data exploration  Exploratory Analytics Platform RDBMS + UDFs SQL-MapReduce Application Application Splunk HPCC Systems Loose schema Hadoop Dryad Tenzing RDBMS Dremel Schema Piccolo Analytics MapReduce © 2011 by The 451 Group. All rights reserved
  23. 23. Data platforms for different data types Preliminary survey results – for illustrative purposes: Customer Data 59% 5% 11% Transactional Data 51% 8% 11% Domain-specific Application Data 46% 14% 14% Online Transaction Data 46% 11% 11% Application Log Data 41% 16% 14% Other Documents/Content 35% 16% 16% Audio/Video/Graphics 30% 14% 24% Network Log Data 30% 16% 22% Search Log 27% 19% 22% Other Log Files 27% 16% 24% Web Log Data 27% 19% 22% Social Media/Online Data 24% 22% 24% Enterprise Data Warehouse Exploratory Analytics Platform Hadoop © 2011 by The 451 Group. All rights reserved
  24. 24. Data platforms for different application workloads Preliminary survey results – for illustrative purposes: Data Consolidation 49% 11% 14% Data Storage for Compliance 49% 11% 16% Financial Forecasting 49% 16% 8% Decision Support 49% 22% 8% Data Sandboxing 43% 16% 11% Trend Analysis 43% 19% 19% Data Indexing/Search 41% 16% 19% Ad Hoc, Iterative Analysis 41% 22% 16% Customer Analysis 38% 22% 14% IT Data Analysis 35% 22% 16% Clickstream Analysis 30% 22% 19% Enterprise Data Warehouse Exploratory Analytics Platform Hadoop © 2011 by The 451 Group. All rights reserved
  25. 25. eBay’s Singularity platform Analyze & Report Discover & Explore Data warehouse Singularity Hadoop 6+PB Teradata EDW 40+PB Teradata appliance 20+PB Hadoop clusterStructured SQL analysis Semi-structured SQL Unstructured analysis500+ concurrent users 150+ concurrent users 5-10 concurrent users ‘soft data projection’ – apply structural patterns as the data is analyzed support for user-defined functions go beyond standard SQL a SQL interface familiar to existing analysts © 2011 by The 451 Group. All rights reserved
  26. 26. What is Total Data? While maximizing the investment in existing skills and resources  Value = (Volume Variety Velocity) x (Totality + Exploration) (Time x Skills and Resources) Big data integration © 2011 by The 451 Group. All rights reserved
  27. 27. What is Total Data? While maximizing the investment in existing skills and resources  Value = (Volume Variety Velocity) x (Totality + Exploration) (Time x Skills and Resources) Total Data is making the most efficient use of existing and new data management resources to deliver value from data Inspired by ‘Total Football’ © 2011 by The 451 Group. All rights reserved
  28. 28. The old way Data Reporting/BI mart Reporting/BI App Relationa Data l mart database App Reporting/BI App Data Relationa EDW l cleansing/MDM App database Reporting/BI Reporting/BI App Relationa Data archive l Reporting/BI database App © 2011 by The 451 Group. All rights reserved
  29. 29. The old wayData Operational Analytic Businessarchive database database intelligence 29 © 2011 by The 451 Group. All rights reserved
  30. 30. The new way App Stream processing Reporting/BI Reporting/BI Reporting/BI Cache Data martApp Big data Relationa Reporting/BI Hadoop l integration Datastructure Relationa databaseApp l database NoSQLApp database “Data Hub” EDWApp NewSQL databaseApp Exploratory Non- Datastructure Big data Big data Queryable analyticsApp Relationa relational storage analytics archive platform l databaseApp database © 2011 by The 451 Group. All rights reserved
  31. 31. The new way Data archive Exploratory analytics Data cache/grid‘Data Hub’ Non-relational Hadoop database Data Datastructure NoSQL warehouse database Event stream Relational 31 processing database © 2011 by The 451 Group. All rights reserved
  32. 32. Relevant reports Total Data • Explaining the the total data management approach to dealing with the impact of big data on the data management landscape • Coming late 2011 • sales@the451group.com COMING Free copy for completing our Total Data survey: LATE www.bit.ly/451data 2011 © 2011 by The 451 Group. All rights reserved
  33. 33. The Blind Men and the Elephant © 2011 by The 451 Group. All rights reserved
  34. 34. The Apache Hadoop ecosystemBig data analytics Microsoft IBM Revolution Platfora Karmasphere ZettaSet MicroStrategy Tableau Pentaho Kitenga Datameer Jaspersoft Opera SASBig data integration RainStor Platform Pervasive Informatica Composite Talend IBM Quest Hadapt SnapLogic Oracle Teradata Microsoft ClouderaHadoop distributors Cloudera Hortonworks Microsoft DataStax IBM MapR EMC AmazonBig data storage Appistry EMC Dell IBM Gluster NetApp © 2011 by The 451 Group. All rights reserved
  35. 35. Hadoop-related business strategies Chukwa Sqoop ZooKeeper Pig Hortonworks HBase Avro Mahout Flume Cloudera CDHSupportsubscription MapReduce Whirr IBM BigInsights Community Hama HDFS Hive Hadoop Common © 2011 by The 451 Group. All rights reserved
  36. 36. Hadoop-related business strategies Management Cloudera Enterprise Chukwa Sqoop ZooKeeper Pig IBM BigInsights Enterprise HBase Avro Mahout FlumeSupportsubscription Hortonworks MapReduce Whirr Data Platform Hama HDFS Hive Hadoop Common © 2011 by The 451 Group. All rights reserved
  37. 37. Apache Hadoop contributors Source: Datameer blog. http://datameer.com/blog/uncategorized/whose-hadoop-is-bigger-really-2.html © 2011 by The 451 Group. All rights reserved
  38. 38. Key contributors Source: Hortonworks blog. http://www.hortonworks.com/reality-check-contributions-to-apache-hadoop/ © 2011 by The 451 Group. All rights reserved
  39. 39. Key contributors Source: Cloudera blog. http://www.cloudera.com/blog/2011/10/the-community-effect/ © 2011 by The 451 Group. All rights reserved
  40. 40. Key contributors Source: Cloudera blog. http://www.cloudera.com/blog/2011/10/the-community-effect/ © 2011 by The 451 Group. All rights reserved
  41. 41. Hadoop-related business strategies Management Default alternatives: MapR/EMC Chukwa Sqoop ZooKeeper Pig – Direct Access NFS HBase Avro Mahout Flume DataStaxSupportsubscription –CassandraFS MapReduce Whirr Optional alternatives: Hama IBM – GPFS HDFS Hive Appistry – CloudIQ Hadoop Common Gluster – GlusterFS © 2011 by The 451 Group. All rights reserved
  42. 42. Hadoop-related business strategies Management Default alternatives: MapR/EMC Chukwa Sqoop ZooKeeper Pig – JobTracker HA HBase Avro Mahout Flume PlatformSupportsubscription – Platform MapReduce MapReduce Whirr Hama HDFS Hive Hadoop Common © 2011 by The 451 Group. All rights reserved
  43. 43. Hadoop component alternatives Concerns about JobTracker and NameNode as SPOFMapReduce JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTrackerHDFS NameNode DataNode DataNode DataNode DataNode DataNode DataNode © 2011 by The 451 Group. All rights reserved
  44. 44. Apache Hadoop 0.23 and beyond NextGen MapReduce splits JobTracker into resource management and application lifecycle managementNextGen MapReduce Resource Node Manager Node Manager Node Manager Node Manager Node Manager Manager App Master App Master App Master App Master App Master © 2011 by The 451 Group. All rights reserved
  45. 45. Apache Hadoop 0.23 and beyond NextGen MapReduce splits JobTracker into resource management and application lifecycle managementNextGen MapReduce Resource Node Manager Node Manager Node Manager Node Manager Node Manager Manager App Master App Master App Master App Master App Master NameNode HA adds a standby NameNode to enable warm and hot standby for both planned and unplanned downtime NameNode HA Active Standby NameNode DataNode DataNode DataNode DataNode NameNode Does not preclude the use of alternatives, but does raise the bar for ‘enterprise-level’ capabilities in Apache Hadoop © 2011 by The 451 Group. All rights reserved
  46. 46. A cautionary tale? © 2011 by The 451 Group. All rights reserved
  47. 47. Survey details: http://bit.ly/451datamatthew.aslett@the451group.com www.twitter.com/maslett © 2011 by The 451 Group. All rights reserved

×