Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Architecting Virtualized Infrastructure for Big Data


Published on

Slides from Strata 2012 for Architecting Virtualized Platforms for Big Data.

Published in: Technology, Business
  • Be the first to comment

Architecting Virtualized Infrastructure for Big Data

  1. 1. Architecting Virtualized Infrastructure for Big DataRichard McDougall@richardmcdougllCTO, Application Infrastructure, Big Data Lead, VMware, Inc © 2009 VMware Inc. All rights reserved
  2. 2. Cloud: Big Shifts in Simplification and Optimization1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  3. 3. Infrastructure, Apps and now Data… Build Run Private Public ManageSimplify Infrastructure Simplify App Platform Simplify Data With Cloud Through PaaS 3
  4. 4. Trend 1/3: New Data Growing at 60% Y/YExabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio( generation… digital(tv( digital(photos( camera(phones,(rfid( medical(imaging,( sensors( satellite(images,(logs,(scanners,(twi7er( cad/cam,(appliances,(machine(data,(digital(movies( Source: The Information Explosion, 20094
  5. 5. Data Growth in the Enterprise5
  6. 6. Trend 2/3: Big Data – Driven by Real-World Benefit6
  7. 7. Trend 3/3: Value from Data Exceeds Hardware Cost!  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of 10x lower cost hardware •  Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost 7
  8. 8. A Holistic View of a Big Data System: Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Big SQL Batch Database (Greenplum, Processin AsterData, (hBase, Etc…) g Gemfire, Cassandra) Unstructured Data (HDFS)8
  9. 9. Big Data Frameworks and CharacteristicsFramework Scale of Scale of Computable Local data Cluster Data? Disks?File System: 10s PB 100s Some Yes, for costGluster, Isilon, etc,…Map-reduce: 100s PB 1,000s Yes Yes, for cost,Hadoop bandwidth and availabilityBig-SQL: PB’s 100s Some Yes, for costGreenplum, Aster Data, andNetezza, … bandwidthNo-SQL: Trilions 100s Some Yes, for costCassandra, hBase, … Of rows and availabilityIn-Memory: Billions of 10s-100s Yes PrimarilyRedis, Gemfire, rows MemoryMembase, … 9
  10. 10. The Unified Analytics Cloud Platform Madlib Analytics Tools Karmasphere Data Meer Tableau Hadoop Developer Spring PaaS Python Frameworks Cloudfoundry Cassandra hBase HDFS Database/DataStore Greenplum Voldemort Data-Director Data Platform Data PaaS EMC Chorus vSphere Cloud Infrastructure Private Public10
  11. 11. Unifying the Big Data Platform using Virtualization!  Goals •  Make it fast and easy to provision new data Clusters on Demand •  Allow Mixing of Workloads •  Leverage virtual machines to provide isolation (esp. for Multi-tenant) •  Optimize data performance based on virtual topologies •  Make the system reliable based on virtual topologies!  Leveraging Virtualization •  Elastic scale •  Use high-availability to protect key services, e.g., Hadoop’s namenode/job tracker •  Resource controls and sharing: re-use underutilized memory, cpu •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment Cloud Infrastructure Private Public11
  12. 12. A Unified Analytics Cloud Significantly Simplifies !  Simplify •  Single Hardware Infrastructure •  Faster/Easier provisioningSQLCluster Big SQL NoSQL Hadoop NoSQL Cluster Unifed Analytics Infrastructure Private Public Hadoop Cluster !  Optimize •  Shared Resources = higher utilization •  Elastic resources = faster on-demand Decision Support Cluster access 12
  13. 13. Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 20 Petabytes 200,000 IOPS 400,000 IOPS 10,000,000 IOPS 1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec13
  14. 14. VMware is Commited to be the Best Virtual platform forHadoop!  Performance Studies and Best Practices •  Studies through 2010-2011 of Hadoop 0.20 on vSphere 5 •  White paper, including detailed configurations and recommendations!  Making Hadoop run well on vSphere •  Performance optimizations in vSphere releases •  VMware engagement in Hadoop Community effort •  Supporting key partners with their distibutions on vSphere •  Contributing enhancements to Hadoop!  Hadoop Framework Integration •  Spring Hadoop: Enabling Spring to simplify Map-Reduce Jobs •  Spring Batch: Sophisticated batch management (Oozie on steroids)14
  15. 15. Extend Virtual Storage Architecture to Include Local Disk !  Shared Storage: SAN or NAS !  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VMHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 15
  16. 16. Performance Analysis of Big Data (Hadoop) on Virtualization Ratio of time taken – Lower is Better 1.2 1 0.8 Ratio to Native 0.6 1 VM 0.4 2 VMs 0.2 0 Tested on vSphere 5.016
  17. 17. Simplify Hetrogeneous Data Management via Data PaaS Large- In- File- Big Scale Memor system SQL NoSQL y Analytics Tools Developer Databases Data PaaS – Common Data Management Layer Data Platform Provisioning Multi-tenancy Import/ExportCloud Infrastructure Management Data Discovery Cloud Infrastructure 17
  18. 18. vFabric Data Director Powers Database-as-a-Service Existing Applications New Applications vFabric Data Director Automation Backup/ One click Provisioning Clone HA Self-Service Restore DBA App Dev Policy Based Resource Security Database Monitor Control Mgmt Mgmt Templates DBA IT Admin VMware vSphere18
  19. 19. Data Systems: Databases, file systems Analytics Tools Unstructured Structured Developer Databases Large- In- Data Platform File- Big Scale Memor system SQL Cloud Infrastructure NoSQL y19
  20. 20. Technology: Databases and Data Stores for Big Data Unstructured Structured Large- File- In- Big Scale system Memory SQL NoSQL Log files, machine Loosely typed deviceTypes of generated data, data, records, events, Structured, Structured data Data documents, statistics, complex partitionable data device data, relations/graphs etc… NAS, HDFS, Techno- Cassandra, hBase, Gemfire, Redis, Greenplum, Sybase Blob (S3, Atmos, logies Voldemort Membase IQ, Aster Data, etc,. etc..) Store any data, High performance Easy to scale-out, easy to scale-out, High Throughput, low for repetitive Values flexible and dynamic can optimize for latency queries. Ease of schema’s20 cost query language.
  21. 21. Simplified Developer Experience through PaaS Analytics Tools Developer Databases Data PlatformCloud Infrastructure Platform as a Service21
  22. 22. Spring Big Data Integrations!  NoSQL Integration •  Spring data for MongoDB, Gemfire, Riak, Neo4j, Blob, Cassandra!  Spring Hadoop •  Announced this week at Strata! •  Provides support for developing applications based on Hadoop technologies by leveraging the capabilities of the Spring ecosystem.!  Spring Batch •  Integration allows Hadoop jobs and HDFS operations as part of workflow22
  23. 23. The Unified Analytics Cloud Platform Madlib Analytics Tools Karmasphere Data Meer Tableau Hadoop Developer Spring PaaS Python Frameworks Cloudfoundry Cassandra hBase HDFS Database/DataStore Greenplum Voldemort Data-Director Data Platform Data PaaS EMC Chorus vSphere Cloud Infrastructure Private Public23
  24. 24. Summary!  Revolution in Big Data is under way •  Data centric applications are now critical!  Hadoop on Virtualization •  Proven performance •  Cloud/Virtualization values apparent for Hadoop use!  Simplify through a Unified Analytics Cloud •  One Platform for today’s and future big-data systems •  Better Utilization •  Faster deployment, elastic resources •  Secure, Isolated, Multi-tenant capability for Analytics24
  25. 25. References!  Twitter •  @richardmcdougll!  My CTO Blog •!  Hadoop on vSphere •  Talk @ Hadoop World •  Performance Paper –!  Spring Hadoop •