Vmware Serengeti - Based on Infochimps Ironfan

on

  • 756 views

VMware's vitualized Hadoop, based on the Infochimps open source project, Ironfan.

VMware's vitualized Hadoop, based on the Infochimps open source project, Ironfan.

Statistics

Views

Total Views
756
Views on SlideShare
756
Embed Views
0

Actions

Likes
1
Downloads
25
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Vmware Serengeti - Based on Infochimps Ironfan Vmware Serengeti - Based on Infochimps Ironfan Presentation Transcript

  • © 2012 VMware Inc. All rights reserved Confidential Hadoop-as-a-Service CXO Big Data Seminar September 26, 2012
  • 2 Confidential Agenda  VMware Data Portfolio  Big Data and Virtualization Trends  Enterprise Hadoop Needs  Virtualized Hadoop for the Enterprise  Summary
  • 3 Confidential Trends Driving Change in Enterprise IT Cloud • Offered “as-a-Service” • Virtualization New Application Types • Mobile, SaaS, social • Apps released early and often Frameworks • New application frameworks driving • Increase in application development Data Disruption • Web orientation drives exponential data volumes • Reduced latency and new types of data
  • 4 Confidential The Database is Being Stretched Big Data Cloud Delivery Flexible Data  Virtualized  Offered “-as-a-Service”  Petabytes vs. Gigabytes  Democratize BI  Multi-structured data  Developer productivity Fast Data  Global access patterns  Mobile app proliferation
  • 5 Confidential Big, Fast and Flexible Data FlexibleBig Big Data Processing Big Data Analytics Serengeti Fast OLTP workloads Analytic workloads Cloud Delivery Model Data as a service for private and public clouds OSS Relational Document Object Key / Value GemFire vPostgres GemFire GemFire
  • 6 Confidential Agenda  VMware Data Portfolio  Big Data and Virtualization Trends  Enterprise Hadoop Needs  Virtualized Hadoop for the Enterprise  Summary
  • 7 Confidential Data is exploding & Hadoop is driving growth Unstructured data driving growth Hadoop adoption is ramping 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Structured Unstructured Complex unstructured data forecastedto outpace structured relationaldata by 10x by 2020 Evaluating 53% In- production 23% Piloting 18% Testing 2% Don't know 2% Other 2% Source: Forrester Survey of 60 CIOs, September 2011 • Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider Enterprise data strategy • Gartner predicts +800% data growth over next 5 years • Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs
  • 8 Confidential Log Processing / Click Stream Analytics Machine Learning / sophisticated data mining Web crawling / text processing Extract Transform Load (ETL) replacement Image / XML message processing Broad Application of Hadoop technology General archiving / compliance Financial Services Mobile / Telecom Internet Retailer Scientific Research Pharmaceutical / Drug Discovery Social Media Vertical Use CasesHorizontal Use Cases Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.
  • 9 Confidential The Future of Virtualization VDC Software-defined Datacenter Services 2008 2012 FUTURE Time to Provision New Services Workloads Virtualized Weeks Days/Hours Minutes/Seconds 25% 60% + >90%
  • 10 Confidential Virtualization enables a Common Infrastructure for Big Data Single purpose clusters for various business applications lead to cluster sprawl. Virtualization Platform  Simplify • Single Hardware Infrastructure • Unified operations  Optimize • Shared Resources = higher utilization • Elastic resources = faster on-demand access MPP DB HadoopHBase Virtualization Platform MPP DB Hadoop HBase Cluster Sprawling Cluster Consolidation
  • 11 Confidential Agenda  VMware Data Portfolio  Big Data and Virtualization Trends  Enterprise Hadoop Needs  Virtualized Hadoop for the Enterprise  Summary
  • 12 Confidential Hadoop Users Data scientists, analysts, developers • Line of business users • Intimate with data and analysis, not IT • Tasked with providing actionable intelligence that impacts the business Concerns • Obtain a Hadoop cluster on demand • Minimize time to insight • Require reasonable performance from Hadoop cluster
  • 13 Confidential The IT Guy Admins, architects, CIO • Responsible for technology infrastructure, compliance, budget management • Evaluates new technologies and recommends best practices Concerns • Keeping up with demands of the business • Cost savings and consolidation • Reliability • Complexity of running and tuning Hadoop clusters • Shortage of skills to do the above
  • 14 Confidential Hadoop Journey in Enterprises 20 3000 node Integrated Scale
  • 15 Confidential Agenda  VMware Data Portfolio  Big Data and Virtualization Trends  Enterprise Hadoop Needs  Virtualized Hadoop for the Enterprise  Summary
  • 16 Confidential Why Virtualize Hadoop?  Shrink and expand cluster on demand  Independent scaling of Compute and data  Strong multi-tenancy Elasticity & Multi-tenancy  High availability for entire Hadoop stack  One click to setup  Battle-tested High Availability  Rapid deployment  One stop command center  Easy to configure/reconfigure Operational Simplicity
  • 17 Confidential Project Serengeti  Open source project launched in June, 2012  Toolkit that leverage virtualization to simplify Hadoop deployment and operations  To learn more, projectserengeti.org Deploy a Hadoop cluster in 10 Minutes Customize Hadoop cluster Use Your Favorite Hadoop Distribution One stop command center Serengeti
  • 18 Confidential Rapid Deployment of a Hadoop Cluster with Serengeti Done Step 1: Deploy Serengeti virtual appliance on vSphere. Step 2: A few simple commands to stand up Hadoop Cluster.
  • 19 Confidential A Walk Through Serengeti
  • 20 Confidential A Walk Through Serengeti
  • 21 Confidential A Walk Through Serengeti Scaling out a cluster Advanced cluster creation
  • 22 Confidential Customizing Your Hadoop Cluster  Choice of distros  Storage configuration • Choice of shared storage or local disk  Resource configuration  High availability option  # of nodes  Also used to tune Hadoop config … "distro":"apache", "groups":[ { "name": "master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": { "type": "SHARED", "sizeGB": 20}, "instanceType": "MEDIUM", "instanceNum": 1, "haFlag": 'on’}, {"name": "worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instanceType": "SMALL", "instanceNum": 5, "haFlag": 'off' …
  • 23 Confidential Freedom of Choice and Open Source Community Projects Distributions • Flexibility to choose from major distributions • Support for multiple projects (work in progress) • Open architecture to welcome industry participation • Contributing Hadoop Virtualization Extensions (HVE) to open source community
  • 24 Confidential Use Local Disk where it’s Needed SAN Storage $2 - $10/Gigabyte $1M gets: 0.5 Petabytes 200,000 IOPS 8Gbyte/sec NAS Filers $1 - $5/Gigabyte $1M gets: 1 Petabyte 200,000 IOPS 10Gbyte/sec Local Storage $0.05/Gigabyte $1M gets: 10 Petabytes 400,000 IOPS 250 Gbytes/sec
  • 25 Confidential Virtual Storage Architecture Includes Local Disk  Shared Storage: SAN or NAS • Easy to provision • Automated cluster rebalancing • Leverage high availability protection  Local Storage: Local Disks • Local disk for Hadoop • Scalable bandwidth, lower cost/GB Host Hadoop OtherVM OtherVM Host Hadoop Hadoop OtherVM Host Hadoop Hadoop OtherVM Host Hadoop OtherVM OtherVM Host Hadoop Hadoop OtherVM Host Hadoop Hadoop OtherVM Shared Storage Shared Storage Local Storage
  • 26 Confidential Hadoop Runs Well on Virtualization 0 50 100 150 200 250 300 350 400 450 TeraGen TeraSort TeraValidate Elapsedtime,seconds(lowerisbetter) Native 1 VM 2 VMs 4 VMs Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf
  • 27 Confidential Why Virtualize Hadoop?  Shrink and expand cluster on demand  Independent scaling of Compute and data  Strong multi-tenancy Elasticity & Multi-tenancy  High availability for entire Hadoop stack  One click to setup  Battle-tested High Availability  Rapid deployment  One stop command center  Easy to configure/reconfigure Operational Simplicity
  • 28 Confidential High Availability for the Hadoop Stack HDFS (Hadoop Distributed File System) HBase (Key-Value store) MapReduce (Job Scheduling/Execution System) Pig (Data Flow) Hive (SQL) BI ReportingETL Tools ManagementServer Zookeepr(Coordination) HCatalog RDBMS Namenode Jobtracker Hive MetaDB Hcatalog MDB Server HA for Hadoop stack is more than Name node HA
  • 29 Confidential vMotion Reduces Planned Downtime Description: Enables the live migration of virtual machines from one host to another with continuous service availability. Benefits: • Revolutionary technology that is the basis for automated virtual machine movement • Meets service level and performance goals
  • 30 Confidential Hadoop Aware HA - Protection Against Unplanned Downtime • Protection against host and VM failures • Added application-aware HA for Hadoop NameNode (NN) and JobTracker (JT), protecting against NN and JT failures • Automatic failure detection and restart virtual machine in minutes, on any available host in cluster • In progress Hadoop Jobs will pause and resume when name node is up Overview
  • 31 Confidential vSphere Fault Tolerance Provides Continuous Protection App OS App OS App OSXX App OS App OS App OS App OS X VMware ESX VMware ESX • Single identical VMs running in lockstep on separate hosts • Zero downtime, zero data loss failover for all virtual machines in case of hardware failures • Integrated with VMware HA/DRS • No complex clustering or specialized hardware required • Single common mechanism for all applications and operating systems FTHAHA Overview Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters
  • 32 Confidential Achieve HA for the Entire Hadoop Stack HDFS (Hadoop Distributed File System) HBase (Key-Value store) MapReduce (Job Scheduling/Execution System) Pig (Data Flow) Hive (SQL) BI ReportingETL Tools ManagementServer Zookeepr(Coordination) HCatalog RDBMS Namenode Jobtracker Hive MetaDB Hcatalog MDB Server • Battle-tested high availability technology • Single mechanism to achieve HA for the entire Hadoop stack • One click to enable HA and/or FT
  • 33 Confidential Why Virtualize Hadoop?  Shrink and expand cluster on demand  Independent scaling of Compute and data  Strong multi-tenancy Elasticity & Multi-tenancy  High availability for entire Hadoop stack  One click to setup  Battle-tested High Availability  Rapid deployment  One stop command center  Easy to configure/reconfigure Operational Simplicity
  • 34 Confidential Storage Evolution of Hadoop on VMs Compute Current Hadoop: Combined Storage/ Compute Storage T1 T2 VM VM VM VMVM VM Hadoop in VM - VM lifecycle determined by Datanode - Limited elasticity - Limited to Hadoop Multi-Tenancy Separate Storage - Separate compute from data - Elastic compute - Enable shared workloads - Raise utilization Separate Compute Clusters - Separate virtual clusters per tenant - Stronger VM-grade security and resource isolation - Enable deployment of multiple Hadoop runtime versions Slave Node
  • 35 Confidential Ad hoc data mining In-house Hadoop as a Service “Enterprise EMR” – (Hadoop + Hadoop) Compute layer Data layer HDFS Host Host Host Host Host Host Production recommendation engine Production ETL of log files Virtualization platform HDFS
  • 36 Confidential Hadoop batch analysis Integrated Big Data Production – (Hadoop + other big data) HDFS Host Host Host Host Host Host HBase real-time queries NoSQL – Cassandra key-value store MPP DBMS – Analysis of structured data Compute layer Data layer Virtualization platform
  • 37 Confidential Short-lived Hadoop compute cluster Integrated Hadoop and Webapps – (Hadoop + Other Workloads) HDFS Host Host Host Host Host Host Web servers for ecommerce site Compute layer Data layer Hadoop compute cluster Virtualization platform
  • 38 Confidential Agenda  VMware Data Portfolio  Big Data and Virtualization Trends  Enterprise Hadoop Needs  Virtualized Hadoop for the Enterprise  Summary
  • 39 Confidential Simple, Reliable, Elastic Hadoop on Demand  Shrink and expand cluster on demand  Independent scaling of Compute and data  Strong multi-tenancy Elasticity & Multi-tenancy  High availability for entire Hadoop stack  One click to setup  Battle-tested High Availability  Rapid deployment  One stop command center  Easy to configure/reconfigure Operational Simplicity Hadoop-as-a-Service (Enterprise Grade EMR)
  • 40 Confidential Virtualization Benefits across Hadoop Maturity Spectrum 20 3000 node Integrated Scale
  • 41 Confidential Serengeti Resources  Download and try Serengeti • projectserengeti.org  VMware Hadoop site • vmware.com/hadoop  Hadoop performance on vSphere • vmware.com/files/pdf/VMW-Hadoop- Performance-vSphere5.pdf  Hadoop High Availability solution • vmware.com/files/pdf/Apache-Hadoop- VMware-HA-solution.pdf
  • 42 Confidential Thank You!