Pivotal: Virtualize Big Data to Make the Elephant Dance

  • 1,116 views
Uploaded on

Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been fully realized. In this session, learn how …

Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been fully realized. In this session, learn how virtualization brings the advantages of greater elasticity, stronger isolation for multi-tenancy, and a click HA protection to Hadoop, while maintaining the comparable performance to Hadoop on physical machines.


Objective 1: Understand the benefits of virtualizing Hadoop.
After this session you will be able to:
Objective 2: Understand how to get started with Pivotal HD Hadoop .
Objective 3: Understand where to find more information.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,116
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
46
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Virtualize Big Data to Make the Elephant Dance June Yang, Senior Director of Product Management, VMWare Dan Baskett, Senior Consultant Technologist, Pivotal © Copyright 2013 EMC Corporation. All rights reserved. 1
  • 2. Unstructured Data is exploding… Hadoop is driving growth Hadoop adoption is ramping Unstructured data driving growth Don't know Other 2% 2% Testing 2% Complex unstructured data forecasted to outpace structured relational data by 10x by 2020 Piloting 18% Inproduction 23% 2011 2012 2013 2014 2015 2016 Structured 2017 2018 Unstructured 2019 Evaluating 53% 2020 Source: Forrester Survey of 60 CIOs , September 2011 • Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider Enterprise data strategy • • Gartner predicts +800% data growth over next 5 years Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs © Copyright 2013 EMC Corporation. All rights reserved. 2
  • 3. Broad Application of Hadoop Technology Use Cases Vertical Industries Log Processing / Click Stream Analytics Financial Services Machine Learning / sophisticated data mining Internet Retailer Web crawling / text processing Pharmaceutical / Drug Discovery Extract Transform Load (ETL) replacement Mobile / Telecom Image / XML message processing Scientific Research General archiving / compliance Social Media Hadoop is a platform that will revolutionize how Enterprises handle data © Copyright 2013 EMC Corporation. All rights reserved. 3
  • 4. The Big Data Journey in the Enterprise Integrated Stage 3: Cloud Analytics Platform • Serve many departments • Often part of mission critical workflow • Fully integrated with analytics/BI tools Stage 2: Hadoop Production • Serve a few departments • More use cases • Growing # and size of clusters • Core Hadoop + components Stage1: Hadoop Piloting • Often start with line of business • Try 1 or 2 use cases to explore the value of Hadoop 0 node © Copyright 2013 EMC Corporation. All rights reserved. 10’s 100’s Scale 4
  • 5. Deploy Hadoop Clusters in Minutes © Copyright 2013 EMC Corporation. All rights reserved. 5
  • 6. One click to scale out your cluster on the fly © Copyright 2013 EMC Corporation. All rights reserved. 6
  • 7. Customize your Hadoop/Hbase Cluster Customize with Cluster Specification File © Copyright 2013 EMC Corporation. All rights reserved. 7
  • 8. Cluster Spec File Details Storage configuration Choice of shared storage or Local disk High availability option # of Hadoop nodes Resource configuration © Copyright 2013 EMC Corporation. All rights reserved. Cluster Specification File "groups":[ { "name":"master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": { "type": "SHARED”, sizeGB": 20}, "instance_type":MEDIUM, "instance_num":1, "ha":true}, {"name":"worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instance_type":SMALL, "instance_num":5, "ha":false … 8
  • 9. Your Choice of Hadoop Distributions and Tools Distributions Community Projects • Flexibility to choose and try out major distributions • Support for multiple projects • Open architecture to welcome industry participation • Contributing Hadoop Virtualization Extensions (HVE) to open source community © Copyright 2013 EMC Corporation. All rights reserved. 9
  • 10. Proactive monitoring with VCOPs  Proactively monitoring through VCOPs  Gain comprehensive visibility  Eliminate manual processes with intelligent automation  Proactively manage operations  Alternatively, use monitoring tools like Nagios, Ganglia © Copyright 2013 EMC Corporation. All rights reserved. 10
  • 11. Beyond day 1 - Automation of Hadoop Cluster lifecycle management … Deploy Custo mize Scaling Tune config uration Load data Execut e jobs © Copyright 2013 EMC Corporation. All rights reserved. 11
  • 12. The Big Data Journey in the Enterprise Integrated Stage 2: Hadoop Production • Serve a few departments • More use cases • Growing # and size of clusters • Core Hadoop + components Stage1: Hadoop Piloting  Rapid deployment  On the fly cluster resizing  Choice of Hadoop distros  Automation of cluster lifecycle 0 node © Copyright 2013 EMC Corporation. All rights reserved. 10’s 100’s Scale 12
  • 13. Achieve HA for the Entire Hadoop Stack Zookeepr (Coordination) Pig (Data Flow) BI Reporting Hive (SQL) RDBMS Hive MetaDB HCatalog Hcatalog MDB MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System) Jobtracker Namenode Management Server ETL Tools Server • vSphere HA is battle-tested high availability technology • Single mechanism to achieve HA for the entire Hadoop stack • One click to enable HA and/or FT © Copyright 2013 EMC Corporation. All rights reserved. 13
  • 14. Challenges of Running Hadoop in Enterprises Dept A: recommendation engine Production Production Test Log files Experimentation Transaction data Dept B: ad targeting Social data © Copyright 2013 EMC Corporation. All rights reserved. On the horizon… NoSQL Real time SQL … Test Experimentation Historical cust behavior Pain Points: 1. Cluster sprawling 2. Redundant common data in separate clusters 3. Difficult use the right tool for the right problem 4. Peak compute and I/O resource is limited to number of nodes in each independent cluster 14
  • 15. What if you can… Recommendation engine Ad targeting Production Production Test Experimentation Test Experimentation © Copyright 2013 EMC Corporation. All rights reserved. One physical platform to support multiple virtual big data clusters Experimentation Production recommendation engine Test/Dev Production Ad Targeting 15
  • 16. Bigger is Better  Hadoop is linearly scalable, more nodes, better performance, for the same job, it will take – 2 hour to complete on a 50 node cluster – 1 hour to complete on a 100 node cluster – 30 min to complete on a 200 node cluster © Copyright 2013 EMC Corporation. All rights reserved. 16
  • 17. You may ask  What about differentiated SLAs – –  For production Hadoop jobs, need to ensure high priority Lower priority of experimental Hadoop jobs. Will I have a noisy neighbor problems with shared infrastructure approach? © Copyright 2013 EMC Corporation. All rights reserved. 17
  • 18. VM Containers with Isolation are a Tried and Tested Approach Reckless Workload 2 Hungry Workload 1 Noisy Workload 3 VMware vSphere + Serengeti Host Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 18
  • 19. Shared infrastructure: Three big types of Isolation are Required  Resource Isolation • Control the greedy noisy neighbor • Reserve resources to meet needs  Version Isolation • Allow concurrent OS, App, Distro versions  Security Isolation • Provide privacy between users/groups • Runtime and data privacy required VMware vSphere + Serengeti Host Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 19
  • 20. With virtualization, you can have your cake and eat it too  One physical platform to support multiple virtual big data clusters Experimentation Compute layer Data layer Production recommendation engine Test/Dev Production Ad Targeting VMware vSphere + Serengeti – – Low Priority High Priority – – Share data to minimize copying Single infrastructure to maintain Bigger cluster for better performance Share hardware resource to achieve higher utilization  Virtualization ensures strong isolation between clusters. – – – – © Copyright 2013 EMC Corporation. All rights reserved. Resource isolation. Failure isolation Configure isolation Security isolation 20
  • 21. Elastic Hadoop with Virtualization VM Hadoop Node Combined Storage/Com pute Unmodified Hadoop node in a VM  VM lifecycle determined by Datanode  Limited elasticity © Copyright 2013 EMC Corporation. All rights reserved. VM VM T1 Compute VM Storage Separate Compute from Storage  Separate compute from data  Stateless compute  Elastic compute VM VM T2 Storage Separate Virtual Compute Clusters per tenant  Separate virtual compute  Compute cluster per tenant  Stronger VM-grade security and resource isolation 21
  • 22. Scale in/out Hadoop dynamically  Deploy separate compute clusters for different tenants sharing HDFS.  Commission/decommission task trackers according to priority and available resources Job Tracker Job Tracker Compute layer Compute VM Compute VM Dynamic resourcepool Experimentation Experimentation Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Production recommendation engine Production VMware vSphere + Serengeti Data layer © Copyright 2013 EMC Corporation. All rights reserved. 22
  • 23. The Big Data Journey in the Enterprise Integrated Stage 3: Cloud Analytics Platform • Serve many departments • Often part of mission critical workflow • Fully integrated with analytics/BI tools Stage 2: Hadoop Production  High Availability  Consolidation  Differentiated SLAs  Elastic Scaling Stage1: Hadoop Piloting  Rapid deployment  On the fly cluster resizing  Choice of Hadoop distros  Automation of cluster lifecycle 0 node © Copyright 2013 EMC Corporation. All rights reserved. 10’s 100’s Scale 23
  • 24. Business Intelligence Cloud Analytics Platform Machine Learning Real Time Streams CETAS Automated Models Stream Processing E T L Data Visualization … Real Time Structured Database Data Warehouse Unstructured and Batch Processing HDFS Compute © Copyright 2013 EMC Corporation. All rights reserved. Cloud Infrastructure Storage Networking 24
  • 25. Big Data Tools and Characteristics Framework Scale of data Scale of Cluster Computable Data? Local Disks? Map-reduce: 100s PB 10s to 1,000s Yes Yes, for cost, bandwidth and availability Big-SQL: PB’s 10s to 100s Some Yes, for cost and bandwidth No-SQL: Cassandra, hBase, … Trilions Of rows 10s to 100s Some Yes, for cost and availability In-Memory: Billions of rows 10s-100s Yes Primarily Memory Hadoop HawQ,, Aster Data, Impala, … Redis, Gemfire, Membase, … © Copyright 2013 EMC Corporation. All rights reserved. 25
  • 26. Choose a platform that… Allows user to pick the right tools at the right time Put resources where needed based on SLA policy © Copyright 2013 EMC Corporation. All rights reserved. 26
  • 27. In-house Hadoop as a Service – (Hadoop + Hadoop) Production ETL of log files Ad hoc data mining Compute layer Data layer Production recommendation engine HDFS HDFS VMware vSphere + Serengeti Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 27
  • 28. Integrated Big Data Production – (Mixed big data workloads) Hadoop batch analysis Compute layer Data layer HBase real-time queries HDFS NoSQL – Cassandra key-value store MPP DBMS – Analysis of structured data VMware vSphere + Serengeti Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 28
  • 29. Integrated Hadoop and Webapps – (Big Data + Other Workloads) Short-lived Hadoop compute cluster Compute layer Data layer Hadoop compute cluster Web servers for ecommerce site HDFS VMware vSphere + Serengeti Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 29
  • 30. The Big Data Journey in the Enterprise Stage 3: Cloud Analytics Platform  Mixed workloads  Right tool at the right time  Flexible and elastic infrastrure Integrated Stage 2: Hadoop Production  High Availability  Consolidation  Differentiated SLAs  Elastic Scaling Stage1: Hadoop Piloting  Rapid deployment  On the fly cluster resizing  Choice of Hadoop distros  Automation of cluster lifecycle 0 node © Copyright 2013 EMC Corporation. All rights reserved. 10’s 100’s Scale 30
  • 31. Learn More  Download and try Serengeti – projectserengeti.org • VMware Hadoop site – vmware.com/hadoop • Hadoop performance on vSphere white paper – http://www.vmware.com/files/pdf/techpaper /hadoop-vsphere51-32hosts.pdf • Hadoop virtualization extensions (HVE) Whitepaper – © Copyright 2013 EMC Corporation. All rights reserved. http://www.vmware.com/files/pdf/techpaper /hadoop-vsphere51-32hosts.pdf 31
  • 32. Thank You! June Yang Senior Director, VMware juneyang@vmware.com © Copyright 2013 EMC Corporation. All rights reserved. Dan Baskette Senior Consultant Technologist dan.baskette@emc.com 32
  • 33. Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005 Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench Clinton Ooi Bhavin Modi Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights SK Krishnamurthy Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M Pivotal: Big & Fast data – merging real-time data and deep analytics Michael Crutcher Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005 © Copyright 2013 EMC Corporation. All rights reserved. 33