Your SlideShare is downloading. ×
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

680
views

Published on

This session discusses the rational behind the Greenplum Analytics Workbench initiative - it's goals, present status today and the roadmap for this first of a kind initiative. Enterprises learn about …

This session discusses the rational behind the Greenplum Analytics Workbench initiative - it's goals, present status today and the roadmap for this first of a kind initiative. Enterprises learn about how a Hadoop Cloud can help unlock revenue opportunities from the data within the cluster.

Published in: Technology, Business

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
680
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
42
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Greenplum Analytics Workbench APURVA DESAI© Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. Overview© Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. What is Hadoop? What is Hadoop? – Distributed computing paradigm – File system – HDFS – Processing framework –Map Reduce – Languages – PIG, HIVE – Key Value Store – Hbase Why is it important? – BIG Data is everywhere – BIG Data is mostly unstructured – Need affordable, scalable no-sql processing© Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. Analytics Workbench - Motivation Open source – Hadoop industry is nascent – BIG Data development needs scale Greenplum – Innovation & Experimentation platform – Contribute to the community – GPDB & GPHD - Mixed mode environment© Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. Greenplum Vision© Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. Buildout Pre-requisites Hardware systems integration Hadoop experience Program Management Partner ecosystem Greenplum has Inhouse Expertise© Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. Team Introduction  System Integration – Greg, Eric, Don, Dave, Patrick  Program Management – Mike, Joe  Hadoop – Apurva, Judes, Clinton, Chandra, Ashwin© Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Partners  Intel – 2000 Westmere CPUs  Mellanox – 1,000+ NICs – 72 IB switches  Micron – 6,000 8GB DRAM  Seagate – 12,000 2TB Drives  Supermicro – 1000 Chasis/MB© Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. Partners  Switch – Hosting Facilities  VMware – Operational Support – Rubicon© Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. Peek @ the Cluster© Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. Cluster Statistics Largest cluster for Apache Hadoop validation! # Of Physical Hosts : > 1,000 (> 10,000 with VMs) # Of Racks : 54 (50 just for the DataNodes) # Of Processors : > 24,000 Amount Of RAM : > 48TB Amount of Disk Capacity : > 24PB – “Equivalent to nearly half of the entire written works of mankind from the beginning of recorded history”© Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Namenode© Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Job Tracker© Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. CPU© Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Use Cases© Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Hadoop Review© Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Hadoop Shuffle© Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Initial Use Cases Apache Hadoop Validation Mellanox UDA Terasort Benchmark© Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. Apache Hadoop Validation Purpose – Run Apache Hadoop Validation at Scale – Validate cluster configuration Various Configurations Validated – Standard Out Of The Box Configs – Configs Modified For IO Intensive Processing© Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Apache Hadoop Preliminary Results Apache Hadoop-1.0.0 validation 1.2 1 0.8 Execution Time (Min) 0.6 0.4 1000 Nodes 0.2 0© Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Apache Hadoop Findings Apache BigTop for integration tests Functional validation passed as expected Next Steps – Identify integration cases – Contribute back to BigTop – Stabilize Hadoop 0.23© Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Mellanox UDA - Overview  RDMA in Hadoop Shuffle stage  Register Map & Reduce task buffer  Hadoop JT for Task completion  cp sorted maptask o/p  reduce i/p  Perform in-memory merge @reduce  Avoid disk spills for large inputs  Reduce CPU load for sort & merge  GP + Mellanox collaboration – Open Sourcing UDA© Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. Mellanox UDA Preliminary Results Preliminary UDA results provided by Mellanox Show improvement with UDA vs Vanilla Hadoop. Better CPU utilization Reduced execution time Next Steps – Run on Analytics Workbench schedule for June 2012 – Configuration on the workbench to turn it on/off© Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. TeraSort Benchmark Industry standard benchmark Good validation of configuration 3 Steps – Teragen – Generate 1TB of data – Terasort – Sort generated data – Teravalidate – Validate the sort Measure time for each step© Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. TeraSort Benchmark Preliminary Results Apache Hadoop-1.0.0 validation - TeraSort 9 8 7 Exection Time in Sec 6 5 TeraGen 4 TeraSort 3 2 1 0 1 TB 10 TB # of TB Generated and Sorted© Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. TeraSort Benchmark Findings Minimal tuning of configuration Results are within expected range. Next Steps – Tune the cluster for optimal performance – Use the benchmark for every new release© Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Lessons Learnt© Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. Buildout Progress 1200 racked ready 1000 Number of nodes 800 600 400 200 0 Dec 11 Jan 12 Feb 12 Mar 12 April 12 Month© Copyright 2012 EMC Corporation. All rights reserved. 28
  • 29. ―Real‖ Hadoop Cluster© Copyright 2012 EMC Corporation. All rights reserved. 29
  • 30. Categories Racking & Stacking  Hadoop Deployment Networking  Post deployment Non Hadoop Hosts  Process Base OS Setup© Copyright 2012 EMC Corporation. All rights reserved. 30
  • 31. In Closing© Copyright 2012 EMC Corporation. All rights reserved. 31
  • 32. Upcoming work Workbench Tasks – Load various data sets – Load GPDB, Hive, Hbase, Zookeeper, etc. – Load Chorus, Command center, UAP stack – VM provisioning – Various audits On-boarding candidates – HD Education – Apache Hadoop Build & Validate – Mellanox UDA – Intel HiBench – Big data benchmarking – Hi resolution image processing, etc. etc.© Copyright 2012 EMC Corporation. All rights reserved. 32
  • 33. A day in the life @ Switch© Copyright 2012 EMC Corporation. All rights reserved. 33
  • 34. Q&A© Copyright 2012 EMC Corporation. All rights reserved. 34
  • 35. Other Relevant Greenplum SessionsSession Presenter TimesUnified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00Greenplum Database Overview Michael Crutcher Mon 8:30-9:30 Wed 10:00-11:00Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30Optimizing Greenplum Database on VMware Kevin O’Leary Mon 4:00-5:00 Tues 4:15-5:15Virtualized InfrastructureBig Data Driven Businesses in Action: Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30Creating Real Business Value UsingGreenplum UAP (Panel w/4 Customers)Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45Disruptive Data Science — How Data Annika Jimenez Tues 4:15-5:15 Thurs 11:30-12:30Science and Big Data are Transforming David DietrichBusiness, IT and People© Copyright 2012 EMC Corporation. All rights reserved. 35

×