Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

  • 1,138 views
Uploaded on

This session discusses the rational behind the Greenplum Analytics Workbench initiative - it's goals, present status today and the roadmap for this first of a kind initiative. Enterprises learn......

This session discusses the rational behind the Greenplum Analytics Workbench initiative - it's goals, present status today and the roadmap for this first of a kind initiative. Enterprises learn about how a Hadoop Cloud can help unlock revenue opportunities from the data within the cluster.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,138
On Slideshare
1,137
From Embeds
1
Number of Embeds
1

Actions

Shares
Downloads
40
Comments
0
Likes
1

Embeds 1

http://www.slashdocs.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Greenplum Analytics Workbench APURVA DESAI© Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. Overview© Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. What is Hadoop? What is Hadoop? – Distributed computing paradigm – File system – HDFS – Processing framework –Map Reduce – Languages – PIG, HIVE – Key Value Store – Hbase Why is it important? – BIG Data is everywhere – BIG Data is mostly unstructured – Need affordable, scalable no-sql processing© Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. Analytics Workbench - Motivation Open source – Hadoop industry is nascent – BIG Data development needs scale Greenplum – Innovation & Experimentation platform – Contribute to the community – GPDB & GPHD - Mixed mode environment© Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. Greenplum Vision© Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. Buildout Pre-requisites Hardware systems integration Hadoop experience Program Management Partner ecosystem Greenplum has Inhouse Expertise© Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. Team Introduction  System Integration – Greg, Eric, Don, Dave, Patrick  Program Management – Mike, Joe  Hadoop – Apurva, Judes, Clinton, Chandra, Ashwin© Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Partners  Intel – 2000 Westmere CPUs  Mellanox – 1,000+ NICs – 72 IB switches  Micron – 6,000 8GB DRAM  Seagate – 12,000 2TB Drives  Supermicro – 1000 Chasis/MB© Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. Partners  Switch – Hosting Facilities  VMware – Operational Support – Rubicon© Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. Peek @ the Cluster© Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. Cluster Statistics Largest cluster for Apache Hadoop validation! # Of Physical Hosts : > 1,000 (> 10,000 with VMs) # Of Racks : 54 (50 just for the DataNodes) # Of Processors : > 24,000 Amount Of RAM : > 48TB Amount of Disk Capacity : > 24PB – “Equivalent to nearly half of the entire written works of mankind from the beginning of recorded history”© Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Namenode© Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Job Tracker© Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. CPU© Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Use Cases© Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Hadoop Review© Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Hadoop Shuffle© Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Initial Use Cases Apache Hadoop Validation Mellanox UDA Terasort Benchmark© Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. Apache Hadoop Validation Purpose – Run Apache Hadoop Validation at Scale – Validate cluster configuration Various Configurations Validated – Standard Out Of The Box Configs – Configs Modified For IO Intensive Processing© Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Apache Hadoop Preliminary Results Apache Hadoop-1.0.0 validation 1.2 1 0.8 Execution Time (Min) 0.6 0.4 1000 Nodes 0.2 0© Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Apache Hadoop Findings Apache BigTop for integration tests Functional validation passed as expected Next Steps – Identify integration cases – Contribute back to BigTop – Stabilize Hadoop 0.23© Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Mellanox UDA - Overview  RDMA in Hadoop Shuffle stage  Register Map & Reduce task buffer  Hadoop JT for Task completion  cp sorted maptask o/p  reduce i/p  Perform in-memory merge @reduce  Avoid disk spills for large inputs  Reduce CPU load for sort & merge  GP + Mellanox collaboration – Open Sourcing UDA© Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. Mellanox UDA Preliminary Results Preliminary UDA results provided by Mellanox Show improvement with UDA vs Vanilla Hadoop. Better CPU utilization Reduced execution time Next Steps – Run on Analytics Workbench schedule for June 2012 – Configuration on the workbench to turn it on/off© Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. TeraSort Benchmark Industry standard benchmark Good validation of configuration 3 Steps – Teragen – Generate 1TB of data – Terasort – Sort generated data – Teravalidate – Validate the sort Measure time for each step© Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. TeraSort Benchmark Preliminary Results Apache Hadoop-1.0.0 validation - TeraSort 9 8 7 Exection Time in Sec 6 5 TeraGen 4 TeraSort 3 2 1 0 1 TB 10 TB # of TB Generated and Sorted© Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. TeraSort Benchmark Findings Minimal tuning of configuration Results are within expected range. Next Steps – Tune the cluster for optimal performance – Use the benchmark for every new release© Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Lessons Learnt© Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. Buildout Progress 1200 racked ready 1000 Number of nodes 800 600 400 200 0 Dec 11 Jan 12 Feb 12 Mar 12 April 12 Month© Copyright 2012 EMC Corporation. All rights reserved. 28
  • 29. ―Real‖ Hadoop Cluster© Copyright 2012 EMC Corporation. All rights reserved. 29
  • 30. Categories Racking & Stacking  Hadoop Deployment Networking  Post deployment Non Hadoop Hosts  Process Base OS Setup© Copyright 2012 EMC Corporation. All rights reserved. 30
  • 31. In Closing© Copyright 2012 EMC Corporation. All rights reserved. 31
  • 32. Upcoming work Workbench Tasks – Load various data sets – Load GPDB, Hive, Hbase, Zookeeper, etc. – Load Chorus, Command center, UAP stack – VM provisioning – Various audits On-boarding candidates – HD Education – Apache Hadoop Build & Validate – Mellanox UDA – Intel HiBench – Big data benchmarking – Hi resolution image processing, etc. etc.© Copyright 2012 EMC Corporation. All rights reserved. 32
  • 33. A day in the life @ Switch© Copyright 2012 EMC Corporation. All rights reserved. 33
  • 34. Q&A© Copyright 2012 EMC Corporation. All rights reserved. 34
  • 35. Other Relevant Greenplum SessionsSession Presenter TimesUnified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00Greenplum Database Overview Michael Crutcher Mon 8:30-9:30 Wed 10:00-11:00Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30Optimizing Greenplum Database on VMware Kevin O’Leary Mon 4:00-5:00 Tues 4:15-5:15Virtualized InfrastructureBig Data Driven Businesses in Action: Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30Creating Real Business Value UsingGreenplum UAP (Panel w/4 Customers)Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45Disruptive Data Science — How Data Annika Jimenez Tues 4:15-5:15 Thurs 11:30-12:30Science and Big Data are Transforming David DietrichBusiness, IT and People© Copyright 2012 EMC Corporation. All rights reserved. 35