Your SlideShare is downloading. ×
Hadoop on OpenStack - Sahara @DevNation 2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop on OpenStack - Sahara @DevNation 2014

1,071
views

Published on

Data analysis is hard enough, don't get bogged down managing Hadoop...

Data analysis is hard enough, don't get bogged down managing Hadoop...

Published in: Software, Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,071
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
102
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Big data processing with Hadoop on OpenStack Matthew Farrellee (@spinningmatt) Red Hat
  • 2. Here for a talk about Savanna? Oops, this talk is about Sahara. Good news is they’re the same thing. Savanna was renamed for trademark reasons to Sahara. You have to go to page 10 of google results to find out why: https://www.google.com/search?q=savanna+hadoop&start=90
  • 3. In brief - what is Hadoop ● Narrow - Apache Hadoop - a specific Apache project originally from Yahoo!, based on papers published from Google ● Broad - an ecosystem of projects, mostly Apache, that integrate in some way with Apache Hadoop ● Most common to use the broad definition
  • 4. Hadoop from Hortonworks (+ others) ● Multiple projects ○ Workload management ○ Resource management ○ System management ○ Data ingest & storage ○ Compute frameworks ○ Domain languages ● Data storage and processing focused
  • 5. In brief - what is OpenStack OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
  • 6. An ecosystem of projects ● Compute - Nova ● Networking - Neutron ● Object Storage - Swift ● Block Storage - Cinder ● Identity - Keystone ● Image Service - Glance ● Dashboard - Horizon ● Telemetry - Ceilometer ● Orchestration - Heat ● Data Processing - Sahara
  • 7. Longer comments on big data Choose your own adventure… Go to the next slide and get the day over sooner See some shoegazing followed by a rant and have the day last longer
  • 8. Interest (via Google Trends) Hadoop EC2 OpenStack www.google.com/trends/explore#q=hadoop,ec2,openstack
  • 9. Interest (via Google Trends) Hadoop EC2 OpenStack www.google.com/trends/explore#q=hadoop,ec2,openstack EC2 beta Aug 25 2006 (http://aws.typepad. com/aws/2006/08/amazon_ec2_beta.html)
  • 10. Data analysis is hard
  • 11. Analysis - have a question ● Even this alone is hard to come up with ● The question you answer won’t be the question you set out to ask ● You’ll have to iterate and refine Can I predict doctor specialty from what procedures they perform?
  • 12. Analysis - finding the data ● Publically - ○ Tons of data repositories ○ No consistency, even within a specific repository ● Privately - ○ Data often hidden in silos ○ Even less consistency ● Avoid datasets that don’t come with a dictionary ○ Data w/o a dictionary is like code w/o comments
  • 13. Analysis - acceptable use ● Publically - ○ Data sets often have associated licenses ○ Yes, even public (government) sets ○ You may have to find an alternative set ● Privately - ○ Often tightly controlled, considered sensitive business data ○ If you can use it, maybe only in a specific place ○ Likely no alternatives
  • 14. ● The story of Stephen Glasser and Cheryl Palma ● Two of the oldest people in the medical profession working with medicare ● Stephen Glasser graduated in 1773 ● Cheryl Palma graduated in 1776 Analysis - explore / clean the data
  • 15. Analysis - finally ● You got some answer to a question you approximately asked ● You must refine the question and process ● Repeat This is hard enough without having to manage tools and infrastructure!
  • 16. Sahara’s goal Make managing Hadoop+ infrastructure and tools so simple that doing so never gets in your way
  • 17. Sahara is ● An OpenStack project in the Data Processing program ● Started one year ago (Summit in Portland) ● Incubated in Icehouse (6 months ago) ● Integrated for Juno (6 months from now)
  • 18. Sahara’s architecture Data Sources Sahara Python Client RESTAPI Cluster Configuration Manager Horizon Keystone Auth Data Access Layer Swift Sahara Pages Hadoop VM Vendors Plugins Hadoop VM Hadoop VM Hadoop VM Resources Orchestration Manager Job Sources Job Manager Heat Nova Glance Cinder Neutron Trove DB Sahara Service
  • 19. Sahara’s plugin architecture ● This is important! ● It’s where Hadoop distribution vendors integrate their management software ● It’s how users pick different software versions ● Currently: Vanilla (reference impl. w/ Apache versions), HDP (via Ambari), IDH (via Intel Manager) and under review CDH and Spark
  • 20. Sahara lets you ● Create and manage clusters ● Define and run analysis jobs ● All through a programmatic interface ● Or a web console
  • 21. Sahara’s REST API
  • 22. API v1 (Cluster operations) ● http://bit.ly/1hRXrVX ● Plugins ○ list - comes from configuration ○ get - provides capabilities of a plugin, e.g. services ● Images ○ register - provide basic metadata, username - going away w/ heat ○ tag/untag - associate image w/ a plugin
  • 23. API v1 (Cluster operations) (cont) ● http://bit.ly/1hRXrVX ● Templates ○ node groups ○ clusters ● Clusters ○ Instances of templates
  • 24. API v1.1 (Elastic Data Processing) ● http://bit.ly/1kXGjGj ● Data Source ○ Input and output locations (Swift/HDFS urls) ● Job Binaries ○ Often JARs or scripts stored in Swift or ... ● Jobs ○ Templates for a job with missing parameters ● Job executions ○ Instances of templates with parameters provided
  • 25. API v2 (future) Consistent, stable, and clean evolution of v1 & v1.1 ○ Image handling in v1 wasn’t RESTful ○ Reduce use of internally stored binaries ○ Jobs & job executions weren’t RESTful ○ Resource naming wasn’t consistent (clusters v job- executions & cluster-templates v jobs) ○ Prune unused operations, e.g status-refresh ○ Align resource lifecycle, e.g. terminate = stop&delete vs terminate = stop
  • 26. Sahara’s Plugin API
  • 27. Sahara’s Plugin API ● http://bit.ly/1h4MiAW ● get_versions ● get_configs(version) ● get_node_processes(version) ● get_required_image_tags(version) ● validate(cluster) ● configure_cluster(cluster) ● start_cluster(cluster) ● scale_cluster(cluster) ● ...
  • 28. Roadmap ● I mentioned a couple things, but this is a community project ● The Icehouse release is tomorrow ● Design summit, where developers & users & business get together to define the roadmap, is May 13-16 in Atlanta
  • 29. Demo with bigpetstore ● http://jayunit100.github.io/bigpetstore/slides ● Bigpetstore (by @jayunit100) ○ A full stack hadoop application ○ Uses the main players in the hadoop ecosystem ○ To demonstrate a single domain ○ Just accepted into the Bigtop project!
  • 30. Demo with bigpetstore...live (cont) We’re going to perform petstore transaction analysis - 1. Generate data from a model 2. Transform data for processing 3. Process w/ pig or mahout, we’ll do pig 4. Visualize results in web app
  • 31. Demo video... https://www.youtube.com/watch?v=vmry_kXqn4c