Hadoop on OpenStack - Sahara @DevNation 2014

Big data processing with
Hadoop on OpenStack
Matthew Farrellee
(@spinningmatt)
Red Hat

Here for a talk about Savanna?
Oops, this talk is about Sahara.
Good news is they’re the same thing.
Savanna was renamed for trademark reasons to Sahara.
You have to go to page 10 of google results to find out why:
https://www.google.com/search?q=savanna+hadoop&start=90

In brief - what is Hadoop
● Narrow - Apache Hadoop - a specific
Apache project originally from Yahoo!, based
on papers published from Google
● Broad - an ecosystem of projects, mostly
Apache, that integrate in some way with
Apache Hadoop
● Most common to use the broad definition

Hadoop from Hortonworks (+ others)
● Multiple projects
○ Workload management
○ Resource management
○ System management
○ Data ingest & storage
○ Compute frameworks
○ Domain languages
● Data storage and
processing focused

In brief - what is OpenStack
OpenStack is a cloud operating system that
controls large pools of compute, storage, and
networking resources throughout a datacenter,
all managed through a dashboard that gives
administrators control while empowering their
users to provision resources through a web
interface.

An ecosystem of projects
● Compute - Nova
● Networking - Neutron
● Object Storage - Swift
● Block Storage - Cinder
● Identity - Keystone
● Image Service - Glance
● Dashboard - Horizon
● Telemetry - Ceilometer
● Orchestration - Heat
● Data Processing - Sahara

Longer comments on big data
Choose your own adventure…
Go to the next slide and get the day over
sooner
See some shoegazing followed by a rant and
have the day last longer

Interest (via Google Trends)
Hadoop
EC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack

Interest (via Google Trends)
Hadoop
EC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack
EC2 beta Aug 25 2006 (http://aws.typepad.
com/aws/2006/08/amazon_ec2_beta.html)

Analysis - have a question
● Even this alone is hard to come up with
● The question you answer won’t be the
question you set out to ask
● You’ll have to iterate and refine
Can I predict doctor specialty from what
procedures they perform?

Analysis - finding the data
● Publically -
○ Tons of data repositories
○ No consistency, even within a specific repository
● Privately -
○ Data often hidden in silos
○ Even less consistency
● Avoid datasets that don’t come with a
dictionary
○ Data w/o a dictionary is like code w/o comments

Analysis - acceptable use
● Publically -
○ Data sets often have associated licenses
○ Yes, even public (government) sets
○ You may have to find an alternative set
● Privately -
○ Often tightly controlled, considered sensitive
business data
○ If you can use it, maybe only in a specific place
○ Likely no alternatives

● The story of Stephen Glasser and Cheryl
Palma
● Two of the oldest people in the medical
profession working with medicare
● Stephen Glasser graduated in 1773
● Cheryl Palma graduated in 1776
Analysis - explore / clean the data

Analysis - finally
● You got some answer to a question you
approximately asked
● You must refine the question and process
● Repeat
This is hard enough without having to manage
tools and infrastructure!

Sahara’s goal
Make managing Hadoop+ infrastructure and
tools so simple that doing so never gets in your
way

Sahara is
● An OpenStack project in the Data
Processing program
● Started one year ago (Summit in Portland)
● Incubated in Icehouse (6 months ago)
● Integrated for Juno (6 months from now)

Sahara’s architecture
Data
Sources
Sahara
Python
Client
RESTAPI
Cluster
Configuration
Manager
Horizon
Keystone
Auth
Data
Access
Layer
Swift
Sahara
Pages
Hadoop
VM
Vendors
Plugins
Hadoop
VM
Hadoop
VM
Hadoop
VM
Resources
Orchestration
Manager
Job
Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Sahara Service

Sahara’s plugin architecture
● This is important!
● It’s where Hadoop distribution vendors
integrate their management software
● It’s how users pick different software
versions
● Currently: Vanilla (reference impl. w/ Apache
versions), HDP (via Ambari), IDH (via Intel
Manager) and under review CDH and Spark

Sahara lets you
● Create and manage clusters
● Define and run analysis jobs
● All through a programmatic interface
● Or a web console

API v1 (Cluster operations)
● http://bit.ly/1hRXrVX
● Plugins
○ list - comes from configuration
○ get - provides capabilities of a plugin, e.g. services
● Images
○ register - provide basic metadata, username - going
away w/ heat
○ tag/untag - associate image w/ a plugin

API v1 (Cluster operations) (cont)
● http://bit.ly/1hRXrVX
● Templates
○ node groups
○ clusters
● Clusters
○ Instances of templates

API v1.1 (Elastic Data Processing)
● http://bit.ly/1kXGjGj
● Data Source
○ Input and output locations (Swift/HDFS urls)
● Job Binaries
○ Often JARs or scripts stored in Swift or ...
● Jobs
○ Templates for a job with missing parameters
● Job executions
○ Instances of templates with parameters provided

API v2 (future)
Consistent, stable, and clean evolution of v1 & v1.1
○ Image handling in v1 wasn’t RESTful
○ Reduce use of internally stored binaries
○ Jobs & job executions weren’t RESTful
○ Resource naming wasn’t consistent (clusters v job-
executions & cluster-templates v jobs)
○ Prune unused operations, e.g status-refresh
○ Align resource lifecycle, e.g. terminate = stop&delete
vs terminate = stop

Sahara’s Plugin API
● http://bit.ly/1h4MiAW
● get_versions
● get_configs(version)
● get_node_processes(version)
● get_required_image_tags(version)
● validate(cluster)
● configure_cluster(cluster)
● start_cluster(cluster)
● scale_cluster(cluster)
● ...

Roadmap
● I mentioned a couple things, but this is a
community project
● The Icehouse release is tomorrow
● Design summit, where developers & users &
business get together to define the roadmap,
is May 13-16 in Atlanta

Demo with bigpetstore
● http://jayunit100.github.io/bigpetstore/slides
● Bigpetstore (by @jayunit100)
○ A full stack hadoop application
○ Uses the main players in the hadoop ecosystem
○ To demonstrate a single domain
○ Just accepted into the Bigtop project!

Demo with bigpetstore...live (cont)
We’re going to perform petstore transaction
analysis -
1. Generate data from a model
2. Transform data for processing
3. Process w/ pig or mahout, we’ll do pig
4. Visualize results in web app

Demo video...
https://www.youtube.com/watch?v=vmry_kXqn4c

Hadoop on OpenStack - Sahara @DevNation 2014

Hadoop on OpenStack - Sahara @DevNation 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Hadoop on OpenStack - Sahara @DevNation 2014

Similar to Hadoop on OpenStack - Sahara @DevNation 2014 (20)

Recently uploaded

Recently uploaded (20)

Hadoop on OpenStack - Sahara @DevNation 2014