SlideShare a Scribd company logo
1 of 32
Download to read offline
Big data processing with
Hadoop on OpenStack
Matthew Farrellee
(@spinningmatt)
Red Hat
Here for a talk about Savanna?
Oops, this talk is about Sahara.
Good news is they’re the same thing.
Savanna was renamed for trademark reasons to Sahara.
You have to go to page 10 of google results to find out why:
https://www.google.com/search?q=savanna+hadoop&start=90
In brief - what is Hadoop
● Narrow - Apache Hadoop - a specific
Apache project originally from Yahoo!, based
on papers published from Google
● Broad - an ecosystem of projects, mostly
Apache, that integrate in some way with
Apache Hadoop
● Most common to use the broad definition
Hadoop from Hortonworks (+ others)
● Multiple projects
○ Workload management
○ Resource management
○ System management
○ Data ingest & storage
○ Compute frameworks
○ Domain languages
● Data storage and
processing focused
In brief - what is OpenStack
OpenStack is a cloud operating system that
controls large pools of compute, storage, and
networking resources throughout a datacenter,
all managed through a dashboard that gives
administrators control while empowering their
users to provision resources through a web
interface.
An ecosystem of projects
● Compute - Nova
● Networking - Neutron
● Object Storage - Swift
● Block Storage - Cinder
● Identity - Keystone
● Image Service - Glance
● Dashboard - Horizon
● Telemetry - Ceilometer
● Orchestration - Heat
● Data Processing - Sahara
Longer comments on big data
Choose your own adventure…
Go to the next slide and get the day over
sooner
See some shoegazing followed by a rant and
have the day last longer
Interest (via Google Trends)
Hadoop
EC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack
Interest (via Google Trends)
Hadoop
EC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack
EC2 beta Aug 25 2006 (http://aws.typepad.
com/aws/2006/08/amazon_ec2_beta.html)
Data analysis is hard
Analysis - have a question
● Even this alone is hard to come up with
● The question you answer won’t be the
question you set out to ask
● You’ll have to iterate and refine
Can I predict doctor specialty from what
procedures they perform?
Analysis - finding the data
● Publically -
○ Tons of data repositories
○ No consistency, even within a specific repository
● Privately -
○ Data often hidden in silos
○ Even less consistency
● Avoid datasets that don’t come with a
dictionary
○ Data w/o a dictionary is like code w/o comments
Analysis - acceptable use
● Publically -
○ Data sets often have associated licenses
○ Yes, even public (government) sets
○ You may have to find an alternative set
● Privately -
○ Often tightly controlled, considered sensitive
business data
○ If you can use it, maybe only in a specific place
○ Likely no alternatives
● The story of Stephen Glasser and Cheryl
Palma
● Two of the oldest people in the medical
profession working with medicare
● Stephen Glasser graduated in 1773
● Cheryl Palma graduated in 1776
Analysis - explore / clean the data
Analysis - finally
● You got some answer to a question you
approximately asked
● You must refine the question and process
● Repeat
This is hard enough without having to manage
tools and infrastructure!
Sahara’s goal
Make managing Hadoop+ infrastructure and
tools so simple that doing so never gets in your
way
Sahara is
● An OpenStack project in the Data
Processing program
● Started one year ago (Summit in Portland)
● Incubated in Icehouse (6 months ago)
● Integrated for Juno (6 months from now)
Sahara’s architecture
Data
Sources
Sahara
Python
Client
RESTAPI
Cluster
Configuration
Manager
Horizon
Keystone
Auth
Data
Access
Layer
Swift
Sahara
Pages
Hadoop
VM
Vendors
Plugins
Hadoop
VM
Hadoop
VM
Hadoop
VM
Resources
Orchestration
Manager
Job
Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Sahara Service
Sahara’s plugin architecture
● This is important!
● It’s where Hadoop distribution vendors
integrate their management software
● It’s how users pick different software
versions
● Currently: Vanilla (reference impl. w/ Apache
versions), HDP (via Ambari), IDH (via Intel
Manager) and under review CDH and Spark
Sahara lets you
● Create and manage clusters
● Define and run analysis jobs
● All through a programmatic interface
● Or a web console
Sahara’s REST API
API v1 (Cluster operations)
● http://bit.ly/1hRXrVX
● Plugins
○ list - comes from configuration
○ get - provides capabilities of a plugin, e.g. services
● Images
○ register - provide basic metadata, username - going
away w/ heat
○ tag/untag - associate image w/ a plugin
API v1 (Cluster operations) (cont)
● http://bit.ly/1hRXrVX
● Templates
○ node groups
○ clusters
● Clusters
○ Instances of templates
API v1.1 (Elastic Data Processing)
● http://bit.ly/1kXGjGj
● Data Source
○ Input and output locations (Swift/HDFS urls)
● Job Binaries
○ Often JARs or scripts stored in Swift or ...
● Jobs
○ Templates for a job with missing parameters
● Job executions
○ Instances of templates with parameters provided
API v2 (future)
Consistent, stable, and clean evolution of v1 & v1.1
○ Image handling in v1 wasn’t RESTful
○ Reduce use of internally stored binaries
○ Jobs & job executions weren’t RESTful
○ Resource naming wasn’t consistent (clusters v job-
executions & cluster-templates v jobs)
○ Prune unused operations, e.g status-refresh
○ Align resource lifecycle, e.g. terminate = stop&delete
vs terminate = stop
Sahara’s Plugin API
Sahara’s Plugin API
● http://bit.ly/1h4MiAW
● get_versions
● get_configs(version)
● get_node_processes(version)
● get_required_image_tags(version)
● validate(cluster)
● configure_cluster(cluster)
● start_cluster(cluster)
● scale_cluster(cluster)
● ...
Roadmap
● I mentioned a couple things, but this is a
community project
● The Icehouse release is tomorrow
● Design summit, where developers & users &
business get together to define the roadmap,
is May 13-16 in Atlanta
Demo with bigpetstore
● http://jayunit100.github.io/bigpetstore/slides
● Bigpetstore (by @jayunit100)
○ A full stack hadoop application
○ Uses the main players in the hadoop ecosystem
○ To demonstrate a single domain
○ Just accepted into the Bigtop project!
Demo with bigpetstore...live (cont)
We’re going to perform petstore transaction
analysis -
1. Generate data from a model
2. Transform data for processing
3. Process w/ pig or mahout, we’ll do pig
4. Visualize results in web app
Demo video...
https://www.youtube.com/watch?v=vmry_kXqn4c
Hadoop on OpenStack - Sahara @DevNation 2014

More Related Content

What's hot

20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetupWei Ting Chen
 
Hong Kong OpenStack Summit: Savanna - Hadoop on OpenStack
Hong Kong OpenStack Summit: Savanna - Hadoop on OpenStackHong Kong OpenStack Summit: Savanna - Hadoop on OpenStack
Hong Kong OpenStack Summit: Savanna - Hadoop on OpenStackSergey Lukjanov
 
Savanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSavanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSergey Lukjanov
 
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Sergey Lukjanov
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopDataWorks Summit
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila finalWei Ting Chen
 
OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014Sergey Lukjanov
 
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...DataWorks Summit
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
Savanna project update Jan 2014
Savanna project update Jan 2014Savanna project update Jan 2014
Savanna project update Jan 2014Sergey Lukjanov
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...DataStax
 
Sahara presentation latest - Codemotion Rome 2015
Sahara presentation latest - Codemotion Rome 2015Sahara presentation latest - Codemotion Rome 2015
Sahara presentation latest - Codemotion Rome 2015Codemotion
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015Yousun Jeong
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 

What's hot (20)

20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup
 
Hong Kong OpenStack Summit: Savanna - Hadoop on OpenStack
Hong Kong OpenStack Summit: Savanna - Hadoop on OpenStackHong Kong OpenStack Summit: Savanna - Hadoop on OpenStack
Hong Kong OpenStack Summit: Savanna - Hadoop on OpenStack
 
Savanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSavanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStack
 
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila final
 
OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014
 
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Savanna project update Jan 2014
Savanna project update Jan 2014Savanna project update Jan 2014
Savanna project update Jan 2014
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
 
Sahara presentation latest - Codemotion Rome 2015
Sahara presentation latest - Codemotion Rome 2015Sahara presentation latest - Codemotion Rome 2015
Sahara presentation latest - Codemotion Rome 2015
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 

Viewers also liked

Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014Tesora
 
Hadoop on OpenStack
Hadoop on OpenStackHadoop on OpenStack
Hadoop on OpenStackSandeep Raju
 
Hipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large ScaleHipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large ScaleLiu Liu
 
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Cloudera, Inc.
 
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013Hortonworks
 
15 minute presentation about Thesis
15 minute presentation about Thesis15 minute presentation about Thesis
15 minute presentation about ThesisSven Meys
 
از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...
از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...
از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...Leila Esmaeili
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Yahoo Developer Network
 
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageCloudera, Inc.
 
آشنایی با جرم‌یابی قانونی رایانه‌ای
آشنایی با جرم‌یابی قانونی رایانه‌ایآشنایی با جرم‌یابی قانونی رایانه‌ای
آشنایی با جرم‌یابی قانونی رایانه‌ایRamin Najjarbashi
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
 
Cloud Security and Risk Management
Cloud Security and Risk ManagementCloud Security and Risk Management
Cloud Security and Risk ManagementMorteza Javan
 

Viewers also liked (19)

Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014
 
Hadoop on OpenStack
Hadoop on OpenStackHadoop on OpenStack
Hadoop on OpenStack
 
2012 09-08-josug-jeff
2012 09-08-josug-jeff2012 09-08-josug-jeff
2012 09-08-josug-jeff
 
Hadoop For OpenStack Log Analysis
Hadoop For OpenStack Log AnalysisHadoop For OpenStack Log Analysis
Hadoop For OpenStack Log Analysis
 
Hipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large ScaleHipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large Scale
 
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
 
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
 
15 minute presentation about Thesis
15 minute presentation about Thesis15 minute presentation about Thesis
15 minute presentation about Thesis
 
از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...
از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...
از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
Virtualizing Hadoop
Virtualizing HadoopVirtualizing Hadoop
Virtualizing Hadoop
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
 
Sahara Updates - Kilo Edition
Sahara Updates - Kilo EditionSahara Updates - Kilo Edition
Sahara Updates - Kilo Edition
 
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
 
آشنایی با جرم‌یابی قانونی رایانه‌ای
آشنایی با جرم‌یابی قانونی رایانه‌ایآشنایی با جرم‌یابی قانونی رایانه‌ای
آشنایی با جرم‌یابی قانونی رایانه‌ای
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
Cloud Security and Risk Management
Cloud Security and Risk ManagementCloud Security and Risk Management
Cloud Security and Risk Management
 

Similar to Hadoop on OpenStack - Sahara @DevNation 2014

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeIan Lumb
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingYahoo Developer Network
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 

Similar to Hadoop on OpenStack - Sahara @DevNation 2014 (20)

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 

Recently uploaded

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 

Recently uploaded (20)

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 

Hadoop on OpenStack - Sahara @DevNation 2014

  • 1. Big data processing with Hadoop on OpenStack Matthew Farrellee (@spinningmatt) Red Hat
  • 2. Here for a talk about Savanna? Oops, this talk is about Sahara. Good news is they’re the same thing. Savanna was renamed for trademark reasons to Sahara. You have to go to page 10 of google results to find out why: https://www.google.com/search?q=savanna+hadoop&start=90
  • 3. In brief - what is Hadoop ● Narrow - Apache Hadoop - a specific Apache project originally from Yahoo!, based on papers published from Google ● Broad - an ecosystem of projects, mostly Apache, that integrate in some way with Apache Hadoop ● Most common to use the broad definition
  • 4. Hadoop from Hortonworks (+ others) ● Multiple projects ○ Workload management ○ Resource management ○ System management ○ Data ingest & storage ○ Compute frameworks ○ Domain languages ● Data storage and processing focused
  • 5. In brief - what is OpenStack OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
  • 6. An ecosystem of projects ● Compute - Nova ● Networking - Neutron ● Object Storage - Swift ● Block Storage - Cinder ● Identity - Keystone ● Image Service - Glance ● Dashboard - Horizon ● Telemetry - Ceilometer ● Orchestration - Heat ● Data Processing - Sahara
  • 7. Longer comments on big data Choose your own adventure… Go to the next slide and get the day over sooner See some shoegazing followed by a rant and have the day last longer
  • 8. Interest (via Google Trends) Hadoop EC2 OpenStack www.google.com/trends/explore#q=hadoop,ec2,openstack
  • 9. Interest (via Google Trends) Hadoop EC2 OpenStack www.google.com/trends/explore#q=hadoop,ec2,openstack EC2 beta Aug 25 2006 (http://aws.typepad. com/aws/2006/08/amazon_ec2_beta.html)
  • 11. Analysis - have a question ● Even this alone is hard to come up with ● The question you answer won’t be the question you set out to ask ● You’ll have to iterate and refine Can I predict doctor specialty from what procedures they perform?
  • 12. Analysis - finding the data ● Publically - ○ Tons of data repositories ○ No consistency, even within a specific repository ● Privately - ○ Data often hidden in silos ○ Even less consistency ● Avoid datasets that don’t come with a dictionary ○ Data w/o a dictionary is like code w/o comments
  • 13. Analysis - acceptable use ● Publically - ○ Data sets often have associated licenses ○ Yes, even public (government) sets ○ You may have to find an alternative set ● Privately - ○ Often tightly controlled, considered sensitive business data ○ If you can use it, maybe only in a specific place ○ Likely no alternatives
  • 14. ● The story of Stephen Glasser and Cheryl Palma ● Two of the oldest people in the medical profession working with medicare ● Stephen Glasser graduated in 1773 ● Cheryl Palma graduated in 1776 Analysis - explore / clean the data
  • 15. Analysis - finally ● You got some answer to a question you approximately asked ● You must refine the question and process ● Repeat This is hard enough without having to manage tools and infrastructure!
  • 16. Sahara’s goal Make managing Hadoop+ infrastructure and tools so simple that doing so never gets in your way
  • 17. Sahara is ● An OpenStack project in the Data Processing program ● Started one year ago (Summit in Portland) ● Incubated in Icehouse (6 months ago) ● Integrated for Juno (6 months from now)
  • 19. Sahara’s plugin architecture ● This is important! ● It’s where Hadoop distribution vendors integrate their management software ● It’s how users pick different software versions ● Currently: Vanilla (reference impl. w/ Apache versions), HDP (via Ambari), IDH (via Intel Manager) and under review CDH and Spark
  • 20. Sahara lets you ● Create and manage clusters ● Define and run analysis jobs ● All through a programmatic interface ● Or a web console
  • 22. API v1 (Cluster operations) ● http://bit.ly/1hRXrVX ● Plugins ○ list - comes from configuration ○ get - provides capabilities of a plugin, e.g. services ● Images ○ register - provide basic metadata, username - going away w/ heat ○ tag/untag - associate image w/ a plugin
  • 23. API v1 (Cluster operations) (cont) ● http://bit.ly/1hRXrVX ● Templates ○ node groups ○ clusters ● Clusters ○ Instances of templates
  • 24. API v1.1 (Elastic Data Processing) ● http://bit.ly/1kXGjGj ● Data Source ○ Input and output locations (Swift/HDFS urls) ● Job Binaries ○ Often JARs or scripts stored in Swift or ... ● Jobs ○ Templates for a job with missing parameters ● Job executions ○ Instances of templates with parameters provided
  • 25. API v2 (future) Consistent, stable, and clean evolution of v1 & v1.1 ○ Image handling in v1 wasn’t RESTful ○ Reduce use of internally stored binaries ○ Jobs & job executions weren’t RESTful ○ Resource naming wasn’t consistent (clusters v job- executions & cluster-templates v jobs) ○ Prune unused operations, e.g status-refresh ○ Align resource lifecycle, e.g. terminate = stop&delete vs terminate = stop
  • 27. Sahara’s Plugin API ● http://bit.ly/1h4MiAW ● get_versions ● get_configs(version) ● get_node_processes(version) ● get_required_image_tags(version) ● validate(cluster) ● configure_cluster(cluster) ● start_cluster(cluster) ● scale_cluster(cluster) ● ...
  • 28. Roadmap ● I mentioned a couple things, but this is a community project ● The Icehouse release is tomorrow ● Design summit, where developers & users & business get together to define the roadmap, is May 13-16 in Atlanta
  • 29. Demo with bigpetstore ● http://jayunit100.github.io/bigpetstore/slides ● Bigpetstore (by @jayunit100) ○ A full stack hadoop application ○ Uses the main players in the hadoop ecosystem ○ To demonstrate a single domain ○ Just accepted into the Bigtop project!
  • 30. Demo with bigpetstore...live (cont) We’re going to perform petstore transaction analysis - 1. Generate data from a model 2. Transform data for processing 3. Process w/ pig or mahout, we’ll do pig 4. Visualize results in web app