4. OpenStack Data Processing: Sahara
Mission: To provide a scalable data processing
stack and associated management interfaces.
• provision and operate Hadoop clusters
• schedule and operate Hadoop jobs
7. Use cases
• Self-service provisioning of Hadoop clusters
• Utilization of unused compute capacity for
bursty workloads
• Dev -> Stage -> Prod lifecycle
• Run Hadoop workloads in few clicks without
expertise in Hadoop ops
20. Icehouse release
• HBase (and Sqoop) available via HDP plugin
• Spark images w/ diskimage-builder (full plugin in review)
• Heat for provisioning
• i18n translation started
• Neutron namespaces w/ rootwrap
• Guest agent implementation started
21. Elastic Data Processing (EDP) is Sahara’s take on
data processing workflow management.
Goal - let end users (those w/ high value questions
to answer) get answers about data without having
to know a single thing about cluster management.
“Customers launch millions of Amazon EMR clusters every year.”
http://aws.amazon.com/elasticmapreduce/
Elastic Data Processing update
26. Command line interface overview
If you can do it with the Dashboard, you
can do it from the command-line
Blueprint: python-savannaclient-cli
27. Command line interface overview
Image management
$ sahara
...
Positional arguments:
<subcommand>
image-add-tag Add a tag to an image.
image-list Print a list of available images.
image-register Register an image from the Image index.
image-remove-tag Remove a tag from an image.
image-show Show details of an image.
image-unregister Unregister an image.
28. Command line interface overview
Node group, cluster and job templates
$ sahara
node-group-template-create Create a node group...
node-group-template-delete Delete a node group...
node-group-template-list Print a list of available...
node-group-template-show Show details of a node...
cluster-template-create Create a cluster template.
cluster-template-delete Delete a cluster template.
cluster-template-list Print a list of available...
cluster-template-show Show details of a cluster...
job-template-create Create a job template.
job-template-delete Delete a job template.
job-template-list Print a list of job...
job-template-show Show details of a job...
29. Command line interface overview
Data sources and job binaries
$ sahara
...
<subcommand>
data-source-create Create a data source that provides
job input receives job output.
data-source-delete Delete a data source.
data-source-list Print a list of available data...
data-source-show Show details of a data source.
job-binary-create Record a job binary.
job-binary-delete Delete a job binary.
job-binary-list Print a list of job binaries.
job-binary-show Show details of a job binary.
30. Command line interface overview
Clusters and jobs
$ sahara
...
<subcommand>
cluster-create Create a cluster.
cluster-delete Delete a cluster.
cluster-list Print a list of available clusters.
cluster-show Show details of a cluster.
job-create
job-delete Delete a job.
job-list Print a list of jobs.
job-show Show details of a job.
32. HDP Plugin Overview
• Full support for all Sahara Functionality
• Nova and Neutron network
• Cluster Scaling
• Scale Up
• Swift Integration
• Cinder Support
• Data Locality
• EDP
• Apache Ambari REST API’s used for cluster
provisioning
• Monitoring/Management of clusters via Ambari
• Full support for multiple HDP stacks
• HDP pre-installed or generic VM images
33. HDP 1.3.2
● NameNode
● Secondary NameNode
● DataNode
● HDFS
● ZooKeeper
● Ambari Server/Agent
● HCatalog
● Sqoop
● Job Tracker
● Task Tracker
● MapReduce
● Hive
● MySQL
● Pig
● WebHCat Server
● Oozie
● Ganglia
● Nagios
● HBase
HDP Plugin Stack Support
HDP 2.0.6
● History Server
● MapReduce 2 / YARN
● Resource Manager
● YARN Client
HDP 2.1
● Storm
● Falcon
C
om
ing
Soon!
A
vailable
A
vailable
HDP 2.1 +
● SOLR
● Cascading
R
oadm
ap
34. HDP Disk Images
• Disk Image Builder offers consistent approach for image creation
• HDP Plugin provides images and scripts for (CentOS, RHEL):
• Plain
• 1.3.2
• 2.0.6
• 2.1 (coming soon)
• Pre-Packaged images (1.3.2, 2.0.6) provide images with HDP packages pre-
installed for accelerated provisioning, reduced network traffic
• Image Build Scripts allow images to be customized
• Security
• Custom Packages
• O/S Settings
35. Ambari Blueprints
• Two primary goals of Ambari Blueprints
• Ability to export a complete description of a
running cluster
• Provide API based cluster installations based on
a self- contained cluster description
• Blueprints contain cluster topology and configuration
information
• Enables Interesting use cases between physical and
virtual, including OpenStack/Sahara
37. Juno roadmap
• Further integration with OpenStack ecosystem:
• Distributed architecture
• Guest agents
• EDP enhancements
• Merge dashboard to Horizon
To be discussed and confirmed at Design Summit