1
2
Ashwin Kumar
Pivotal
Pivotal Hadoop on Cloud Foundry
3
 Open source software for reliable,
scalable, distributed computing
 HDFS - A distributed file system for
“large” I/O
 YARN – A framework for resource
scheduling and management
 MapReduce – Popular paradigm for
parallel batch processing
Apache Hadoop
4
 Enterprise grade Hadoop distribution
• Cluster Management and Monitoring
• Bulk Data Loader
• Extensions for Virtualization
 Advanced Database Services
• World’s Fastest SQL on Hadoop
• 100% SQL Compliance
Pivotal Hadoop
5
 Provision Hadoop resources to power data-centric Cloud
Foundry Apps
• Park unstructured data on HDFS.
• Execute batch processing via MapReduce.
• Perform deep, complex analytics in SQL using HAWQ.
Pivotal HD for Cloud Foundry
6
Extensibility in PaaS
Pivotal CF
7
Extensibility in PaaS
Pivotal CF Pivotal HD
8
Cloud Foundry Service API
Pivotal CF Pivotal HD
PHD
Service
Broker
Cloud
Controller
9
Pivotal HD Service
HDFS
Hive
YARN
HBase
HAWQ
ZooKeeper
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
PHD
Service
Broker
10
Pivotal HD Service
PHD
Service
Broker
HDFS
Hive
YARN
HBase
HAWQ
ZooKeeper
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
11
Pivotal HD Service
PHD
Service
Broker
HDFS
Hive
YARN
HBase
HAWQ
ZooKeeper
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
12
 Shared Clusters
 Bare Metal Installs
 Negotiation across Multiple Clusters
 Exclusive Clusters
 Dynamic Provisioning
Pivotal HD Service
13
 Shared Clusters
 Bare Metal Installs
 Negotiation across Multiple Clusters
 Exclusive Clusters
 Dynamic Provisioning
Pivotal HD Service
14
Pivotal HD Service
PHD
Service
Broker
HDFS
Hive
YARN
HBase
HAWQ
ZooKeeper
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
Slave
Node
15
 Shared Clusters
 Bare Metal Installs
 Negotiation across Multiple Clusters
 Exclusive Clusters
 Dynamic Provisioning
Pivotal HD Service
16
Pivotal HD Service
PHD
Service
Broker
HDFS YARN HAWQ
HDFS YARN HAWQ
HDFS YARN HAWQ
17
Pivotal HD Service
PHD
Service
Broker
HDFS YARN HAWQ
HDFS YARN HAWQ
HDFS YARN HAWQ
18
 Shared Clusters
 Bare Metal Installs
 Negotiation across Multiple Clusters
 Exclusive Clusters
 Dynamic Provisioning
Pivotal HD Service
19
Pivotal HD Service
PHD
Service
Broker
HDFS YARN HAWQ
HDFS YARN HAWQ
HDFS YARN HAWQ
20
Pivotal HD Service
PHD
Service
Broker
HDFS YARN HAWQ
HDFS YARN HAWQ
HDFS YARN HAWQ
21
 Shared Clusters
 Bare Metal Installs
 Negotiation across Multiple Clusters
 Exclusive Clusters
 Dynamic Provisioning
Pivotal HD Service
22
Pivotal HD Service
PHD
Service
Broker
HDFS YARN HAWQ
HDFS YARN HAWQ
23
Pivotal HD Service
PHD
Service
Broker
HDFS YARN HAWQ
HDFS YARN HAWQ
HDFS YARN HAWQ
24
 Pivotal HD
• Deployable by BOSH
• Exposed as a Cloud Foundry service
 Data intensive apps are coming
 Only possible through extensibility of Cloud Foundry
Conclusion
25

Pivotal HD as a Cloud Foundry Service

Editor's Notes

  • #7 As a user of Cloud Foundry, you’re probably aware of openness, as it pertains to avoiding vendor lock-in. But, similarly fundamental is the notion that the Cloud Foundry PaaS itself can be extended to enhance the PaaS value proposition.
  • #8 Its through exactly this extensibility that Pivotal CF allows Hadoop to exist as a complementary service to application developers. With this “enhanced” PaaS, the application developer, besides hosting support, getting domain names, and single-node services like Postgres or MySQL, he or she will now have the ability to leverage large Hadoop clusters for analytics. But, how exactly does this work? What is this extensibility we’re talking about?
  • #9 At the core of this extensibility, is a communication that occurs between the Cloud Controller and a Service Broker, whose responsibility is to negotiate service capabilities on behalf of the other nodes comprising the service, whatever that service may be. This communication is responsible for establishing a couple of exchanges:Catalog Management – Declaring what service is available, and variants of it can be requested by CF adminstrators. Think of shared MySQL servers, dedicated MySQL servers.Provisioning – The act of reserving resources on the cluster.Binding – The act of enabling access of particulars apps to the cluster.What’s key to note about this communication is the flexibility of the protocol, which allows the provisioning to be service-defined. This is going to be critical as we start to look at what it means to treat Hadoop as a service. For such a complex, distributed service like Hadoop, there are many configurations and different use cases for how a typical configuration exists in an enterprise. We’ll start with what’s likely the most accessible and straightforward approach to provisioning Hadoop.
  • #10 Our first of many variants of Hadoop-as-a-service is comprised of a shared, static HDFS cluster that gets BOSH-deployed, along with the service broker, using the same infrastructure that your Cloud Foundry PaaS was deployed upon.
  • #11 In this model, the provision request will be received by the Service Broker are propagated to the various sub-components of the cluster.
  • #12 Ultimately, the act of provisioning will have reserved resources on each of the Hadoop components. For example, on HDFS, some amount of space will have been reserved on the filesystem; and with HAWQ, a database will have been created to house SQL data. The ensuing bind requests will allow apps to gain access to the HDFS subfolder, to that HAWQ database, and so on.
  • #15 Shared cluster that is BOSH deployed side-by-side with your CF.
  • #17 Shared cluster that is BOSH deployed side-by-side with your CF.
  • #18 Shared cluster that is BOSH deployed side-by-side with your CF.
  • #20 Shared cluster that is BOSH deployed side-by-side with your CF.
  • #21 Shared cluster that is BOSH deployed side-by-side with your CF.
  • #23 Shared cluster that is BOSH deployed side-by-side with your CF.
  • #24 Stepping stone to dynamic MapReduce queries, namely the ability to through a simple API, spin up the cluster, send the mapreduce job, execute, return analysis, and tear down the cluster.