Your SlideShare is downloading. ×
0
Building a cloud
based managed BigData platform
Hemanth Yamijala
Lead Consultant - ThoughtWorks
yhemanth@thoughtworks.com
...
Pillars in a BigData Solution
BigData

Infrastructure

Data

Process
A Managed Platform
BigData

Services
Infrastructure

Services
Data

Process
Reuse infrastructure
BigData
•Consolidate cluster
resources

•Saves capacity cost
•Saves operational
cost

•Enforce common...
Reuse Data
BigData
•Democratize data
assets

•Assist self service,
discoverability

•Conventions based
organization of dat...
Reuse process
BigData

•Build common processing

frameworks or libraries
•Ingest and Extract can be
centralized services
•...
Other Reasons
• Develop and leverage skill set of people
• Separating concerns of running
applications vs running infrastr...
Flavors of managed BigData platforms

•
•

Physical data centers
Private or Public clouds

•
•

Infrastructure Providers:
...
Architectural Layers
Enterprise User Data / Workloads

User Data / Workloads

Enterprise Managed BigData
Services (E.g. Ne...
Components in a managed platform
Presentation

Command Line Tools

API

Analytics Workbench

Data analytics

Data Catalog
...
Elastic MapReduce - 101
•

Provision a Hadoop cluster of given size, using given type
of instances

•
•
•
•
•
•

Support f...
Reasons for having Enterprise Tier on EMR

• Improve usability by providing better
abstractions, necessary automation

• I...
Improving Usability
•

EMR API expects some
repetitive setup steps as
part of job submission. E.g.
Hive setup for all Hive...
Improving Usability
{"steps": [
{
"stepActionOnFailure": "CONTINUE",
"stepName": "Setup Hive",
"stepArgs": [
"s3://us-east...
Improving Usability
•

Separate cluster management
from job management.

•

EMR expects users to
know the cluster sizes
wh...
Improve cost utilization

•

Different cluster types
in EMR: ephemeral
(default) and static

•

Ephemeral clusters can
be ...
Job Management System Design
Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Manage provisioning,
monitoring and terminating
clusters. Matches job
requests to suitable cl...
Job Management System Design
Pool of clusters brought up
either on demand or predetermined, based on
requirements of resou...
Job Management System Design
Has knowledge of how to
convert a user jobflow to an
EMR jobflow. Also knows
how to submit jobfl...
Job Management System Design
Job
Management
Service

Monitors running jobs on
clusters using CloudWatch
(or similar system...
Job Management System Design
Job
Management
Service

Front-end service API for
users to submit their jobs.

Job Executor

...
Thank you!
http://www.thoughtworks.com/insights/bigdata-analytics
Upcoming SlideShare
Loading in...5
×

Building a cloud based managed BigData platform for the enterprise

296

Published on

These are slides I presented at the BigData Conclave event in Bangalore, December 2013. The talk was focused on sharing experiences building a managed bigdata platform on top of the Amazon AWS infrastructure and how it adds value to enterprises

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
296
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Building a cloud based managed BigData platform for the enterprise"

  1. 1. Building a cloud based managed BigData platform Hemanth Yamijala Lead Consultant - ThoughtWorks yhemanth@thoughtworks.com @yhemanth
  2. 2. Pillars in a BigData Solution BigData Infrastructure Data Process
  3. 3. A Managed Platform BigData Services Infrastructure Services Data Process
  4. 4. Reuse infrastructure BigData •Consolidate cluster resources •Saves capacity cost •Saves operational cost •Enforce common Infrastructure access control and security measures Data Process
  5. 5. Reuse Data BigData •Democratize data assets •Assist self service, discoverability •Conventions based organization of data •Enforce access policies Infrastructure Data Process
  6. 6. Reuse process BigData •Build common processing frameworks or libraries •Ingest and Extract can be centralized services •Frameworks can be developed for ETL processes, workflows, etc. •Save time in building Infrastructure Data analytical solutions Process
  7. 7. Other Reasons • Develop and leverage skill set of people • Separating concerns of running applications vs running infrastructure • Evaluate and adopt new developments in the space
  8. 8. Flavors of managed BigData platforms • • Physical data centers Private or Public clouds • • Infrastructure Providers: • Amazon Web Services, Google Compute Engine, Microsoft Azure, IBM, Open Stack, Rackspace Platform Providers: • • Qubole, Xurmo In-House: Netflix, ...
  9. 9. Architectural Layers Enterprise User Data / Workloads User Data / Workloads Enterprise Managed BigData Services (E.g. Netflix Genie) Managed BigData Services (E.g. EMR, Savanna, Redshift) Cloud Storage (E.g. S3, Swift) Virtualized Compute (E.g. EC2, Nova)
  10. 10. Components in a managed platform Presentation Command Line Tools API Analytics Workbench Data analytics Data Catalog Query Aggregates ETL Platform Ingest FileSystem Workflow Provisioning Scheduler Job Management Extract Access Control Eventing Infrastructure Redshift Data S3 EMR Compute IAM Identity SNS Infrastructure
  11. 11. Elastic MapReduce - 101 • Provision a Hadoop cluster of given size, using given type of instances • • • • • • Support for most of the ecosystem- Hive, Pig, HBase, etc. Can scale up and down nodes for a cluster on demand User submits ‘jobflows’ - a sequence of Hadoop jobs Integrates with S3 as permanent store of data Integrates with other Amazon services Cost = Std. EC2 instance cost + extra + Std s3 ops etc.
  12. 12. Reasons for having Enterprise Tier on EMR • Improve usability by providing better abstractions, necessary automation • Improve cost utilization by reusing infrastructure • Improve performance by providing system level optimizations
  13. 13. Improving Usability • EMR API expects some repetitive setup steps as part of job submission. E.g. Hive setup for all Hive jobs • Provide a service API with a simpler interface that automates the setup.
  14. 14. Improving Usability {"steps": [ { "stepActionOnFailure": "CONTINUE", "stepName": "Setup Hive", "stepArgs": [ "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--install-hive", "--hive-versions", "latest" ], "stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar" }, { "stepActionOnFailure": "CONTINUE", "stepName": "Install Hive Site Configuration", "stepArgs": [ "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--install-hive-site", "--hive-site=s3://com.x.y.z/security/configs/hive-site.xml", "--hive-versions", "latest" ], "stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar" }, { "stepActionOnFailure": "TERMINATE_JOB_FLOW", "stepName": "Run Hive Script", "stepArgs": [ "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", jobs: [{ "name": "hive-query", "type": "hive", "args": [ "-hiveconf", "hive.cli.print.header=true" ], "script": "select * from table;" } ]
  15. 15. Improving Usability • Separate cluster management from job management. • EMR expects users to know the cluster sizes when launching jobs • Have the system (or administrators) launch clusters on behalf of users • Users will either not know how to launch clusters, or will launch incorrectly sized ones. • Have the system submit jobs to appropriate clusters • Scale them according to the needs of the jobs automatically or administratively
  16. 16. Improve cost utilization • Different cluster types in EMR: ephemeral (default) and static • Ephemeral clusters can be a huge cost drain Note: minimum charges for a hour • Static clusters can also waste money (if unused) • Go with a Hybrid model Launch clusters on demand, but maximize the cost to utilization ratio - keep them alive at least for an hour • • • Reuse them for other jobs transparently • Shutdown if not used anymore Saved $3000 in a month with this strategy
  17. 17. Job Management System Design Job Management Service Job Executor Resource Estimator Cluster Manager
  18. 18. Job Management System Design Manage provisioning, monitoring and terminating clusters. Matches job requests to suitable clusters based on policy Job Management Service Job Executor Resource Estimator Cluster Manager
  19. 19. Job Management System Design Pool of clusters brought up either on demand or predetermined, based on requirements of resource requirements, longevity, etc. Job Management Service Job Executor Resource Estimator Cluster Manager
  20. 20. Job Management System Design Has knowledge of how to convert a user jobflow to an EMR jobflow. Also knows how to submit jobflows to clusters identified by cluster manager Job Management Service Job Executor Resource Estimator Cluster Manager
  21. 21. Job Management System Design Job Management Service Monitors running jobs on clusters using CloudWatch (or similar system), and determines whether to add / delete more nodes to a cluster Job Executor Resource Estimator Cluster Manager
  22. 22. Job Management System Design Job Management Service Front-end service API for users to submit their jobs. Job Executor Resource Estimator Cluster Manager
  23. 23. Thank you! http://www.thoughtworks.com/insights/bigdata-analytics
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×