STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation
Best Practices Building a
Multi-tenant Big Data Infrastructure
STAC Summit 2014 - NYC
Gord Sissons, gsissons@ca.ibm.com @GJSissons

© 2014 IBM Corporation2
Agenda
What do we mean by multi-tenancy?
Our evolving view - from HPC to HPA
Enter Big Data
Client example – multi-tenant Hadoop
New frameworks & Benchmarking Hadoop
Closing thoughts

Multi-tenancy is an over-loaded term
Virtualization
Multiple users, lines-of-business
Multiple application instances & versions
Multi-tenant datastores – security isolation
Multiple distributed frameworks
Multiple instances of the same framework
Our viewpoint shaped by managing scaled-out cluster
infrastructure for the Financial Services Community
Means different things to different people

HPC, HPA
IBM Platform
Symphony
Low latency scheduling
Dynamic resource sharing
ISV applications
Extensive APIs High-performance SOA
A high-performance, shared
grid infrastructure for risk
analytics
From a shared infrastructure for risk analytics to born-in-the-cloud frameworks
Batch
IBM Platform
LSF
Multi-headed
Configurations
Batch workloads
On a shared infrastructure,
sharing resources according
to policy – a broad set of
workloads
Our evolving view of multi-tenancy

Client requirements
Need for guaranteed service levels, notion of ownership
Time-variant, directed sharing policies
Dynamic, transparent service orchestration
Support for multiple concurrent applications
Agile flexing & resource reclaim
A simple value proposition to the business – sign on to a shared
infrastructure and have guaranteed resource ownership, and a better
quality of service than you could realize on dedicated infrastructure

split 0
split 1
split 2
split 3
split 4
split 5
Map
Map
Map
Reduce
Reduce
Reduce
C Client
output 0
output 1
output 2
M Master
Input
Files
Map
Phase
Intermediate
Files
Reduce
Phase
Output
Files
Enter Hadoop - much attention for new workloads
 Data warehouse modernization
 Fraud analytics
 Audit & compliance
 Social media analytics
 360 view of the customer
 Machine data analytics
 Text analytics
 Tick analytics
 Trade visibility
 Click-stream analytics
 Vehicle telematics
History repeating itself - Much as distributed system dominate large-
scale HPC, the same is becoming true in data management

HPC, HPA
IBM Platform
Symphony
ISV applications
analytics
Batch
IBM Platform
LSF
Multi-headed
Configurations
Batch workloads
to policy
Big Data
IBM Platform
Symphony
Advanced Edition
MapReduce
Multitenancy
Agile Scheduling
Hadoop MapReduce
Advanced, high-performance
MapReduce framework with
Hadoop compatibility and
multitenancy

Cluster Sprawl – The Elephant in the Room
 Diverse applications with different dependencies
 Different distributions, versions & tools
 Life cycle management challenges – dev, QA, test, production
 Big Data is more than just Hadoop – multiple projects and frameworks

HPC, HPA
IBM Platform
Symphony
ISV applications
analytics
Batch
IBM Platform
LSF
Multi-headed
Configurations
Batch workloads
to policy
Big Data
IBM Platform
Symphony
Advanced Edition
Low latency MapReduce
Multitenancy
Agile Scheduling
Hadoop MapReduce
Advanced, high-performance
MapReduce framework with 100%
Hadoop compatibility and
sophisticated multitenancy
Application
Frameworks
IBM Application
Services Controller
Complex Service
Orchestration
Advanced Services
“Born in the cloud”
application frameworks

Customer example
US financial institution, approx 9M customers
 Retail banking, credit cards, insurance, portfolio mgmt, real-estate, retirement
planning & more
Began Hadoop journey in ~2010
 Deliver new services, reduce costs, off-load warehouse, provide timely data
access to analysts & data scientists
Target application areas
 CRM, click-stream analytics, fraud alerting, actuarial underwriting, social data
analytics, vehicle telematics / geo-spatial analytics
Rapid success, internal demand & security requirements
drove the need for an architecture re-think in ~2012
 Deployed IBM Platform Symphony MapReduce + Elastic Storage
(based on IBM GPFS) realizing a shared, multi-tenant analytics grid

App #1
User Group #1
App #2
User Group #2
App #3
User Group #3
App #4
User Group #4
App #5
User Group #5
App #6
User Group #6
App #7
User Group #7
App #n
User Group #n
…
Shared infrastructure – current state
 Over two-dozen lines of business sharing production cluster
 1 PB deployed, rapid growth trajectory - ~ 40% reduction in storage requirement
 Security isolation, guaranteed service-levels, show-back accounting
 Significant performance & operational gains, higher infrastructure utilization
 Avoided the need for additional production clusters
InfoSphere BigInsights - Enterprise-grade Hadoop
Platform Symphony MapReduce – Multi-tenancy, high-performance, service level guarantees
IBM Elastic Storage (based on IBM GPFS) - HDFS compatible, POSIX, enterprise-features

Planned cluster expansion – early 2015
Expanding the Hadoop infrastructure
Deploying Spark to support new applications
Big R deployment serving data scientists community
Pilot Hadoop-as-a-service on cloud
SQL-on-Hadoop deployment to serve demand from analysts

Hadoop-DS Benchmark – October 2014
 IBM developed benchmark reflecting growing interest in SQL-on-Hadoop
 Showcase IBM’s Big SQL capability
 Big Data DS benchmark - based on TPC-DS
 Fully complies with the TPC-DS schema requirement
 Uses all 99 queries
 Meets the multi-user requirement
 Has been audited by a TPC-DS auditor but as a non-TPC benchmark
 Select deviations from TPC-DS due to Hadoop limitations:
 No data maintenance operations, referential integrity enforcement, or ACID
property validation as these are not feasible with HDFS
 Additional statistics used
 Metric adjustments
 No price/performance measures included
 Not an official TPC benchmark result

Benchmarking SQL language compatibility
Key points
 With competing solutions, many
queries needed to be re-written
 Owing to various restrictions,
some queries could not be re-
written or failed at run-time
 Re-writing queries in a
benchmark scenario where
results are known is one thing –
doing this against real production
databases is another
 Minimum 3.6x speed advantage
across 46 common query set
InfoSphere BigInsights runs all queries with 12 allowable modifications
Detailed presentation on SlideShare: http://www.slideshare.net/IBM_IM/hadoop-ds-benchmark-results
Audited by InfoSizing, certified TPC auditors – letter of attestation available

Resource manager included in Hadoop 2.x and later
Decouples Hadoop workload & resource management
Introduces a general purpose application container
Enjoys broad industry support
By all means use it, but understand current limitations
 Missing flexible resource sharing policies, not yet widely deployed
outside Hadoop contexts, limited application service orchestration
capabilities
What about YARN?
Yet Another Resource Negotiator

Closing thoughts
http://ibm.com/platformcomputing
http://ibm.com/hadoop
Be clear on what you mean by multi-tenancy
The right approach to building a shared
infrastructure will depend on what you have
Consider the need for policy management and the
ability to orchestrate services for a wide variety of
distributed frameworks

STAC Summit 2014 - Building a multitenant Big Data infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to STAC Summit 2014 - Building a multitenant Big Data infrastructure

Similar to STAC Summit 2014 - Building a multitenant Big Data infrastructure (20)

Recently uploaded

Recently uploaded (20)

STAC Summit 2014 - Building a multitenant Big Data infrastructure