SlideShare a Scribd company logo
1 of 49
Download to read offline
Experiences in running Hadoop As A Service 
chaiken@altiscale.com = #HadoopSherpa 
DAVID CHAIKEN • 21 NOVEMBER 2014
Talk Outline 
Altiscale Company Introduction and Perspective 
Altiscale Architecture 
Use Cases: Performance, Job Analysis, Scheduling 
Infinite Hadoop 
Challenges to the Hadoop Community 
Copyright 
© 
2014 
Al2scale, 
Inc.
Corporate Background 
Hadoop-as-a-Service (HaaS) innovator 
Company founded in 2012 (Palo Alto & Chennai) 
Founding team from Yahoo 
• Raymie Stata, CEO, Former CTO 
• David Chaiken, CTO, Former Chief Architect 
• Charles Wimmer, Head of Operations, Former SRE 
Employees from Yahoo, Google, Netflix, LinkedIn, 
VMware and others 
Top-tier investors 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Chennai 
Long-term colleagues from Yahoo and before 
IIT Madras Research Park (back gate of IIT-M) 
Architecture, Core Development, Test (Apache Bigtop) 
Control Plane agile development, 2-week sprints 
Next: Test++, Customer Support, Operations 
Copyright 
© 
2014 
Al2scale, 
Inc.
Everybody Loves Hadoop But… 
Significant capex expenditure on 
infrastructure 
• Complex to manage and maintain 
Time to get cluster up and running 
is long 
Capacity planning is difficult 
Skillset is difficult to recruit, train 
and retain 
What 
about 
the 
cloud? 
Copyright 
© 
2014 
Al2scale, 
Inc.
True Hadoop-as-a-Service 
Altiscale is the industry’s first purpose-built, 
petabyte scale Hadoop cloud 
• Altiscale operates Hadoop for you 
• Infrastructure optimized to run Hadoop 
fast and reliably 
• Pay for Hadoop service, not 
infrastructure 
Copyright 
© 
2014 
Al2scale, 
Inc.
We Team With You To Help Deliver Insights 
Customer 
Al,scale 
Poten2al 
insights 
from 
a 
flood 
of 
data 
generated 
by 
the 
connected 
world 
Our 
Opera2ons 
Team 
and 
Hadoop 
Cloud 
helps 
realize 
those 
insights 
+ 
Copyright 
© 
2014 
Al2scale, 
Inc.
Customers 
Copyright 
© 
2014 
Al2scale, 
Inc.
How We Do It 
Virtual 
Hadoop 
Cluster 
Pre-­‐configured 
Apps 
YARN 
Service 
HDFS 
Service 
More 
Apps 
Data 
Connect 
File 
Transfer 
KaRa 
Flume 
Hive 
Pig 
Oozie 
We 
op2mize 
the 
job 
to 
complete 
fast 
and 
cost-­‐effec2vely 
Our 
Hadoop 
Helpdesk 
gives 
you 
access 
to 
Hadoop 
experts 
Your 
data 
is 
migrated 
to 
HDFS 
and 
a 
virtual 
Hadoop 
cluster 
in 
our 
cloud 
Our 
Hadoop 
Opera2ons 
Team 
maintains 
the 
cluster 
and 
plans 
the 
job 
Our 
team 
monitors 
and 
manages 
the 
job 
through 
to 
comple2on 
We 
provide 
an 
up2me 
SLA 
so 
our 
Hadoop 
cloud 
is 
always 
available 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Data and Control Planes 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Data and Control Planes 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Customer Environments 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: O&O Hadoop Cluster 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Host Components 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Workbenches 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Data Transfer 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Portal and REST API 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Control Plane Databases 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Control Plane Services 
Copyright 
© 
2014 
Al2scale, 
Inc.
Altiscale Architecture: Hadoop-Based Analysis 
Copyright 
© 
2014 
Al2scale, 
Inc.
Hadoop as a Service Offering 
Data is migrated to our HDFS 
service HDFS 
Service 
Data 
Connectors 
Foundry 
Apps 
Apache 
Mahout 
Cascading 
Revolu2on 
R 
KaRa/Camus 
Avro 
Pentaho 
Kele 
Matlab 
Spark 
Sqoop 
H2O 
Core 
Apps 
Apache 
Hive 
Apache 
Pig 
Apache 
Oozie 
Apache 
HCatalog 
Apache 
Flume 
R 
JDK/JRE 
Python 
HpFS 
FUSE 
LZOP, 
Snappy, 
gzip 
Terminal access to Hadoop 
cluster and associated apps 
Portal provides job status, 
billing and support information 
1 
2 
3 
Copyright 
© 
2014 
Al2scale, 
Inc.
Challenges… 
Copyright 
© 
2014 
Al2scale, 
Inc.
Performance Challenges… 
Disks: Configuration, Controllers, Density, Cost 
Network: Jumbo Packet MTU 
Memory: 
echo never >  
/sys/kernel/mm/redhat_transparent_hugepage/enabled 
Network: When does locality matter? 
Flash: When to use SSD? 
Copyright 
© 
2014 
Al2scale, 
Inc.
Customer Case Study: Analyze Query 
Customer provided Hive query + data sets 
(100GBs to ~5 TBs) 
Needed help optimizing the query 
Didn’t rewrite query immediately 
Wanted to characterize query performance and 
isolate bottlenecks first
Analyze and Tune Execution 
Ran original query on the datasets in our environment: 
• Two M/R Stages: Stage-1, Stage-2 
Long running reducers run out of memory 
• set mapreduce.reduce.memory.mb=5120! 
• Reduces slots and extends reduce time 
Query fails to launch Stage-2 with out of memory 
• set HADOOP_HEAPSIZE=1024 on client machine 
Query has 250,000 Mappers in Stage-2 which causes 
failure 
• set mapred.max.split.size=5368709120 
to reduce Mappers
Analysis: Job Execution Characteristics 
Next challenge - how to visualize job execution? 
Existing hadoop/hive logs not sufficient for this task 
Wrote internal tools 
• parse job history files 
• plot mapper and reducer execution
Analysis: Map (Stage-1)
Analysis: Reduce (Stage-1) Long Tail 
Single 
reduce 
task
Analysis: Map (Stage-2)
Analysis: Reduce (Stage-2)
Analysis Execution: Findings 
Lone, long running reducer in first stage of query 
Analyzed input data: 
• Query split input data by userId 
• Bucketizing input data by userId 
• One very large bucket: “invalid” userId 
• Discussed “invalid” userid with customer 
An error value is a common pattern! 
• Need to differentiate between “Don’t know and don’t care” 
or “don’t know and do care.”
Interactive (DRAM-centric) Processing Systems 
Loading data into DRAM makes processing fast! 
Examples: Spark, Impala, 0xdata, …, [SAP HANA], … 
Streaming systems (Storm, DataTorrent) may be similar 
Need to increase YARN container memory size
Hive + Interactive: Watch Out for Container Size 
Caution: larger YARN container settings for interactive 
jobs may not be right for batch systems like Hive 
Container size: needs to combine vcores and memory: 
yarn.scheduler.maximum-allocation-vcores 
yarn.nodemanager.resource.cpu-vcores ...!
Hive + Interactive: Watch Out for Fragmentation 
Attempting to schedule interactive systems and batch 
systems like Hive may result in fragmentation 
Interactive systems may require all-or-nothing 
scheduling 
Batch jobs with little tasks may starve interactive jobs
Hive + Interactive: Watch Out for Fragmentation 
Solutions for fragmentation… 
Reserve interactive nodes before starting batch jobs 
Reduce interactive container size (if the algorithm permits) 
Node labels (YARN-726) and gang scheduling (YARN-624)
Altiscale: Hadoop Storage and Compute 
Altiscale’s point of view on Hadoop as a Service: 
• sell HDFS in increments of 10 TB 
• sell compute in increments of 
10K TaskHours/Month 
We market Infinite Hadoop, and provide services 
so that customers need not worry about cluster nodes. 
But Apache Hadoop user interfaces provide 
node-oriented view of clusters… 
Copyright 
© 
2014 
Al2scale, 
Inc.
ResourceManager User Interface 
Copyright 
© 
2014 
Al2scale, 
Inc.
ResourceManager User Interface 
Copyright 
© 
2014 
Al2scale, 
Inc.
NameNode User Interface 
Copyright 
© 
2014 
Al2scale, 
Inc.
NameNode User Interface 
Copyright 
© 
2014 
Al2scale, 
Inc.
Feedback from Customers 
Storage plan normally easy to estimate 
Compute plan is hard to estimate 
• Customer pain point: achieving necessary 
computation needs sometimes requires more peak 
compute capacity than provided by the number of 
nodes required for storage 
• Opportunity: average compute often requires less 
than the number of nodes required for storage 
Copyright 
© 
2014 
Al2scale, 
Inc.
Solution: Change Altiscale’s Product! 
Make “Infinite” computation available to customers 
Multitenancy implementation phases, each of which 
includes a milestone with production deliverables 
0. Automation for burn/add/remove nodes 
1. Deploy Linux containers using Docker 
2. Decouple compute/storage + manual bursting 
3. Automation: orchestrate add/remove nodes according to 
allocation plan from the capacity team. 
4. Optimized: predictive allocation, economic incentives 
Copyright 
© 
2014 
Al2scale, 
Inc.
Physical Cluster per Customer 
Copyright 
© 
2014 
Al2scale, 
Inc.
NM and DN in Docker Containers 
Copyright 
© 
2014 
Al2scale, 
Inc.
Decouple Compute/Storage 
Copyright 
© 
2014 
Al2scale, 
Inc.
What Customers Get 
On demand access to “Infinite” Computation 
Ability to handle unexpected needs 
without contacting Altiscale 
“Access to a $10M cluster for just $1M” 
Future… 
Ability to package Hadoop job environment using 
Docker (YARN-1964) 
Copyright 
© 
2014 
Al2scale, 
Inc.
Challenges to the Hadoop Community 
Hive + Hadoop debugging can get very complex 
• Sifting through many logs and screens 
• Automatic transmission versus manual transmission 
Static partitioning induced by Java Virtual Machine has 
benefits but also induces challenges. 
Where there are difficulties, there’s opportunity: 
• Better tooling, instrumentation, integration of logs/metrics 
YARN still evolving into an operating system 
Just starting to build real multitenancy into Hadoop. 
Hadoop as a Service: aggregate and share expertise
Running Hadoop as Service in AltiScale Platform

More Related Content

What's hot

Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 

What's hot (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & Ambari
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Hadoop and Spark – Perfect Together
Hadoop and Spark – Perfect TogetherHadoop and Spark – Perfect Together
Hadoop and Spark – Perfect Together
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Hybrid is the New Normal
Hybrid is the New NormalHybrid is the New Normal
Hybrid is the New Normal
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 

Similar to Running Hadoop as Service in AltiScale Platform

The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Joey Echeverria
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
skumpf
 

Similar to Running Hadoop as Service in AltiScale Platform (20)

Feb 2024 Apache Hudi Community Sync with Daniel Ford
Feb 2024 Apache Hudi Community Sync with Daniel FordFeb 2024 Apache Hudi Community Sync with Daniel Ford
Feb 2024 Apache Hudi Community Sync with Daniel Ford
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Robin_Hadoop
Robin_HadoopRobin_Hadoop
Robin_Hadoop
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 

More from InMobi Technology

More from InMobi Technology (20)

Optimizer Hints
Optimizer HintsOptimizer Hints
Optimizer Hints
 
Case Studies on PostgreSQL
Case Studies on PostgreSQLCase Studies on PostgreSQL
Case Studies on PostgreSQL
 
PostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major FeaturesPostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major Features
 
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQLToro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
 
Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
 
Building Machine Learning Pipelines
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning Pipelines
 
Ensemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic TradingEnsemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic Trading
 
Backbone & Graphs
Backbone & GraphsBackbone & Graphs
Backbone & Graphs
 
24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL
 
Reflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site ScriptingReflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site Scripting
 
Introduction to Threat Modeling
Introduction to Threat ModelingIntroduction to Threat Modeling
Introduction to Threat Modeling
 
HTTP Basics Demo
HTTP Basics DemoHTTP Basics Demo
HTTP Basics Demo
 
The Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big DataThe Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big Data
 
Attacking Web Proxies
Attacking Web ProxiesAttacking Web Proxies
Attacking Web Proxies
 
Security News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet BangaloreSecurity News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet Bangalore
 
Matriux blue
Matriux blueMatriux blue
Matriux blue
 
PCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder dataPCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder data
 
Shodan- That Device Search Engine
Shodan- That Device Search EngineShodan- That Device Search Engine
Shodan- That Device Search Engine
 
Big Data BI Simplified
Big Data BI SimplifiedBig Data BI Simplified
Big Data BI Simplified
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Running Hadoop as Service in AltiScale Platform

  • 1. Experiences in running Hadoop As A Service chaiken@altiscale.com = #HadoopSherpa DAVID CHAIKEN • 21 NOVEMBER 2014
  • 2. Talk Outline Altiscale Company Introduction and Perspective Altiscale Architecture Use Cases: Performance, Job Analysis, Scheduling Infinite Hadoop Challenges to the Hadoop Community Copyright © 2014 Al2scale, Inc.
  • 3. Corporate Background Hadoop-as-a-Service (HaaS) innovator Company founded in 2012 (Palo Alto & Chennai) Founding team from Yahoo • Raymie Stata, CEO, Former CTO • David Chaiken, CTO, Former Chief Architect • Charles Wimmer, Head of Operations, Former SRE Employees from Yahoo, Google, Netflix, LinkedIn, VMware and others Top-tier investors Copyright © 2014 Al2scale, Inc.
  • 4. Altiscale Chennai Long-term colleagues from Yahoo and before IIT Madras Research Park (back gate of IIT-M) Architecture, Core Development, Test (Apache Bigtop) Control Plane agile development, 2-week sprints Next: Test++, Customer Support, Operations Copyright © 2014 Al2scale, Inc.
  • 5. Everybody Loves Hadoop But… Significant capex expenditure on infrastructure • Complex to manage and maintain Time to get cluster up and running is long Capacity planning is difficult Skillset is difficult to recruit, train and retain What about the cloud? Copyright © 2014 Al2scale, Inc.
  • 6. True Hadoop-as-a-Service Altiscale is the industry’s first purpose-built, petabyte scale Hadoop cloud • Altiscale operates Hadoop for you • Infrastructure optimized to run Hadoop fast and reliably • Pay for Hadoop service, not infrastructure Copyright © 2014 Al2scale, Inc.
  • 7. We Team With You To Help Deliver Insights Customer Al,scale Poten2al insights from a flood of data generated by the connected world Our Opera2ons Team and Hadoop Cloud helps realize those insights + Copyright © 2014 Al2scale, Inc.
  • 8. Customers Copyright © 2014 Al2scale, Inc.
  • 9. How We Do It Virtual Hadoop Cluster Pre-­‐configured Apps YARN Service HDFS Service More Apps Data Connect File Transfer KaRa Flume Hive Pig Oozie We op2mize the job to complete fast and cost-­‐effec2vely Our Hadoop Helpdesk gives you access to Hadoop experts Your data is migrated to HDFS and a virtual Hadoop cluster in our cloud Our Hadoop Opera2ons Team maintains the cluster and plans the job Our team monitors and manages the job through to comple2on We provide an up2me SLA so our Hadoop cloud is always available Copyright © 2014 Al2scale, Inc.
  • 10. Altiscale Architecture: Data and Control Planes Copyright © 2014 Al2scale, Inc.
  • 11. Altiscale Architecture: Data and Control Planes Copyright © 2014 Al2scale, Inc.
  • 12. Altiscale Architecture: Customer Environments Copyright © 2014 Al2scale, Inc.
  • 13. Altiscale Architecture: O&O Hadoop Cluster Copyright © 2014 Al2scale, Inc.
  • 14. Altiscale Architecture: Host Components Copyright © 2014 Al2scale, Inc.
  • 15. Altiscale Architecture: Workbenches Copyright © 2014 Al2scale, Inc.
  • 16. Altiscale Architecture: Data Transfer Copyright © 2014 Al2scale, Inc.
  • 17. Altiscale Architecture: Portal and REST API Copyright © 2014 Al2scale, Inc.
  • 18. Altiscale Architecture: Control Plane Databases Copyright © 2014 Al2scale, Inc.
  • 19. Altiscale Architecture: Control Plane Services Copyright © 2014 Al2scale, Inc.
  • 20. Altiscale Architecture: Hadoop-Based Analysis Copyright © 2014 Al2scale, Inc.
  • 21.
  • 22. Hadoop as a Service Offering Data is migrated to our HDFS service HDFS Service Data Connectors Foundry Apps Apache Mahout Cascading Revolu2on R KaRa/Camus Avro Pentaho Kele Matlab Spark Sqoop H2O Core Apps Apache Hive Apache Pig Apache Oozie Apache HCatalog Apache Flume R JDK/JRE Python HpFS FUSE LZOP, Snappy, gzip Terminal access to Hadoop cluster and associated apps Portal provides job status, billing and support information 1 2 3 Copyright © 2014 Al2scale, Inc.
  • 23. Challenges… Copyright © 2014 Al2scale, Inc.
  • 24. Performance Challenges… Disks: Configuration, Controllers, Density, Cost Network: Jumbo Packet MTU Memory: echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled Network: When does locality matter? Flash: When to use SSD? Copyright © 2014 Al2scale, Inc.
  • 25. Customer Case Study: Analyze Query Customer provided Hive query + data sets (100GBs to ~5 TBs) Needed help optimizing the query Didn’t rewrite query immediately Wanted to characterize query performance and isolate bottlenecks first
  • 26. Analyze and Tune Execution Ran original query on the datasets in our environment: • Two M/R Stages: Stage-1, Stage-2 Long running reducers run out of memory • set mapreduce.reduce.memory.mb=5120! • Reduces slots and extends reduce time Query fails to launch Stage-2 with out of memory • set HADOOP_HEAPSIZE=1024 on client machine Query has 250,000 Mappers in Stage-2 which causes failure • set mapred.max.split.size=5368709120 to reduce Mappers
  • 27. Analysis: Job Execution Characteristics Next challenge - how to visualize job execution? Existing hadoop/hive logs not sufficient for this task Wrote internal tools • parse job history files • plot mapper and reducer execution
  • 29. Analysis: Reduce (Stage-1) Long Tail Single reduce task
  • 32. Analysis Execution: Findings Lone, long running reducer in first stage of query Analyzed input data: • Query split input data by userId • Bucketizing input data by userId • One very large bucket: “invalid” userId • Discussed “invalid” userid with customer An error value is a common pattern! • Need to differentiate between “Don’t know and don’t care” or “don’t know and do care.”
  • 33. Interactive (DRAM-centric) Processing Systems Loading data into DRAM makes processing fast! Examples: Spark, Impala, 0xdata, …, [SAP HANA], … Streaming systems (Storm, DataTorrent) may be similar Need to increase YARN container memory size
  • 34. Hive + Interactive: Watch Out for Container Size Caution: larger YARN container settings for interactive jobs may not be right for batch systems like Hive Container size: needs to combine vcores and memory: yarn.scheduler.maximum-allocation-vcores yarn.nodemanager.resource.cpu-vcores ...!
  • 35. Hive + Interactive: Watch Out for Fragmentation Attempting to schedule interactive systems and batch systems like Hive may result in fragmentation Interactive systems may require all-or-nothing scheduling Batch jobs with little tasks may starve interactive jobs
  • 36. Hive + Interactive: Watch Out for Fragmentation Solutions for fragmentation… Reserve interactive nodes before starting batch jobs Reduce interactive container size (if the algorithm permits) Node labels (YARN-726) and gang scheduling (YARN-624)
  • 37. Altiscale: Hadoop Storage and Compute Altiscale’s point of view on Hadoop as a Service: • sell HDFS in increments of 10 TB • sell compute in increments of 10K TaskHours/Month We market Infinite Hadoop, and provide services so that customers need not worry about cluster nodes. But Apache Hadoop user interfaces provide node-oriented view of clusters… Copyright © 2014 Al2scale, Inc.
  • 38. ResourceManager User Interface Copyright © 2014 Al2scale, Inc.
  • 39. ResourceManager User Interface Copyright © 2014 Al2scale, Inc.
  • 40. NameNode User Interface Copyright © 2014 Al2scale, Inc.
  • 41. NameNode User Interface Copyright © 2014 Al2scale, Inc.
  • 42. Feedback from Customers Storage plan normally easy to estimate Compute plan is hard to estimate • Customer pain point: achieving necessary computation needs sometimes requires more peak compute capacity than provided by the number of nodes required for storage • Opportunity: average compute often requires less than the number of nodes required for storage Copyright © 2014 Al2scale, Inc.
  • 43. Solution: Change Altiscale’s Product! Make “Infinite” computation available to customers Multitenancy implementation phases, each of which includes a milestone with production deliverables 0. Automation for burn/add/remove nodes 1. Deploy Linux containers using Docker 2. Decouple compute/storage + manual bursting 3. Automation: orchestrate add/remove nodes according to allocation plan from the capacity team. 4. Optimized: predictive allocation, economic incentives Copyright © 2014 Al2scale, Inc.
  • 44. Physical Cluster per Customer Copyright © 2014 Al2scale, Inc.
  • 45. NM and DN in Docker Containers Copyright © 2014 Al2scale, Inc.
  • 46. Decouple Compute/Storage Copyright © 2014 Al2scale, Inc.
  • 47. What Customers Get On demand access to “Infinite” Computation Ability to handle unexpected needs without contacting Altiscale “Access to a $10M cluster for just $1M” Future… Ability to package Hadoop job environment using Docker (YARN-1964) Copyright © 2014 Al2scale, Inc.
  • 48. Challenges to the Hadoop Community Hive + Hadoop debugging can get very complex • Sifting through many logs and screens • Automatic transmission versus manual transmission Static partitioning induced by Java Virtual Machine has benefits but also induces challenges. Where there are difficulties, there’s opportunity: • Better tooling, instrumentation, integration of logs/metrics YARN still evolving into an operating system Just starting to build real multitenancy into Hadoop. Hadoop as a Service: aggregate and share expertise