SlideShare a Scribd company logo
Hadoop on Azure
Íñigo Goiri
Commercial options
• Azure offers Hadoop and Spark:
• HDInsight
• Azure Databricks
• Our target:
• “Raw” VMs
• Pure Hadoop OSS
• Fast creation and scaling
Building OSS Hadoop on Azure
• Azure DevOps for building
• Periodic sync to trunk
• Build on VM with OSS Docker image
• Output ‘tgz’ to Azure Blob Storage
Deploying a cluster
• Azure Resource Manager (ARM) template
• JSON file describing resources
• Main resources:
• Virtual Machine Scale Sets (VMSS)
• Virtual Network
• Network Security Group
• Load Balancer
• Public IP
• Internal DNS
VM creation and startup
• Cloud-init script
• YAML syntax similar to Docker
• Kubernetes (AKS) does not add much
• Download code and install
• Hadoop, Docker, ZooKeeper, scripts,…
• Setup environment variables
• Discover other services (e.g., ZooKeeper)
• Start services
VM roles (VMSS)
• 3 x NameNodes
• ZooKeeper
• Journal Nodes
• Routers (RBF)
• 2 x Resource Managers
• N x Workers
• DataNode
• Node Manager
Network
• Virtual Network for all VMs
• Load Balancer
• isActive servlet (HADOOP-15707)
• Public IPs
• External DNS
• Firewall
• Internal DNS
• Locate components (e.g., nn0, zk2, and rm1)
Worker nodes
• Node Manager (YARN)
• Docker for long running services
• DataNode (HDFS)
• Use VM local disks
• Leverage PROVIDED storage
• Mount external storage (S3, ADLS, HDFS,…)
• Local HDFS as caching
Creation performance and scalability
• Create new cluster
• 3-5 minutes
• Add 100 workers
• <3 minutes
• Add 1000 workers
• 900 <3 minutes
• Long tail (<15 minutes)
Low priority VMs
• ~80% price discount
• Can be evicted at any time
• Larger VMs more likely to be evicted
• 30 seconds notification
• Possible to decommission (NM and DN)
• Ideal for worker nodes
• Mix of low-priority and reserved VMs
Low Priority Reserved
Low Priority
Low Priority
Low Priority
Reserved
Reserved
Reserved
Reserved
Managers
Proposed changes to OSS Hadoop
• Hadoop Registry to find managers
• Improve PROVIDED storage (HDFS)
• Improve Dynamic Resource for NMs (YARN)
Hadoop Registry to find Managers
• Currently:
• Script to set DNS names (e.g., nn2.hadooptest.com, rm0.hadooptest.com)
• Configuration file with hard-coded values
• Possible to use DNS resolution (HDFS-14118)
• YARN Registry to find YARN services
• Moved to Hadoop Registry
• New approach:
• Managers (e.g., NN or RM) register when starting
• Workers (e.g., DN or NM) use registry to find managers
• Dynamic subclusters (RBF)
Improve PROVIDED storage
• Currently:
• Generate FS image at start time
• Propagate alias map to DNs
• New approach:
• Dynamic mount points
• HA support
• Lazy loading replicas metadata on DNs
Improve Dynamic Resource Config for NMs
• VMs can change size (CPU)
• Harvesting [OSDI’16]
• Leverage Resource Options (YARN-291, YARN-996)
• Container preemption
• Container priorities (OPPORTUNISTIC)
• Extend current interfaces
• Integrate with Resource Monitor
Future work
• Improve Security
• Currently network rules
• Integration with Azure Active Directory
• Delegation tokens propagation
• Changes to OSS
• Hadoop Registry
• PROVIDED storage
• NM Dynamic Resource
• Open source scripts?

More Related Content

What's hot

What's hot (20)

Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBComparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Mesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run CassandraMesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run Cassandra
 
Redis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis LabsRedis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis Labs
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Redis for horizontally scaled data processing at jFrog bintray
Redis for horizontally scaled data processing at jFrog bintrayRedis for horizontally scaled data processing at jFrog bintray
Redis for horizontally scaled data processing at jFrog bintray
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and Spark
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
On demand file-caching_-_gustavo_brand
On demand file-caching_-_gustavo_brandOn demand file-caching_-_gustavo_brand
On demand file-caching_-_gustavo_brand
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 

Similar to Hadoop Meetup Jan 2019 - Hadoop On Azure

Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
fann wu
 

Similar to Hadoop Meetup Jan 2019 - Hadoop On Azure (20)

MongoDB and AWS: Integrations
MongoDB and AWS: IntegrationsMongoDB and AWS: Integrations
MongoDB and AWS: Integrations
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
AWS Database Services-Philadelphia AWS User Group-4-17-2018
AWS Database Services-Philadelphia AWS User Group-4-17-2018AWS Database Services-Philadelphia AWS User Group-4-17-2018
AWS Database Services-Philadelphia AWS User Group-4-17-2018
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
KeyValue Stores
KeyValue StoresKeyValue Stores
KeyValue Stores
 
Share point 2013 on azure
Share point 2013 on azureShare point 2013 on azure
Share point 2013 on azure
 
Big data_hadoop_spark_kafka_nosql_training
Big data_hadoop_spark_kafka_nosql_trainingBig data_hadoop_spark_kafka_nosql_training
Big data_hadoop_spark_kafka_nosql_training
 
Riga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWSRiga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWS
 
Drop acid
Drop acidDrop acid
Drop acid
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)
 
Server 2016 sneak peek
Server 2016 sneak peekServer 2016 sneak peek
Server 2016 sneak peek
 
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowOpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
 
London HUG 8/3 - Nomad
London HUG 8/3 - NomadLondon HUG 8/3 - Nomad
London HUG 8/3 - Nomad
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
КАТЕРИНА АБЗЯТОВА «Ефективне планування тестування ключові аспекти та практ...
КАТЕРИНА АБЗЯТОВА  «Ефективне планування тестування  ключові аспекти та практ...КАТЕРИНА АБЗЯТОВА  «Ефективне планування тестування  ключові аспекти та практ...
КАТЕРИНА АБЗЯТОВА «Ефективне планування тестування ключові аспекти та практ...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Ransomware Mallox [EN].pdf
Ransomware         Mallox       [EN].pdfRansomware         Mallox       [EN].pdf
Ransomware Mallox [EN].pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
UiPath New York Community Day in-person event
UiPath New York Community Day in-person eventUiPath New York Community Day in-person event
UiPath New York Community Day in-person event
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Hadoop Meetup Jan 2019 - Hadoop On Azure

  • 2. Commercial options • Azure offers Hadoop and Spark: • HDInsight • Azure Databricks • Our target: • “Raw” VMs • Pure Hadoop OSS • Fast creation and scaling
  • 3. Building OSS Hadoop on Azure • Azure DevOps for building • Periodic sync to trunk • Build on VM with OSS Docker image • Output ‘tgz’ to Azure Blob Storage
  • 4. Deploying a cluster • Azure Resource Manager (ARM) template • JSON file describing resources • Main resources: • Virtual Machine Scale Sets (VMSS) • Virtual Network • Network Security Group • Load Balancer • Public IP • Internal DNS
  • 5. VM creation and startup • Cloud-init script • YAML syntax similar to Docker • Kubernetes (AKS) does not add much • Download code and install • Hadoop, Docker, ZooKeeper, scripts,… • Setup environment variables • Discover other services (e.g., ZooKeeper) • Start services
  • 6. VM roles (VMSS) • 3 x NameNodes • ZooKeeper • Journal Nodes • Routers (RBF) • 2 x Resource Managers • N x Workers • DataNode • Node Manager
  • 7. Network • Virtual Network for all VMs • Load Balancer • isActive servlet (HADOOP-15707) • Public IPs • External DNS • Firewall • Internal DNS • Locate components (e.g., nn0, zk2, and rm1)
  • 8. Worker nodes • Node Manager (YARN) • Docker for long running services • DataNode (HDFS) • Use VM local disks • Leverage PROVIDED storage • Mount external storage (S3, ADLS, HDFS,…) • Local HDFS as caching
  • 9. Creation performance and scalability • Create new cluster • 3-5 minutes • Add 100 workers • <3 minutes • Add 1000 workers • 900 <3 minutes • Long tail (<15 minutes)
  • 10. Low priority VMs • ~80% price discount • Can be evicted at any time • Larger VMs more likely to be evicted • 30 seconds notification • Possible to decommission (NM and DN) • Ideal for worker nodes • Mix of low-priority and reserved VMs Low Priority Reserved Low Priority Low Priority Low Priority Reserved Reserved Reserved Reserved Managers
  • 11. Proposed changes to OSS Hadoop • Hadoop Registry to find managers • Improve PROVIDED storage (HDFS) • Improve Dynamic Resource for NMs (YARN)
  • 12. Hadoop Registry to find Managers • Currently: • Script to set DNS names (e.g., nn2.hadooptest.com, rm0.hadooptest.com) • Configuration file with hard-coded values • Possible to use DNS resolution (HDFS-14118) • YARN Registry to find YARN services • Moved to Hadoop Registry • New approach: • Managers (e.g., NN or RM) register when starting • Workers (e.g., DN or NM) use registry to find managers • Dynamic subclusters (RBF)
  • 13. Improve PROVIDED storage • Currently: • Generate FS image at start time • Propagate alias map to DNs • New approach: • Dynamic mount points • HA support • Lazy loading replicas metadata on DNs
  • 14. Improve Dynamic Resource Config for NMs • VMs can change size (CPU) • Harvesting [OSDI’16] • Leverage Resource Options (YARN-291, YARN-996) • Container preemption • Container priorities (OPPORTUNISTIC) • Extend current interfaces • Integrate with Resource Monitor
  • 15. Future work • Improve Security • Currently network rules • Integration with Azure Active Directory • Delegation tokens propagation • Changes to OSS • Hadoop Registry • PROVIDED storage • NM Dynamic Resource • Open source scripts?