SlideShare a Scribd company logo
1 of 48
@ashishth
@ashishth
hot path
cold path
Serving-layer
data sources consumers
Governance
HDFS Compliant Storage
(Data Lake)
Meta data
Management
Security /
Access Control
Ingest real-time data Real Time NOSQL Store
ETL
Ingest batch data AdHoc Query in DataLake
Downstream Applications
Store real-time data
for long term analysis
Orchestration
Corporate
Data
Devices
&
Sensors
Advanced Analytics
& Data Science
Machine Learning
R, Python, APIs
Analytics
Data Exploration
Corporate
Reporting
Self-Service BI
Streaming/Real-
Time/
Application
Stream Processing
@ashishth
AZURE SDK
AZURE
DATA FACTORY
AZURE IMPORT
EXPORT SERVICE
AZURE CLI
COGNITIVE SERVICESBOT SERVICE
AZURE SEARCH
AZURE
DATA CATALOG
AZURE EXPRESSROUTE AZURE NETWORK
SECURITY GROUPS
AZURE FUNCTIONS
VISUAL STUDIOOPERATIONS
MANAGEMENT SUITE
AZURE
ACTIVE DIRECTORY
AZURE KEY
MANAGEMENT SERVICE
AZURE STORAGE
BLOBS
AZURE DATA LAKE
STORAGE
AZURE IOT HUB AZURE EVENT HUBS
KAFKA ON
AZURE HDINSIGHT
AZURE SQL DATA WAREHOUSEAZURE SQL DB AZURE COSMOS DB
AZURE
ANALYSIS SERVICES POWER BI
AZURE
HDINSIGHT
AZURE
DATABRICKS
AZURE
STREAM ANALYTICS
AZURE ML ML SERVER AZURE
DATABRICKS
@ashishth
@ashishth
• The most trusted and
compliant platform
Azure HDInsight
A secure and managed Apache Hadoop and Spark platform for building data lakes in the Cloud
@ashishth
Monitoring
& Security
Presto or Hive
LLAP?
Which storage
system?
How to Transfer
the Data
ADF/Airflow or Oozie?
Pig, Hive or
Spark
Spark Streaming or Storm
ETL Serving Layer
Storage
Orchestration
Event Processing
@ashishth
@ashishth
Pig
Designed for ETL ETL Data warehousing
Adoption High, increasing Low, decreasing Stable
Number of connectors Highest High High
Languages Python, R, Scala, Java, SQL Pig SQL
Performance High Medium Medium
@ashishth
Spark Structured Streaming Storm
Adoption High, increasing Decreasing
Event processing guarantee Exactly once At least once
Throughput High Low
Processing Model Micro Batch Real-Time
Latency High Low
Event time support Yes Yes
Languages Python, R, Scala, Java,
SQL
Java
@ashishth
Capability Hive LLAP
Interactive Query Speed High High Medium
Scale High High Low
Caching Yes Yes Early Support
Result Caching Yes No No
Intelligent Cache Eviction Yes No No
Materialized Views Yes No No
Complex Fact to Fact Joins Yes Yes No
Transactions Yes No No
Query Concurrency High Low Low
Row , Column level
security
Yes [Apache Ranger+ AAD] Medium Medium
Rich end user Tools Yes Yes Yes
Language Support SQL, UDF SQL, Scala, Python SQL
Data Source Connector
Support
Storage Handlers Data Sources connectors
@ashishth
@ashishth
Hive Metadata
Spark Metadata
Hive Metadata
Azure HDInsight 3.6 with Hadoop 2.6 Azure HDInsight 4.0 with Hadoop 3.x
Hive Metastore migration tool: https://azure.microsoft.com/en-us/blog/hdinsight-metastore-migration-tool-
open-source-release-now-available/ @ashishth
ADF Airflow Oozie
Service management Azure PaaS IaaS VM HDInsight
Code JSON Python Java
GUI ADF V2 has great UX Good UX Below Average UX
Community Microsoft Growing (12,133 Stars) Declining (483 Stars)
On-demand clusters Yes No, but extensible No
Extensibility Custom action-only Full, graph + actions Custom action-only
Pipeline definition JSON/UX Python/ UX XML/JAVA/UX
Devops-first design Yes Yes Yes
Pipeline monitoring Yes Yes Yes
Scheduling Event, Time Event Event, Time
@ashishth
@ashishth
Data
movement
Storage
options
and
tradeoffs
Caching
@ashishth
Data Qty Network Bandwidth
45 Mbps (T3) 100 Mbps 1 Gbps
1 TB 2 days 1 day 2 hours
10 TB 22 days 10 days 1 day
35 TB 76 days 34 days 3 days
80 TB 173 days 78 days 8 days
100 TB 216 days 97 days 10 days
200 TB 1 year 194 days 19 days
500 TB 3 years 1 year 49 days
1 PB 6 years 3 years 97 days
2 PB 12 years 5 years 194 days
@ashishth
Network Transfer with TLS
• Over Internet
• Express Route
• Data Box online Transfer
Shipping data offline
• Data Box offline data transfer
@ashishth
USB 3.1 SSD disks
Order up to 5 in each pack
Ruggedized, self-contained appliances
100 TB
8 TB, up to 40 TB
1 PB
@ashishth
Use Azure Data Box to migrate data from an on-premises HDFS store to Azure
Storage
Type Latency ( Consistency of
latency)
Workloads Bandwidth Key Benefits
ADLS Gen 2 Hierarchical 10-50ms (Medium) HDInsight 3.6 &
4.0
Unconstrained Atomic Rename,
File Folder level
ACL’s
Standard
BLOB
Object
Store
10-50ms (Medium) HDInsight 3.6 &
4.0
Unconstrained Mature
Premium
BLOB
Object
Store
~5ms (High) HBase in Preview Unconstrained Fast
Premium
Managed
Disks
Hierarchical ~5ms (High) Kafka, HBase in
preview
Based on disk Consistent latency
ADLS Gen 1 Hierarchical 10-100ms (Low) HDInsight 3.6(
No HBase)
High Atomic Rename,
File Folder level
ACL’s
@ashishth
@ashishth
RegionServer
Client
-Put
-Delete
-Get
Region
Region
Region
Log
Flusher
Memstore
HFile
Memstore
HFile
Memstore
HFile
Storage
@ashishth
RegionServer
Storage
Client
-Put
-Update
-Get
-Delete
Log
Flusher
Remote store write path challenges with Write Ahead Log
Insert Update Get Delete
Sync Operation
• Inconsistent Latencies
• High latencies
@ashishth
RegionServer
Premium
Managed
Disk(s)
Client
-Put
-Update
-Get
-Delete
Log
Flusher
Insert Update Get Delete
Sync Operation
Introducing Premium Managed disk for
WAL
• Consistent Latencies
• Low latencies
• Data Durability
@ashishth
RegionServer
Client
-Put
-Delete
-Get
Region
Region
Region
Log
Flusher
Memstore
HFile
Memstore
HFile
Memstore
HFile
Low latency workload HBase/ Small write
@ashishth
Storage
Premium
Managed
Disk(s)
@ashishth
@ashishth
RegionServer
Client
-Put
-Delete
-Get
Region
Region
Region
Log
Flusher
Memstore
HFile
Memstore
HFile
Memstore
HFile
@ashishth
PremiumBLOBStorage
Premium
Managed
Disk(s)
@ashishth
Workload Caching Options Key benefits
Spark Spark IO Cache Up to ~8 to 10x perf improvements
HBase &
Phoenix
Bucket cache Up 5-10x perf gains on recently read or written
data
Hive + LLAP LLAP Intelligent cache/Result Cache Up to ~4-100X gain on cached data
@ashishth
Azure Data Lake Storage
INSTANCE CORE RAM TEMP SSD
D1 v2 1 3.50 GiB 50 GiB
D2 v2 2 7.00 GiB 100 GiB
D3 v2 4 14.00 GiB 200 GiB
D4 v2 8 28.00 GiB 400 GiB
D5 v2 16 56.00 GiB 800 GiB
• Significant Spark performance speed up
with IO cache (up to 9X perf gains)
• Automatic cache resource management
• DRAM + Temp SSD makes large cache
pool
@ashishth@ashishth
@ashishth
PERIMETER
Isolate clusters within VNETs
Service Endpoint support for WASB, Azure DB, Cosmos DB
Restrict outbound traffic using NVAs*
AUTHENTICATION
Azure Active Directory
Kerberos with Active
Directory
AUTHORIZATION
Role-Based Access Control
Apache Ranger based Access
Control
DATA PROTECTION
Encryption on-the-wire with HTTPS enforced
Encryption at Rest using Azure Key Vault
Auditing of all data operations and configuration changes
@ashishth
@ashishth
@ashishth
Apache Ranger ADLS Gen 2 ACLs
@ashishth
Scenario Authorizing Component
Yarn: Submit-App Apache Ranger: Yarn Plugin
Hive Operations: Select , Drop, index, Lock, Read, Write, Masking,
Row level filter on Hive Database, Table & Columns
Apache Ranger: Hive Plugin
Create/ Alter Table with storage location reference Apache Ranger + ADLS Gen 2 ACL’s
Spark SQL access with Hive Metastore Apache Ranger: Hive Plugin
HBase Access Policies Apache Ranger/ HBase plugin
Kafka Access Policies Apache ranger/ Kafka Plugin
Access Azure Data Lake Storage Gen2 using the Spark DataFrame
API
ADLS Gen 2 ACLs
Access Azure Data Lake Storage Gen2 using the RDD API ADLS Gen 2 ACLs
HDFS operations: Mkdir, ls, put, copyFromLocal, get, cat, mv, cp
etc
ADLS Gen 2 ACLs
Running Map Reduce jobs ADLS Gen 2 ACLs
@ashishth
@ashishth
@ashishth
• hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode get # A report that shows the
• details of HDFS state: hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -report # Get
HDFS
• out of safe mode hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode leave #
Get
• HDFS into safe mode hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode enter
@ashishth
SetupAutoscale
Customize to your own scenario
Pay for ONLY what you need
Monitoring scaling history easily
Graceful Scale Down
@ashishth
HDInsight Cluster
Gateways
Head Node 1 Head Node 2
Worker Node Worker Node Worker Node Worker Node
Zookeeper1
Zookeeper1
Zookeeper1
Hive Metastore
YARN
https://cluster.azurehdinsight.net/APIs
@ashishth
Workload DR Option
Spark / Hive Manual, Partner solution
HBase HBase replication, Snapshot export, Import
Export, Copy Tables
Kafka Mirror Maker
@ashishth
https://github.com/anagha-microsoft/hdi-spark-dr
https://github.com/anagha-microsoft/hdi-kafka-dr
https://docs.microsoft.com/en-
us/azure/hdinsight/hbase/apache-hbase-backup-replication
@ashishth
Apache Ambari Azure Log Analytics IntegrationHDInsight Cluster Metrics
@ashishth
@ashishth
Motivation and benefits
Architecture best practices
Infrastructure best practices
Storage best practices
Data migration best practices
Security and DevOps best practices
https://azure.microsoft.com/en-us/blog/migrating-on-premises-hadoop-infrastructure-to-azure-hdinsight/
@ashishth

More Related Content

What's hot

Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0J.B. Langston
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022HostedbyConfluent
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redisTanu Siwag
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXAbhishek Mallick
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveDataWorks Summit
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGateJeffrey T. Pollock
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Switchdev - No More SDK
Switchdev - No More SDKSwitchdev - No More SDK
Switchdev - No More SDKKernel TLV
 
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Kai Wähner
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 

What's hot (20)

Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Securing Kafka with SPIFFE @ TransferWise
Securing Kafka with SPIFFE @ TransferWiseSecuring Kafka with SPIFFE @ TransferWise
Securing Kafka with SPIFFE @ TransferWise
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0
 
Redis 101
Redis 101Redis 101
Redis 101
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOX
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGate
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Switchdev - No More SDK
Switchdev - No More SDKSwitchdev - No More SDK
Switchdev - No More SDK
 
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
HAProxy
HAProxy HAProxy
HAProxy
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 

Similar to Building Big Data Applications using Spark, Hive, HBase and Kafka

Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...nnakasone
 
Azure Hd insigth news
Azure Hd insigth newsAzure Hd insigth news
Azure Hd insigth newsnnakasone
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for ArchitectsAshish Thapliyal
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practicesAshish Thapliyal
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyQA or the Highway
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!gagravarr
 
Azure Big data
Azure Big data Azure Big data
Azure Big data Michel HUBERT
 
Accelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data LakeAccelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data LakeAlluxio, Inc.
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Big Data, IngenierĂ­a de datos, y Data Lakes en AWS
Big Data, IngenierĂ­a de datos, y Data Lakes en AWSBig Data, IngenierĂ­a de datos, y Data Lakes en AWS
Big Data, IngenierĂ­a de datos, y Data Lakes en AWSjavier ramirez
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_pptjerrin joseph
 

Similar to Building Big Data Applications using Spark, Hive, HBase and Kafka (20)

Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Azure Hd insigth news
Azure Hd insigth newsAzure Hd insigth news
Azure Hd insigth news
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practices
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika Achary
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Ess1000 glossary
Ess1000 glossaryEss1000 glossary
Ess1000 glossary
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
Azure Big data
Azure Big data Azure Big data
Azure Big data
 
Accelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data LakeAccelerating Analytics with EMR on your S3 Data Lake
Accelerating Analytics with EMR on your S3 Data Lake
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Big Data, IngenierĂ­a de datos, y Data Lakes en AWS
Big Data, IngenierĂ­a de datos, y Data Lakes en AWSBig Data, IngenierĂ­a de datos, y Data Lakes en AWS
Big Data, IngenierĂ­a de datos, y Data Lakes en AWS
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 

More from Ashish Thapliyal

Five essential new enhancements in azure HDnsight
Five essential new enhancements in azure HDnsightFive essential new enhancements in azure HDnsight
Five essential new enhancements in azure HDnsightAshish Thapliyal
 
HDInsight Security & Compliance
HDInsight Security & ComplianceHDInsight Security & Compliance
HDInsight Security & ComplianceAshish Thapliyal
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryAshish Thapliyal
 
HDInsight HBase replication
HDInsight HBase replicationHDInsight HBase replication
HDInsight HBase replicationAshish Thapliyal
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightAshish Thapliyal
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsAshish Thapliyal
 
Monitor Azure HDInsight with Azure Log Analytics
Monitor Azure HDInsight with Azure Log AnalyticsMonitor Azure HDInsight with Azure Log Analytics
Monitor Azure HDInsight with Azure Log AnalyticsAshish Thapliyal
 
HDInsight Interactive Query
HDInsight Interactive QueryHDInsight Interactive Query
HDInsight Interactive QueryAshish Thapliyal
 
HDInsight HBase Performance best practices
HDInsight HBase Performance best practicesHDInsight HBase Performance best practices
HDInsight HBase Performance best practicesAshish Thapliyal
 
Architecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsightArchitecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsightAshish Thapliyal
 
DIY: TPCDS HDInsight Benchmark
DIY: TPCDS HDInsight BenchmarkDIY: TPCDS HDInsight Benchmark
DIY: TPCDS HDInsight BenchmarkAshish Thapliyal
 

More from Ashish Thapliyal (12)

Five essential new enhancements in azure HDnsight
Five essential new enhancements in azure HDnsightFive essential new enhancements in azure HDnsight
Five essential new enhancements in azure HDnsight
 
HDInsight Security & Compliance
HDInsight Security & ComplianceHDInsight Security & Compliance
HDInsight Security & Compliance
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
 
HDInsight HBase replication
HDInsight HBase replicationHDInsight HBase replication
HDInsight HBase replication
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight Deployments
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Monitor Azure HDInsight with Azure Log Analytics
Monitor Azure HDInsight with Azure Log AnalyticsMonitor Azure HDInsight with Azure Log Analytics
Monitor Azure HDInsight with Azure Log Analytics
 
HDInsight Interactive Query
HDInsight Interactive QueryHDInsight Interactive Query
HDInsight Interactive Query
 
HDInsight HBase Performance best practices
HDInsight HBase Performance best practicesHDInsight HBase Performance best practices
HDInsight HBase Performance best practices
 
Architecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsightArchitecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsight
 
DIY: TPCDS HDInsight Benchmark
DIY: TPCDS HDInsight BenchmarkDIY: TPCDS HDInsight Benchmark
DIY: TPCDS HDInsight Benchmark
 

Recently uploaded

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Building Big Data Applications using Spark, Hive, HBase and Kafka

  • 1.
  • 4. hot path cold path Serving-layer data sources consumers Governance HDFS Compliant Storage (Data Lake) Meta data Management Security / Access Control Ingest real-time data Real Time NOSQL Store ETL Ingest batch data AdHoc Query in DataLake Downstream Applications Store real-time data for long term analysis Orchestration Corporate Data Devices & Sensors Advanced Analytics & Data Science Machine Learning R, Python, APIs Analytics Data Exploration Corporate Reporting Self-Service BI Streaming/Real- Time/ Application Stream Processing @ashishth
  • 5. AZURE SDK AZURE DATA FACTORY AZURE IMPORT EXPORT SERVICE AZURE CLI COGNITIVE SERVICESBOT SERVICE AZURE SEARCH AZURE DATA CATALOG AZURE EXPRESSROUTE AZURE NETWORK SECURITY GROUPS AZURE FUNCTIONS VISUAL STUDIOOPERATIONS MANAGEMENT SUITE AZURE ACTIVE DIRECTORY AZURE KEY MANAGEMENT SERVICE AZURE STORAGE BLOBS AZURE DATA LAKE STORAGE AZURE IOT HUB AZURE EVENT HUBS KAFKA ON AZURE HDINSIGHT AZURE SQL DATA WAREHOUSEAZURE SQL DB AZURE COSMOS DB AZURE ANALYSIS SERVICES POWER BI AZURE HDINSIGHT AZURE DATABRICKS AZURE STREAM ANALYTICS AZURE ML ML SERVER AZURE DATABRICKS @ashishth
  • 7. • The most trusted and compliant platform Azure HDInsight A secure and managed Apache Hadoop and Spark platform for building data lakes in the Cloud @ashishth
  • 8. Monitoring & Security Presto or Hive LLAP? Which storage system? How to Transfer the Data ADF/Airflow or Oozie? Pig, Hive or Spark Spark Streaming or Storm ETL Serving Layer Storage Orchestration Event Processing @ashishth
  • 10. Pig Designed for ETL ETL Data warehousing Adoption High, increasing Low, decreasing Stable Number of connectors Highest High High Languages Python, R, Scala, Java, SQL Pig SQL Performance High Medium Medium @ashishth
  • 11. Spark Structured Streaming Storm Adoption High, increasing Decreasing Event processing guarantee Exactly once At least once Throughput High Low Processing Model Micro Batch Real-Time Latency High Low Event time support Yes Yes Languages Python, R, Scala, Java, SQL Java @ashishth
  • 12. Capability Hive LLAP Interactive Query Speed High High Medium Scale High High Low Caching Yes Yes Early Support Result Caching Yes No No Intelligent Cache Eviction Yes No No Materialized Views Yes No No Complex Fact to Fact Joins Yes Yes No Transactions Yes No No Query Concurrency High Low Low Row , Column level security Yes [Apache Ranger+ AAD] Medium Medium Rich end user Tools Yes Yes Yes Language Support SQL, UDF SQL, Scala, Python SQL Data Source Connector Support Storage Handlers Data Sources connectors @ashishth
  • 14. Hive Metadata Spark Metadata Hive Metadata Azure HDInsight 3.6 with Hadoop 2.6 Azure HDInsight 4.0 with Hadoop 3.x Hive Metastore migration tool: https://azure.microsoft.com/en-us/blog/hdinsight-metastore-migration-tool- open-source-release-now-available/ @ashishth
  • 15. ADF Airflow Oozie Service management Azure PaaS IaaS VM HDInsight Code JSON Python Java GUI ADF V2 has great UX Good UX Below Average UX Community Microsoft Growing (12,133 Stars) Declining (483 Stars) On-demand clusters Yes No, but extensible No Extensibility Custom action-only Full, graph + actions Custom action-only Pipeline definition JSON/UX Python/ UX XML/JAVA/UX Devops-first design Yes Yes Yes Pipeline monitoring Yes Yes Yes Scheduling Event, Time Event Event, Time @ashishth
  • 18. Data Qty Network Bandwidth 45 Mbps (T3) 100 Mbps 1 Gbps 1 TB 2 days 1 day 2 hours 10 TB 22 days 10 days 1 day 35 TB 76 days 34 days 3 days 80 TB 173 days 78 days 8 days 100 TB 216 days 97 days 10 days 200 TB 1 year 194 days 19 days 500 TB 3 years 1 year 49 days 1 PB 6 years 3 years 97 days 2 PB 12 years 5 years 194 days @ashishth
  • 19. Network Transfer with TLS • Over Internet • Express Route • Data Box online Transfer Shipping data offline • Data Box offline data transfer @ashishth
  • 20. USB 3.1 SSD disks Order up to 5 in each pack Ruggedized, self-contained appliances 100 TB 8 TB, up to 40 TB 1 PB @ashishth Use Azure Data Box to migrate data from an on-premises HDFS store to Azure Storage
  • 21. Type Latency ( Consistency of latency) Workloads Bandwidth Key Benefits ADLS Gen 2 Hierarchical 10-50ms (Medium) HDInsight 3.6 & 4.0 Unconstrained Atomic Rename, File Folder level ACL’s Standard BLOB Object Store 10-50ms (Medium) HDInsight 3.6 & 4.0 Unconstrained Mature Premium BLOB Object Store ~5ms (High) HBase in Preview Unconstrained Fast Premium Managed Disks Hierarchical ~5ms (High) Kafka, HBase in preview Based on disk Consistent latency ADLS Gen 1 Hierarchical 10-100ms (Low) HDInsight 3.6( No HBase) High Atomic Rename, File Folder level ACL’s @ashishth
  • 24. RegionServer Storage Client -Put -Update -Get -Delete Log Flusher Remote store write path challenges with Write Ahead Log Insert Update Get Delete Sync Operation • Inconsistent Latencies • High latencies @ashishth
  • 25. RegionServer Premium Managed Disk(s) Client -Put -Update -Get -Delete Log Flusher Insert Update Get Delete Sync Operation Introducing Premium Managed disk for WAL • Consistent Latencies • Low latencies • Data Durability @ashishth
  • 31. Workload Caching Options Key benefits Spark Spark IO Cache Up to ~8 to 10x perf improvements HBase & Phoenix Bucket cache Up 5-10x perf gains on recently read or written data Hive + LLAP LLAP Intelligent cache/Result Cache Up to ~4-100X gain on cached data @ashishth
  • 32. Azure Data Lake Storage INSTANCE CORE RAM TEMP SSD D1 v2 1 3.50 GiB 50 GiB D2 v2 2 7.00 GiB 100 GiB D3 v2 4 14.00 GiB 200 GiB D4 v2 8 28.00 GiB 400 GiB D5 v2 16 56.00 GiB 800 GiB • Significant Spark performance speed up with IO cache (up to 9X perf gains) • Automatic cache resource management • DRAM + Temp SSD makes large cache pool @ashishth@ashishth
  • 34. PERIMETER Isolate clusters within VNETs Service Endpoint support for WASB, Azure DB, Cosmos DB Restrict outbound traffic using NVAs* AUTHENTICATION Azure Active Directory Kerberos with Active Directory AUTHORIZATION Role-Based Access Control Apache Ranger based Access Control DATA PROTECTION Encryption on-the-wire with HTTPS enforced Encryption at Rest using Azure Key Vault Auditing of all data operations and configuration changes @ashishth
  • 37. Apache Ranger ADLS Gen 2 ACLs @ashishth
  • 38. Scenario Authorizing Component Yarn: Submit-App Apache Ranger: Yarn Plugin Hive Operations: Select , Drop, index, Lock, Read, Write, Masking, Row level filter on Hive Database, Table & Columns Apache Ranger: Hive Plugin Create/ Alter Table with storage location reference Apache Ranger + ADLS Gen 2 ACL’s Spark SQL access with Hive Metastore Apache Ranger: Hive Plugin HBase Access Policies Apache Ranger/ HBase plugin Kafka Access Policies Apache ranger/ Kafka Plugin Access Azure Data Lake Storage Gen2 using the Spark DataFrame API ADLS Gen 2 ACLs Access Azure Data Lake Storage Gen2 using the RDD API ADLS Gen 2 ACLs HDFS operations: Mkdir, ls, put, copyFromLocal, get, cat, mv, cp etc ADLS Gen 2 ACLs Running Map Reduce jobs ADLS Gen 2 ACLs @ashishth
  • 41. • hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode get # A report that shows the • details of HDFS state: hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -report # Get HDFS • out of safe mode hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode leave # Get • HDFS into safe mode hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode enter @ashishth
  • 42. SetupAutoscale Customize to your own scenario Pay for ONLY what you need Monitoring scaling history easily Graceful Scale Down @ashishth
  • 43. HDInsight Cluster Gateways Head Node 1 Head Node 2 Worker Node Worker Node Worker Node Worker Node Zookeeper1 Zookeeper1 Zookeeper1 Hive Metastore YARN https://cluster.azurehdinsight.net/APIs @ashishth
  • 44. Workload DR Option Spark / Hive Manual, Partner solution HBase HBase replication, Snapshot export, Import Export, Copy Tables Kafka Mirror Maker @ashishth https://github.com/anagha-microsoft/hdi-spark-dr https://github.com/anagha-microsoft/hdi-kafka-dr https://docs.microsoft.com/en- us/azure/hdinsight/hbase/apache-hbase-backup-replication
  • 46. Apache Ambari Azure Log Analytics IntegrationHDInsight Cluster Metrics @ashishth
  • 48. Motivation and benefits Architecture best practices Infrastructure best practices Storage best practices Data migration best practices Security and DevOps best practices https://azure.microsoft.com/en-us/blog/migrating-on-premises-hadoop-infrastructure-to-azure-hdinsight/ @ashishth

Editor's Notes

  1. Azure HDInsight is a secure and managed platform for building data lakes on Azure based on the Apache Hadoop and Spark frameworks. So, what all does HDInsight have to offer? Reliable Open Source analytics with an Industry leading SLA HDInsight allows you to easily spin up open source cluster types guaranteed with the industry’s best 99.9% SLA and 24/7 support. We guarantee this SLA for the entire big data solution, not just the VM instances. HDInsight is architected for full redundancy and high availability including head node replication, data geo-replication, and built-in standby NameNode making HDInsight resilient to critical failures not addressed in standard Hadoop implementations. Azure also offers cluster monitoring and 24x7 enterprise support backed by Microsoft and Hortonworks with 37 combined committers for Hadoop core, more than all other managed cloud providers combined to support your deployment and the ability to fix and commit code back to Hadoop. Enterprise Grade Security & Monitoring HDInsight protects your data assets and easily extends your on-premise security and governance controls to the cloud. We feature single sign-on (SSO), multi-factor authentication and seamless management of millions of identities through Azure Active Directory. You can authorize users and groups with fine-grained access control policies over all your enterprise data with Apache Ranger. HDInsight meets HIPAA, PCI, SOC compliance, ensuring your enterprise data assets are always protected with the highest security and regulatory compliance. To ensure the highest level of business continuity, HDInsight extends capabilities for alerting, monitoring, defining pre-emptive actions, and enhanced workload protection through native integration with Azure Operations Management Suite (OMS). Most Productive platform for developers and scientists HDInsight offers developers tailored experiences through rich productivity suites for Hadoop & Spark with integrated development environments using Visual Studio, Eclipse, and IntelliJ supporting Scala, Python, R, Java, and .Net. HDInsight gives data scientists the ability to create narratives that combine code, statistical equations, and visualizations that tell a story about the data through integration to the two most popular notebooks: Jupyter and Zeppelin. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft R Server. Multi-threaded math libraries and transparent parallelization in R Server means handling up to 1000x more data and up to 50x faster speeds than open source R—helping you train more accurate models for better predictions than previously possible. Cost effective cloud scale HDInsight has decoupled compute and storage, enabling you to cost-effectively scale workloads up or down, independent of storage. Local storage can still be used for caching and fast I/O. Spark and interactive Hive users can choose SSD memory for interactive performance; while Kafka users can retain all streaming data in premium managed disks. You only pay for the compute and storage you use and are given the ability to choose any Azure VM types that enables the best utilization of resources. A recent study showed HDInsight delivering 63% lower TCO than deploying Hadoop on premises over 5 years.* Integration with leading Productivity Applications In the broader ecosystem for Hadoop, there is a thriving market of independent software vendors (ISVs) who provide value added solutions. Through a unique design where every cluster is extended with edge nodes and script action, HDInsight lets customers spin up Hadoop and Spark clusters pre-integrated and pre-tuned with any ISV application out-of-the-box. Datameer, Cask, AtScale, StreamSets are few such applications, which are very popular on the HDInsight platform today. Easy for administrators to manage With HDInsight, administrators can deploy Hadoop in the cloud without buying new hardware or incurring other up-front costs. There’s also no time-consuming installation or set up. There is also no need to patch the operating system or upgrade the Hadoop versions. Azure does it for you. Launch your first cluster in minutes.
  2. The new world of HDInsight 4.0 with Hadoop 3.0, brings the Spark and Hive worlds closer together. Lets see, how… Before Hadoop 3.0, the Spark executors would directly access the Hive metastore. While, on the surface, this seems like a fine thing to do, it is rife with problems. The new architecture instead requires explicit registration of Hive transactional tables as Spark external tables through Hive Warehouse Connector. While it adds one extra step during configuration, this approach greatly increases the reliability of data access. Hive Warehouse Connector supports efficient predicate pushdown and Apache Arrow-based communication between Spark executors and Hive LLAP daemons. This results in overall small overhead of communication between two systems. With Hive Warehouse Connector, Apache Spark on HDInsight 4.0 gets mature transactional capabilities.​ The new integration between Apache Spark and Hive LLAP in HDInsight 4.0 delivers new capabilities for business analysts, data scientists, and data engineers. Business analysts get a performant SQL engine in the form of Hive LLAP (Interactive Query) while data scientists and data engineers get a great platform for ML experimentation and ETL with Apache Spark over transactional data in Hive tables.​
  3. Reference https://azure.microsoft.com/en-us/blog/deploying-apache-airflow-in-azure-to-build-and-run-data-pipelines/
  4. Build 2015
  5. Transfer data over network with TLS Over internet - You can transfer data to Azure storage over a regular internet connection using any one of several tools such as: Azure Storage Explorer, AzCopy, Azure Powershell, and Azure CLI. See Moving data to and from Azure Storage for more information. Express Route - ExpressRoute is an Azure service that lets you create private connections between Microsoft datacenters and infrastructure that’s on your premises or in a colocation facility. ExpressRoute connections do not go over the public Internet, and offer higher security, reliability, and speeds with lower latencies than typical connections over the Internet. For more information, see Create and modify an ExpressRoute circuit. Data Box online data transfer - Data Box Edge and Data Box Gateway are online data transfer products that act as network storage gateways to manage data between your site and Azure. Data Box Edge, an on-premises network device, transfers data to and from Azure and uses artificial intelligence (AI)-enabled edge compute to process data. Data Box Gateway is a virtual appliance with storage gateway capabilities. For more information, see Azure Data Box Documentation - Online Transfer. Shipping data Offline Import / Export service - you can send physical disks to Azure and they will be uploaded for you. For more information, see What is Azure Import/Export service?. Data Box offline data transfer - Data Box, Data Box Disk, and Data Box Heavy devices help you transfer large amounts of data to Azure when the network isn’t an option. These offline data transfer devices are shipped between your organization and the Azure datacenter. They use AES encryption to help protect your data in transit, and they undergo a thorough post-upload sanitization process to delete your data from the device. For more information, see Azure Data Box Documentation - Offline Transfer.
  6. Azure Data Lake Storage Gen2 is Azure’s Storage platform for high performance analytics It is built on the strong foundation of Blob Storage which is Azure’s object storage platform and has been serving customers and various use cases (including some analytics use cases) for over a decade ADLS Gen2 is designed with native file system semantics and optimized for high performance analytics (example: rename folder operations which is very common in spark workloads is a single metadata operation as opposed to a large number for individual object operations) ADLS Gen2 also supports POSIX ACL’s which is an open source industry standard ADLS Gen2 currently only supports a small subset of the Blob capabilities (authentication and redundancy) but several of the other Blob capabilities will light up once we support “interoperability” – which is the ability to run multiple protocols on the same account. This is planned to roll out in waves throughout the calendar year Blob interoperability also lights up integrations such as ASA, Event Hubs etc since this was previously done for Blob storage. This includes SDK’s as the Blob SDK’s can be used on the account as opposed to writing brand new SDK’s. ADLS Gen2 is GA and available in all Azure regions and is the recommended storage platform for analytics pipelines
  7. Before I describe specific capabilities and value propositions of HDInsight, let us take a quick look at the architecture of a HDInsight cluster. We will build upon this when we talk about security later on in the presentation. First off, a key difference between an on-premise Hadoop cluster and a HDInsight cluster is that with HDInsight, the storage and compute layers are separated. This allows for storage and compute to be scaled independently of each other. We have seen in numerous customer cases, that trying to combine storage and compute on to a single cluster often leads to underutilization of one or the other or both. With HDInsight, you can keep loading data in to Azure Storage Gen1 or Gen2 or in WASB. And you can create small or large clusters as and when needed. Each HDInsight cluster comes with 2 gateway nodes, 2 head nodes and 3 ZooKeeper nodes. In most cases, these are free of charge. As we will discuss later, we provision multiple of these nodes to ensure high availability. Each HDInsight cluster lives within a VNET. The gateway nodes are the ONLY public endpoints accessible from outside the VNET. As we will see later, this architecture allows you to securely lock down your HDInsight cluster.