5. Oozie Metastore
Hive Metastore
ADLS
Storage (Azure)
GW1
Head Node 1
Worker Node
Worker Node
Worker Node
Worker Node
Zookeeper1
Edge Node
VNET
Subnet
GW2Head Node 2
Zookeeper2 Zookeeper3
Yarn RM
Templeton
Hive Server 2
Ambari Server
ATS
Spark Thrift Server
Livy
Zeppelin
Spark History Server
Oozie
Public Endpoint
Authentication
Routing
Zookeeper
HMaster
Node
Manager
Local HDFS
Oozie Client
Region Server
Storm Worker
6. Capability ADLS Azure Blob
Geographic Availability East US 2, Central US, North Europe All Data Centers
HDFS Yes (Web HDFS) No
Scale No Limit on Bandwidth or Storage
size
Limits
-5PB Storage
-50Gbps Bandwidth
File Folder level ACL’s Yes No
Role Based Access Yes No
Encryption Yes Yes
Geo- Replication No Yes [LRS, GRS, RA-GRS]
Cost [1PB] $40K HOT $20K
COOL $16K
GA Date Nov 16th 2016 Feb 1st 2010
12. 1. Orchestrate, monitor & schedule
• compose data processing, storage &
movement services (on premises &
cloud)
2. Automatic infrastructure mgmt
• combine pipeline intent w/ resource
allocation & mgmt
• data movement as a service (global
footprint & on premises)
3. Single pane of glass
• one place to manage your network
of data flows
24. Why Sharding?
Azure Blob storage has throttling limits
Bandwidth [20Gbps] and Storage [500TB] are top limits that we hit all the time
How to mitigate?
Shard data across multiple storage accounts
Use Azure DataLake store or larger, higher scale storage accounts
Randomize Storage account prefixes
25.
26.
27.
28.
29. Transparent to users, BI tools, etc. – HS2/JDBC is the access point
YARN Cluster
ODBC
JDBC
SQL
Queries
LLAP Daemon LLAP Daemon LLAP Daemon LLAP DaemonQuery
Coordinators
DAGs
30. 4X
Date Size – 1TB TPCDS
Worker Nodes - 32 D14_V2
File Format - ORC
Config – No special settings, default HDI
Hive on Tez Hive on Tez + WASB Driver
Improvements
Interactive Query [COLD] Interactive Query [HOT]