Big Data in the Cloud - The What, Why and How from the Experts

Big Data in the Cloud –
The what, why and how from the experts
Nishant Thacker
Technical Product Manager – Big Data
Microsoft
@nishantthacker

Challenges with implementing clusters

Hadoop Clusters in the Cloud
6

Distributed Storage
• Files split across storage
• Files replicated
• Nearest node responds
• Abstracted Administration
Hadoop/Spark Clusters
Extensible
• APIs to extend functionality
• Add new capabilities
• Allow for inclusion in custom
environments
Automated Failover
• Unmonitored failover to replicated data
• Built for resiliency
• Metadata stored for later retrieval
Hyper-Scale
• Add resources as desired
• Built to include commodity configs
• Direct correlation of performance and
resources
Distributed Compute
• Distributed processing
• Resource Utilization
• Cost-Efficient method calls
8

Distributed Storage
Cloud
Extensible
environments
Automated Failover
Hyper-Scale
resources
Distributed Compute
9

Distributed Storage
Big Data in the Cloud
Extensible
environments
Automated Failover
Hyper-Scale
resources
Distributed Compute
10

Big Data in the Cloud - Options

Scenarios for deploying as hybrid

Traditional Clusters – On Prem
14
Hadoop Cluster
Worker Node
HDFS
HDFS HDFS
Tasks Tasks Tasks Tasks Tasks Tasks
Task Tracker
Master Node
Client
Job (jar) file
Job (jar) file

Azure
HDInsight
Hadoop and Spark as a
Service on Azure
Fully managed Hadoop and Spark for the cloud
100% Open Source Hortonworks Data Platform
Clusters up and running in minutes
Managed, monitored and supported by Microsoft
with the industry’s best enterprise SLA
Use familiar BI tools for analysis, or open source
notebooks for interactive data science
63% lower total cost of ownership than deploy
your own Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

HDInsight Cluster
Azure Data Lake Storage
HDInsight cluster
Domain credentials
Azure Storage Blob
Head node
Back-up
Data node

HDInsight Cluster Security
AAD tenant
Azure VNET to
VNET peering
HDInsight Cluster
Azure Data Lake Storage
Domain credentials
Azure Storage Blob
Head node
Back-up
Data node

Big Data as a Service
Compute requirement U-SQL
ADLS WASB

Decoupling Compute from Storage
Latency? Consistency?
Bandwidth?
Network

Decoupling Compute from Storage
Network
HDD-like latency
50 Tb+ aggregate
bandwidth[1]
Strong consistency
[1] Azure Flat Network Architecture

Azure
Data Lake Store
A hyper scale
repository for big data
analytics workloads
Hadoop File System (HDFS) for the cloud
No limits to scale
Store any data in its native format
Enterprise grade access control and encryption
Optimized for analytic workload performance

Customize
cluster?
HDInsight cluster provisioning states
RDP to cluster, update
config files (non-durable)
Ad hoc
Cluster customization options
Hive/Oozie Metastore
Storage accounts & VNET’s
ScriptAction
Via Azure portal
Ready for
deployment
Accepted
Cluster
storage
provisioned
AzureVM
configuration
Running
Timed Out
Error
Cluster
operational
Configuring
HDInsight
Cluster
customization
(custom script
running
Config values
JAR file placement in
cluster
Via scripting / SDK
No
Yes

Cluster integration options
Each cluster surfaces a REST endpoint for integration,
secured via basic authN over SSL
/thrift – ODBC & JDBC
/Templeton – Job Submission,
Metadata management
/ambari – Cluster health,
monitoring
/oozie – Job orchestration,
scheduling

27
Big Data Application Architecture

The Azure Architecture
Source A
Source B
Source C
Data Factory
Azure Data Lake Store
Source D
Powershell
Stream
Analytics
HDInsight
Azure Data Lake Analytics
Azure SQL Data
Warehouse
Azure Analysis
Services
Ingestion Backend Frontend
Push
Stream
DAX
T-SQL
HiveQL
Analyst
Analyst
Analyst
Analyst

The Azure Architecture - Detailed
29

Introducing Cortana Intelligence Suite
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data

Where Big Data is a cornerstone
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data

 For more information on HDInsight visit: http://azure.com/hdinsight
 For more information on Data Lake visit: http://azure.com/datalake

Big Data in the Cloud - The What, Why and How from the Experts

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data in the Cloud - The What, Why and How from the Experts

Similar to Big Data in the Cloud - The What, Why and How from the Experts (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Big Data in the Cloud - The What, Why and How from the Experts

Editor's Notes