Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.
8. Distributed Storage
• Files split across storage
• Files replicated
• Nearest node responds
• Abstracted Administration
Hadoop/Spark Clusters
Extensible
• APIs to extend functionality
• Add new capabilities
• Allow for inclusion in custom
environments
Automated Failover
• Unmonitored failover to replicated data
• Built for resiliency
• Metadata stored for later retrieval
Hyper-Scale
• Add resources as desired
• Built to include commodity configs
• Direct correlation of performance and
resources
Distributed Compute
• Distributed processing
• Resource Utilization
• Cost-Efficient method calls
8
9. Distributed Storage
• Files split across storage
• Files replicated
• Nearest node responds
• Abstracted Administration
Cloud
Extensible
• APIs to extend functionality
• Add new capabilities
• Allow for inclusion in custom
environments
Automated Failover
• Unmonitored failover to replicated data
• Built for resiliency
• Metadata stored for later retrieval
Hyper-Scale
• Add resources as desired
• Built to include commodity configs
• Direct correlation of performance and
resources
Distributed Compute
• Distributed processing
• Resource Utilization
• Cost-Efficient method calls
9
10. Distributed Storage
• Files split across storage
• Files replicated
• Nearest node responds
• Abstracted Administration
Big Data in the Cloud
Extensible
• APIs to extend functionality
• Add new capabilities
• Allow for inclusion in custom
environments
Automated Failover
• Unmonitored failover to replicated data
• Built for resiliency
• Metadata stored for later retrieval
Hyper-Scale
• Add resources as desired
• Built to include commodity configs
• Direct correlation of performance and
resources
Distributed Compute
• Distributed processing
• Resource Utilization
• Cost-Efficient method calls
10
16. Azure
HDInsight
Hadoop and Spark as a
Service on Azure
Fully managed Hadoop and Spark for the cloud
100% Open Source Hortonworks Data Platform
Clusters up and running in minutes
Managed, monitored and supported by Microsoft
with the industry’s best enterprise SLA
Use familiar BI tools for analysis, or open source
notebooks for interactive data science
63% lower total cost of ownership than deploy
your own Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
17. HDInsight Cluster
Azure Data Lake Storage
HDInsight cluster
Domain credentials
Azure Storage Blob
Head node
Back-up
Data node
18. HDInsight Cluster Security
AAD tenant
Azure VNET to
VNET peering
HDInsight Cluster
Azure Data Lake Storage
Domain credentials
Azure Storage Blob
Head node
Back-up
Data node
19. Big Data as a Service
Compute requirement U-SQL
ADLS WASB
23. Azure
Data Lake Store
A hyper scale
repository for big data
analytics workloads
Hadoop File System (HDFS) for the cloud
No limits to scale
Store any data in its native format
Enterprise grade access control and encryption
Optimized for analytic workload performance
24. Customize
cluster?
HDInsight cluster provisioning states
RDP to cluster, update
config files (non-durable)
Ad hoc
Cluster customization options
Hive/Oozie Metastore
Storage accounts & VNET’s
ScriptAction
Via Azure portal
Ready for
deployment
Accepted
Cluster
storage
provisioned
AzureVM
configuration
Running
Timed Out
Error
Cluster
operational
Configuring
HDInsight
Cluster
customization
(custom script
running
Config values
JAR file placement in
cluster
Via scripting / SDK
No
Yes
25. Cluster integration options
Each cluster surfaces a REST endpoint for integration,
secured via basic authN over SSL
/thrift – ODBC & JDBC
/Templeton – Job Submission,
Metadata management
/ambari – Cluster health,
monitoring
/oozie – Job orchestration,
scheduling
28. The Azure Architecture
Source A
Source B
Source C
Data Factory
Azure Data Lake Store
Source D
Powershell
Stream
Analytics
HDInsight
Azure Data Lake Analytics
Azure SQL Data
Warehouse
Azure Analysis
Services
Ingestion Backend Frontend
Push
Stream
DAX
T-SQL
HiveQL
Analyst
Analyst
Analyst
Analyst
30. Introducing Cortana Intelligence Suite
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data
31. Where Big Data is a cornerstone
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data
Hardware acquisition (Capex up front)
Scale constrained to on-premise procurement (resource and capacity planning)
Skilled Hadoop Expertise Tuning + Maintenance
Why Hadoop in the cloud?
You can deploy Hadoop in a traditional on-site datacenter. Some companies–including Microsoft–also offer Hadoop as a cloud-based service. One obvious question is: why use Hadoop in the cloud? Here's why a growing number of organizations are choosing this option.
The cloud saves time and money
Open source doesn't mean free. Deploying Hadoop on-premises still requires servers and skilled Hadoop experts to set up, tune, and maintain them. A cloud service lets you spin up a Hadoop cluster in minutes without up-front costs.
See how Virginia Tech is using Microsoft's cloud instead of spending millions of dollars to establish their own supercomputing center.
The cloud is flexible and scales fast
In the Microsoft Azure cloud, you pay only for the compute and storage you use, when you use it. Spin up a Hadoop cluster, analyze your data, then shut it down to stop the meter.
We quickly spun up the Azure HDInsight cluster and processed six years worth of data in just a few hours, and then we shut it down&ellipsis; processing the data in the cloud made it very affordable.
–Paul Henderson, National Health Service (U.K.)
The cloud makes you nimble
Create a Hadoop cluster in minutes–and add nodes on-demand. The cloud offers organizations immediate time to value.
It was simply so much faster to do this in the cloud with Windows Azure. We were able to implement the solution and start working with data in less than a week.
–Morten Meldgaard, Chr. Hansen
This topic explores how you can get data into your Big Data solution. It describes several different but typical data ingestion techniques that are generally applicable to any Big Data solution. These techniques include ways to handle streaming data and for automating the ingestion process. While the focus is primarily on Microsoft Azure HDInsight, many of the techniques described here are equally relevant to solutions built on other Big Data frameworks and platforms.
The figure shows an overview of the techniques and technologies covered in this section of the guide.
This topic explores how you can get data into your Big Data solution. It describes several different but typical data ingestion techniques that are generally applicable to any Big Data solution. These techniques include ways to handle streaming data and for automating the ingestion process. While the focus is primarily on Microsoft Azure HDInsight, many of the techniques described here are equally relevant to solutions built on other Big Data frameworks and platforms.
The figure shows an overview of the techniques and technologies covered in this section of the guide.
This topic explores how you can get data into your Big Data solution. It describes several different but typical data ingestion techniques that are generally applicable to any Big Data solution. These techniques include ways to handle streaming data and for automating the ingestion process. While the focus is primarily on Microsoft Azure HDInsight, many of the techniques described here are equally relevant to solutions built on other Big Data frameworks and platforms.
The figure shows an overview of the techniques and technologies covered in this section of the guide.
This topic explores how you can get data into your Big Data solution. It describes several different but typical data ingestion techniques that are generally applicable to any Big Data solution. These techniques include ways to handle streaming data and for automating the ingestion process. While the focus is primarily on Microsoft Azure HDInsight, many of the techniques described here are equally relevant to solutions built on other Big Data frameworks and platforms.
The figure shows an overview of the techniques and technologies covered in this section of the guide.
This topic explores how you can get data into your Big Data solution. It describes several different but typical data ingestion techniques that are generally applicable to any Big Data solution. These techniques include ways to handle streaming data and for automating the ingestion process. While the focus is primarily on Microsoft Azure HDInsight, many of the techniques described here are equally relevant to solutions built on other Big Data frameworks and platforms.
The figure shows an overview of the techniques and technologies covered in this section of the guide.
This topic explores how you can get data into your Big Data solution. It describes several different but typical data ingestion techniques that are generally applicable to any Big Data solution. These techniques include ways to handle streaming data and for automating the ingestion process. While the focus is primarily on Microsoft Azure HDInsight, many of the techniques described here are equally relevant to solutions built on other Big Data frameworks and platforms.
The figure shows an overview of the techniques and technologies covered in this section of the guide.
Cortana Intelligence delivers an end-to-end platform with an integrated and comprehensive set of tools and services to help you build intelligent applications that let you easily take advantage of Advanced Analytics and intelligence capabilities.
First, Cortana Intelligence provides services to bring data in, so that you can analyze it. It provides information management capabilities like Azure Data Factory so that you can pull data from any source (relational DB like SQL or non-relational ones like your Hadoop cluster) in an automated and scheduled way, while performing the necessary data transforms (like setting certain data columns as dates vs. currency etc). Think ETL (Extract, Transform, Load) in the cloud. Event Hubs does the same for IoT type ingestion of data that streams in from lots of end points.
The data brought in then can be persisted in flexible big data storage services like Data Lake Store and Azure SQL Data Warehouse.
You can then use a wide range of analytics services from Machine Learning to Azure Data Lake Analytics to Azure HDInsight to Azure Stream Analytics to analyze the data stored in the big data storage. This means you can create analytics services and models specific to your business need (say real time demand forecasting).
The resultant analytics services and models created by taking these steps can then be surfaced as interactive dashboards and visualizations via Power BI.
These same analytics services and models created can also be integrated into various different UI (web apps or mobile apps or rich client apps), or with Cortana, so end users can naturally interact with them via speech etc., and so that end users can get proactively be notified by Cortana if the analytics model finds a new anomaly (unusual growth in certain product purchases- in the case of real time demand forecasting example given above) or whatever deserves the attention of the business users. Similar integration can occur with Cognitive Services or Bot Framework based applications.
At a high level though, Cortana Intelligence capabilities are in three main areas: data, analytics and intelligence.
<Transition>: We’re going to dive into each one, starting with data.
Cortana Intelligence delivers an end-to-end platform with an integrated and comprehensive set of tools and services to help you build intelligent applications that let you easily take advantage of Advanced Analytics and intelligence capabilities.
First, Cortana Intelligence provides services to bring data in, so that you can analyze it. It provides information management capabilities like Azure Data Factory so that you can pull data from any source (relational DB like SQL or non-relational ones like your Hadoop cluster) in an automated and scheduled way, while performing the necessary data transforms (like setting certain data columns as dates vs. currency etc). Think ETL (Extract, Transform, Load) in the cloud. Event Hubs does the same for IoT type ingestion of data that streams in from lots of end points.
The data brought in then can be persisted in flexible big data storage services like Data Lake Store and Azure SQL Data Warehouse.
You can then use a wide range of analytics services from Machine Learning to Azure Data Lake Analytics to Azure HDInsight to Azure Stream Analytics to analyze the data stored in the big data storage. This means you can create analytics services and models specific to your business need (say real time demand forecasting).
The resultant analytics services and models created by taking these steps can then be surfaced as interactive dashboards and visualizations via Power BI.
These same analytics services and models created can also be integrated into various different UI (web apps or mobile apps or rich client apps), or with Cortana, so end users can naturally interact with them via speech etc., and so that end users can get proactively be notified by Cortana if the analytics model finds a new anomaly (unusual growth in certain product purchases- in the case of real time demand forecasting example given above) or whatever deserves the attention of the business users. Similar integration can occur with Cognitive Services or Bot Framework based applications.
At a high level though, Cortana Intelligence capabilities are in three main areas: data, analytics and intelligence.
<Transition>: We’re going to dive into each one, starting with data.