© Copyright SELA Software & Education Labs Ltd. | 14-18 Baruch Hirsch St Bnei Brak, 51202 Israel | www.selagroup.com
Eyal Ben Ivri
Building Big Data Solutions
on Azure
About me
Eyal Ben Ivri
Big Data & Cloud Architect, Sela Group
Focus On Hadoop Eco-System & Big-Data +
NoSQL Solutions
Modern Data – The Big Picture
IoT
User Data
Media Files
Documents
Machine Data
Log Files
The Light Rail problem – TLV
Railway
Imagine the new light Rail maintenance
company
IoT – Internet of Trains (and cameras, and cash
registers and carts and rails and more…)
Analyze data in stream and in batch
Dashboards
Alerts
The perfect problem
What We Need
An integrated data solution that will be:
Able to process events from external sources
Able to walk data through different pipelines
Fast and responsive
Big-Data Ready
In Other Words
Consume
BI Dashboards Applications
Process
ETL Aggregations Computation Analysis Querying
Persist
Hadoop SQL NoSQL
Ingest
IoT Structured Data Un-Structured Data
Microsoft Azure Services for
IoT and BigData
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database
Machine
Learning
App Service
Service Bus
Table/Blob
Storage
Stream Analytics Power BI
External Data
Sources
DocumentDB HDInsight
Notification
Hubs
Data Lake Store Data Factory Mobile Services
External Data
Sources
Data Lake
Analytics
BizTalk Services
{ }
Microsoft Azure Services for
IoT and BigData
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database
Machine
Learning
App Service
Service Bus
Table/Blob
Storage
Stream Analytics Power BI
External Data
Sources
DocumentDB HDInsight
Notification
Hubs
Data Lake Store Data Factory Mobile Services
External Data
Sources
Data Lake
Analytics
BizTalk Services
{ }
Event Hub
Messages at scale
Why not throw it into a queue, and have a
listener at the backend?
Scaling limits, because of the architecture of queues
and topics of a standard Service Bus
Event Hub uses a partition model
Getting Started
Easy to set up
Two Configurations
Partition Count – Depend on the number of consumers (2-
32)
Message Retention (days) – between 1 and 7 days
Secured using SAS Policies
Field
Gateway
Device
Connectivity & Management
IoT with Event Hubs
Devices
RTOS,Linux,Windows,Android,iOS
Cloud Gateway
Event Hubs
Field
Gateway
Protocol
Adaptation
Field
Gateway
Device
Connectivity & Management
Analytics &
Operationalized Insights
IoT & Data Processing Patterns
Devices
RTOS,Linux,Windows,Android,iOS
Protocol
Adaptation
Batch Analytics & Visualizations
Azure HDInsight, AzureML, Power BI,
Azure Data Factory
Hot Path Analytics
Azure Stream Analytics, Azure HDInsight Storm
Hot Path Business Logic
Service Fabric & Actor Framework
Cloud Gateway
Event Hubs
&
IoT Hub
Field
Gateway
Protocol
Adaptation
TLV Railway
Can now ingest millions of messages each
second
These messages carry data from:
Devices
End-Machines
Servers
Next, we need to use this data to create real-
time alerts when something goes wrong
Azure Stream Analytics
Automatic recovery
Monitoring and alerting
Scale on demand
Managed Cloud Service
Each unit handles 1MB/s
Can scale up to 1GB/s
SQL like language
temporal windowing
semantics
support for reference data
Stream Analytics – Main Concepts
Inputs
Can be stream or reference data (metadata)
Stream Data sources can be Event Hub, Blob Storage
(using blobs with timestamps) or IoT Hub (preview)
Serialization types support CSV, JSON, and Avro
Query
A SQL query to that will select from input(s) and
dump results to output(s)
Output
Can be Blob, SQL, Event Hub (notification), Power BI
(preview), Table storage, Service Bus or DocumentDB
Tumbling Windows
How many trains entered each station every 5
minutes?
SELECT TrainId, COUNT(*) FROM EntryStream
GROUP BY TrainId, TumblingWindow(minute,5)
Temporal Windows
Tumbling Window
A series of fixed-sized, non-overlapping and
contiguous time intervals
Hopping Window
Scheduled overlapping windows
Sliding Window
Outputs events only for those points in time when
the content of the window actually changes
TLV Railway
Can now respond in near-real-time to events as
they happen
Track and maintain malfunctioning equipment
Receive real time data regarding customers
entering and leaving stations
Data can now be processed, so we need a place
to save it, preferably at scale.
DocumentDB and Azure Data
Services
fully managed, scalable, queryable, schema free JSON
document database service for modern applications
transactional processing
rich query
managed as a service
elastic scale
internet accessible http/rest
schema-free data model
arbitrary data formats
DocumentDB features
JSON Documents
SQL support
Linq Support
REST API Support
JS Support (triggers, UDFs, stored procedures)
Automatic Index
Multiple Document Transactions
Tunable Consistency
DocumentDB Key Concept
Collection
A collection of Documents
Not a table (different entities can go into the same
collection)
Collections = Partitions
Not just logical containers, but physical ones
Demo
TLV Railway – Part 1
TLV Railway
Can now store it’s data in a highly scalable store
Great for interactive querying of any data
Messages from sensors
Reference Data
But this data (and other data) needs to move to
other places (SQL, Batch processing, ML). How?
What is Azure Data Factory?
Azure Data Factory is a managed service to produce
trusted information from data stored in the cloud
and on-premises. Easily create, orchestrate and
schedule highly-available, fault tolerant work flows
to move and transform your data at scale.
Evolving Approaches to Analytics
ETL Tool
(SSIS, etc)
EDW
(SQL Svr, Teradata, etc)
Extract
Original
Data
Load
Transformed
Data
Transform
BI Tools
Ingest
Original
Data
Scale-out
Storage &
Compute
(HDFS, Blob Storage,
etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
Data Factory – Main concepts
Data Store
A data source/sink component
SQL (Azure or On-Premise), Storage, DocumentDB and
more)
Data Set
A defined data set that is contained inside a data store
One data store can have many data sets
Compute
A service for computation
HDInsight, Azure Batch, Data Lake Analytics, Azure ML
Data Factory – Main concepts
Pipeline
Set of instructions
“Take data from data set A and move to compute,
then store results in data set B”
Slices
Everything is time sliced
A data set (source) can declare on what time
intervals the data can be sliced, and the pipeline will
be activated when a new slice is ready
JSON
Microsoft Azure Services for
IoT and BigData
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database
Machine
Learning
App Service
Service Bus
Table/Blob
Storage
Stream Analytics Power BI
External Data
Sources
DocumentDB HDInsight
Notification
Hubs
Data Lake Store Data Factory Mobile Services
External Data
Sources
Data Lake
Analytics
BizTalk Services
{ }
Microsoft Azure Services for
IoT and BigData
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database
Machine
Learning
App Service
Service Bus
Table/Blob
Storage
Stream Analytics Power BI
External Data
Sources
DocumentDB HDInsight
Notification
Hubs
Data Lake Store Data Factory Mobile Services
External Data
Sources
Data Lake
Analytics
BizTalk Services
{ }
TLV Railway
Can now integrate different services and
different data sources
Move data with ease and as little hassle as
possible
What about aggregations, deeper dive into
data, for more complex analysis?
HDInsight
Hadoop-as-a-Service
Based on the Hortonworks distribution
Few flavors:
Hadoop (Windows + Linux)
Storm (Windows + Linux)
HBase (Windows + Linux)
Spark (Windows + Linux)
Data size
Access
Updates
Structure
Integrity
Scaling
Hadoop vs. Relational DB
Demo
TLV Railway – Part 2
TLV Railway - Summary
Can now perform advanced analytics on top of
large amounts of data, in a variety of formats
(not just structured, boring data)
Can integrate all the loose ends of data coming
in, with data generated in ”Old-School” data
platforms like SQL that is collected from Line-
of-Business applications
We’ve covered data ingestion, responding in
real-time, querying, storing and processing
Azure Stack
Hadoop and OSS vs.
Azure IoT and BigData Ecosystem
Azure Ecosystem OSS
Event Hubs Kafka
Stream Analytics Storm
HDInsight Hadoop
Map Reduce Map Reduce
Hive Hive
Spark Spark
HBase HBase
Azure ML Mahout
Data Factory Pig
DocumentDB MongoDB / Couchbase
Data Lake (preview)
Is “TLV Railway” fake?
London did it first
Summary
Get started today at
http://azure.microsoft.com
Questions

Building big data solutions on azure

  • 1.
    © Copyright SELASoftware & Education Labs Ltd. | 14-18 Baruch Hirsch St Bnei Brak, 51202 Israel | www.selagroup.com Eyal Ben Ivri Building Big Data Solutions on Azure
  • 2.
    About me Eyal BenIvri Big Data & Cloud Architect, Sela Group Focus On Hadoop Eco-System & Big-Data + NoSQL Solutions
  • 3.
    Modern Data –The Big Picture IoT User Data Media Files Documents Machine Data Log Files
  • 5.
    The Light Railproblem – TLV Railway Imagine the new light Rail maintenance company IoT – Internet of Trains (and cameras, and cash registers and carts and rails and more…) Analyze data in stream and in batch Dashboards Alerts The perfect problem
  • 6.
    What We Need Anintegrated data solution that will be: Able to process events from external sources Able to walk data through different pipelines Fast and responsive Big-Data Ready
  • 7.
    In Other Words Consume BIDashboards Applications Process ETL Aggregations Computation Analysis Querying Persist Hadoop SQL NoSQL Ingest IoT Structured Data Un-Structured Data
  • 8.
    Microsoft Azure Servicesfor IoT and BigData Devices Device Connectivity Storage Analytics Presentation & Action Event Hubs SQL Database Machine Learning App Service Service Bus Table/Blob Storage Stream Analytics Power BI External Data Sources DocumentDB HDInsight Notification Hubs Data Lake Store Data Factory Mobile Services External Data Sources Data Lake Analytics BizTalk Services { }
  • 9.
    Microsoft Azure Servicesfor IoT and BigData Devices Device Connectivity Storage Analytics Presentation & Action Event Hubs SQL Database Machine Learning App Service Service Bus Table/Blob Storage Stream Analytics Power BI External Data Sources DocumentDB HDInsight Notification Hubs Data Lake Store Data Factory Mobile Services External Data Sources Data Lake Analytics BizTalk Services { }
  • 10.
    Event Hub Messages atscale Why not throw it into a queue, and have a listener at the backend? Scaling limits, because of the architecture of queues and topics of a standard Service Bus Event Hub uses a partition model
  • 11.
    Getting Started Easy toset up Two Configurations Partition Count – Depend on the number of consumers (2- 32) Message Retention (days) – between 1 and 7 days Secured using SAS Policies
  • 12.
    Field Gateway Device Connectivity & Management IoTwith Event Hubs Devices RTOS,Linux,Windows,Android,iOS Cloud Gateway Event Hubs Field Gateway Protocol Adaptation
  • 13.
    Field Gateway Device Connectivity & Management Analytics& Operationalized Insights IoT & Data Processing Patterns Devices RTOS,Linux,Windows,Android,iOS Protocol Adaptation Batch Analytics & Visualizations Azure HDInsight, AzureML, Power BI, Azure Data Factory Hot Path Analytics Azure Stream Analytics, Azure HDInsight Storm Hot Path Business Logic Service Fabric & Actor Framework Cloud Gateway Event Hubs & IoT Hub Field Gateway Protocol Adaptation
  • 14.
    TLV Railway Can nowingest millions of messages each second These messages carry data from: Devices End-Machines Servers Next, we need to use this data to create real- time alerts when something goes wrong
  • 15.
    Azure Stream Analytics Automaticrecovery Monitoring and alerting Scale on demand Managed Cloud Service Each unit handles 1MB/s Can scale up to 1GB/s SQL like language temporal windowing semantics support for reference data
  • 16.
    Stream Analytics –Main Concepts Inputs Can be stream or reference data (metadata) Stream Data sources can be Event Hub, Blob Storage (using blobs with timestamps) or IoT Hub (preview) Serialization types support CSV, JSON, and Avro Query A SQL query to that will select from input(s) and dump results to output(s) Output Can be Blob, SQL, Event Hub (notification), Power BI (preview), Table storage, Service Bus or DocumentDB
  • 17.
    Tumbling Windows How manytrains entered each station every 5 minutes? SELECT TrainId, COUNT(*) FROM EntryStream GROUP BY TrainId, TumblingWindow(minute,5)
  • 18.
    Temporal Windows Tumbling Window Aseries of fixed-sized, non-overlapping and contiguous time intervals Hopping Window Scheduled overlapping windows Sliding Window Outputs events only for those points in time when the content of the window actually changes
  • 19.
    TLV Railway Can nowrespond in near-real-time to events as they happen Track and maintain malfunctioning equipment Receive real time data regarding customers entering and leaving stations Data can now be processed, so we need a place to save it, preferably at scale.
  • 20.
    DocumentDB and AzureData Services fully managed, scalable, queryable, schema free JSON document database service for modern applications transactional processing rich query managed as a service elastic scale internet accessible http/rest schema-free data model arbitrary data formats
  • 21.
    DocumentDB features JSON Documents SQLsupport Linq Support REST API Support JS Support (triggers, UDFs, stored procedures) Automatic Index Multiple Document Transactions Tunable Consistency
  • 22.
    DocumentDB Key Concept Collection Acollection of Documents Not a table (different entities can go into the same collection) Collections = Partitions Not just logical containers, but physical ones
  • 23.
  • 24.
    TLV Railway Can nowstore it’s data in a highly scalable store Great for interactive querying of any data Messages from sensors Reference Data But this data (and other data) needs to move to other places (SQL, Batch processing, ML). How?
  • 25.
    What is AzureData Factory? Azure Data Factory is a managed service to produce trusted information from data stored in the cloud and on-premises. Easily create, orchestrate and schedule highly-available, fault tolerant work flows to move and transform your data at scale.
  • 26.
    Evolving Approaches toAnalytics ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data
  • 27.
    Data Factory –Main concepts Data Store A data source/sink component SQL (Azure or On-Premise), Storage, DocumentDB and more) Data Set A defined data set that is contained inside a data store One data store can have many data sets Compute A service for computation HDInsight, Azure Batch, Data Lake Analytics, Azure ML
  • 28.
    Data Factory –Main concepts Pipeline Set of instructions “Take data from data set A and move to compute, then store results in data set B” Slices Everything is time sliced A data set (source) can declare on what time intervals the data can be sliced, and the pipeline will be activated when a new slice is ready JSON
  • 30.
    Microsoft Azure Servicesfor IoT and BigData Devices Device Connectivity Storage Analytics Presentation & Action Event Hubs SQL Database Machine Learning App Service Service Bus Table/Blob Storage Stream Analytics Power BI External Data Sources DocumentDB HDInsight Notification Hubs Data Lake Store Data Factory Mobile Services External Data Sources Data Lake Analytics BizTalk Services { }
  • 31.
    Microsoft Azure Servicesfor IoT and BigData Devices Device Connectivity Storage Analytics Presentation & Action Event Hubs SQL Database Machine Learning App Service Service Bus Table/Blob Storage Stream Analytics Power BI External Data Sources DocumentDB HDInsight Notification Hubs Data Lake Store Data Factory Mobile Services External Data Sources Data Lake Analytics BizTalk Services { }
  • 32.
    TLV Railway Can nowintegrate different services and different data sources Move data with ease and as little hassle as possible What about aggregations, deeper dive into data, for more complex analysis?
  • 34.
    HDInsight Hadoop-as-a-Service Based on theHortonworks distribution Few flavors: Hadoop (Windows + Linux) Storm (Windows + Linux) HBase (Windows + Linux) Spark (Windows + Linux)
  • 35.
  • 36.
  • 37.
    TLV Railway -Summary Can now perform advanced analytics on top of large amounts of data, in a variety of formats (not just structured, boring data) Can integrate all the loose ends of data coming in, with data generated in ”Old-School” data platforms like SQL that is collected from Line- of-Business applications We’ve covered data ingestion, responding in real-time, querying, storing and processing Azure Stack
  • 38.
    Hadoop and OSSvs. Azure IoT and BigData Ecosystem Azure Ecosystem OSS Event Hubs Kafka Stream Analytics Storm HDInsight Hadoop Map Reduce Map Reduce Hive Hive Spark Spark HBase HBase Azure ML Mahout Data Factory Pig DocumentDB MongoDB / Couchbase
  • 39.
  • 40.
  • 41.
  • 42.
    Summary Get started todayat http://azure.microsoft.com
  • 43.

Editor's Notes

  • #9 Key goal of slide: IoT as you know is a hot area these days and there are a number of players that claim to be active in this space…. And they tend to focus on specific elements you see in this diagram. Microsoft has the most comprehensive portfolio of cloud services that customers need to develop and deploy end-to-end IoT solutions. Customers are adopting these services and are successfully deploying their solutions today (reference Rockwell, ThyssenKrupp) Talk track [Short Version for Sam’s Leadership Session]: As we think about Azure IoT services, Microsoft has the most comprehensive portfolio of cloud services that customers need to develop and deploy end-to-end IoT solutions Ranging from devices that produce data, to connecting them to the cloud storage, and driving analytics to gain valuable business insights that allows enterprises to take actions Talk track [Long Version Chris’ Breakout Session]: As we think about Azure IoT services, there are a collection of capabilities involved. First there are Producers. These can be basic sensors, small form factor devices, traditional computer systems, or even complex assets made up of a number of data sources. Next we have the Connect Devices capabilities on the ingress level within and around Azure. The primary destination is Service Bus & Event Hubs, but this relies on client agent technology either at the edge device level or within a field or cloud gateway. We also have capabilities for other external data sources o provide data As data is ingressed to Azure, there are various Storage options there can be a number of destinations engaged. Traditional database technology, table or blob, or even more complex destinations like Document DB are possible. External or third party technologies can also be used. This is where the flexibility and agility of a platform shows its strength, This is where analysts like Gartner are forming opinions about just how robust our platform can be. As this data is processed in Azure, there are a number of capabilities that can be utilized. Machine Learning, HD Insight, Stream Analytics are examples of tools that can analytics the data in various ways. Finally the concept of Take Actions uses Azure services. Data may populate a LOB portal, be pushed to apps, or presented in analytics and productivity tools. These are all ways that the data gets out of these architecture points to allow organizations to use analysis to change / transform their business. Through all of these areas, there is the possibility of utilizing existing investments either within your Azure environment, or elsewhere.
  • #10 Key goal of slide: IoT as you know is a hot area these days and there are a number of players that claim to be active in this space…. And they tend to focus on specific elements you see in this diagram. Microsoft has the most comprehensive portfolio of cloud services that customers need to develop and deploy end-to-end IoT solutions. Customers are adopting these services and are successfully deploying their solutions today (reference Rockwell, ThyssenKrupp) Talk track [Short Version for Sam’s Leadership Session]: As we think about Azure IoT services, Microsoft has the most comprehensive portfolio of cloud services that customers need to develop and deploy end-to-end IoT solutions Ranging from devices that produce data, to connecting them to the cloud storage, and driving analytics to gain valuable business insights that allows enterprises to take actions Talk track [Long Version Chris’ Breakout Session]: As we think about Azure IoT services, there are a collection of capabilities involved. First there are Producers. These can be basic sensors, small form factor devices, traditional computer systems, or even complex assets made up of a number of data sources. Next we have the Connect Devices capabilities on the ingress level within and around Azure. The primary destination is Service Bus & Event Hubs, but this relies on client agent technology either at the edge device level or within a field or cloud gateway. We also have capabilities for other external data sources o provide data As data is ingressed to Azure, there are various Storage options there can be a number of destinations engaged. Traditional database technology, table or blob, or even more complex destinations like Document DB are possible. External or third party technologies can also be used. This is where the flexibility and agility of a platform shows its strength, This is where analysts like Gartner are forming opinions about just how robust our platform can be. As this data is processed in Azure, there are a number of capabilities that can be utilized. Machine Learning, HD Insight, Stream Analytics are examples of tools that can analytics the data in various ways. Finally the concept of Take Actions uses Azure services. Data may populate a LOB portal, be pushed to apps, or presented in analytics and productivity tools. These are all ways that the data gets out of these architecture points to allow organizations to use analysis to change / transform their business. Through all of these areas, there is the possibility of utilizing existing investments either within your Azure environment, or elsewhere.
  • #31 Key goal of slide: IoT as you know is a hot area these days and there are a number of players that claim to be active in this space…. And they tend to focus on specific elements you see in this diagram. Microsoft has the most comprehensive portfolio of cloud services that customers need to develop and deploy end-to-end IoT solutions. Customers are adopting these services and are successfully deploying their solutions today (reference Rockwell, ThyssenKrupp) Talk track [Short Version for Sam’s Leadership Session]: As we think about Azure IoT services, Microsoft has the most comprehensive portfolio of cloud services that customers need to develop and deploy end-to-end IoT solutions Ranging from devices that produce data, to connecting them to the cloud storage, and driving analytics to gain valuable business insights that allows enterprises to take actions Talk track [Long Version Chris’ Breakout Session]: As we think about Azure IoT services, there are a collection of capabilities involved. First there are Producers. These can be basic sensors, small form factor devices, traditional computer systems, or even complex assets made up of a number of data sources. Next we have the Connect Devices capabilities on the ingress level within and around Azure. The primary destination is Service Bus & Event Hubs, but this relies on client agent technology either at the edge device level or within a field or cloud gateway. We also have capabilities for other external data sources o provide data As data is ingressed to Azure, there are various Storage options there can be a number of destinations engaged. Traditional database technology, table or blob, or even more complex destinations like Document DB are possible. External or third party technologies can also be used. This is where the flexibility and agility of a platform shows its strength, This is where analysts like Gartner are forming opinions about just how robust our platform can be. As this data is processed in Azure, there are a number of capabilities that can be utilized. Machine Learning, HD Insight, Stream Analytics are examples of tools that can analytics the data in various ways. Finally the concept of Take Actions uses Azure services. Data may populate a LOB portal, be pushed to apps, or presented in analytics and productivity tools. These are all ways that the data gets out of these architecture points to allow organizations to use analysis to change / transform their business. Through all of these areas, there is the possibility of utilizing existing investments either within your Azure environment, or elsewhere.
  • #32 Key goal of slide: IoT as you know is a hot area these days and there are a number of players that claim to be active in this space…. And they tend to focus on specific elements you see in this diagram. Microsoft has the most comprehensive portfolio of cloud services that customers need to develop and deploy end-to-end IoT solutions. Customers are adopting these services and are successfully deploying their solutions today (reference Rockwell, ThyssenKrupp) Talk track [Short Version for Sam’s Leadership Session]: As we think about Azure IoT services, Microsoft has the most comprehensive portfolio of cloud services that customers need to develop and deploy end-to-end IoT solutions Ranging from devices that produce data, to connecting them to the cloud storage, and driving analytics to gain valuable business insights that allows enterprises to take actions Talk track [Long Version Chris’ Breakout Session]: As we think about Azure IoT services, there are a collection of capabilities involved. First there are Producers. These can be basic sensors, small form factor devices, traditional computer systems, or even complex assets made up of a number of data sources. Next we have the Connect Devices capabilities on the ingress level within and around Azure. The primary destination is Service Bus & Event Hubs, but this relies on client agent technology either at the edge device level or within a field or cloud gateway. We also have capabilities for other external data sources o provide data As data is ingressed to Azure, there are various Storage options there can be a number of destinations engaged. Traditional database technology, table or blob, or even more complex destinations like Document DB are possible. External or third party technologies can also be used. This is where the flexibility and agility of a platform shows its strength, This is where analysts like Gartner are forming opinions about just how robust our platform can be. As this data is processed in Azure, there are a number of capabilities that can be utilized. Machine Learning, HD Insight, Stream Analytics are examples of tools that can analytics the data in various ways. Finally the concept of Take Actions uses Azure services. Data may populate a LOB portal, be pushed to apps, or presented in analytics and productivity tools. These are all ways that the data gets out of these architecture points to allow organizations to use analysis to change / transform their business. Through all of these areas, there is the possibility of utilizing existing investments either within your Azure environment, or elsewhere.