Enabling Next Gen Analytics with
Azure Data Lake
Microsoft Azure
Microsoft Cloud
Global Trusted Hybrid
Big Data Definition
Big data is high-volume, high-velocity
and/or high-variety information assets
that demand cost-effective, innovative
forms of information processing that
enable enhanced insight, decision
making, and process automation.
– Gartner, Big Data Definition*
* Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
Big Data as a Cornerstone of
Cortana Intelligence
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data
However, there are challenges to Big Data…
Obtaining skills
and capabilities
Determining how
to get value
Integrating with
existing IT
investments
*Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
Azure
HDInsight
A Cloud Spark and
Hadoop service for the
Enterprise
Reliable with an industry leading SLA
Enterprise-grade security and monitoring
Productive platform for developers and
scientists
Cost effective cloud scale
Integration with leading ISV applications
Easy for administrators to manage
63% lower TCO than deploy your own
Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
• One-click deploy experience for installing apps.
• Fully managed PaaS offering.
• Access to entire cluster and secure by default.
• Install apps on new or existing clusters.
• Ease of authoring and deployment.
• Certified partners only.
HDInsight Application Platform
Hybrid cloud, a reality today
74%
Enterprises believe a
hybrid cloud will enable
business growth1
82%
Enterprises have a hybrid
cloud strategy, up from 74
percent a year ago2
Workload
requirements
Regulation
Sensitive data
Customization
Latency
Legacy support
Introduction to StreamSets
for Microsoft Azure
Who is StreamSets?
Enterprise Data DNA
StreamSets Mission
Top-tier Investors Commercial Customers Across Verticals
150,000 downloads
⅓ of the Fortune 100
Empower enterprises to harness their data in motion.
Products
StreamSets Dataflow Performance Manager™ (DPM)
StreamSets Data Collector™ (open source)
Strong Partner Ecosystem Open Source Success
StreamSets Solution
Desired Business Outcomes
● Developer & operator
efficiency
● On-time delivery
● Data trust & governance
Data in motion middleware that ensures data trust.
StreamSets
Dataflow Performance Manager (DPM)
StreamSets Products
StreamSets
Data Collector (SDC)
Open source tooling and engine to
build complex any-to-any dataflows.
Cloud Service to
map, measure and master
dataflow operations.
DATAFLOW LIFECYCLE
DEVELOP OPERATE
EVOLVE (Proactive)
REMEDIATE (Reactive)
● Developers
● Scientists
● Architects
● Operators
● Stewards
● Architects
StreamSets Deployment Models
Install on
Local Machine
Install on
Azure VM
StreamSets Deployment Models
StreamSets and Microsoft Azure
in Use in a Major Bank
The Customer
● Forbes Global 500 financial services company.
● Adopting and moving into cloud at rapid phase.
● Growing rapidly both via acquisitions and organic growth.
Key Challenges Related to
Data Movement
● Number of legacy tools both customer and vendor built.
● Security policy changes very hard to manage.
● Lack of security governance due to fragmentation of tools and lack of
standardization.
● Difficulty onboarding new data sources as soon as the are created
(technology change).
● Data drift (unexpected changes) very hard to manage at scale.
Key Factors for the Customer to
Consider Streamsets
● KPIs
● Delivery guarantees
● Multiple types of origins and destinations using a single tool.
● Works natively with Microsoft Azure as part of HDInsight or Azure
Virtual Machine or deployed on premise.
● Visualization of actual data transfers.
● Define security boundaries, actors etc.
● Repeating pattern
Customer’s Business Objectives
● Short Compute and Long Storage (ADLS,Azure Blob) in turn fine-grained
cost control.
● Ability to build microanalytics framework. For instance, instead of taking
entire dataset, build same micro datasets and build microanalytics
framework and derive results faster (faster iteration).
● Move away from traditional Data Lake to Azure Data Lake to manage
cost and scale.
Use Cases for StreamSets
Use Cases
1. Data Movement from On-Premise to
Azure Data Lake
2. Consolidating Migration tools into
single tool
3. Building DR for HDInsight Kafka
workloads.
Resources / Q & A
StreamSets Data Collector @ Azure Marketplace
https://azure.microsoft.com/en-us/marketplace/partners/streamsets/streamsets-data-collector/
Ingest Data into Microsoft Azure Data Lake (YouTube)
https://www.youtube.com/watch?v=c1dVnOK7Luw
StreamSets Community
https://streamsets.com/community/
StreamSets Dataflow Performance Manager Product Information
https://streamsets.com/products/dpm/
Thanks!

Enabling Next Gen Analytics with Azure Data Lake and StreamSets

  • 1.
    Enabling Next GenAnalytics with Azure Data Lake
  • 2.
  • 3.
  • 4.
    Big Data Definition Bigdata is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. – Gartner, Big Data Definition* * Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
  • 5.
    Big Data asa Cornerstone of Cortana Intelligence Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data
  • 6.
    However, there arechallenges to Big Data… Obtaining skills and capabilities Determining how to get value Integrating with existing IT investments *Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
  • 7.
    Azure HDInsight A Cloud Sparkand Hadoop service for the Enterprise Reliable with an industry leading SLA Enterprise-grade security and monitoring Productive platform for developers and scientists Cost effective cloud scale Integration with leading ISV applications Easy for administrators to manage 63% lower TCO than deploy your own Hadoop on-premises* *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
  • 8.
    • One-click deployexperience for installing apps. • Fully managed PaaS offering. • Access to entire cluster and secure by default. • Install apps on new or existing clusters. • Ease of authoring and deployment. • Certified partners only. HDInsight Application Platform
  • 9.
    Hybrid cloud, areality today 74% Enterprises believe a hybrid cloud will enable business growth1 82% Enterprises have a hybrid cloud strategy, up from 74 percent a year ago2 Workload requirements Regulation Sensitive data Customization Latency Legacy support
  • 10.
  • 11.
    Who is StreamSets? EnterpriseData DNA StreamSets Mission Top-tier Investors Commercial Customers Across Verticals 150,000 downloads ⅓ of the Fortune 100 Empower enterprises to harness their data in motion. Products StreamSets Dataflow Performance Manager™ (DPM) StreamSets Data Collector™ (open source) Strong Partner Ecosystem Open Source Success
  • 12.
    StreamSets Solution Desired BusinessOutcomes ● Developer & operator efficiency ● On-time delivery ● Data trust & governance Data in motion middleware that ensures data trust.
  • 13.
    StreamSets Dataflow Performance Manager(DPM) StreamSets Products StreamSets Data Collector (SDC) Open source tooling and engine to build complex any-to-any dataflows. Cloud Service to map, measure and master dataflow operations. DATAFLOW LIFECYCLE DEVELOP OPERATE EVOLVE (Proactive) REMEDIATE (Reactive) ● Developers ● Scientists ● Architects ● Operators ● Stewards ● Architects
  • 14.
    StreamSets Deployment Models Installon Local Machine Install on Azure VM
  • 15.
  • 16.
    StreamSets and MicrosoftAzure in Use in a Major Bank
  • 17.
    The Customer ● ForbesGlobal 500 financial services company. ● Adopting and moving into cloud at rapid phase. ● Growing rapidly both via acquisitions and organic growth.
  • 18.
    Key Challenges Relatedto Data Movement ● Number of legacy tools both customer and vendor built. ● Security policy changes very hard to manage. ● Lack of security governance due to fragmentation of tools and lack of standardization. ● Difficulty onboarding new data sources as soon as the are created (technology change). ● Data drift (unexpected changes) very hard to manage at scale.
  • 19.
    Key Factors forthe Customer to Consider Streamsets ● KPIs ● Delivery guarantees ● Multiple types of origins and destinations using a single tool. ● Works natively with Microsoft Azure as part of HDInsight or Azure Virtual Machine or deployed on premise. ● Visualization of actual data transfers. ● Define security boundaries, actors etc. ● Repeating pattern
  • 20.
    Customer’s Business Objectives ●Short Compute and Long Storage (ADLS,Azure Blob) in turn fine-grained cost control. ● Ability to build microanalytics framework. For instance, instead of taking entire dataset, build same micro datasets and build microanalytics framework and derive results faster (faster iteration). ● Move away from traditional Data Lake to Azure Data Lake to manage cost and scale.
  • 21.
    Use Cases forStreamSets Use Cases 1. Data Movement from On-Premise to Azure Data Lake 2. Consolidating Migration tools into single tool 3. Building DR for HDInsight Kafka workloads.
  • 22.
    Resources / Q& A StreamSets Data Collector @ Azure Marketplace https://azure.microsoft.com/en-us/marketplace/partners/streamsets/streamsets-data-collector/ Ingest Data into Microsoft Azure Data Lake (YouTube) https://www.youtube.com/watch?v=c1dVnOK7Luw StreamSets Community https://streamsets.com/community/ StreamSets Dataflow Performance Manager Product Information https://streamsets.com/products/dpm/
  • 23.