Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Big data and the cloud are perfect partners for companies who want to unlock maximum value from all of their unstructured, semi-structured, and structured data. The challenge has been how to create and manage a reliable end-to-end solution that spans data ingestion, storage and analysis in the face of the volume, velocity and variety of big data sources.
In this webinar, we will show you how to achieve big data bliss by combining StreamSets Data Collector, which specializes in creating and running complex any-to-any dataflows, with Microsoft's Azure Data Lake and Azure analytic solutions.
We will walk through an example of how a major bank is using StreamSets to transport their on-premise data to the Azure Cloud Computing Platform and Azure Data Lake to take advantage of analytics tools with unprecedented scale and performance.
Big Data Definition
Big data is high-volume, high-velocity
and/or high-variety information assets
that demand cost-effective, innovative
forms of information processing that
enable enhanced insight, decision
making, and process automation.
– Gartner, Big Data Definition*
* Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
Big Data as a Cornerstone of
Big Data Stores
Data Lake Store
However, there are challenges to Big Data…
to get value
*Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
A Cloud Spark and
Hadoop service for the
Reliable with an industry leading SLA
Enterprise-grade security and monitoring
Productive platform for developers and
Cost effective cloud scale
Integration with leading ISV applications
Easy for administrators to manage
63% lower TCO than deploy your own
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
• One-click deploy experience for installing apps.
• Fully managed PaaS offering.
• Access to entire cluster and secure by default.
• Install apps on new or existing clusters.
• Ease of authoring and deployment.
• Certified partners only.
HDInsight Application Platform
Hybrid cloud, a reality today
Enterprises believe a
hybrid cloud will enable
Enterprises have a hybrid
cloud strategy, up from 74
percent a year ago2
Introduction to StreamSets
for Microsoft Azure
Who is StreamSets?
Enterprise Data DNA
Top-tier Investors Commercial Customers Across Verticals
⅓ of the Fortune 100
Empower enterprises to harness their data in motion.
StreamSets Dataflow Performance Manager™ (DPM)
StreamSets Data Collector™ (open source)
Strong Partner Ecosystem Open Source Success
Desired Business Outcomes
● Developer & operator
● On-time delivery
● Data trust & governance
Data in motion middleware that ensures data trust.
Dataflow Performance Manager (DPM)
Data Collector (SDC)
Open source tooling and engine to
build complex any-to-any dataflows.
Cloud Service to
map, measure and master
StreamSets Deployment Models
StreamSets and Microsoft Azure
in Use in a Major Bank
● Forbes Global 500 financial services company.
● Adopting and moving into cloud at rapid phase.
● Growing rapidly both via acquisitions and organic growth.
Key Challenges Related to
● Number of legacy tools both customer and vendor built.
● Security policy changes very hard to manage.
● Lack of security governance due to fragmentation of tools and lack of
● Difficulty onboarding new data sources as soon as the are created
● Data drift (unexpected changes) very hard to manage at scale.
Key Factors for the Customer to
● Delivery guarantees
● Multiple types of origins and destinations using a single tool.
● Works natively with Microsoft Azure as part of HDInsight or Azure
Virtual Machine or deployed on premise.
● Visualization of actual data transfers.
● Define security boundaries, actors etc.
● Repeating pattern
Customer’s Business Objectives
● Short Compute and Long Storage (ADLS,Azure Blob) in turn fine-grained
● Ability to build microanalytics framework. For instance, instead of taking
entire dataset, build same micro datasets and build microanalytics
framework and derive results faster (faster iteration).
● Move away from traditional Data Lake to Azure Data Lake to manage
cost and scale.
Use Cases for StreamSets
1. Data Movement from On-Premise to
Azure Data Lake
2. Consolidating Migration tools into
3. Building DR for HDInsight Kafka
Resources / Q & A
StreamSets Data Collector @ Azure Marketplace
Ingest Data into Microsoft Azure Data Lake (YouTube)
StreamSets Dataflow Performance Manager Product Information