Andrei Varanovich, InSpark
Lambda Architecture in
the Cloud with Azure
Databricks
#SAISDev6
Selfie
Data & AI Lead
@DrGigabit
andrei.varanovich
andrei.varanovich@inspark.nl
Data
Programmability
Cloud
High-
performance
teams
Neural
Networks
##SAISDev6
Big Data problem is many
small data problems
##SAISDev6
4
2.500.000 visitors per year
8.000 objects of art and history
1.000.000 objects stored from
the year 1200
##SAISDev6
Under the hood
5
Retail
Sponsorship engagement
Occupancy rate
Service Management
Incident Management
Marketing Performance
Warehouse Management
Capacity Planning
Financial Performance
Ticket Sales
##SAISDev6
THE DATA JOURNEY
6##SAISDev6
Consolidation
Insights
Innovation
Consolidate data
in a centralized
store
Organizational
processes and
efficiency
New ideas,
leveraging
machine learning
7
IN THE NEED FOR THE PLATFORM
START
• Begin small and
focused
• Prove value
GROW
• Grow organically
as more use cases
arise
SCALE
• Go production and
scale to the revel
required
We are in the need of the truly elastic data platform, to avoid any upfront planning, deployment and
operations expenses, and put business value discovery first. The platform should support the [big]data
projects in any stage, without the need to reengineer the whole solution.
##SAISDev6
Lambda Architecture on Azure
8
INGEST BATCH
INGEST STREAM STORE ANALYZE
Azure Data
Lake Store
Azure Data Factory
Azure Databricks
(managed Spark;
batch & streaming)
Social
LOB
Graph
IoT
Image
CRM
AI models/ APIs
Cognitive Services
Azure container
Service & registry
Insight sharing
Power BI/ other tools
Event Hubs
Stream
Batch
SECURITY &
MANAGEMENT Azure Log
Analytics
Azure Graph
API
Cost
monitoring
Azure Active Directory
##SAISDev6
9
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Production jobs & workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
##SAISDev6
10
Simplicity is the ultimate sophistication
Leonardo da Vinci
##SAISDev6
11
LAMBDA TO THE RESCUE
12
##SAISDev6
##SAISDev6 13
Composition of
functions is applying
one function to the
result of another
14
f(x) = x+1
(g º f)(x) = g(f(x))
(g º f)(x) = (x+1)2
g(x) = x2
input input+1 input2
##SAISDev6
input+1
(input+1)2
15
Transformation pipeline as a series of transitions
s1
f3
f2
f1
sum
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Production jobs & workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
##SAISDev6
Conclusions
17
… with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
—Dennis Ritchie
##SAISDev6
• Standardization on Apache Spark allows us to move forward without
introducing extra complexity.
• 100% PaaS offering is important – no need to maintain the
infrastructure. All components we use offered as PaaS on Azure.
• Data pipelines as function composition allows us to ensure end-to-end
consistency and spot the errors quickly.
• Saving intermediate states allows to quickly inspect the data sets.
Thank you!
18
##SAISDev6
Questions?
19
##SAISDev6

Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich

  • 1.
    Andrei Varanovich, InSpark LambdaArchitecture in the Cloud with Azure Databricks #SAISDev6
  • 2.
    Selfie Data & AILead @DrGigabit andrei.varanovich andrei.varanovich@inspark.nl Data Programmability Cloud High- performance teams Neural Networks ##SAISDev6
  • 3.
    Big Data problemis many small data problems ##SAISDev6
  • 4.
    4 2.500.000 visitors peryear 8.000 objects of art and history 1.000.000 objects stored from the year 1200 ##SAISDev6
  • 5.
    Under the hood 5 Retail Sponsorshipengagement Occupancy rate Service Management Incident Management Marketing Performance Warehouse Management Capacity Planning Financial Performance Ticket Sales ##SAISDev6
  • 6.
    THE DATA JOURNEY 6##SAISDev6 Consolidation Insights Innovation Consolidatedata in a centralized store Organizational processes and efficiency New ideas, leveraging machine learning
  • 7.
    7 IN THE NEEDFOR THE PLATFORM START • Begin small and focused • Prove value GROW • Grow organically as more use cases arise SCALE • Go production and scale to the revel required We are in the need of the truly elastic data platform, to avoid any upfront planning, deployment and operations expenses, and put business value discovery first. The platform should support the [big]data projects in any stage, without the need to reengineer the whole solution. ##SAISDev6
  • 8.
    Lambda Architecture onAzure 8 INGEST BATCH INGEST STREAM STORE ANALYZE Azure Data Lake Store Azure Data Factory Azure Databricks (managed Spark; batch & streaming) Social LOB Graph IoT Image CRM AI models/ APIs Cognitive Services Azure container Service & registry Insight sharing Power BI/ other tools Event Hubs Stream Batch SECURITY & MANAGEMENT Azure Log Analytics Azure Graph API Cost monitoring Azure Active Directory ##SAISDev6
  • 9.
    9 Optimized Databricks RuntimeEngine DATABRICKS I/O SERVERLESS Collaborative workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses Azure Databricks Production jobs & workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST ##SAISDev6
  • 10.
    10 Simplicity is theultimate sophistication Leonardo da Vinci ##SAISDev6
  • 11.
  • 12.
    LAMBDA TO THERESCUE 12 ##SAISDev6
  • 13.
    ##SAISDev6 13 Composition of functionsis applying one function to the result of another
  • 14.
    14 f(x) = x+1 (gº f)(x) = g(f(x)) (g º f)(x) = (x+1)2 g(x) = x2 input input+1 input2 ##SAISDev6 input+1 (input+1)2
  • 15.
    15 Transformation pipeline asa series of transitions s1 f3 f2 f1 sum
  • 16.
    Optimized Databricks RuntimeEngine DATABRICKS I/O SERVERLESS Collaborative workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses Azure Databricks Production jobs & workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST ##SAISDev6
  • 17.
    Conclusions 17 … with properdesign, the features come cheaply. This approach is arduous, but continues to succeed. —Dennis Ritchie ##SAISDev6 • Standardization on Apache Spark allows us to move forward without introducing extra complexity. • 100% PaaS offering is important – no need to maintain the infrastructure. All components we use offered as PaaS on Azure. • Data pipelines as function composition allows us to ensure end-to-end consistency and spot the errors quickly. • Saving intermediate states allows to quickly inspect the data sets.
  • 18.
  • 19.