Despite the increased availability of ready-to-use generic tools, more and more enterprises are deciding to build in-house data platforms. This practice, common for some time in research labs and digital native companies, is now making its waves across large enterprises that traditionally used proprietary solutions and outsourced most of their IT. The availability of large volumes of data, coupled with more and more complex analytical use cases driven by innovations in data science have yielded these traditional and on premise architectures to become obsolete in favor of cloud architectures powered by open source technologies.
The idea of building an in-house platform at a larger enterprise comes with many challenges of its own: Build an Architecture that combines the best elements of data lakes and data warehouses to accommodate all kinds from BI to ML use cases. The need to interoperate with all the company’s data and technology, including legacy systems. Cultural transformation, including a commitment to adopt agile processes and data driven approaches.
This presentation describes a success story on building a Lakehouse in an enterprise such as LIDL, a successful chain of grocery stores operating in 32 countries worldwide. We will dive into the cloud-based architecture for batch and streaming workloads based on many different source systems of the enterprise and how we applied security on architecture and data. We will detail the creation of a curated Data Lake comprising several layers from a raw ingesting layer up to a layer that presents cleansed and enriched data to the business units as a kind of Data Marketplace.
A lot of focus and effort went into building a semantic Data Lake as a sustainable and easy to use basis for the Lakehouse as opposed to just dumping source data into it. The first use case being applied to the Lakehouse is the Lidl Plus Loyalty Program. It is already deployed to production in 26 countries with more than 30 millions of customers’ data being analyzed on a daily basis. In parallel to productionizing the Lakehouse, a cultural and organizational change process was undertaken to get all involved units to buy into the new data driven approach.
8. 3/6/21
What we need to solve?
Phar Challenges and Goals
Data Consumers
Data Producers
8
What about?
Number of integrations?
Same Data Consistency?
Same Data Quality?
How to Monitor?
How to Manage Access?
How to share?
Time to market?
9. 3/6/21
Platform and Market All-in-One
Phar Challenges and Goals
Phar Data Platform Data Consumers
Data Producers
9
10. 3/6/21
Engine
Analytics Democratitzation
Phar Challenges and Goals
10
Ingestion
Batch
Real
time
WS1
WS2
WS3
WS4
WS5
Workspaces
Batch
Processing
Stream
Processing
Streaming
Market
Cross
Services
Security & Access
Monitoring & Logging
Governance & Lineage
Batch
Market
Phar Data Platform Data Consumers
Data Producers
13. 3/6/21
Data Platform and Data Market bringing together.
Data Market
13
WS2
WS1
WS3
L1 NN
RW
Gross Market
Primary Market
Domain Market
Just Historification
Event Based
Use Case Based
17. 3/6/21
Platform as a Service
Governance
17
Data Hub
Architecture FW & Tooling Security Monitoring
Actionability
Data Market Analytics Workspace
Source Product
Phar Data Platform
Layered Architecture
Phar a Mix of Paradigms
19. 3/6/21
Phar a Mix of Paradigms
Delta Lake Approach
Data
Source
RW L1 NN
Data
Product
19
20. 3/6/21
Data Product
NN
(Data
Workspace)
L1
(Data Market)
RW
(Data History)
Sources
Data Source
RW
(History)
L1
(Market)
NN
(Workspace)
Data Product
NN
Phar Data Platform
Product DB
L1
Processing
Integration
Source
Ingestion Consumption
RW
Data Platform Data Product
Phar a Mix of Paradigms
Delta Lake Approach
20
21. 3/6/21
Data Product
NN
(Data Workspace)
L1
(Data Market)
RW
(Data History)
RW stores all the Events
as ingested following a
Contract in raw format
L1 is an explorable
Market classified by
Entities as a
Sematic Datalake
Data Ready for BI
and Advance
Analytics Products
Data Ingestion
1
2
3
4
NN is working space
available for the
platform costumers
and their Models
Phar a Mix of Paradigms
Delta Lake Approach
21
23. 3/6/21
Phar a Mix of Paradigms
Event-Driven Architecture
23
BATCH
Landing Zone
REALTIME
Landing Zone
Streaming Source
Batch Source
Data Hub
Messages
(Single Event)
Files
(Collection of Events) NN
L1
RW
Data Market
Other Platforms
Data
Contract
32. 3/6/21
Data Market Analytics Workspace
Data Hub
Phar as a Lakehouse
A Lakehouse before the Lakehouse
32
Streaming FW
BATCH
Landing Zone
REALTIME
Landing Zone
NN
L1
RW
NN
L1
Process Consumption
RW
Integration
Validation
Ingestion
BI support
Support for diverse workloads
Batch FW
Schema enforcement and governance
Storage is decoupled from compute
Openness
Support for diverse data types ranging
from unstructured to structured data
End-to-end streaming
Transaction support
34. 3/6/21
Semantic Data Lake
What's new in Phar Architecture.
36
USERTRK
ssouserregisration
ssoclientlegaltermsdeclined
ssouserdelete
ssoclientlegaltermsaccepted
ssouserprofilechange
contactnewloyalty
favoritestorechange
countrychange
deviceuserlanguage
ssonotificationchange
ssouserrecovery
pushchange
notificationchange
RW L1
Source
Data Contracts
Entities by Domains
(Semantic Datalake)
Data Owner
LZ
Datatypes Collection
35. 3/6/21
Data Contracts
What's new in Phar Architecture
37
Phar Data Platform
T1
T2
T3
T4
TN
Entities
Datatypes
Contract
IN
Contract
OUT
Data Consumers
Data Producers
36. 3/6/21
Workspaces and Multitenant Mestastore
What's new in Phar Architecture.
Data Lake
L1 Layer
NOTEBOOKS
CLUSTERS
JOBS
LOCAL
STORAGE
Data Lake
NN Layer
Data Lake
NN Layer
NOTEBOOKS
CLUSTERS
JOBS
LOCAL
STORAGE
NOTEBOOKS
CLUSTERS
JOBS
LOCAL
STORAGE
Data Lake
NN Layer
HIVE Metastore
L1 + NN
Read Only
Read + Write
Schemas Access
Control:
- Public
- Private
- Shared
38. 3/6/21
Numbers
40
Data from 26 Countries
Numbers and Lessons Learned
01
Data from +30 Milions of users
02
700TB Available in the Market
03
+20 Analytical teams using It
04
39. 3/6/21
How we achived?
Numbers and Lessons Learned
41
Machine Learning
Data
Platform
Data
Management
Customer Insights
Personalization
MLaaS
Data Model
Data Integration
Data Catalogue
Domain Knowledge
Architecture
Monitoring
Framework & Tooling
Security
Actionability
Governance
Data as Asset
40. 3/6/21
When we becomes Data-Driven?
42
When Producers are also Consumers
Ingest Storage Model Certify Enable
Describe
Agree
Connect
Secure
Sort
Replica
Context
Structure
Catalogue
Validity
Acceptance
Value
Availability
Access
walkthrough
Validity
Acceptance
Value
Numbers and Lessons Learned
When Analytics starts in the Source