To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX Group)

Jochem van Grondelle, OLX Europe
Prosus Data Speaker Series, Feb 2021
To mesh or...
mess up your
data organisation

2
Jochem van Grondelle
Data Engineering Manager
@ OLX Group since 2015

3
3
OLX GROUP IS PART OF PROSUS
A collection of leading
companies and exciting
businesses!

4
4
PROSUS IS A $120B MARKET CAP. COMPANY
A global internet and entertainment
group and one of the largest
technology investors in the world.
US$120bn
Market
capitalisation
US$44.5bn
Revenues over
the last
3 years
US$3.4bn
Trading profits
over the last
3 years
US$47M
Average
invested in M&A
per annum
US$18.3bn
FY18-19
Revenues
US$3.4bn
FY18-19 Trading
profit
13 of top 20
Fastest-growing
economies*
Present in
*IMF World Economic Outlook, based on 2019E GDP growth estimates for the countries with over 50 million population

5
5
OLX GROUP TODAY:
THE WORLD'S #1 CLASSIFIEDS BUSINESS
HORIZONTALS REAL ESTATE
VERTICALS
OTHER
VERTICALS
CAR
VERTICALS
global
Turkey
Russia
UAE
Africa and
Philippines
Russia
Portugal
Poland
Romania,
Egypt
Furniture,
Europe
Heavy
machinery,
global
Services,
Poland
Poland
South Africa
Romania
Portugal
CONVENIENT
TRANSACTIONS
LATAM,
Asia,
Poland
UAE
Latin
America
South
Africa
Jobs,
India
Jobs,
Poland
+

6
6
WHO WE ARE - OLX GROUP
We are a global product and tech group.
★ +20 brands
★ 15 time zones
★ +10,000 people
★ One mindset
We are a team of 10,000+ ambitious,
curious people building market-leading
trading platforms that empower 300
million people every month to upgrade
their lives.

7
Agenda
4. Next steps and challenges
3. Our data journey
2. What is data mesh?
1. Challenges in data organizations today

8
Concepts in this presentation are based on the data mesh
architecture abstracted and promoted by Zhamak Dehgani

Challenges in data
organizations today

10
Complexity
The biggest challenge with big data is…

11
There has been a revolution in how operational applications
are being run
▪ For the last 20 years there is a continuous trend to move away from the
monolith to distributed domain driven architectures

12
However, data engineers often stay behind by ingesting all that
data in one central data lake - the biggest monolith of them all
▪ The original data warehousing approach was getting data from all different
complex domains and putting them in one big fat database.
▪ Due to issues with scale in volume and complexity, architectures evolved into a
data lake architecture: Don't worry about that whole modeling we talked about,
just get the data out of the operational systems, bring them to this big, fat data
lake in its original form.
https://martinfowler.com/articles/data-monolith-to-mesh.html

13
The data team responsible for storing big data is mostly
disconnected from consumers trying to make sense of that data.
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture

14
Data teams are trying to break down their architecture by
functional areas

15
However, these data engineers are still siloed in between the
world of operational systems and the world of consumers

16
Have pity on your data engineers
▪ They are often dependent from teams
who have no incentive in providing
meaningful, truthful and correct data
▪ They have little understanding of the
source domains that generate the data
and lack the domain expertise in their
teams
▪ They need to provide data for a diverse
set of needs without access to all
consuming domain's experts.

17
In summary: We need to revolutionize our data strategy.
How can we apply domain-driven architecture to data?
▪ The cycle of innovation requires constant
adaptation to the data.
▪ This centralized system simply doesn't
scale.
▪ It has divided the work based on the
technical operation, implemented by one
or more silos of data engineers.

19
Data mesh sets a foundation for getting value from analytical
data at scale – using 4 principles.
Domain-oriented decentralized data
ownership and architecture
Data as a product
Self-service data infrastructure
as a platform
Federated governance

20
Principle 1: Domain oriented decentralized data ownership
▪ Although DDD has influenced modern
architectural thinking, the notion of business
domains have been disregarded in data.
▪ All raw data is in the lake, but there is often no
clear separation of business domains.
▪ Rather than limiting to ingesting raw data from
domains into a centrally owned data lake,
domains need to own and serve their
domain datasets in an easily consumable
way.

21
Domain-driven design moves from a ‘big ball of mud’…
“A BIG BALL OF MUD is haphazardly
structured, sprawling, sloppy, duct-tape and
bailing wire, spaghetti code jungle. We’ve all
seen them. These systems show
unmistakable signs of unregulated growth,
and repeated, expedient repair. Information is
shared promiscuously among distant
elements of the system, often to the point
where nearly all the important information
becomes global or duplicated. The overall
structure of the system may never have been
well defined. If it was, it may have eroded
beyond recognition.”

22
…to contextual models
Great article about DDD: https://medium.com/raa-labs/part-1-domain-driven-design-like-a-pro-f9e78d081f10
▪ Focused
▪ Small
▪ Decoupled
▪ Easy to change
▪ Enables autonomy
▪ Ubiquitous language

23
Some domains are more source oriented while other domains
are more consumer oriented.
Domains aligned with the source Domains aligned with consumption
Chat messages
Browsing interactions
Deliveries
Item recommendations
Customer support tickets
User segmentation
Fraud detection

24
Principle 2: Data as a product
▪ Domain data teams must consider their data assets as their products
and the rest of the organization as their customers.
– Discoverability
– Addressable
– Trustworthy and truthful
– Self-describing semantics and syntax
– Inter-operable and governed by
global standards
Design
“Build what
matters”
Marketing
“Tell people
about it”
Engineering
“Ship it!”

25
Establish the responsibility of domain data product owner –
which could simply be an additional hat to any type of engineer
▪ Makes decisions around the vision and the roadmap for the data products
▪ Concerns themselves with satisfaction of their consumers
▪ Continuously measures and improves the quality of the data
▪ Responsible for the lifecycle of the domain datasets
▪ Defines success criteria and business-aligned metrics

26
Principle 3: Self-service data infra as a platform
▪ High-level abstraction of infrastructure enables teams to autonomously own
their data products
▪ Must include tooling that supports a developer’s workflow of creating,
maintaining and running data products with less specialized knowledge
that existing technologies assume
▪ Domain agnostic
▪ Hides underlying complexity and designed in a self-service manner
▪ But: Treat domain data ownership as primary concern, and tooling and
pipelines secondary

27
The self-service data infra platform can include many generic
elements aimed at making domain data producers more efficient
– Data product versioning
– Data product schema
– Unified data access control and logging
– Data pipeline implementation and orchestration
– Data product discovery, catalog registration and
publishing
– Data governance and standardization
– Data product lineage
– Data product monitoring/alerting/log
– Data product quality metrics (collection and
sharing)

28
Principle 4: (Computational) federated governance
▪ Independent data products need to
interoperate through global standardization
▪ Naming conventions, identifiers, nulls
▪ It is an art to find a balance between what shall
be standardized globally, and what shall be left
to the domains to decide.
– For example, the semantics of ‘chat replies’
could be left to the chat team
– However, a ‘buyer’, as a population of
‘users’, is a global concern.

29
In summary: The ”great divide”.
When it still looked simple.

30
In summary: The ”greater divide”.
When it was still manageable.

31
In summary: The ”best divide”.
When we thought we could still handle it.

32
In summary: Data mesh
The paradigm shift
Source: Data mesh 101 – Everything you need to know to get started

34
A mature data infrastructure is in place and managed globally for
OLX Group to serve, ingest and consume data
Catalog
Self Service tool for data
management, consumption and
discovery
Data Engineers
Ninja
Set of libraries that unify the
tracking integration.
Hydra
A platform-agnostic raw data
HTTP collector.
Lazarus
Service that synchronises the data
from databases.
Data Lake
Data storage compliance with
data protection laws.
Reservoir
A dedicated and reduced data
storage for a specific purpose
Cerberus
A service to process real-time
events
Schema Management
Easily query via Athena,
Spectrum and Presto.
Laquesis
Self Service tool for performing
Experiments, Feature Flags and
Surveys.
Odyn
Operational Data Hub consisting
of a scheduler, operators and
storage
KaaS
Packaged solution for Apache
Kylin usage
Real time process
Machine learning
Analysts
Analytics
Product Managers
Databases
User devices
Microservices
COLLECTION GOVERNANCE CONSUMPTION
DAPI
Generic and scalable DATA API
Real time
5 minutes
5 minutes - 1 hour
SERVICES
Data Scientist

35
Scheduler
ETL-as-a-code platform with
advanced dependency
management system.
Odyn decides which tasks
should run, when and where to
achieve maximum efficiency.
It supports templating, step
expansion and notifications.
Operators
Odyn allows to easily transform
and query the data using SQL.
Out of the box it supports
Athena, Presto, Hive, Redshift,
AWS Batch, Kylin and Spark.
It can be easily extended with
custom operators written in
Python.
Storage
Odyn is a cheap and reliable
storage as the part of Data
Reservoir.
It contains a build in solution to
be compliant with all the data
protection laws.
It is fully integrated with other
data services.
Operational Data Hub
One place for data access so that
many point-to-point connections
between callers and data suppliers
do not need to be made.
Odyn allows blazing fast data
processing as well as
collaboration and sharing of
datasets between the users.
Odyn is the operational data hub consisting of a scheduler,
extendable set of operators and storage
180 users
200 active DAGs
11 K daily tasks

36
OLX Europe’s data team leverages the global data infra and
provides additional data products adapted to regional needs
EU Data
Build and maintain a best-in-class data analytics platform
that enables easy, timely, fast data discovery and consumption
Enable top-notch data mastery in our company by providing the right training
and coaching to our team and our users
Assure top quality of prepared data for the right purpose at the right time
following product and business strategy
Solve new problems in innovative and pragmatic ways grabbing
opportunities for quick value delivery
More data played back to our
external customers
Data-driven product
development lifecycle
Business decisions
fueled by data

37
For example, Sherlock as part of the analytical data platform
enables our users to discover, understand, and explore datasets

38
Our internal data academy helps anyone in OLX get more
familiar with data concepts, technologies and our platforms

39
Finally, Yamato is a large AWS Redshift data base empowering
many data teams to process data blazingly fast in an easy way
250 users
150 K queries daily
16xDC2.8XL - 40 TB

40
Some of our consumer-oriented data teams are already
checking all the boxes for data mesh
Customer support Sales CRM integration
Recommendations Customer classification

41
Challenges in data teams
▪ Data teams are often a bottleneck and cannot keep up
with product development fast enough
▪ Data teams are the go-to point for expertise about domain
data, however they are not the specialists
▪ Lack of governance across data teams
▪ Duplication of work and/or reinventing the wheel
▪ Lack of software engineering principles in data processing

42
Challenges in operational teams in OLX
▪ Technical design for new features does not include requirements for analytics,
experimentation and machine learning projects –
therefore data needs not always covered from start
▪ Data ownership mostly limited to what is required to run a feature
▪ Operational teams are not always aware of the value of data

43
So, although the doors are now open across the data
organization, there is still a divide between data and tech.

44
So, although the doors are now open across the data
organization, there is still a divide between data and tech.

45
However, many things are going well - and we are getting ready
for the paradigm shift!
▪ Self-service data infrastructure is mostly in place already
▪ Data publication possible for anyone
▪ GDPR, security and access governance is in place.
▪ There is a growing acceptance for domain data ownership in operational teams
▪ We already have 100s of product managers so no lack of product thinking

47
Let’s remind ourselves of the 4 principles of data mesh – and
first assess what is in place.
Domain-oriented decentralized data
ownership and architecture
Data as a product
Self-service data infra
as a platform

48
Start small: Pilot partial embedding of domain data engineers in
a few selected operational teams on project basis
▪ Data engineers will ensure that operational teams integrate data
requirements into the design of new features
▪ Data engineers will set the foundation for domain data sets after which the
ownership remains in these operational teams
▪ Data engineers will team up with product managers
▪ Data engineers will learn from software engineers
and adopt software engineering best practices
▪ Data engineers will facilitate trainings
▪ Ensure engineering leaders are on board!

49
Make both sides aware that this is a win-win situation!
▪ Data engineers often lack software engineering
standard practices when it comes to building data
assets.
▪ Software engineers who are building operational
systems often have no experience utilizing data
engineering tool sets, or even understanding the
concept of ‘datasets’.
▪ Removing the skill set silos will lead to
creation of a larger and deeper pool of data
engineering skill sets available to the
organization!

50
Meanwhile, start mapping out the major domains, identify
ownership and develop a data maturity framework by domain

51
Define how to measure success and set a baseline
Data quality
Domain-data Data consumer
needs covered
Documented and
useful datasets
Data
discoverability
Usage Satisfaction
Speed/reliability Skill levels Ease of use
Risk &
Governance
Cost &
Compliance
Ubiquitous
language
Data as a product
Self-service data infra

52
It sounds too good to be true. What is the fine print?
▪ Data mesh is primarily about mindset and
organization; technology is second
▪ Success depends on converging operational and
data roles – organization needs to be ready
▪ Organization needs to be big enough to benefit
▪ Data mesh is a vision that needs to be tailored to
your organization – no plug and play solution
▪ The data lake can still exist in this architecture, but
they become just another node in the mesh, rather
than being the center place.

53
You are not alone! Other companies are setting steps towards a
data mesh architecture – and a learning community is live
▪ https://launchpass.com/data-mesh-learning

54
Jochem van Grondelle
Data Engineering Manager
linkedin.com/in/jochemvangrondelle
jochem.vangrondelle@olx.com
Thank you! Feel free to reach out to discuss further

To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX Group)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX Group)

Similar to To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX Group) (20)

Recently uploaded

Recently uploaded (20)

To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX Group)