Data Mesh @ Yelp - 2019

Yelp’s Mission
Connecting
people with great
local businesses

Who am I?
My name is
Steven, my
preferred
pronoun is “he”
I graduated from UC Berkeley EECS in 2005
This is my second term in Yelp (2017 - now)
Last term is 2011 - 2015
I consider myself a generalist in the ﬁeld

Who am I?
I work in team
metrics-data
within
metrics-platform

Data powers
decision making
OnLine Transaction Processing (OLTP)
We use MySQL to power yelp.com
Each transaction interacts with small amount of
data
Display reviews, photos, tips of a business
OLTP queries’ results are expected to return quickly
No one wants to wait for more than 2 seconds for a
business page to load

OLTP example:
ﬁnd the titles an
author has
written. Take
advantage of an
index
https://en.wikipedia.org/wiki/Library_catalog#/media/File:Schlagwortkatalog.jpg

Data powers
decision making
Developers want to ﬁnd out what local business has
the most reviews
Table scan on the review table?
OnLine Analytical Processing (OLAP)
Queries that scan majority of data relative to total
amount of data
Need specialized system to support such queries
Yelp uses AWS Redshift as a data warehouse to
support OLAP queries.

OLAP example:
average number
of pages in a
book stored
inside main
stack. Need to
scan all the titles.
https://www.dailycal.org/2013/12/08/best-worst-foods-sneak-main-stacks/

Data Fabric We want to avoid n * m programs to transport data
n is the number of source, and m is the number of sink
Domain speciﬁc data stores are here to stay
Stonebraker, “One Size Fits All”: An Idea Whose Time
Has Come and Gone”
Stream-Table Duality
We can formulate the transport of data as streams

https://docs.confluent.io/current/streams/concepts.html

Image source: https://images-na.ssl-images-amazon.com/images/I/71UfEHhZ2uL._SL1000_.jpg

Beneﬁts
Connector
Ecosystem
Lower the barrier of entry
It’s easy to move data between data stores
High performance implementation
Each data store has its own performance
characteristics.
Streams-processing over batch processing
Near real-time data availability

Image source: https://images-na.ssl-images-amazon.com/images/I/71GmEqny4NL._SL1000_.jpg

Lesson Learned
Connector
Ecosystem
Schematized data is good
Lessen the likelihood of malformed data
Schema evolution can be diﬃcult
Making incompatible schema change can break many
things. Discourage them in registration phase.
Decouple data producers and data consumers
We need automation to inform data producers how to
manage data life cycle as producers do not think about
who uses the data.

Image source: https://i.ytimg.com/vi/03y8DJrzzjA/maxresdefault.jpg

Desirable
Improvements
Data Producers should own their data life cycle
Speciﬁc connector owner does not have visibility of
data semantics.
Data Consumers are stakeholders
Consumers don’t want to out incompatible changes
after its been rolled out.
Self-serve mechanism accelerates changes
The only way to rapidly evolves is to self-serve

Data Mesh Data specifications are like microservices APIs
They are contracts between producers and consumers
Each team owns their data specifications
To avoid accidentally abstraction leakage
Decentralization allows rapid experiments
Common conventions are promoted to minimize
frictions among different domain systems

https://martinfowler.com/articles/data-monolith-to-mesh.html

yelp.com/dataset_challenge
Academic
dataset from 10
cities across the
globe!
Your academic project, research or visualizations
submitted by December 31, 2019
=
a $5,000 prize* !
*See full terms on website
6M reviews
1M business attributes
190K businesses
200K photos

Questions/Suggestions?
smoy@yelp.com

Data Mesh @ Yelp - 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Mesh @ Yelp - 2019

Similar to Data Mesh @ Yelp - 2019 (20)

Recently uploaded

Recently uploaded (20)

Data Mesh @ Yelp - 2019