Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Mesh @ Yelp - 2019

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 26 Ad

Data Mesh @ Yelp - 2019

Download to read offline

Yelp has operated our connector ecosystem to feed vital data to domain-specific teams and data stores. We share some of our learning and experiences on operating such system. We will touch on what is the next phase of the system evolution.

Yelp has operated our connector ecosystem to feed vital data to domain-specific teams and data stores. We share some of our learning and experiences on operating such system. We will touch on what is the next phase of the system evolution.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Data Mesh @ Yelp - 2019 (20)

Advertisement

Recently uploaded (20)

Data Mesh @ Yelp - 2019

  1. 1. Data Mesh @ Yelp Sep 12, 2018
  2. 2. Yelp’s Mission Connecting people with great local businesses
  3. 3. Who am I? My name is Steven, my preferred pronoun is “he” I graduated from UC Berkeley EECS in 2005 This is my second term in Yelp (2017 - now) Last term is 2011 - 2015 I consider myself a generalist in the field
  4. 4. Who am I? I work in team metrics-data within metrics-platform
  5. 5. Who am I? I work in team metrics-data within metrics-platform
  6. 6. Data powers decision making OnLine Transaction Processing (OLTP) We use MySQL to power yelp.com Each transaction interacts with small amount of data Display reviews, photos, tips of a business OLTP queries’ results are expected to return quickly No one wants to wait for more than 2 seconds for a business page to load
  7. 7. OLTP example: find the titles an author has written. Take advantage of an index https://en.wikipedia.org/wiki/Library_catalog#/media/File:Schlagwortkatalog.jpg
  8. 8. Data powers decision making Developers want to find out what local business has the most reviews Table scan on the review table? OnLine Analytical Processing (OLAP) Queries that scan majority of data relative to total amount of data Need specialized system to support such queries Yelp uses AWS Redshift as a data warehouse to support OLAP queries.
  9. 9. OLAP example: average number of pages in a book stored inside main stack. Need to scan all the titles. https://www.dailycal.org/2013/12/08/best-worst-foods-sneak-main-stacks/
  10. 10. More throughput Lower Latency
  11. 11. More throughput Lower Latency
  12. 12. Data Fabric We want to avoid n * m programs to transport data n is the number of source, and m is the number of sink Domain specific data stores are here to stay Stonebraker, “One Size Fits All”: An Idea Whose Time Has Come and Gone” Stream-Table Duality We can formulate the transport of data as streams
  13. 13. https://docs.confluent.io/current/streams/concepts.html
  14. 14. https://docs.confluent.io/current/streams/concepts.html
  15. 15. Image source: https://images-na.ssl-images-amazon.com/images/I/71UfEHhZ2uL._SL1000_.jpg
  16. 16. Benefits Connector Ecosystem Lower the barrier of entry It’s easy to move data between data stores High performance implementation Each data store has its own performance characteristics. Streams-processing over batch processing Near real-time data availability
  17. 17. Image source: https://images-na.ssl-images-amazon.com/images/I/71GmEqny4NL._SL1000_.jpg
  18. 18. Lesson Learned Connector Ecosystem Schematized data is good Lessen the likelihood of malformed data Schema evolution can be difficult Making incompatible schema change can break many things. Discourage them in registration phase. Decouple data producers and data consumers We need automation to inform data producers how to manage data life cycle as producers do not think about who uses the data.
  19. 19. Image source: https://i.ytimg.com/vi/03y8DJrzzjA/maxresdefault.jpg
  20. 20. Desirable Improvements Data Producers should own their data life cycle Specific connector owner does not have visibility of data semantics. Data Consumers are stakeholders Consumers don’t want to out incompatible changes after its been rolled out. Self-serve mechanism accelerates changes The only way to rapidly evolves is to self-serve
  21. 21. Data Mesh Data specifications are like microservices APIs They are contracts between producers and consumers Each team owns their data specifications To avoid accidentally abstraction leakage Decentralization allows rapid experiments Common conventions are promoted to minimize frictions among different domain systems
  22. 22. https://martinfowler.com/articles/data-monolith-to-mesh.html
  23. 23. yelp.com/dataset_challenge Academic dataset from 10 cities across the globe! Your academic project, research or visualizations submitted by December 31, 2019 = a $5,000 prize* ! *See full terms on website 6M reviews 1M business attributes 190K businesses 200K photos
  24. 24. Questions/Suggestions? smoy@yelp.com
  25. 25. Thank you.

×