Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Choosing the Right Database - Facebook DevC Malang Hackdays 2017

64 views

Published on

Choosing the Right Database - Facebook DevC Malang Hackdays 2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Choosing the Right Database - Facebook DevC Malang Hackdays 2017

  1. 1. Choosing the Right Database Rendy B. Junior - Data Eng Lead @ Traveloka Facebook DevC Malang Hackdays 2017
  2. 2. 2Facebook DevC Malang Hackdays 2017 About Traveloka Traveloka is a leading Southeast Asia technology company that provides a wide range of travel needs in one platform. Has been downloaded more than 20 million times, making it one of the most popular travel booking app in the region. Rendy B Junior Data Engineering Lead, Traveloka Gone through revolution of Traveloka data pipeline since early phase of the organization. Established both batch and real time data platform which powers organization insights and tens of data-intensive application use cases. Managed to handle 10,000x growth of data over the past 3 years
  3. 3. #EnablingMobility
  4. 4. How we use our data ● Business Intelligence ● Analytics ● Personalization ● Fraud Detection ● Ads optimization ● Cross selling ● AB Test ● etc.
  5. 5. Why you need a database? Facebook DevC Malang Hackdays 2017
  6. 6. 6Facebook DevC Malang Hackdays 2017 So our service could remember something. To store state. Our case: - Flights / hotel available (inventory) - Customer Bookings - Payment Why you need a database?
  7. 7. 7Facebook DevC Malang Hackdays 2017 Common client-server architecture. Database serve a backend service. (don’t expose to client / public…) Why you need a database? Android App Internal storage, etc Web Cookie browser Backend Service Database
  8. 8. Back to basic What’s your requirement? Facebook DevC Malang Hackdays 2017
  9. 9. 9Facebook DevC Malang Hackdays 2017 Know your use case Examples: ● Login feature, need to store user data ● Ecommerce with search - add to cart - payment, need to store inventory, cart, payment
  10. 10. 10Facebook DevC Malang Hackdays 2017 Common requirements for early stage apps: Transactional / OLTP (that’s it) What usually works: MySQL, PostgreSQL Logical modeling: Biz requirement → identify entity → ERD → Normalize Physical modeling: Define schema (field & type), PK, define indexes Common requirements
  11. 11. How about analytics? Facebook DevC Malang Hackdays 2017
  12. 12. 12Facebook DevC Malang Hackdays 2017 Common (temporary) solution for analytics in early stage apps: Replica / Slave on transactional DB Example: MySQL master slave replication Don’t do analytics in Master… ever… Don’t use SQL script, use analytics tools (a lot of open source tools, e.g. Superset) Don’t stay at this solution! Use it temporarily (will explain in later slide) I need analytics on my transaction! Master Slave Analytics Tools
  13. 13. 13Facebook DevC Malang Hackdays 2017 Several things is not really transactional… But we need to get insights from it. ● How much user use sort feature, does it ever been clicked? ● How much search per day? How’s the trend for the past 7 days? We need to send and store user activity, and be able to get insight from it, how? I need user activity insights!
  14. 14. 14Facebook DevC Malang Hackdays 2017 Send your user activity to backend (usually called tracking) Usually RDBMS like MySQL will work for small data, but won’t work for huge data. ● It is not designed for high write activity, at some point your throughput will stuck ● Normally it is 6TB max, and user activity usually a lot more than that ● It is not designed for analytics workload (aggregate on huge number of row) More activities = slow user activity database = slow app tl;dr, eventually it will slow down, eventually you’ll need something different by then I need user activity insights! Android App Backend Service Database
  15. 15. 15Facebook DevC Malang Hackdays 2017 Common solution (so called big data): High write activity? Handled by datahub such as Kafka. Huge size of data? Handled by datalake such as HDFS or S3 Failed analytics query on huge data? Handled by Hadoop or Spark Important note: Use it when you really need it! Those (fancy) things are hard to manage… Don’t bother if your scale is not there yet. I need user activity insights! Android App Backend Service Datalake Datahub
  16. 16. 16Facebook DevC Malang Hackdays 2017 (Cont.) Don’t stay using replica /slave. Use it for short period of time only (early stage of your app). Why? ● Eventually replica will not be able to handle your load ● You will eventually want to join data from different database (not only MySQL) ● You want to store your processed data as well Overall, it is not a good approach to access your db, it is against the best practice... Remember about our replica / slave?
  17. 17. 17Facebook DevC Malang Hackdays 2017 So? Treat your transaction data like user activity data, send it as tracking. E.g. you have booking data with two status: booked, paid. Send it as two tracking 1) booking event 2) payment event along with all those details like booking id, product id, timestamp, etc. So instead of having one table of booking with two possible status, you’ll have two tables, one for each status. This is called immutable data, a log. Read more: https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying Transaction and user activity is not so different
  18. 18. 18Facebook DevC Malang Hackdays 2017 Both transaction and analytics events are flowing here... So one architecture for analytics & transaction Android App Backend Tracking Service Datalake Datahub
  19. 19. 19Facebook DevC Malang Hackdays 2017 Alternative: Google Analytics, Flurry etc. Pros: Good for early stage of app / web, solution is quite complete already. Simple to manage. Cons: But you don’t own your data ● Eventually you’ll need to query it by yourself ● Or create a data processing which enable a feature such as recommendation Eventually you have to build your own. Data is an important asset. ML, which is a promising field as competitive advantage, also based on data. Fully managed alternative
  20. 20. How about not-so-common application requirements? Facebook DevC Malang Hackdays 2017
  21. 21. 21Facebook DevC Malang Hackdays 2017 Not-so-common application requirements Several examples: ● I want to store text, and then search it. Use inverted index like Solr, ElasticSearch. ● I want to create a social network. Use graph database like Neo4j. ● I need to store system metrics and aggregate it. Use time series db like Graphite. We no longer live in RDBMS-only world...
  22. 22. 22Facebook DevC Malang Hackdays 2017 The problem, as well as blessing, there are tons of them! ● RDBMS: MySQL, PostgreSQL ● Document based: MongoDB, DynamoDB ● Columnar database: Redshift, BigQuery ● NewSQL: Aurora, Spanner ● Time Series: InfluxDB ● Inverted Index: ElasticSearch ● Distributed filesystem: HDFS, S3, GCS ● Graph: Neo4j, ArrangoDB ● Cache: Redis, Memcache ● Key value: HBase, DynamoDB And many more… Know the concept and common use case is kuntji. e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics 3. Know databases available out there
  23. 23. Lesson Learned Facebook DevC Malang Hackdays 2017
  24. 24. 24Facebook DevC Malang Hackdays 2017 Use cases: business intelligence, personalization, fraud detection, ab test, etc. Sample case in traveloka Consumer of Data Streaming Batch Traveloka App Kafka ETL Data Warehouse S3 Data Lake Batch Ingest Android, iOS DOMO Analytics UI NoSQL DB Traveloka Services Ingest Cloud Pub/Sub Storage Cloud Storage Pipelines Cloud Dataflow Analytics BigQuery Monitoring Logging Hive, Presto Query
  25. 25. 25Facebook DevC Malang Hackdays 2017 ● Always use technology based on your requirement, not because it is fancy ● Careful of gotchas! There's no silver bullet… (example: Google “MySQL limitation”) ● Eventually, you’ll need to adapt to your growth by: ○ Split use cases based on query pattern and latency (see appendix) ○ Scalable tech based on growth estimation (need to test!) ○ Autoscale! (and managed service if possible) ● Keep learning, new database is coming ● New to those things? Ask around! Save your time, a lot 6. Epilogue: Lessons learned from Traveloka
  26. 26. Thank you
  27. 27. Appendix: Methodical Approach Facebook DevC Malang Hackdays 2017
  28. 28. 28Facebook DevC Malang Hackdays 2017 Outline 1. Define your database requirement 2. Logical data modeling 3. Know databases available out there 4. Choose database, physical data modeling 5. Proof of concept, test 6. Epilogue: Lessons learned from Traveloka
  29. 29. 29Facebook DevC Malang Hackdays 2017 ● Understand biz requirement ● Query patterns - try to write your query in sentence or SQL ● Expected query/s, insert/s, update/s - peak and low ● Latency SLA - subsecond? seconds? minutes? ● Consistency - strong or eventual? ● Consider data growth - will the db survive in 3y? 5y? ● Retention policy - is data > 1 year old still relevant? Don’t forget basic requirements ● e.g. high availability, reliability Aware of constraints ● Cost, budget ● Maintainability (Managed/self-service, Community/Proprietary) 1. Define database requirement
  30. 30. 30Facebook DevC Malang Hackdays 2017 ● Biz requirement: ○ We want to recommend based on latest purchase item type of a user, show relevant items in home page to increase user engagement ● Write your query in sentence: ○ For user id = 1234, return latest purchase item type ○ or in SQL, select latest_purchase_item_type where id = 1234 ● Expected query per second: ○ I want to show recommendation on homepage, and homepage is viewed for 1,000/s during peak, and 100/s during low hour ● Latency SLA: ○ Max query latency 200ms percentile 95, for UX convenience 1. Define database requirement - study case
  31. 31. 31Facebook DevC Malang Hackdays 2017 ● Consistency - strong or eventual ○ Eventually consistent is OK, data lag max 1 hour ● Consider data growth ○ Business expect user growth 3x next year, and 10x in 3 years ● Retention policy - is data > 1 year old still relevant? ○ We could not delete user data, but purchase could be deleted after 1 year 1. Define database requirement - study case
  32. 32. 32Facebook DevC Malang Hackdays 2017 ● Think of entities and its properties ● PK - Each row unique by? ● Indexes you might need - Refer back to your query pattern Study case ● There is user entity, with latest purchase item as its properties ● User id as PK will be ideal ● No need for other index 2. Logical Data Modeling
  33. 33. 33Facebook DevC Malang Hackdays 2017 The problem, as well as blessing, there are tons of them! ● RDBMS: MySQL, PostgreSQL ● Document based: MongoDB, DynamoDB ● Columnar database: Redshift, BigQuery ● NewSQL: Aurora, Spanner ● Time Series: InfluxDB ● Inverted Index: ElasticSearch ● Distributed filesystem: HDFS, S3, GCS ● Graph: Neo4j, ArrangoDB ● Cache: Redis, Memcache ● Key value: HBase, DynamoDB And many more… Know the concept and common use case is the key / kuntji. e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics 3. Know databases available out there
  34. 34. 34Facebook DevC Malang Hackdays 2017 Choose database ● Use your database requirement. ○ Our study case: query always by key, it is key value ● Shortlist db that commonly solve your use case. ○ Our study case: mongodb, dynamodb, hbase, bigtable ● Do cost benefit analysis, find the trade off ● Look for gotchas, google something like “problems with xxDB” 4. Choose database, physical data modeling
  35. 35. 35Facebook DevC Malang Hackdays 2017 Physical Data Modeling in the Old days: ● Choose between MySQL or PostgreSQL ● Create ERD, convert to normalized form ● Define PK and index Nowadays: How data modeled physically is different from one database to another! Several examples: ● in MongoDB, logical object with nested properties could be stored as is ● in Hbase, defining row key is very crucial 4. Choose database, physical data modeling
  36. 36. 36Facebook DevC Malang Hackdays 2017 ● Do capacity planning ○ Row Count: ~500 million ○ Num of column: ~100 ○ Size: ~10TB ○ Update: 100-1000 update / s Read: 10,000 read / s ○ etc... ● Plan test with part of the expected capacity, e.g. 1/10 ○ Define success criteria, e.g. query latency 200ms for percentile 95 ● Load test data ● Test query and insert/update ○ Cross check with success criteria 5. Proof of concept, test!

×