Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto Bangalore Meetup1 Repertoire@Myntra

1,168 views

Published on

Talk at Presto Bangalore Meetup on Myntra's Data Serving Platform: Repertoire

Published in: Software
  • Be the first to comment

Presto Bangalore Meetup1 Repertoire@Myntra

  1. 1. Repertoire Myntra’s Data Serving Platform Repertoire Deepak Batra Nishant Sharma Rijo Joseph
  2. 2. Repertoire Myntra - What do we do? ● Started as a customisation company in 2007, Myntra is largest fashion e-tailer in India today 1, 2 ● In 2016, acquired Jabong, to become India’s largest fashion platform ● 60M+ app downloads for Myntra and Jabong apps on Google Play store ● Myntra+Jabong list over 9M items for sale ● EORS (end of reason sale) is flagship sale event with over 2.8M orders, which are fulfilled within 7 days ● Focus on innovation with AI, AR/VR and omni-channel based products 1 Estimated Based on publicly available numbers, Research reports for FY 2018 2 For core Fashion categories of Apparel, Accessories & footwear for FY 2018 FY18 online share in fashion 2
  3. 3. Repertoire Tech at Myntra ● Key tech focus areas for Myntra + Jabong ○ Storefront: apps and web platform ○ Supply chain: end-to-end inventory & order management ○ Data tech: powering data based insights and intelligent automation in all business areas ● Data tech covers all the sources and consumers of data within Myntra+Jabong ● Data sources include ○ Streaming data from apps and IoT devices ○ Content and campaign management systems for storefront apps ○ Transactional data from supply chain systems like order management (OMS), warehouse and logistics management (WMS and LMS) ● Data is processed and served in both realtime and batch modes ● Consumers of data include reports/dashboards, tech products and data science models
  4. 4. Repertoire Data Tech @ Myntra
  5. 5. Repertoire Challenges ● Tiered SLAs ○ Low Latency Data serving ○ What to cache & How to cache ● Compute ○ Roll-ups and drill-downs on the fly ● Multi-modal ○ Support for Key-value, SQL type queries
  6. 6. Repertoire Challenges ● Query Triaging ○ Execution based on SLAs ● Low Latency Ingestion ○ Low ingestion overhead for real time & batch data ● Fault tolerance and NFRs ○ Availability, Horizontally Scalable, Isolation
  7. 7. Repertoire Open Source Solutions ● Apache Ignite ○ Pros: ■ Indexes ■ Disk backed Cache ○ Cons: ■ Batch Ingestion ■ Uncompressed Data in Cache ● Presto on S3 ○ Pros: ■ Stability ■ Out of the Box ○ Cons: ■ No data co-locality ■ Movement to Azure ● Spark on Alluxio ○ Pros: ■ Data co-locality ■ In-memory Cache ○ Cons: ■ No fixed SLAs ■ Concurrency ● Presto on Alluxio ○ Pros: ■ Data co-locality ■ Consistent query SLAs ○ Cons: ■ No in-memory cache ■ Limited ML support
  8. 8. Repertoire Arch.
  9. 9. Repertoire Arch.
  10. 10. Repertoire Reference Example Distribution of sessions by Operating System (OS), City and Gender based on an event type. SQL Representation SELECT os, city, gender, hll_cardinality(hll_merge(session_id)) FROM events WHERE event_type = 'addToCart' GROUP BY os, city, gender;
  11. 11. Repertoire Arch.
  12. 12. Repertoire Metric Meta Store Information about datasets and their storage Constructs ● Namespaces ● Cubes ● Pre-fetcher ● Cache Manager ● Caching Policy
  13. 13. Repertoire Arch.
  14. 14. Repertoire Prefetcher Service Availability/Scheduling based dataset caching ● Extract smaller datasets and cache Constructs ● Sources ● Transformations ● Fetch Frequency ● Cache Level ● Sinks
  15. 15. Repertoire Query Flow
  16. 16. Repertoire Prefetch Flow
  17. 17. Repertoire Arch.
  18. 18. Repertoire Alluxio ● Open sourced virtual distributed file system. ● Memory centric architecture.
  19. 19. Repertoire Alluxio
  20. 20. Repertoire Alluxio ● Data Locality and short-circuit ● Tiered Storage ● Multiple Caching Policies LRU, LRFU, FIFO ● Pluggable under storage ● Pin/unpin data Performance tuning : ● Read location policy : DeterministicHashPolicy ● Disabled passive cache ● Write location policy : RoundRobinPolicy
  21. 21. Repertoire HyperLogLogPlus ● Probabilistic cardinality estimation algorithm ● Why ? ○ Approx. cardinality without O(N) memory SELECT os, city, gender, hll_cardinality(hll_merge(session_id)) FROM events WHERE event_type = 'addToCart' GROUP BY os, city, gender;
  22. 22. Repertoire HyperLogLogPlus ● Precision parameters ○ P : tune accuracy when dense mode ○ SP : control sparse mode ● Relative accuracy : 1.054 / sqrt(2^p) ● Spark and Presto UDAF
  23. 23. Repertoire ● Read only required data/event(s) ● Partition by events? ○ Too many small files ● Global sort? ○ Too expensive ● Bloom filters? ○ Not supported by Presto ● Localize data and sort within partitions! Event Agnostic Aggregates
  24. 24. Repertoire ● Sorting: ○ bin partitioner ○ sort within partition ● Files size/no. of files ~1GB ● Stripe size ~ 64MB ORC Optimizations
  25. 25. Repertoire Funnel Analysis Funnel Aggregate def funnel(funnel_def, events_list) => [1, 1, 0] device_id session_id dim1 dim2 events d1 s1 v1 v2 [e1,e2,e3,e4...]
  26. 26. Repertoire Some Benchmarks - Benchto ● Input Rows : 27.4 M ● Query runtime improved by 30-35 % Query Complexity Presto (with Alluxio) Presto (with S3) Light (sum) 23 sec 37 sec Medium (HLL on one field) 44 sec 63 sec Heavy (HLL on multi field) 49 sec 72 sec
  27. 27. Repertoire Learnings Presto ● Network Bottlenecks: Using 10Gbps line ● Enabling Disk spills ORC Optimizations ● Binning and Sorting data ● Limiting number of files ● Stripe Size adherence Alluxio ● Deterministic Hash Policy for reads from UnderFS ● Disabling passive cache ● Round Robin Policy for writes
  28. 28. Repertoire Inflight ● Cache Management ● Prefetch Enhancements ○ Different Sources/Sinks ● Query Triaging ● Apache Atlas Integration ● Dedicated Metric meta-store
  29. 29. Repertoire Down the road ● Compute Engines ○ Hive on Spark ○ Spark ● Caching intelligently ● Different Key-Store evaluation
  30. 30. Repertoire Thank You

×