Myntra.com's Big Data Platform
Upcoming SlideShare
Loading in...5
×
 

Myntra.com's Big Data Platform

on

  • 1,110 views

This is the presentation given in Fifth Elephant Conference 2013. It talks about how we've created a cloud based big data which is low on maintenance and running cost. Key technologies used here are ...

This is the presentation given in Fifth Elephant Conference 2013. It talks about how we've created a cloud based big data which is low on maintenance and running cost. Key technologies used here are Twitter Finagle, Apache Kafka, Apache Zookeeper, Amazon S3 and Amazon EMR.

Statistics

Views

Total Views
1,110
Views on SlideShare
1,102
Embed Views
8

Actions

Likes
0
Downloads
45
Comments
1

1 Embed 8

https://funnel.hasgeek.com 8

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Myntra.com's Big Data Platform Myntra.com's Big Data Platform Presentation Transcript

  • Cloud based low cost, low maintenance, scalable data platform Apoorva Gaurav
  • Why hunt elephant to sell shoes ?
  • Why hunt elephant to sell shoes ? WHOM HOW WHAT
  • Use case : List products based on CTR ● Take all impressions of a product and action performed ● Some products are more attractive than others ● Give benefit to such products
  • Use case : List products based on CTR ● select product_id, sum(clicked)/sum (appeared) as ctr from tbl_prod_log group by product_id order by ctr desc ● >100K products, > 500 million impressions a day --- DIFFICULT TO SCALE
  • Use case : User segmentation ● Different users have different browsing patterns ● Segment them based on their history ● Provide them different experience
  • Use case : User segmentation ● select depth, count (cookie_id), group by depth from user_log ● > 1m users daily, multiple browsers, devices ● DIFFICULT TO SCALE
  • Use case : Recommend similar products ● Compute score of products based on various attributes ● Compute score of a user based on products (s)he browses ● Recommend similar products
  • Use case : Recommend similar products ● select id, (w1.att1 + w2. att2 + ... wN.attN) as score from products ● select userid, (v1. score1 + v2.score2 + ... + vN.scoreN) ● >1m user >100K products DIFFICULT TO COMPUTE
  • Constraints ● Fast paced ● Tangible results ● Limited budget ● Low engineering bandwidth
  • Design goals ● Solution should be able to scale up and down ● Record data now, ask questions later ● Generic data model ● Segregate reads from writes ● Low running cost ● Low maintenance overhead
  • Cloud computing Pros ● No setup cost ● Pay as you use ● Scaling is a breeze ● Managed services Cons ● Performance ● Reliability ● Data security ● Control
  • A very basic Big Data system Highly available Very low latency Initial filtering Storage agnostic Scale up and down easily Essentially distributed Very easy to use Highly reliable Huge capacity Cater to any data model Cheap
  • Architecture Diagram
  • Architecture Diagram Hadoop on cloud Easy to scale up and down Pay as you use Infinite capacity 11 nines of durability Flat file storage Cheap Persistent distributed Q 100K msg/sec Events can be played back Highly concurrent server Very easy to use Flexible Much easier to introduce HA, reliability etc Both server and client side data Segregate and upload events to S3 Scales horizontally Distributed config mgmt Fault tolerant
  • Some numbers ● ~20 million events getting logged daily ● Corresponds to ~800 million data points ● & ~25GB ● Close to a 100 jobs a day ● The biggest job has footprints of ~2 billion events ● Platform costs ~20$ daily; jobs ~15$ daily
  • ● One can code in english (Finagle) myService = handleExceptions andThen recordInKafka andThen respond ● Need not be in C or Erlang to be performant (Kafka) ● Can search without index s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30 s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30 ● Spot EMR clusters effeciently ● m1.small are not small ● awk + grep = awesome ● Apache mailing lists SUCK!!! Some key learnings
  • Thank you!! & we are hiring apoorva.gaurav@myntra.com