0
Cloud based low cost, low
maintenance, scalable data platform
Apoorva Gaurav
Why hunt elephant to sell shoes ?
Why hunt elephant to sell shoes ?
WHOM
HOW
WHAT
Use case : List products based on CTR
● Take all impressions of a product and action
performed
● Some products are more at...
Use case : List products based on CTR
● select product_id, sum(clicked)/sum
(appeared) as ctr from tbl_prod_log group
by p...
Use case : User segmentation
● Different users have
different browsing
patterns
● Segment them based
on their history
● Pr...
Use case : User segmentation
● select depth, count
(cookie_id), group by
depth from user_log
● > 1m users daily,
multiple ...
Use case : Recommend similar products
● Compute score of
products based on
various attributes
● Compute score of a
user ba...
Use case : Recommend similar products
● select id, (w1.att1 + w2.
att2 + ... wN.attN) as
score from products
● select user...
Constraints
● Fast paced
● Tangible results
● Limited budget
● Low engineering bandwidth
Design goals
● Solution should be able to scale up and
down
● Record data now, ask questions later
● Generic data model
● ...
Cloud computing
Pros
● No setup cost
● Pay as you use
● Scaling is a breeze
● Managed services
Cons
● Performance
● Reliab...
A very basic Big Data system
Highly available
Very low latency
Initial filtering
Storage agnostic
Scale up and down
easily...
Architecture Diagram
Architecture Diagram Hadoop on cloud
Easy to scale up
and down
Pay as you use
Infinite capacity
11 nines of
durability
Fla...
Some numbers
● ~20 million events getting logged daily
● Corresponds to ~800 million data points
● & ~25GB
● Close to a 10...
● One can code in english (Finagle)
myService = handleExceptions andThen recordInKafka andThen respond
● Need not be in C ...
Thank you!!
& we are hiring
apoorva.gaurav@myntra.com
Upcoming SlideShare
Loading in...5
×

Myntra.com's Big Data Platform

1,480

Published on

This is the presentation given in Fifth Elephant Conference 2013. It talks about how we've created a cloud based big data which is low on maintenance and running cost. Key technologies used here are Twitter Finagle, Apache Kafka, Apache Zookeeper, Amazon S3 and Amazon EMR.

Published in: Technology
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,480
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
92
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Myntra.com's Big Data Platform"

  1. 1. Cloud based low cost, low maintenance, scalable data platform Apoorva Gaurav
  2. 2. Why hunt elephant to sell shoes ?
  3. 3. Why hunt elephant to sell shoes ? WHOM HOW WHAT
  4. 4. Use case : List products based on CTR ● Take all impressions of a product and action performed ● Some products are more attractive than others ● Give benefit to such products
  5. 5. Use case : List products based on CTR ● select product_id, sum(clicked)/sum (appeared) as ctr from tbl_prod_log group by product_id order by ctr desc ● >100K products, > 500 million impressions a day --- DIFFICULT TO SCALE
  6. 6. Use case : User segmentation ● Different users have different browsing patterns ● Segment them based on their history ● Provide them different experience
  7. 7. Use case : User segmentation ● select depth, count (cookie_id), group by depth from user_log ● > 1m users daily, multiple browsers, devices ● DIFFICULT TO SCALE
  8. 8. Use case : Recommend similar products ● Compute score of products based on various attributes ● Compute score of a user based on products (s)he browses ● Recommend similar products
  9. 9. Use case : Recommend similar products ● select id, (w1.att1 + w2. att2 + ... wN.attN) as score from products ● select userid, (v1. score1 + v2.score2 + ... + vN.scoreN) ● >1m user >100K products DIFFICULT TO COMPUTE
  10. 10. Constraints ● Fast paced ● Tangible results ● Limited budget ● Low engineering bandwidth
  11. 11. Design goals ● Solution should be able to scale up and down ● Record data now, ask questions later ● Generic data model ● Segregate reads from writes ● Low running cost ● Low maintenance overhead
  12. 12. Cloud computing Pros ● No setup cost ● Pay as you use ● Scaling is a breeze ● Managed services Cons ● Performance ● Reliability ● Data security ● Control
  13. 13. A very basic Big Data system Highly available Very low latency Initial filtering Storage agnostic Scale up and down easily Essentially distributed Very easy to use Highly reliable Huge capacity Cater to any data model Cheap
  14. 14. Architecture Diagram
  15. 15. Architecture Diagram Hadoop on cloud Easy to scale up and down Pay as you use Infinite capacity 11 nines of durability Flat file storage Cheap Persistent distributed Q 100K msg/sec Events can be played back Highly concurrent server Very easy to use Flexible Much easier to introduce HA, reliability etc Both server and client side data Segregate and upload events to S3 Scales horizontally Distributed config mgmt Fault tolerant
  16. 16. Some numbers ● ~20 million events getting logged daily ● Corresponds to ~800 million data points ● & ~25GB ● Close to a 100 jobs a day ● The biggest job has footprints of ~2 billion events ● Platform costs ~20$ daily; jobs ~15$ daily
  17. 17. ● One can code in english (Finagle) myService = handleExceptions andThen recordInKafka andThen respond ● Need not be in C or Erlang to be performant (Kafka) ● Can search without index s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30 s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30 ● Spot EMR clusters effeciently ● m1.small are not small ● awk + grep = awesome ● Apache mailing lists SUCK!!! Some key learnings
  18. 18. Thank you!! & we are hiring apoorva.gaurav@myntra.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×